Skip to content
Snippets Groups Projects
Commit 07a2b873 authored by Reynold Xin's avatar Reynold Xin Committed by hyukjinkwon
Browse files

[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java

## What changes were proposed in this pull request?
Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed.

## How was this patch tested?
Tested manually. Not sure yet if we should add a test case for just this wrapper ...

Author: Reynold Xin <rxin@databricks.com>

Closes #18988 from rxin/SPARK-21778.
parent 310454be
No related branches found
No related tags found
No related merge requests found
......@@ -1848,11 +1848,43 @@ class Dataset[T] private[sql](
Except(logicalPlan, other.logicalPlan)
}
/**
* Returns a new [[Dataset]] by sampling a fraction of rows (without replacement),
* using a user-supplied seed.
*
* @param fraction Fraction of rows to generate, range [0.0, 1.0].
* @param seed Seed for sampling.
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[Dataset]].
*
* @group typedrel
* @since 2.3.0
*/
def sample(fraction: Double, seed: Long): Dataset[T] = {
sample(withReplacement = false, fraction = fraction, seed = seed)
}
/**
* Returns a new [[Dataset]] by sampling a fraction of rows (without replacement).
*
* @param fraction Fraction of rows to generate, range [0.0, 1.0].
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[Dataset]].
*
* @group typedrel
* @since 2.3.0
*/
def sample(fraction: Double): Dataset[T] = {
sample(withReplacement = false, fraction = fraction)
}
/**
* Returns a new [[Dataset]] by sampling a fraction of rows, using a user-supplied seed.
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.
* @param fraction Fraction of rows to generate, range [0.0, 1.0].
* @param seed Seed for sampling.
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
......@@ -1871,7 +1903,7 @@ class Dataset[T] private[sql](
* Returns a new [[Dataset]] by sampling a fraction of rows, using a random seed.
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.
* @param fraction Fraction of rows to generate, range [0.0, 1.0].
*
* @note This is NOT guaranteed to provide exactly the fraction of the total count
* of the given [[Dataset]].
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment