diff --git a/docs/ml-features.md b/docs/ml-features.md index ca1ccc40509d0b3c00bf31e790c920a4de74e60b..1d3449746c9be2567895b5c8fc68f4622f8e7219 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1423,12 +1423,12 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: - +features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. - +* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. +* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. The user can choose a selection method using `setSelectorType`. diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index 42568c312e70e984b128a9ca1910d36d5925dd45..acd28943132db2fadd03559f73ea02997c90ca9f 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -227,11 +227,13 @@ both speed and statistical learning behavior. [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: +features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. +* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. +* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. The user can choose a selection method using `setSelectorType`. diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala index 8699929bab7933cda613a5412a653971266ad679..353bd186daf019c9c638453ded6da284da3e41b3 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala @@ -91,9 +91,37 @@ private[feature] trait ChiSqSelectorParams extends Params @Since("2.1.0") def getFpr: Double = $(fpr) + /** + * The upper bound of the expected false discovery rate. + * Only applicable when selectorType = "fdr". + * Default value is 0.05. + * @group param + */ + @Since("2.2.0") + final val fdr = new DoubleParam(this, "fdr", + "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1)) + setDefault(fdr -> 0.05) + + /** @group getParam */ + def getFdr: Double = $(fdr) + + /** + * The upper bound of the expected family-wise error rate. + * Only applicable when selectorType = "fwe". + * Default value is 0.05. + * @group param + */ + @Since("2.2.0") + final val fwe = new DoubleParam(this, "fwe", + "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1)) + setDefault(fwe -> 0.05) + + /** @group getParam */ + def getFwe: Double = $(fwe) + /** * The selector type of the ChisqSelector. - * Supported options: "numTopFeatures" (default), "percentile", "fpr". + * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe". * @group param */ @Since("2.1.0") @@ -111,11 +139,17 @@ private[feature] trait ChiSqSelectorParams extends Params /** * Chi-Squared feature selection, which selects categorical features to use for predicting a * categorical label. - * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false * positive rate of selection. + * - `fdr` uses the [Benjamini-Hochberg procedure] + * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) + * to choose all features whose false discovery rate is below a threshold. + * - `fwe` chooses all features whose p-values is below a threshold, + * thus controlling the family-wise error rate of selection. * By default, the selection method is `numTopFeatures`, with the default number of top features * set to 50. */ @@ -138,6 +172,14 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str @Since("2.1.0") def setFpr(value: Double): this.type = set(fpr, value) + /** @group setParam */ + @Since("2.2.0") + def setFdr(value: Double): this.type = set(fdr, value) + + /** @group setParam */ + @Since("2.2.0") + def setFwe(value: Double): this.type = set(fwe, value) + /** @group setParam */ @Since("2.1.0") def setSelectorType(value: String): this.type = set(selectorType, value) @@ -167,6 +209,8 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str .setNumTopFeatures($(numTopFeatures)) .setPercentile($(percentile)) .setFpr($(fpr)) + .setFdr($(fdr)) + .setFwe($(fwe)) val model = selector.fit(input) copyValues(new ChiSqSelectorModel(uid, model).setParent(this)) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index 034e3625e8c01a47824cd1072c91492fc2c3a9fe..b32d3f252ae597b6a7693e2ea7ac9f928e929bde 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -639,12 +639,16 @@ private[python] class PythonMLLibAPI extends Serializable { numTopFeatures: Int, percentile: Double, fpr: Double, + fdr: Double, + fwe: Double, data: JavaRDD[LabeledPoint]): ChiSqSelectorModel = { new ChiSqSelector() .setSelectorType(selectorType) .setNumTopFeatures(numTopFeatures) .setPercentile(percentile) .setFpr(fpr) + .setFdr(fdr) + .setFwe(fwe) .fit(data.rdd) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala index 7ef2a95b96f2dc122d70b185eebac88b0be6dc24..9dea3c3e843c4af5fa0dad71ff704cb541aabb60 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala @@ -171,11 +171,17 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + * `fdr`, `fwe`. * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false * positive rate of selection. + * - `fdr` uses the [Benjamini-Hochberg procedure] + * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) + * to choose all features whose false discovery rate is below a threshold. + * - `fwe` chooses all features whose p-values is below a threshold, + * thus controlling the family-wise error rate of selection. * By default, the selection method is `numTopFeatures`, with the default number of top features * set to 50. */ @@ -184,6 +190,8 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { var numTopFeatures: Int = 50 var percentile: Double = 0.1 var fpr: Double = 0.05 + var fdr: Double = 0.05 + var fwe: Double = 0.05 var selectorType = ChiSqSelector.NumTopFeatures /** @@ -215,6 +223,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { this } + @Since("2.2.0") + def setFdr(value: Double): this.type = { + require(0.0 <= value && value <= 1.0, "FDR must be in [0,1]") + fdr = value + this + } + + @Since("2.2.0") + def setFwe(value: Double): this.type = { + require(0.0 <= value && value <= 1.0, "FWE must be in [0,1]") + fwe = value + this + } + @Since("2.1.0") def setSelectorType(value: String): this.type = { require(ChiSqSelector.supportedSelectorTypes.contains(value), @@ -245,6 +267,21 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < fpr } + case ChiSqSelector.FDR => + // This uses the Benjamini-Hochberg procedure. + // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure + val tempRes = chiSqTestResult + .sortBy { case (res, _) => res.pValue } + val maxIndex = tempRes + .zipWithIndex + .filter { case ((res, _), index) => + res.pValue <= fdr * (index + 1) / chiSqTestResult.length } + .map { case (_, index) => index } + .max + tempRes.take(maxIndex + 1) + case ChiSqSelector.FWE => + chiSqTestResult + .filter { case (res, _) => res.pValue < fwe / chiSqTestResult.length } case errorType => throw new IllegalStateException(s"Unknown ChiSqSelector Type: $errorType") } @@ -255,19 +292,22 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { private[spark] object ChiSqSelector { - /** - * String name for `numTopFeatures` selector type. - */ - val NumTopFeatures: String = "numTopFeatures" + /** String name for `numTopFeatures` selector type. */ + private[spark] val NumTopFeatures: String = "numTopFeatures" - /** - * String name for `percentile` selector type. - */ - val Percentile: String = "percentile" + /** String name for `percentile` selector type. */ + private[spark] val Percentile: String = "percentile" /** String name for `fpr` selector type. */ - val FPR: String = "fpr" + private[spark] val FPR: String = "fpr" + + /** String name for `fdr` selector type. */ + private[spark] val FDR: String = "fdr" + + /** String name for `fwe` selector type. */ + private[spark] val FWE: String = "fwe" + /** Set of selector types that ChiSqSelector supports. */ - val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR) + val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR, FDR, FWE) } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala index 80970fd74488174e84ccfc59d0e853a0b2b18558..f6c68b9314cadbb97338224d0579bf2206d1fe17 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala @@ -79,6 +79,12 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext ChiSqSelectorSuite.testSelector(selector, dataset) } + test("Test Chi-Square selector: fwe") { + val selector = new ChiSqSelector() + .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6) + ChiSqSelectorSuite.testSelector(selector, dataset) + } + test("read/write") { def checkModelData(model: ChiSqSelectorModel, model2: ChiSqSelectorModel): Unit = { assert(model.selectedFeatures === model2.selectedFeatures) diff --git a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala index 77219e500617df2f9ef88a58b6a65f7b8cd0ccca..305cb4cbbdeeaa1b33e7e880f5df17fb9d124845 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala @@ -27,60 +27,143 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext { /* * Contingency tables - * feature0 = {8.0, 0.0} + * feature0 = {6.0, 0.0, 8.0} * class 0 1 2 - * 8.0||1|0|1| - * 0.0||0|2|0| + * 6.0||1|0|0| + * 0.0||0|3|0| + * 8.0||0|0|2| + * degree of freedom = 4, statistic = 12, pValue = 0.017 * * feature1 = {7.0, 9.0} * class 0 1 2 * 7.0||1|0|0| - * 9.0||0|2|1| + * 9.0||0|3|2| + * degree of freedom = 2, statistic = 6, pValue = 0.049 * - * feature2 = {0.0, 6.0, 8.0, 5.0} + * feature2 = {0.0, 6.0, 3.0, 8.0} * class 0 1 2 * 0.0||1|0|0| - * 6.0||0|1|0| + * 6.0||0|1|2| + * 3.0||0|1|0| * 8.0||0|1|0| - * 5.0||0|0|1| + * degree of freedom = 6, statistic = 8.66, pValue = 0.193 + * + * feature3 = {7.0, 0.0, 5.0, 4.0} + * class 0 1 2 + * 7.0||1|0|0| + * 0.0||0|2|0| + * 5.0||0|1|1| + * 4.0||0|0|1| + * degree of freedom = 6, statistic = 9.5, pValue = 0.147 + * + * feature4 = {6.0, 5.0, 4.0, 0.0} + * class 0 1 2 + * 6.0||1|1|0| + * 5.0||0|2|0| + * 4.0||0|0|1| + * 0.0||0|0|1| + * degree of freedom = 6, statistic = 8.0, pValue = 0.238 + * + * feature5 = {0.0, 9.0, 5.0, 4.0} + * class 0 1 2 + * 0.0||1|0|1| + * 9.0||0|1|0| + * 5.0||0|1|0| + * 4.0||0|1|1| + * degree of freedom = 6, statistic = 5, pValue = 0.54 * * Use chi-squared calculator from Internet */ - test("ChiSqSelector transform test (sparse & dense vector)") { - val labeledDiscreteData = sc.parallelize( - Seq(LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))), - LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))), - LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))), - LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))), 2) + lazy val labeledDiscreteData = sc.parallelize( + Seq(LabeledPoint(0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0)))), + LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0)))), + LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0)))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2) + + test("ChiSqSelector transform by numTopFeatures test (sparse & dense vector)") { val preFilteredData = - Seq(LabeledPoint(0.0, Vectors.dense(Array(8.0))), - LabeledPoint(1.0, Vectors.dense(Array(0.0))), - LabeledPoint(1.0, Vectors.dense(Array(0.0))), - LabeledPoint(2.0, Vectors.dense(Array(8.0)))) - val model = new ChiSqSelector(1).fit(labeledDiscreteData) + Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0)))) + + val model = new ChiSqSelector(3).fit(labeledDiscreteData) val filteredData = labeledDiscreteData.map { lp => LabeledPoint(lp.label, model.transform(lp.features)) - }.collect().toSeq + }.collect().toSet assert(filteredData === preFilteredData) } - test("ChiSqSelector by fpr transform test (sparse & dense vector)") { - val labeledDiscreteData = sc.parallelize( - Seq(LabeledPoint(0.0, Vectors.sparse(4, Array((0, 8.0), (1, 7.0)))), - LabeledPoint(1.0, Vectors.sparse(4, Array((1, 9.0), (2, 6.0), (3, 4.0)))), - LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 4.0))), - LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0, 9.0)))), 2) + test("ChiSqSelector transform by Percentile test (sparse & dense vector)") { val preFilteredData = - Seq(LabeledPoint(0.0, Vectors.dense(Array(0.0))), - LabeledPoint(1.0, Vectors.dense(Array(4.0))), - LabeledPoint(1.0, Vectors.dense(Array(4.0))), - LabeledPoint(2.0, Vectors.dense(Array(9.0)))) - val model: ChiSqSelectorModel = new ChiSqSelector().setSelectorType("fpr") - .setFpr(0.1).fit(labeledDiscreteData) + Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0)))) + + val model = new ChiSqSelector().setSelectorType("percentile").setPercentile(0.5) + .fit(labeledDiscreteData) + val filteredData = labeledDiscreteData.map { lp => + LabeledPoint(lp.label, model.transform(lp.features)) + }.collect().toSet + assert(filteredData === preFilteredData) + } + + test("ChiSqSelector transform by FPR test (sparse & dense vector)") { + val preFilteredData = + Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0)))) + + val model = new ChiSqSelector().setSelectorType("fpr").setFpr(0.15) + .fit(labeledDiscreteData) + val filteredData = labeledDiscreteData.map { lp => + LabeledPoint(lp.label, model.transform(lp.features)) + }.collect().toSet + assert(filteredData === preFilteredData) + } + + test("ChiSqSelector transform by FDR test (sparse & dense vector)") { + val preFilteredData = + Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0)))) + + val model = new ChiSqSelector().setSelectorType("fdr").setFdr(0.15) + .fit(labeledDiscreteData) + val filteredData = labeledDiscreteData.map { lp => + LabeledPoint(lp.label, model.transform(lp.features)) + }.collect().toSet + assert(filteredData === preFilteredData) + } + + test("ChiSqSelector transform by FWE test (sparse & dense vector)") { + val preFilteredData = + Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0)))) + + val model = new ChiSqSelector().setSelectorType("fwe").setFwe(0.3) + .fit(labeledDiscreteData) val filteredData = labeledDiscreteData.map { lp => LabeledPoint(lp.label, model.transform(lp.features)) - }.collect().toSeq + }.collect().toSet assert(filteredData === preFilteredData) } diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 62c31431b58ff19b68aa97d08a8a9d6fd9063071..dbd17e01d221308d235612bcab0ef9edb9d2840d 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -2629,8 +2629,28 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja """ .. note:: Experimental - Chi-Squared feature selection, which selects categorical features to use for predicting a - categorical label. + Creates a ChiSquared feature selector. + The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + `fdr`, `fwe`. + + * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. + + * `percentile` is similar but chooses a fraction of all features + instead of a fixed number. + + * `fpr` chooses all features whose p-value is below a threshold, + thus controlling the false positive rate of selection. + + * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/ + False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_ + to choose all features whose false discovery rate is below a threshold. + + * `fwe` chooses all features whose p-values is below a threshold, + thus controlling the family-wise error rate of selection. + + By default, the selection method is `numTopFeatures`, with the default number of top features + set to 50. + >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame( @@ -2676,27 +2696,37 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja fpr = Param(Params._dummy(), "fpr", "The highest p-value for features to be kept.", typeConverter=TypeConverters.toFloat) + fdr = Param(Params._dummy(), "fdr", "The upper bound of the expected false discovery rate.", + typeConverter=TypeConverters.toFloat) + + fwe = Param(Params._dummy(), "fwe", "The upper bound of the expected family-wise error rate.", + typeConverter=TypeConverters.toFloat) + @keyword_only def __init__(self, numTopFeatures=50, featuresCol="features", outputCol=None, - labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05): + labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, + fdr=0.05, fwe=0.05): """ __init__(self, numTopFeatures=50, featuresCol="features", outputCol=None, \ - labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05) + labelCol="label", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \ + fdr=0.05, fwe=0.05) """ super(ChiSqSelector, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.ChiSqSelector", self.uid) self._setDefault(numTopFeatures=50, selectorType="numTopFeatures", percentile=0.1, - fpr=0.05) + fpr=0.05, fdr=0.05, fwe=0.05) kwargs = self.__init__._input_kwargs self.setParams(**kwargs) @keyword_only @since("2.0.0") def setParams(self, numTopFeatures=50, featuresCol="features", outputCol=None, - labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05): + labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, + fdr=0.05, fwe=0.05): """ setParams(self, numTopFeatures=50, featuresCol="features", outputCol=None, \ - labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05) + labelCol="labels", selectorType="numTopFeatures", percentile=0.1, fpr=0.05, \ + fdr=0.05, fwe=0.05) Sets params for this ChiSqSelector. """ kwargs = self.setParams._input_kwargs @@ -2761,6 +2791,36 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja """ return self.getOrDefault(self.fpr) + @since("2.2.0") + def setFdr(self, value): + """ + Sets the value of :py:attr:`fdr`. + Only applicable when selectorType = "fdr". + """ + return self._set(fdr=value) + + @since("2.2.0") + def getFdr(self): + """ + Gets the value of fdr or its default value. + """ + return self.getOrDefault(self.fdr) + + @since("2.2.0") + def setFwe(self, value): + """ + Sets the value of :py:attr:`fwe`. + Only applicable when selectorType = "fwe". + """ + return self._set(fwe=value) + + @since("2.2.0") + def getFwe(self): + """ + Gets the value of fwe or its default value. + """ + return self.getOrDefault(self.fwe) + def _create_model(self, java_model): return ChiSqSelectorModel(java_model) diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py index bde0f67be775c3d22b5c69668a6416e051af0d90..61f2bc7492ad69fbd2b29044fa97c24a2c27c041 100644 --- a/python/pyspark/mllib/feature.py +++ b/python/pyspark/mllib/feature.py @@ -274,11 +274,24 @@ class ChiSqSelectorModel(JavaVectorTransformer): class ChiSqSelector(object): """ Creates a ChiSquared feature selector. - The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. - `percentile` is similar but chooses a fraction of all features instead of a fixed number. - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false - positive rate of selection. + The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, + `fdr`, `fwe`. + + * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. + + * `percentile` is similar but chooses a fraction of all features + instead of a fixed number. + + * `fpr` chooses all features whose p-value is below a threshold, + thus controlling the false positive rate of selection. + + * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/ + False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_ + to choose all features whose false discovery rate is below a threshold. + + * `fwe` chooses all features whose p-values is below a threshold, + thus controlling the family-wise error rate of selection. + By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. @@ -305,11 +318,14 @@ class ChiSqSelector(object): .. versionadded:: 1.4.0 """ - def __init__(self, numTopFeatures=50, selectorType="numTopFeatures", percentile=0.1, fpr=0.05): + def __init__(self, numTopFeatures=50, selectorType="numTopFeatures", percentile=0.1, fpr=0.05, + fdr=0.05, fwe=0.05): self.numTopFeatures = numTopFeatures self.selectorType = selectorType self.percentile = percentile self.fpr = fpr + self.fdr = fdr + self.fwe = fwe @since('2.1.0') def setNumTopFeatures(self, numTopFeatures): @@ -338,11 +354,29 @@ class ChiSqSelector(object): self.fpr = float(fpr) return self + @since('2.2.0') + def setFdr(self, fdr): + """ + set FDR [0.0, 1.0] for feature selection by FDR. + Only applicable when selectorType = "fdr". + """ + self.fdr = float(fdr) + return self + + @since('2.2.0') + def setFwe(self, fwe): + """ + set FWE [0.0, 1.0] for feature selection by FWE. + Only applicable when selectorType = "fwe". + """ + self.fwe = float(fwe) + return self + @since('2.1.0') def setSelectorType(self, selectorType): """ set the selector type of the ChisqSelector. - Supported options: "numTopFeatures" (default), "percentile", "fpr". + Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe". """ self.selectorType = str(selectorType) return self @@ -358,7 +392,7 @@ class ChiSqSelector(object): Apply feature discretizer before using this function. """ jmodel = callMLlibFunc("fitChiSqSelector", self.selectorType, self.numTopFeatures, - self.percentile, self.fpr, data) + self.percentile, self.fpr, self.fdr, self.fwe, data) return ChiSqSelectorModel(jmodel)