Skip to content
Snippets Groups Projects
Commit 021dafc6 authored by Yuhao Yang's avatar Yuhao Yang Committed by Joseph K. Bradley
Browse files

[SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large

jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10146 from hhbyyh/chiSq.
parent cd81fc9e
No related branches found
No related tags found
No related merge requests found
...@@ -109,7 +109,9 @@ private[stat] object ChiSqTest extends Logging { ...@@ -109,7 +109,9 @@ private[stat] object ChiSqTest extends Logging {
} }
i += 1 i += 1
distinctLabels += label distinctLabels += label
features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) => val brzFeatures = features.toBreeze
(startCol until endCol).map { col =>
val feature = brzFeatures(col)
allDistinctFeatures(col) += feature allDistinctFeatures(col) += feature
(col, feature, label) (col, feature, label)
} }
...@@ -122,7 +124,7 @@ private[stat] object ChiSqTest extends Logging { ...@@ -122,7 +124,7 @@ private[stat] object ChiSqTest extends Logging {
pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct.zipWithIndex.toMap pairCounts.keys.filter(_._1 == startCol).map(_._3).toArray.distinct.zipWithIndex.toMap
} }
val numLabels = labels.size val numLabels = labels.size
pairCounts.keys.groupBy(_._1).map { case (col, keys) => pairCounts.keys.groupBy(_._1).foreach { case (col, keys) =>
val features = keys.map(_._2).toArray.distinct.zipWithIndex.toMap val features = keys.map(_._2).toArray.distinct.zipWithIndex.toMap
val numRows = features.size val numRows = features.size
val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels)) val contingency = new BDM(numRows, numLabels, new Array[Double](numRows * numLabels))
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment