-
- Downloads
[SPARK-15509][ML][SPARKR] R MLlib algorithms should support input columns "features" and "label"
https://issues.apache.org/jira/browse/SPARK-15509 ## What changes were proposed in this pull request? Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM: `training <- loadDF(sqlContext, ".../mnist", "libsvm")` `model <- naiveBayes(label ~ features, training)` This fails with: ``` 16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.IllegalArgumentException: Output column features already exists. at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca The same issue appears for the "label" column once you rename the "features" column. ``` The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala` ## How was this patch tested? Test on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #13584 from keypointt/SPARK-15509.
Showing
- mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala 1 addition, 0 deletions.../org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala 3 additions, 2 deletions.../scala/org/apache/spark/ml/r/GaussianMixtureWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala 1 addition, 0 deletions...pache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/IsotonicRegressionWrapper.scala 3 additions, 2 deletions...ala/org/apache/spark/ml/r/IsotonicRegressionWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/KMeansWrapper.scala 3 additions, 2 deletions.../src/main/scala/org/apache/spark/ml/r/KMeansWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/NaiveBayesWrapper.scala 6 additions, 5 deletions.../main/scala/org/apache/spark/ml/r/NaiveBayesWrapper.scala
- mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala 71 additions, 0 deletions.../src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala 0 additions, 3 deletions...est/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala 56 additions, 0 deletions...test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala
Loading
Please register or sign in to comment