-
- Downloads
[SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat and...
[SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat and allow multiple input paths for determining numFeatures ## What changes were proposed in this pull request? This is related with [SPARK-19918](https://issues.apache.org/jira/browse/SPARK-19918) and [SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362). This PR proposes to use `TextFileFormat` and allow multiple input paths (but with a warning) when determining the number of features in LibSVM data source via an extra scan. There are three points here: - The main advantage of this change should be to remove file-listing bottlenecks in driver side. - Another advantage is ones from using `FileScanRDD`. For example, I guess we can use `spark.sql.files.ignoreCorruptFiles` option when determining the number of features. - We can unify the schema inference code path in text based data sources. This is also a preparation for [SPARK-21289](https://issues.apache.org/jira/browse/SPARK-21289). ## How was this patch tested? Unit tests in `LibSVMRelationSuite`. Closes #18288 Author: hyukjinkwon <gurwls223@gmail.com> Closes #18556 from HyukjinKwon/libsvm-schema.
Showing
- mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala 13 additions, 13 deletions...la/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 23 additions, 2 deletions.../src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
- mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala 13 additions, 4 deletions...g/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
Please register or sign in to comment