-
- Downloads
[SPARK-1752][MLLIB] Standardize text format for vectors and labeled points
We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: `[v0,v1,..]` 2. sparse vector: `(size,[i0,i1],[v0,v1])` 3. labeled point: `(label,vector)` where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically. `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`. CC: @mateiz, @srowen Author: Xiangrui Meng <meng@databricks.com> Closes #685 from mengxr/labeled-io and squashes the following commits: 2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1 297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io 56746ea [Xiangrui Meng] replace # by . 623a5f0 [Xiangrui Meng] merge master f06d5ba [Xiangrui Meng] add docs and minor updates 640fe0c [Xiangrui Meng] throw SparkException 5bcfbc4 [Xiangrui Meng] update test to add scientific notations e86bf38 [Xiangrui Meng] remove NumericTokenizer 050fca4 [Xiangrui Meng] use StringTokenizer 6155b75 [Xiangrui Meng] merge master f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests aea4ae3 [Xiangrui Meng] minor updates 810d6df [Xiangrui Meng] update tokenizer/parser implementation 7aac03a [Xiangrui Meng] remove Scala parsers c1885c1 [Xiangrui Meng] add headers and minor changes b0c50cb [Xiangrui Meng] add customized parser d731817 [Xiangrui Meng] style update 63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors 5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors 7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__ e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData 9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints 19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
Showing
- examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala 1 addition, 1 deletion.../org/apache/spark/examples/mllib/DecisionTreeRunner.scala
- mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 29 additions, 4 deletions...la/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
- mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 28 additions, 5 deletions...rc/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
- mllib/src/main/scala/org/apache/spark/mllib/regression/LabeledPoint.scala 29 additions, 2 deletions...cala/org/apache/spark/mllib/regression/LabeledPoint.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala 2 additions, 1 deletion...ala/org/apache/spark/mllib/util/LinearDataGenerator.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/LogisticRegressionDataGenerator.scala 2 additions, 1 deletion...he/spark/mllib/util/LogisticRegressionDataGenerator.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 42 additions, 5 deletions.../src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/NumericParser.scala 121 additions, 0 deletions...ain/scala/org/apache/spark/mllib/util/NumericParser.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/SVMDataGenerator.scala 1 addition, 1 deletion.../scala/org/apache/spark/mllib/util/SVMDataGenerator.scala
- mllib/src/test/scala/org/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala 60 additions, 0 deletions...g/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala 25 additions, 0 deletions...st/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/regression/LabeledPointSuite.scala 39 additions, 0 deletions...org/apache/spark/mllib/regression/LabeledPointSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala 29 additions, 1 deletion...test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/util/NumericParserSuite.scala 42 additions, 0 deletions...cala/org/apache/spark/mllib/util/NumericParserSuite.scala
- python/pyspark/mllib/_common.py 51 additions, 21 deletionspython/pyspark/mllib/_common.py
- python/pyspark/mllib/linalg.py 24 additions, 10 deletionspython/pyspark/mllib/linalg.py
- python/pyspark/mllib/regression.py 4 additions, 1 deletionpython/pyspark/mllib/regression.py
- python/pyspark/mllib/util.py 50 additions, 19 deletionspython/pyspark/mllib/util.py
Loading
Please register or sign in to comment