Skip to content
Snippets Groups Projects
  • Xiangrui Meng's avatar
    189df165
    [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points · 189df165
    Xiangrui Meng authored
    We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:
    
    1. dense vector: `[v0,v1,..]`
    2. sparse vector: `(size,[i0,i1],[v0,v1])`
    3. labeled point: `(label,vector)`
    
    where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.
    
    `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.
    
    CC: @mateiz, @srowen
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #685 from mengxr/labeled-io and squashes the following commits:
    
    2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
    297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
    d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
    56746ea [Xiangrui Meng] replace # by .
    623a5f0 [Xiangrui Meng] merge master
    f06d5ba [Xiangrui Meng] add docs and minor updates
    640fe0c [Xiangrui Meng] throw SparkException
    5bcfbc4 [Xiangrui Meng] update test to add scientific notations
    e86bf38 [Xiangrui Meng] remove NumericTokenizer
    050fca4 [Xiangrui Meng] use StringTokenizer
    6155b75 [Xiangrui Meng] merge master
    f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
    a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
    ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
    e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
    aea4ae3 [Xiangrui Meng] minor updates
    810d6df [Xiangrui Meng] update tokenizer/parser implementation
    7aac03a [Xiangrui Meng] remove Scala parsers
    c1885c1 [Xiangrui Meng] add headers and minor changes
    b0c50cb [Xiangrui Meng] add customized parser
    d731817 [Xiangrui Meng] style update
    63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
    ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
    cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
    a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
    5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
    7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
    e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
    9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
    19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
    189df165
    History
    [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points
    Xiangrui Meng authored
    We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:
    
    1. dense vector: `[v0,v1,..]`
    2. sparse vector: `(size,[i0,i1],[v0,v1])`
    3. labeled point: `(label,vector)`
    
    where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.
    
    `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.
    
    CC: @mateiz, @srowen
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #685 from mengxr/labeled-io and squashes the following commits:
    
    2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
    297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
    d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
    56746ea [Xiangrui Meng] replace # by .
    623a5f0 [Xiangrui Meng] merge master
    f06d5ba [Xiangrui Meng] add docs and minor updates
    640fe0c [Xiangrui Meng] throw SparkException
    5bcfbc4 [Xiangrui Meng] update test to add scientific notations
    e86bf38 [Xiangrui Meng] remove NumericTokenizer
    050fca4 [Xiangrui Meng] use StringTokenizer
    6155b75 [Xiangrui Meng] merge master
    f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
    a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
    ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
    e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
    aea4ae3 [Xiangrui Meng] minor updates
    810d6df [Xiangrui Meng] update tokenizer/parser implementation
    7aac03a [Xiangrui Meng] remove Scala parsers
    c1885c1 [Xiangrui Meng] add headers and minor changes
    b0c50cb [Xiangrui Meng] add customized parser
    d731817 [Xiangrui Meng] style update
    63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
    ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
    cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
    a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
    5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
    7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
    e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
    9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
    19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint