Skip to content
Snippets Groups Projects
  • Xiangrui Meng's avatar
    80c29689
    [SPARK-1212] Adding sparse data support and update KMeans · 80c29689
    Xiangrui Meng authored
    Continue our discussions from https://github.com/apache/incubator-spark/pull/575
    
    This PR is WIP because it depends on a SNAPSHOT version of breeze.
    
    Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.
    
    @fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.
    
    I'm going to update this PR to include:
    
    1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.
    
    2. Some numbers about the performance.
    
    3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #117 from mengxr/sparse-kmeans and squashes the following commits:
    
    67b368d [Xiangrui Meng] fix SparseVector.toArray
    5eda0de [Xiangrui Meng] update NOTICE
    67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
    1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
    9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
    226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
    238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
    b28ba2f [Xiangrui Meng] add toArray to Vector
    e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
    72bde33 [Xiangrui Meng] clean up code for distance computation
    712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
    27858e4 [Xiangrui Meng] update breeze version to 0.7
    07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
    6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
    42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
    d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
    42b4e50 [Xiangrui Meng] line feed at the end
    a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
    3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
    0107e19 [Xiangrui Meng] update NOTICE
    87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
    0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
    f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
    ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
    4e7d5ca [Xiangrui Meng] minor style update
    07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
    80c29689
    History
    [SPARK-1212] Adding sparse data support and update KMeans
    Xiangrui Meng authored
    Continue our discussions from https://github.com/apache/incubator-spark/pull/575
    
    This PR is WIP because it depends on a SNAPSHOT version of breeze.
    
    Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.
    
    @fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.
    
    I'm going to update this PR to include:
    
    1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.
    
    2. Some numbers about the performance.
    
    3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #117 from mengxr/sparse-kmeans and squashes the following commits:
    
    67b368d [Xiangrui Meng] fix SparseVector.toArray
    5eda0de [Xiangrui Meng] update NOTICE
    67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
    1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
    9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
    226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
    238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
    b28ba2f [Xiangrui Meng] add toArray to Vector
    e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
    72bde33 [Xiangrui Meng] clean up code for distance computation
    712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
    27858e4 [Xiangrui Meng] update breeze version to 0.7
    07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
    6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
    42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
    d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
    42b4e50 [Xiangrui Meng] line feed at the end
    a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
    3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
    0107e19 [Xiangrui Meng] update NOTICE
    87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
    0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
    f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
    ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
    4e7d5ca [Xiangrui Meng] minor style update
    07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data