Skip to content
Snippets Groups Projects
  • Kirill A. Korinskiy's avatar
    8c07c75c
    [SPARK-5521] PCA wrapper for easy transform vectors · 8c07c75c
    Kirill A. Korinskiy authored
    I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
    
    Example of usage:
    ```
      import org.apache.spark.mllib.regression.LinearRegressionWithSGD
      import org.apache.spark.mllib.regression.LabeledPoint
      import org.apache.spark.mllib.linalg.Vectors
      import org.apache.spark.mllib.feature.PCA
    
      val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
        val parts = line.split(',')
        LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
      }.cache()
    
      val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
      val training = splits(0).cache()
      val test = splits(1)
    
      val pca = PCA.create(training.first().features.size/2, data.map(_.features))
      val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
      val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
    
      val numIterations = 100
      val model = LinearRegressionWithSGD.train(training, numIterations)
      val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
    
      val valuesAndPreds = test.map { point =>
        val score = model.predict(point.features)
        (score, point.label)
      }
    
      val valuesAndPreds_pca = test_pca.map { point =>
        val score = model_pca.predict(point.features)
        (score, point.label)
      }
    
      val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
      val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
    
      println("Mean Squared Error = " + MSE)
      println("PCA Mean Squared Error = " + MSE_pca)
    ```
    
    Author: Kirill A. Korinskiy <catap@catap.ru>
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #4304 from catap/pca and squashes the following commits:
    
    501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
    9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
    1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
    8c07c75c
    History
    [SPARK-5521] PCA wrapper for easy transform vectors
    Kirill A. Korinskiy authored
    I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
    
    Example of usage:
    ```
      import org.apache.spark.mllib.regression.LinearRegressionWithSGD
      import org.apache.spark.mllib.regression.LabeledPoint
      import org.apache.spark.mllib.linalg.Vectors
      import org.apache.spark.mllib.feature.PCA
    
      val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
        val parts = line.split(',')
        LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
      }.cache()
    
      val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
      val training = splits(0).cache()
      val test = splits(1)
    
      val pca = PCA.create(training.first().features.size/2, data.map(_.features))
      val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
      val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
    
      val numIterations = 100
      val model = LinearRegressionWithSGD.train(training, numIterations)
      val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
    
      val valuesAndPreds = test.map { point =>
        val score = model.predict(point.features)
        (score, point.label)
      }
    
      val valuesAndPreds_pca = test_pca.map { point =>
        val score = model_pca.predict(point.features)
        (score, point.label)
      }
    
      val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
      val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
    
      println("Mean Squared Error = " + MSE)
      println("PCA Mean Squared Error = " + MSE_pca)
    ```
    
    Author: Kirill A. Korinskiy <catap@catap.ru>
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes #4304 from catap/pca and squashes the following commits:
    
    501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
    9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
    1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors