Skip to content
Snippets Groups Projects
  1. May 12, 2015
  2. May 11, 2015
    • Joshi's avatar
      [SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark · b94a9337
      Joshi authored
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:
      
      cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
      62becc1 [Joshi] SPARK-7435: Update to DataFrame
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      b94a9337
    • Reynold Xin's avatar
      [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns. · 028ad4bd
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6068 from rxin/drop-column and squashes the following commits:
      
      9d7d5ec [Reynold Xin] [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.
      028ad4bd
    • Zhongshuai Pei's avatar
      [SPARK-7437] [SQL] Fold "literal in (item1, item2, ..., literal, ...)" into true or false directly · 4b5e1fe9
      Zhongshuai Pei authored
      SQL
      ```
      select key from src where 3 in (4, 5);
      ```
      Before
      ```
      == Optimized Logical Plan ==
      Project [key#12]
       Filter 3 INSET (5,4)
        MetastoreRelation default, src, None
      ```
      
      After
      ```
      == Optimized Logical Plan ==
      LocalRelation [key#228], []
      ```
      
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5972 from DoingDone9/InToFalse and squashes the following commits:
      
      4c722a2 [Zhongshuai Pei] Update predicates.scala
      abe2bbb [Zhongshuai Pei] Update Optimizer.scala
      fa461a5 [Zhongshuai Pei] Update Optimizer.scala
      e34c28a [Zhongshuai Pei] Update predicates.scala
      24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      35ceb7a [Zhongshuai Pei] Update Optimizer.scala
      36c194e [Zhongshuai Pei] Update Optimizer.scala
      2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
      14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
      f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
      f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
      f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
      34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      4b5e1fe9
    • Cheng Hao's avatar
      [SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS · e35d878b
      Cheng Hao authored
      This is a follow up of #5876 and should be merged after #5876.
      
      Let's wait for unit testing result from Jenkins.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits:
      
      f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
      a8260e8 [Cheng Hao] Update code as feedback
      f4e243f [Cheng Hao] remove the serde setting for SequenceFile
      d166afa [Cheng Hao] style issue
      d25a4aa [Cheng Hao] Add SerDe support for CTAS
      e35d878b
    • Reynold Xin's avatar
      [SPARK-7324] [SQL] DataFrame.dropDuplicates · b6bf4f76
      Reynold Xin authored
      This should also close https://github.com/apache/spark/pull/5870
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6066 from rxin/dropDups and squashes the following commits:
      
      130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
      b6bf4f76
    • Tathagata Das's avatar
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the... · f9c7580a
      Tathagata Das authored
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the current state of the context
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6058 from tdas/SPARK-7530 and squashes the following commits:
      
      80ee0e6 [Tathagata Das] STARTED --> ACTIVE
      3da6547 [Tathagata Das] Added synchronized
      dd88444 [Tathagata Das] Added more docs
      e1a8505 [Tathagata Das] Fixed comment length
      89f9980 [Tathagata Das] Change to Java enum and added Java test
      7c57351 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      dd4e702 [Tathagata Das] Addressed comments.
      3d56106 [Tathagata Das] Added Mima excludes
      2b86ba1 [Tathagata Das] Added scala docs.
      1722433 [Tathagata Das] Fixed style
      976b094 [Tathagata Das] Added license
      0585130 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      e0f0a05 [Tathagata Das] Added getState and exposed StreamingContextState
      f9c7580a
    • Xusen Yin's avatar
      [SPARK-5893] [ML] Add bucketizer · 35fb42a0
      Xusen Yin authored
      JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).
      
      One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,
      
      ```scala
      buckets = Array(-0.5, 0.0, 0.5)
      ```
      
      splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:
      
      dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
      1ca973a [Joseph K. Bradley] one more bucketizer test
      34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
      eacfcfa [Xusen Yin] change ML attribute from splits into buckets
      c3cc770 [Xusen Yin] add more unit test for binary search
      3a16cc2 [Xusen Yin] refine comments and names
      ac77859 [Xusen Yin] fix style error
      fb30d79 [Xusen Yin] fix and test binary search
      2466322 [Xusen Yin] refactor Bucketizer
      11fb00a [Xusen Yin] change it into an Estimator
      998bc87 [Xusen Yin] check buckets
      4024cf1 [Xusen Yin] add test suite
      5fe190e [Xusen Yin] add bucketizer
      35fb42a0
    • Reynold Xin's avatar
      Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket. · 87229c95
      Reynold Xin authored
      So users that are interested in this can track it easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6067 from rxin/SPARK-7550 and squashes the following commits:
      
      ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.
      87229c95
    • Reynold Xin's avatar
      [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames. · 3a9b6997
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6062 from rxin/agg-retain-doc and squashes the following commits:
      
      43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
      3a9b6997
    • madhukar's avatar
      [SPARK-7084] improve saveAsTable documentation · 57255dcd
      madhukar authored
      Author: madhukar <phatak.dev@gmail.com>
      
      Closes #5654 from phatak-dev/master and squashes the following commits:
      
      386f407 [madhukar] #5654 updated for all the methods
      2c997c5 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      00bc819 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      2a802c6 [madhukar] #5654 updated the doc according to comments
      866e8df [madhukar] [SPARK-7084] improve saveAsTable documentation
      57255dcd
    • Reynold Xin's avatar
      [SQL] Show better error messages for incorrect join types in DataFrames. · 4f4dbb03
      Reynold Xin authored
      As a follow-up to https://github.com/apache/spark/pull/5944
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6064 from rxin/jointype-better-error and squashes the following commits:
      
      7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames.
      4f4dbb03
    • Sean Owen's avatar
      [MINOR] [DOCS] Fix the link to test building info on the wiki · 91dc3dfd
      Sean Owen authored
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6063 from srowen/FixRunningTestsLink and squashes the following commits:
      
      db62018 [Sean Owen] Fix the link to test building info on the wiki
      91dc3dfd
    • LCY Vincent's avatar
      Update Documentation: leftsemi instead of semijoin · a8ea0968
      LCY Vincent authored
      should sync up with here?
      https://github.com/apache/spark/blob/119f45d61d7b48d376cca05e1b4f0c7fcf65bfa8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala#L26
      
      Author: LCY Vincent <lauchunyin@gmail.com>
      
      Closes #5944 from vincentlaucy/master and squashes the following commits:
      
      fc0e454 [LCY Vincent] Update DataFrame.scala
      a8ea0968
    • jerryshao's avatar
      [STREAMING] [MINOR] Close files correctly when iterator is finished in streaming WAL recovery · 25c01c54
      jerryshao authored
      Currently there's no chance to close the file correctly after the iteration is finished, change to `CompletionIterator` to avoid resource leakage.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #6050 from jerryshao/close-file-correctly and squashes the following commits:
      
      52dfaf5 [jerryshao] Close files correctly when iterator is finished
      25c01c54
    • gchen's avatar
      [SPARK-7516] [Minor] [DOC] Replace depreciated inferSchema() with createDataFrame() · 8e674331
      gchen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7516
      
      In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():
      
      schemaPeople = sqlContext.inferSchema(people) ->
      schemaPeople = sqlContext.createDataFrame(people)
      
      Author: gchen <chenguancheng@gmail.com>
      
      Closes #6041 from gchen/python-docs and squashes the following commits:
      
      c27eb7c [gchen] replace inferSchema() with createDataFrame()
      8e674331
    • Kousuke Saruta's avatar
      [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode · 6e9910c2
      Kousuke Saruta authored
      Now PySpark on YARN with cluster mode is supported so let's update doc.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:
      
      ad9f88c [Kousuke Saruta] Brushed up sentences
      469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
      fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
      6e9910c2
    • Steve Loughran's avatar
      [SPARK-7508] JettyUtils-generated servlets to log & report all errors · 7ce2a33c
      Steve Loughran authored
      Patch for SPARK-7508
      
      This logs  warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500.
      
      In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way.
      
      This patch doesn't address that, it ensures that
      1. if the jetty handlers were put to one side the users would see the errors
      2. at least the exceptions appear in the server-side logs.
      
      This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits:
      
      584836f [Steve Loughran] SPARK-7508 drop trailing semicolon
      ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed
      258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed
      69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors
      7ce2a33c
    • Sandy Ryza's avatar
      [SPARK-6470] [YARN] Add support for YARN node labels. · 82fee9d9
      Sandy Ryza authored
      This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #5242 from sryza/sandy-spark-6470 and squashes the following commits:
      
      6af87b9 [Sandy Ryza] Change info to warning
      6e22d99 [Sandy Ryza] [YARN] SPARK-6470.  Add support for YARN node labels.
      82fee9d9
    • Reynold Xin's avatar
      [SPARK-7462] By default retain group by columns in aggregate · 0a4844f9
      Reynold Xin authored
      Updated Java, Scala, Python, and R.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5996 from rxin/groupby-retain and squashes the following commits:
      
      aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
      f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
      5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
      c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit https://github.com/amplab-extras/spark/commit/9a6be746efc9fafad88122fa2267862ef87aa0e1
      b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
      d910141 [Reynold Xin] Updated rest of the files
      1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
      0a4844f9
    • Tathagata Das's avatar
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start... · 1b465569
      Tathagata Das authored
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start multiple StreamingContexts in the same JVM
      
      Currently attempt to start a streamingContext while another one is started throws a confusing exception that the action name JobScheduler is already registered. Instead its best to throw a proper exception as it is not supported.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5907 from tdas/SPARK-7361 and squashes the following commits:
      
      fb81c4a [Tathagata Das] Fix typo
      a9cd5bb [Tathagata Das] Added startSite to StreamingContext
      5fdfc0d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7361
      5870e2b [Tathagata Das] Added check for multiple streaming contexts
      1b465569
    • Bryan Cutler's avatar
      [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option · 4f8a1551
      Bryan Cutler authored
      As is, to specify this option on command line, you have to escape the angle brackets.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
      
      b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
      4f8a1551
    • Yanbo Liang's avatar
      [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlib · 042dda3c
      Yanbo Liang authored
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6044 from yanboliang/spark-6092 and squashes the following commits:
      
      726a9b1 [Yanbo Liang] add newRankingMetrics
      33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
      042dda3c
    • Wesley Miao's avatar
      [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time · d70a0768
      Wesley Miao authored
      tdas
      
      https://issues.apache.org/jira/browse/SPARK-7326
      
      The problem most likely resides in DStream.slice() implementation, as shown below.
      
        def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
          if (!isInitialized) {
            throw new SparkException(this + " has not been initialized")
          }
          if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("toTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
          logInfo("Slicing from " + fromTime + " to " + toTime +
            " (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
      
          alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
            if (time >= zeroTime) getOrCompute(time) else None
          })
        }
      
      Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation.
      
      The fix is to add a new floor() function in Time.scala to respect the zeroTime while performing the floor :
      
        def floor(that: Duration, zeroTime: Time): Time = {
          val t = that.milliseconds
          new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds)
        }
      
      And then change the DStream.slice to call this new floor function by passing in its zeroTime.
      
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
      This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0.
      
      Author: Wesley Miao <wesley.miao@gmail.com>
      Author: Wesley <wesley.miao@autodesk.com>
      
      Closes #5871 from wesleymiao/spark-7326 and squashes the following commits:
      
      82a4d8c [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream dosen't work all the time
      48b4dc0 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      6ade399 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      2611745 [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      d70a0768
    • tianyi's avatar
      [SPARK-7519] [SQL] fix minor bugs in thrift server UI · 2242ab31
      tianyi authored
      Bugs description:
      
      1. There are extra commas on the top of session list.
      2. The format of time in "Start at:" part is not the same as others.
      3. The total number of online sessions is wrong.
      
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #6048 from tianyi/SPARK-7519 and squashes the following commits:
      
      ed366b7 [tianyi] fix bug
      2242ab31
  3. May 10, 2015
    • Shivaram Venkataraman's avatar
      [SPARK-7512] [SPARKR] Fix RDD's show method to use getJRDD · 0835f1ed
      Shivaram Venkataraman authored
      Since the RDD object might be a Pipelined RDD we should use `getJRDD` to get the right handle to the Java object.
      
      Fixes the bug reported at
      http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      
      cc concretevitamin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6035 from shivaram/sparkr-show-bug and squashes the following commits:
      
      d70145c [Shivaram Venkataraman] Fix RDD's show method to use getJRDD Fixes the bug reported at http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      0835f1ed
    • Glenn Weidner's avatar
      [SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python · c5aca0c2
      Glenn Weidner authored
      Modified 2 files:
      python/pyspark/ml/param/_shared_params_code_gen.py
      python/pyspark/ml/param/shared.py
      
      Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
      python _shared_params_code_gen.py > shared.py
      
      Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala.  Note warning was displayed when committing shared.py:
      warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
      
      Author: Glenn Weidner <gweidner@us.ibm.com>
      
      Closes #6023 from gweidner/br-7427 and squashes the following commits:
      
      db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
      c5aca0c2
    • Kirill A. Korinskiy's avatar
      [SPARK-5521] PCA wrapper for easy transform vectors · 8c07c75c
      Kirill A. Korinskiy authored
      I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
      
      Example of usage:
      ```
        import org.apache.spark.mllib.regression.LinearRegressionWithSGD
        import org.apache.spark.mllib.regression.LabeledPoint
        import org.apache.spark.mllib.linalg.Vectors
        import org.apache.spark.mllib.feature.PCA
      
        val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
          val parts = line.split(',')
          LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
        }.cache()
      
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)
      
        val pca = PCA.create(training.first().features.size/2, data.map(_.features))
        val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
        val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
      
        val numIterations = 100
        val model = LinearRegressionWithSGD.train(training, numIterations)
        val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
      
        val valuesAndPreds = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
      
        val valuesAndPreds_pca = test_pca.map { point =>
          val score = model_pca.predict(point.features)
          (score, point.label)
        }
      
        val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
        val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
      
        println("Mean Squared Error = " + MSE)
        println("PCA Mean Squared Error = " + MSE_pca)
      ```
      
      Author: Kirill A. Korinskiy <catap@catap.ru>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4304 from catap/pca and squashes the following commits:
      
      501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
      9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
      1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
      8c07c75c
    • Joseph K. Bradley's avatar
      [SPARK-7431] [ML] [PYTHON] Made CrossValidatorModel call parent init in PySpark · 3038443e
      Joseph K. Bradley authored
      Fixes bug with PySpark cvModel not having UID
      Also made small PySpark fixes: Evaluator should inherit from Params.  MockModel should inherit from Model.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
      
      57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
      3038443e
    • Cheng Lian's avatar
      [MINOR] [SQL] Fixes variable name typo · 6bf9352f
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6038)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6038 from liancheng/fix-typo and squashes the following commits:
      
      572c2a4 [Cheng Lian] Fixes variable name typo
      6bf9352f
    • Oleg Sidorkin's avatar
      [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector · d7a37bca
      Oleg Sidorkin authored
      Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns.
      If original column is used in SQL query once the resulting DataFrame will contain non-renamed column.
      If original column is used in SQL query several times with different aliases, sqlContext.load will fail.
      Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema.
      Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement.
      
      Readings:
      http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnname
      http://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2
      
      Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement.
      http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
      getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method.
      
      Author: Oleg Sidorkin <oleg.sidorkin@gmail.com>
      
      Closes #6032 from osidorkin/master and squashes the following commits:
      
      10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error)
      2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query)
      b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite
      09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector
      d7a37bca
    • Yanbo Liang's avatar
      [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlib · bf7e81a5
      Yanbo Liang authored
      https://issues.apache.org/jira/browse/SPARK-6091
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6011 from yanboliang/spark-6091 and squashes the following commits:
      
      bb3e4ba [Yanbo Liang] trigger jenkins
      53c045d [Yanbo Liang] keep compatibility for python 2.6
      972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib
      bf7e81a5
  4. May 09, 2015
    • Yuhao Yang's avatar
      [SPARK-7475] [MLLIB] adjust ldaExample for online LDA · b13162b3
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7475
      
      Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer.
      
      cc jkbradley
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6000 from hhbyyh/ldaExample and squashes the following commits:
      
      0a7e2bc [Yuhao Yang] fix according to comments
      5810b0f [Yuhao Yang] adjust ldaExample for online LDA
      b13162b3
    • tedyu's avatar
      [BUILD] Reference fasterxml.jackson.version in sql/core/pom.xml · bd74301f
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6031 from tedyu/master and squashes the following commits:
      
      5c2580c [tedyu] Reference fasterxml.jackson.version in sql/core/pom.xml
      ff2a44f [tedyu] Merge branch 'master' of github.com:apache/spark
      28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
      bd74301f
    • tedyu's avatar
      Upgrade version of jackson-databind in sql/core/pom.xml · 3071aac3
      tedyu authored
      Currently version of jackson-databind in sql/core/pom.xml is 2.3.0
      
      This is older than the version specified in root pom.xml
      
      This PR upgrades the version in sql/core/pom.xml so that they're consistent.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6028 from tedyu/master and squashes the following commits:
      
      28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
      3071aac3
    • dobashim's avatar
      [STREAMING] [DOCS] Fix wrong url about API docs of StreamingListener · 7d0f1720
      dobashim authored
      A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)
      
      Author: dobashim <dobashim@oss.nttdata.co.jp>
      
      Closes #6024 from dobashim/master and squashes the following commits:
      
      ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
      7d0f1720
    • Kousuke Saruta's avatar
      [SPARK-7403] [WEBUI] Link URL in objects on Timeline View is wrong in case of running on YARN · 12b95abc
      Kousuke Saruta authored
      When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong.
      
      In timeline-view.js, the link is generated as follows.
      ```
      window.location.href = "job/?id=" + getJobId(this);
      ```
      
      This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/"
      
      The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages.
      
      ![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png)
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5947 from sarutak/fix-link-in-timeline and squashes the following commits:
      
      aaf40e1 [Kousuke Saruta] Added Copyright for vis.js
      01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href
      12b95abc
Loading