Skip to content
Snippets Groups Projects
  1. Apr 19, 2014
    • Reynold Xin's avatar
      README update · 28238c81
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #443 from rxin/readme and squashes the following commits:
      
      16853de [Reynold Xin] Updated SBT and Scala instructions.
      3ac3ceb [Reynold Xin] README update
      28238c81
  2. Apr 18, 2014
  3. Apr 17, 2014
    • Patrick Wendell's avatar
      HOTFIX: Ignore streaming UI test · 7863ecca
      Patrick Wendell authored
      This is currently causing many builds to hang.
      
      https://issues.apache.org/jira/browse/SPARK-1530
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #440 from pwendell/uitest-fix and squashes the following commits:
      
      9a143dc [Patrick Wendell] Ignore streaming UI test
      7863ecca
    • Patrick Wendell's avatar
      FIX: Don't build Hive in assembly unless running Hive tests. · 6c746ba3
      Patrick Wendell authored
      This will make the tests more stable when not running SQL tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #439 from pwendell/hive-tests and squashes the following commits:
      
      88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
      6c746ba3
    • Thomas Graves's avatar
      SPARK-1408 Modify Spark on Yarn to point to the history server when app ... · 0058b5d2
      Thomas Graves authored
      ...finishes
      
      Note this is dependent on https://github.com/apache/spark/pull/204 to have a working history server, but there are no code dependencies.
      
      This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #362 from tgravescs/SPARK-1408 and squashes the following commits:
      
      ec89705 [Thomas Graves] Fix typo.
      446122d [Thomas Graves] Make config yarn specific
      f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes
      0058b5d2
    • Marcelo Vanzin's avatar
      [SPARK-1395] Allow "local:" URIs to work on Yarn. · 69047506
      Marcelo Vanzin authored
      This only works for the three paths defined in the environment
      (SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF).
      
      Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #303 from vanzin/yarn-local and squashes the following commits:
      
      82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.
      69047506
  4. Apr 16, 2014
    • AbhishekKr's avatar
      [python alternative] pyspark require Python2, failing if system default is Py3 from shell.py · bb76eae1
      AbhishekKr authored
      Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py
      
      Author: AbhishekKr <abhikumar163@gmail.com>
      
      Closes #399 from abhishekkr/pyspark_shell and squashes the following commits:
      
      134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py
      bb76eae1
    • Sandeep's avatar
      SPARK-1462: Examples of ML algorithms are using deprecated APIs · 6ad4c549
      Sandeep authored
      This will also fix SPARK-1464: Update MLLib Examples to Use Breeze.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #416 from techaddict/1462 and squashes the following commits:
      
      a43638e [Sandeep] Some Style Changes
      3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples
      6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs
      6ad4c549
    • Michael Armbrust's avatar
      Include stack trace for exceptions thrown by user code. · d4916a8e
      Michael Armbrust authored
      It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler.  This is a simple patch to include the stack trace for the actual failure in the error message.  Suggestions on formatting welcome.
      
      Before:
      ```
      scala> sc.parallelize(1 :: Nil).map(_ => sys.error("Ahh!")).collect()
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!)
      	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
      ...
      ```
      
      After:
      ```
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!
              scala.sys.package$.error(package.scala:27)
              $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
              $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
              scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
              scala.collection.Iterator$class.foreach(Iterator.scala:727)
              scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
              scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
              scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
              scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
              scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
              scala.collection.AbstractIterator.to(Iterator.scala:1157)
              scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
              scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
              scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
              scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
              org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
              org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
              org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
              org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
              org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110)
              org.apache.spark.scheduler.Task.run(Task.scala:50)
              org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
              org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46)
              org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
              java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              java.lang.Thread.run(Thread.java:744)
      Driver stacktrace:
      	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
      	at scala.Option.foreach(Option.scala:236)
      	at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143)
      	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
      	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
      	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
      	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
      	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
      	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #409 from marmbrus/stacktraces and squashes the following commits:
      
      3e4eb65 [Michael Armbrust] indent. include header for driver stack trace.
      018b06b [Michael Armbrust] Include stack trace for exceptions in user code.
      d4916a8e
    • baishuo(白硕)'s avatar
      Update ReducedWindowedDStream.scala · 07b7ad30
      baishuo(白硕) authored
      change  _slideDuration  to   _windowDuration
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #425 from baishuo/master and squashes the following commits:
      
      6f09ea1 [baishuo(白硕)] Update ReducedWindowedDStream.scala
      07b7ad30
    • Chen Chao's avatar
      misleading task number of groupByKey · 9c40b9ea
      Chen Chao authored
      "By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389
      
      detail is as following code :
      
        def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
          val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
          for (r <- bySize if r.partitioner.isDefined) {
            return r.partitioner.get
          }
          if (rdd.context.conf.contains("spark.default.parallelism")) {
            new HashPartitioner(rdd.context.defaultParallelism)
          } else {
            new HashPartitioner(bySize.head.partitions.size)
          }
        }
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #403 from CrazyJvm/patch-4 and squashes the following commits:
      
      42f6c9e [Chen Chao] fix format
      829a995 [Chen Chao] fix format
      1568336 [Chen Chao] misleading task number of groupByKey
      9c40b9ea
    • Kan Zhang's avatar
      Fixing a race condition in event listener unit test · 38877ccf
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #401 from kanzhang/fix-1475 and squashes the following commits:
      
      c6058bd [Kan Zhang] Fixing a race condition in event listener unit test
      38877ccf
    • Chen Chao's avatar
      remove unnecessary brace and semicolon in 'putBlockInfo.synchronize' block · 016a8776
      Chen Chao authored
      delete semicolon
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #411 from CrazyJvm/patch-5 and squashes the following commits:
      
      72333a3 [Chen Chao] remove unnecessary brace
      de5d9a7 [Chen Chao] style fix
      016a8776
    • Ankur Dave's avatar
      SPARK-1329: Create pid2vid with correct number of partitions · 17d32345
      Ankur Dave authored
      Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition.
      
      GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case.
      
      Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #368 from ankurdave/fix-pid2vid-size and squashes the following commits:
      
      5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
      17d32345
    • Ankur Dave's avatar
      Rebuild routing table after Graph.reverse · 235a47ce
      Ankur Dave authored
      GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning.
      
      However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched.
      
      A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view.
      
      Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html).
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #431 from ankurdave/fix-reverse-bug and squashes the following commits:
      
      75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse
      235a47ce
    • Patrick Wendell's avatar
      Add clean to build · 987760ec
      Patrick Wendell authored
      987760ec
    • Ye Xianjin's avatar
      [SPARK-1511] use Files.move instead of renameTo in TestUtils.scala · 10b1c59d
      Ye Xianjin authored
      JIRA issue:[SPARK-1511](https://issues.apache.org/jira/browse/SPARK-1511)
      
      TestUtils.createCompiledClass method use renameTo() to move files which fails when the src and dest files are in different disks or partitions. This pr uses Files.move() instead. The move method will try to use renameTo() and then fall back to copy() and delete(). I think this should handle this issue.
      
      I didn't found a test suite for this file, so I add file existence detection after file moving.
      
      Author: Ye Xianjin <advancedxy@gmail.com>
      
      Closes #427 from advancedxy/SPARK-1511 and squashes the following commits:
      
      a2b97c7 [Ye Xianjin] Based on @srowen's comment, assert file existence.
      6f95550 [Ye Xianjin] use Files.move instead of renameTo to handle the src and dest files are in different disks or partitions.
      10b1c59d
    • xuan's avatar
      SPARK-1465: Spark compilation is broken with the latest hadoop-2.4.0 release · 725925cf
      xuan authored
      YARN-1824 changes the APIs (addToEnvironment, setEnvFromInputString) in Apps, which causes the spark build to break if built against a version 2.4.0. To fix this, create the spark own function to do that functionality which will not break compiling against 2.3 and other 2.x versions.
      
      Author: xuan <xuan@MacBook-Pro.local>
      Author: xuan <xuan@macbook-pro.home>
      
      Closes #396 from xgong/master and squashes the following commits:
      
      42b5984 [xuan] Remove two extra imports
      bc0926f [xuan] Remove usage of org.apache.hadoop.util.Shell
      be89fa7 [xuan] fix Spark compilation is broken with the latest hadoop-2.4.0 release
      725925cf
    • Sandeep's avatar
      SPARK-1469: Scheduler mode should accept lower-case definitions and have... · e269c24d
      Sandeep authored
      ... nicer error messages
      
      There are  two improvements to Scheduler Mode:
      1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO).
      2. If an invalid mode is given we should print a better error message.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #388 from techaddict/1469 and squashes the following commits:
      
      a31bbd5 [Sandeep] SPARK-1469: Scheduler mode should accept lower-case definitions and have nicer error messages There are  two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message.
      e269c24d
    • Patrick Wendell's avatar
      Minor addition to SPARK-1497 · 82349fbd
      Patrick Wendell authored
      82349fbd
    • Sean Owen's avatar
      SPARK-1497. Fix scalastyle warnings in YARN, Hive code · 77f83679
      Sean Owen authored
      (I wasn't sure how to automatically set `SPARK_YARN=true` and `SPARK_HIVE=true` when running scalastyle, but these are the errors that turn up.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #413 from srowen/SPARK-1497 and squashes the following commits:
      
      f0c9318 [Sean Owen] Fix more scalastyle warnings in yarn
      80bf4c3 [Sean Owen] Add YARN alpha / YARN profile to scalastyle check
      026319c [Sean Owen] Fix scalastyle warnings in YARN, Hive code
      77f83679
    • Holden Karau's avatar
      SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold to... · c3527a33
      Holden Karau authored
      SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold to MLUtils & fixes bug in BernoulliSampler]
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #18 from holdenk/addkfoldcrossvalidation and squashes the following commits:
      
      208db9b [Holden Karau] Fix a bad space
      e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead
      6ddbf05 [Holden Karau] swap training and validation order
      7157ae9 [Holden Karau] CR feedback
      90896c7 [Holden Karau] New line
      150889c [Holden Karau] Fix up error messages in the MLUtilsSuite
      2cb90b3 [Holden Karau] Fix the names in kFold
      c702a96 [Holden Karau] Fix imports in MLUtils
      e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala
      c5b723f [Holden Karau] clean up
      7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line
      bb5fa56 [Holden Karau] extra line sadness
      163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds
      5a33f1d [Holden Karau] Code review follow up.
      e8741a7 [Holden Karau] CR feedback
      b78804e [Holden Karau] Remove cross validation [TODO in another pull request]
      91eae64 [Holden Karau] Consolidate things in mlutils
      264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param
      dd0b737 [Holden Karau] Wrap long lines (oops)
      c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD
      08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement
      a751ec6 [Holden Karau] Add k-fold cross validation to MLLib
      c3527a33
    • Chen Chao's avatar
      update spark.default.parallelism · 9edd8878
      Chen Chao authored
      actually, the value 8 is only valid in mesos fine-grained mode :
      <code>
        override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8)
      </code>
      
      while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers!
      <code>
      override def defaultParallelism(): Int = {
         conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
        }
      </code>
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #389 from CrazyJvm/patch-2 and squashes the following commits:
      
      84a7fe4 [Chen Chao] miss </li> at the end of every single line
      04a9796 [Chen Chao] change format
      ee0fae0 [Chen Chao] update spark.default.parallelism
      9edd8878
    • Cheng Lian's avatar
      Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME · fec462c1
      Cheng Lian authored
      When running Hive tests, the working directory is `$SPARK_HOME/sql/hive`, while when running `sbt hive/console`, it becomes `$SPARK_HOME`, and test tables are not loaded if `HIVE_DEV_HOME` is not defined.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #417 from liancheng/loadTestTables and squashes the following commits:
      
      7cea8d6 [Cheng Lian] Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME
      fec462c1
    • Marcelo Vanzin's avatar
      Make "spark logo" link refer to "/". · c0273d80
      Marcelo Vanzin authored
      This is not an issue with the driver UI, but when you fire
      up the history server, there's currently no way to go back to
      the app listing page without editing the browser's location
      field (since the logo's link points to the root of the
      application's own UI - i.e. the "stages" tab).
      
      The change just points the logo link to "/", which is the app
      listing for the history server, and the stages tab for the
      driver's UI.
      
      Tested with both history server and live driver.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #408 from vanzin/web-ui-root and squashes the following commits:
      
      1b60cb6 [Marcelo Vanzin] Make "spark logo" link refer to "/".
      c0273d80
    • Cheng Lian's avatar
      [SPARK-959] Updated SBT from 0.13.1 to 0.13.2 · 6a10d801
      Cheng Lian authored
      JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959)
      
      SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #426 from liancheng/updateSbt and squashes the following commits:
      
      95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959
      6a10d801
  5. Apr 15, 2014
    • Michael Armbrust's avatar
      [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs · 273c2fd0
      Michael Armbrust authored
      This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #354 from marmbrus/insertIntoTable and squashes the following commits:
      
      6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
      f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
      765c506 [Michael Armbrust] Add to JavaAPI.
      77b512c [Michael Armbrust] typos.
      5c3ef95 [Michael Armbrust] use names for boolean args.
      882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
      d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
      fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
      273c2fd0
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
    • Xiangrui Meng's avatar
      [FIX] update sbt-idea to version 1.6.0 · 8517911e
      Xiangrui Meng authored
      I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It seems upgrading `sbt-idea` to 1.6.0 fixed the problem.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #419 from mengxr/idea-plugin and squashes the following commits:
      
      fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0
      8517911e
    • Patrick Wendell's avatar
      SPARK-1455: Better isolation for unit tests. · 5aaf9836
      Patrick Wendell authored
      This is a simple first step towards avoiding running the Hive tests
      whenever possible.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #420 from pwendell/test-isolation and squashes the following commits:
      
      350c8af [Patrick Wendell] SPARK-1455: Better isolation for unit tests.
      5aaf9836
    • Manish Amde's avatar
      Decision Tree documentation for MLlib programming guide · 07d72fe6
      Manish Amde authored
      Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release.
      
      Apart from a general review, I need specific input on the following:
      * I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release.
      * I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get.
      
      cc: @mengxr, @hirakendu, @etrain, @atalwalkar
      
      Author: Manish Amde <manish9ue@gmail.com>
      
      Closes #402 from manishamde/tree_doc and squashes the following commits:
      
      022485a [Manish Amde] more documentation
      865826e [Manish Amde] minor: grammar
      dbb0e5e [Manish Amde] minor improvements to text
      b9ef6c4 [Manish Amde] basic decision tree code examples
      6e297d7 [Manish Amde] added subsections
      f427e84 [Manish Amde] renaming sections
      9c0c4be [Manish Amde] split candidate
      6925275 [Manish Amde] impurity and information gain
      94fd2f9 [Manish Amde] more reorg
      b93125c [Manish Amde] more subsection reorg
      3ecb2ad [Manish Amde] minor text addition
      1537dd3 [Manish Amde] added placeholders and some doc
      d06511d [Manish Amde] basic skeleton
      07d72fe6
    • DB Tsai's avatar
      [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation. · 6843d637
      DB Tsai authored
      This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !
      
      When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.
      
      Let's review how updater works when returning newWeights given the input parameters.
      
      w' = w - thisIterStepSize * (gradient + regGradient(w))  Note that regGradient is function of w!
      If we set gradient = 0, thisIterStepSize = 1, then
      regGradient(w) = w - w'
      
      As a result, for regVal, it can be computed by
      
          val regVal = updater.compute(
            weights,
            new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
      and for regGradient, it can be obtained by
      
            val regGradient = weights.sub(
              updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)
      
      The PR includes the tests which compare the result with SGD with/without regularization.
      
      We did a comparison between LBFGS and SGD, and often we saw 10x less
      steps in LBFGS while the cost of per step is the same (just computing
      the gradient).
      
      The following is the paper by Prof. Ng at Stanford comparing different
      optimizers including LBFGS and SGD. They use them in the context of
      deep learning, but worth as reference.
      http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #353 from dbtsai/dbtsai-LBFGS and squashes the following commits:
      
      984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.
      6843d637
    • William Benton's avatar
      SPARK-1501: Ensure assertions in Graph.apply are asserted. · 2580a3b1
      William Benton authored
      The Graph.apply test in GraphSuite had some assertions in a closure in
      a graph transformation. As a consequence, these assertions never
      actually executed.  Furthermore, these closures had a reference to
      (non-serializable) test harness classes because they called assert(),
      which could be a problem if we proactively check closure serializability
      in the future.
      
      This commit simply changes the Graph.apply test to collect the graph
      triplets so it can assert about each triplet from a map method.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #415 from willb/graphsuite-nop-fix and squashes the following commits:
      
      0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.
      2580a3b1
Loading