Skip to content
Snippets Groups Projects
  1. May 21, 2014
  2. May 17, 2014
    • Andrew Or's avatar
      [SPARK-1808] Route bin/pyspark through Spark submit · 4b8ec6fc
      Andrew Or authored
      **Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
      
      **Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
      
      **Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
      
      For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
      
      This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #799 from andrewor14/pyspark-submit and squashes the following commits:
      
      bf37e36 [Andrew Or] Minor changes
      01066fa [Andrew Or] bin/pyspark for Windows
      c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
      1866f85 [Andrew Or] Windows is not cooperating
      456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
      7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      b7ba0d8 [Andrew Or] Address a few comments (minor)
      06eb138 [Andrew Or] Use shlex instead of writing our own parser
      05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
      6fba412 [Andrew Or] Deal with quotes + address various comments
      fe4c8a7 [Andrew Or] Update --help for bin/pyspark
      afe47bf [Andrew Or] Fix spark shell
      f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      a371d26 [Andrew Or] Route bin/pyspark through Spark submit
      4b8ec6fc
  3. May 15, 2014
  4. May 14, 2014
    • Xiangrui Meng's avatar
      [FIX] do not load defaults when testing SparkConf in pyspark · 94c6c06e
      Xiangrui Meng authored
      The default constructor loads default properties, which can fail the test.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #775 from mengxr/pyspark-conf-fix and squashes the following commits:
      
      83ef6c4 [Xiangrui Meng] do not load defaults when testing SparkConf in pyspark
      94c6c06e
  5. May 13, 2014
  6. May 10, 2014
  7. May 07, 2014
    • Xiangrui Meng's avatar
      [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark · 3188553f
      Xiangrui Meng authored
      Make loading/saving labeled data easier for pyspark users.
      
      Also changed type check in `SparseVector` to allow numpy integers.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits:
      
      2943fa7 [Xiangrui Meng] format docs
      d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
      3188553f
    • Aaron Davidson's avatar
      SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions · 3308722c
      Aaron Davidson authored
      This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:
      
      - The Python daemon waits for Spark to close the socket before exiting,
        in order to avoid causing spurious IOExceptions in Spark's
        `PythonRDD::WriterThread`.
      - Removes the Python Monitor Thread, which polled for task cancellations
        in order to kill the Python worker. Instead, we do this in the
        onCompleteCallback, since this is guaranteed to be called during
        cancellation.
      - Adds a "completed" variable to TaskContext to avoid the issue noted in
        [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
        Along with this, I removed the "context.interrupted = true" flag in
        the onCompleteCallback.
      - Extracts PythonRDD::WriterThread to its own class.
      
      Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with
      
      ```
      sc.textFile("latlon.tsv").take(5)
      ```
      
      many times without error.
      
      Additionally, in order to test the unswallowed exceptions, I performed
      
      ```
      sc.textFile("s3n://<big file>").count()
      ```
      
      and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #640 from aarondav/pyspark-io and squashes the following commits:
      
      b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
      c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
      3308722c
    • Kan Zhang's avatar
      [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... · 967635a2
      Kan Zhang authored
      ... that do not change schema
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:
      
      111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
      91dc787 [Kan Zhang] Taking into account newly added Ordering param
      79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
      967635a2
  8. May 06, 2014
    • Sandeep's avatar
      SPARK-1637: Clean up examples for 1.0 · a000b5c3
      Sandeep authored
      - [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      - [x] Move Python examples into examples/src/main/python
      - [x] Update docs to reflect these changes
      
      Author: Sandeep <sandeep@techaddict.me>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #571 from techaddict/SPARK-1637 and squashes the following commits:
      
      47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
      8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
      5f96121 [Sandeep] Move Python examples into examples/src/main/python
      0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      a000b5c3
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
  9. May 05, 2014
    • Xiangrui Meng's avatar
      [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide · 98750a74
      Xiangrui Meng authored
      Final pass before the v1.0 release.
      
      * Remove `VectorRDDs`
      * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
      * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
      * Clean `DecisionTree` package doc and test suite.
      * Mark model constructors `private[spark]`
      * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
      * Add `saveAsLibSVMFile`.
      * Add `appendBias` to `MLUtils`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #524 from mengxr/mllib-cleaning and squashes the following commits:
      
      295dc8b [Xiangrui Meng] update loadLibSVMFile doc
      1977ac1 [Xiangrui Meng] fix doc of appendBias
      649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
      54b812c [Xiangrui Meng] add appendBias
      a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
      d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
      9b02b93 [Xiangrui Meng] minor code style update
      a593ddc [Xiangrui Meng] fix python tests
      fc28c18 [Xiangrui Meng] mark more classes experimental
      f6cbbff [Xiangrui Meng] fix Java tests
      0af70b0 [Xiangrui Meng] minor
      6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
      df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
      c81807f [Xiangrui Meng] set the default value of AddIntercept to false
      03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
      c66c56f [Xiangrui Meng] move tree md to package object doc
      a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
      9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
      1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
      98750a74
  10. Apr 30, 2014
    • Sandy Ryza's avatar
      SPARK-1004. PySpark on YARN · ff5be9a4
      Sandy Ryza authored
      This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #30 from sryza/sandy-spark-1004 and squashes the following commits:
      
      89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
      5165a02 [Sandy Ryza] Fix docs
      fd0df79 [Sandy Ryza] PySpark on YARN
      ff5be9a4
  11. Apr 29, 2014
    • Xiangrui Meng's avatar
      [SPARK-1674] fix interrupted system call error in pyspark's RDD.pipe · d33df1c1
      Xiangrui Meng authored
      `RDD.pipe`'s doctest throws interrupted system call exception on Mac. It can be fixed by wrapping `pipe.stdout.readline` in an iterator.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #594 from mengxr/pyspark-pipe and squashes the following commits:
      
      cc32ac9 [Xiangrui Meng] fix interrupted system call error in pyspark's RDD.pipe
      d33df1c1
    • Michael Armbrust's avatar
      Minor fix to python table caching API. · 497be3ca
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #585 from marmbrus/pythonCacheTable and squashes the following commits:
      
      7ec1f91 [Michael Armbrust] Minor fix to python table caching API.
      497be3ca
  12. Apr 25, 2014
    • Holden Karau's avatar
      SPARK-1242 Add aggregate to python rdd · e03bc379
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #139 from holdenk/add_aggregate_to_python_api and squashes the following commits:
      
      0f39ae3 [Holden Karau] Merge in master
      4879c75 [Holden Karau] CR feedback, fix issue with empty RDDs in aggregate
      70b4724 [Holden Karau] Style fixes from code review
      96b047b [Holden Karau] Add aggregate to python rdd
      e03bc379
  13. Apr 24, 2014
    • Ahir Reddy's avatar
      [SPARK-986]: Job cancelation for PySpark · e53eb4f0
      Ahir Reddy authored
      * Additions to the PySpark API to cancel jobs
      * Monitor Thread in PythonRDD to kill Python workers if a task is interrupted
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #541 from ahirreddy/python-cancel and squashes the following commits:
      
      dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer
      6c860ab [Ahir Reddy] PR Comments
      4b4100a [Ahir Reddy] Success flag
      adba6ed [Ahir Reddy] Destroy python workers
      27a2f8f [Ahir Reddy] Start the writer thread...
      d422f7b [Ahir Reddy] Remove unnecesssary vals
      adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker
      d9e472f [Ahir Reddy] Revert "removed unnecessary vals"
      5b9cae5 [Ahir Reddy] removed unnecessary vals
      07b54d9 [Ahir Reddy] Fix canceling unit test
      8ae9681 [Ahir Reddy] Don't interrupt worker
      7722342 [Ahir Reddy] Monitor Thread for python workers
      db04e16 [Ahir Reddy] Added canceling api to PySpark
      e53eb4f0
    • Arun Ramakrishnan's avatar
      SPARK-1438 RDD.sample() make seed param optional · 35e3d199
      Arun Ramakrishnan authored
      copying form previous pull request https://github.com/apache/spark/pull/462
      
      Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.
      
      In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.
      
      Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
      sample(fraction, withReplacement=false, seed=math.random)
      Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.
      
      If backward compatible is important, 3 new method can be introduced (without default params) like this
      sample(fraction)
      sample(fraction, withReplacement)
      sample(fraction, withReplacement, seed)
      
      Added some tests for the scala RDD takeSample method.
      
      Author: Arun Ramakrishnan <smartnut007@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #477 from smartnut007/master and squashes the following commits:
      
      07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
      b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
      8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
      69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
      0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
      35e3d199
  14. Apr 22, 2014
    • Xusen Yin's avatar
      fix bugs of dot in python · c919798f
      Xusen Yin authored
      If there are no `transpose()` in `self.theta`, a
      
      *ValueError: matrices are not aligned*
      
      is occurring. The former test case just ignore this situation.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #463 from yinxusen/python-naive-bayes and squashes the following commits:
      
      fcbe3bc [Xusen Yin] fix bugs of dot in python
      c919798f
  15. Apr 21, 2014
    • Matei Zaharia's avatar
      [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs · fc783847
      Matei Zaharia authored
      I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.
      
      Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/
      
      Author: Matei Zaharia <matei@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Patrick Wendell <pwendell@gmail.com>
      
      Closes #457 from mateiz/better-docs and squashes the following commits:
      
      a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
      5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
      f05abc0 [Matei Zaharia] Don't include java.lang package names
      995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
      a14a93c [Matei Zaharia] typo
      76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
      ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
      acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
      fc783847
  16. Apr 19, 2014
    • Michael Armbrust's avatar
      Add insertInto and saveAsTable to Python API. · 10d04213
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #447 from marmbrus/pythonInsert and squashes the following commits:
      
      c7ab692 [Michael Armbrust] Keep docstrings < 72 chars.
      ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.
      10d04213
  17. Apr 18, 2014
    • Reynold Xin's avatar
      Fixed broken pyspark shell. · 81a152c5
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #444 from rxin/pyspark and squashes the following commits:
      
      fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6.
      571830b [Reynold Xin] Fixed broken pyspark shell.
      81a152c5
    • CodingCat's avatar
      SPARK-1483: Rename minSplits to minPartitions in public APIs · e31c8ffc
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-1483
      
      From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:
      
      4b60541 [CodingCat] deprecate defaultMinSplits
      ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
      e31c8ffc
  18. Apr 17, 2014
    • Patrick Wendell's avatar
      FIX: Don't build Hive in assembly unless running Hive tests. · 6c746ba3
      Patrick Wendell authored
      This will make the tests more stable when not running SQL tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #439 from pwendell/hive-tests and squashes the following commits:
      
      88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
      6c746ba3
  19. Apr 16, 2014
  20. Apr 15, 2014
    • Michael Armbrust's avatar
      [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs · 273c2fd0
      Michael Armbrust authored
      This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #354 from marmbrus/insertIntoTable and squashes the following commits:
      
      6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
      f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
      765c506 [Michael Armbrust] Add to JavaAPI.
      77b512c [Michael Armbrust] typos.
      5c3ef95 [Michael Armbrust] use names for boolean args.
      882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
      d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
      fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
      273c2fd0
    • Matei Zaharia's avatar
      [WIP] SPARK-1430: Support sparse data in Python MLlib · 63ca581d
      Matei Zaharia authored
      This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
      
      On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
      
      Some to-do items left:
      - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
      - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
      - [x] Explain how to use these in the Python MLlib docs.
      
      CC @mengxr, @joshrosen
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #341 from mateiz/py-ml-update and squashes the following commits:
      
      d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
      ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
      b9f97a3 [Matei Zaharia] Fix test
      1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
      88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
      37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
      da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
      c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
      a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
      74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
      889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
      ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
      a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
      0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
      eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
      2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
      154f45d [Matei Zaharia] Update docs, name some magic values
      881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
      63ca581d
    • Sandeep's avatar
      SPARK-1426: Make MLlib work with NumPy versions older than 1.7 · df360917
      Sandeep authored
      Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array.
      Replace it with a fallback
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #391 from techaddict/1426 and squashes the following commits:
      
      d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older than 1.7 Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array. Replace it with a fallback
      df360917
    • Ahir Reddy's avatar
      SPARK-1374: PySpark API for SparkSQL · c99bcb7f
      Ahir Reddy authored
      An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
      
      ```
      from pyspark.context import SQLContext
      sqlCtx = SQLContext(sc)
      rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
      srdd = sqlCtx.applySchema(rdd)
      sqlCtx.registerRDDAsTable(srdd, "table1")
      srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
      srdd2.collect()
      ```
      The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #363 from ahirreddy/pysql and squashes the following commits:
      
      0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
      307d6e0 [Ahir Reddy] Style fix
      6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
      3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
      29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
      f2312c7 [Ahir Reddy] Moved everything into sql.py
      a19afe4 [Ahir Reddy] Doc fixes
      6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
      521ff6d [Ahir Reddy] Trying to get spark to build with hive
      ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
      ded03e7 [Ahir Reddy] Added doc test for HiveContext
      22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
      e4da06c [Ahir Reddy] Display message if hive is not built into spark
      227a0be [Michael Armbrust] Update API links. Fix Hive example.
      58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
      4285340 [Michael Armbrust] Fix building of Hive API Docs.
      38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
      337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
      40491c9 [Ahir Reddy] PR Changes + Method Visibility
      1836944 [Michael Armbrust] Fix comments.
      e00980f [Michael Armbrust] First draft of python sql programming guide.
      b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
      f98a422 [Ahir Reddy] HiveContexts
      79621cf [Ahir Reddy] cleaning up cruft
      b406ba0 [Ahir Reddy] doctest formatting
      20936a5 [Ahir Reddy] Added tests and documentation
      e4d21b4 [Ahir Reddy] Added pyrolite dependency
      79f739d [Ahir Reddy] added more tests
      7515ba0 [Ahir Reddy] added more tests :)
      d26ec5e [Ahir Reddy] added test
      e9f5b8d [Ahir Reddy] adding tests
      906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
      251f99d [Ahir Reddy] for now only allow dictionaries as input
      09b9980 [Ahir Reddy] made jrdd explicitly lazy
      c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
      725c91e [Ahir Reddy] awesome row objects
      55d1c76 [Ahir Reddy] return row objects
      4fe1319 [Ahir Reddy] output dictionaries correctly
      be079de [Ahir Reddy] returning dictionaries works
      cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
      e948bd9 [Ahir Reddy] yippie
      4886052 [Ahir Reddy] even better
      c0fb1c6 [Ahir Reddy] more working
      043ca85 [Ahir Reddy] working
      5496f9f [Ahir Reddy] doesn't crash
      b8b904b [Ahir Reddy] Added schema rdd class
      67ba875 [Ahir Reddy] java to python, and python to java
      bcc0f23 [Ahir Reddy] Java to python
      ab6025d [Ahir Reddy] compiling
      c99bcb7f
  21. Apr 10, 2014
    • Ivan Wick's avatar
      Set spark.executor.uri from environment variable (needed by Mesos) · 5cd11d51
      Ivan Wick authored
      The Mesos backend uses this property when setting up a slave process.  It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark.
      
      Author: Ivan Wick <ivanwick+github@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #311 from ivanwick/master and squashes the following commits:
      
      da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos)
      5cd11d51
    • Sandeep's avatar
      SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining · 3bd31294
      Sandeep authored
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #356 from techaddict/1428 and squashes the following commits:
      
      3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
      3bd31294
  22. Apr 08, 2014
    • Holden Karau's avatar
      Spark 1271: Co-Group and Group-By should pass Iterable[X] · ce8ec545
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
      
      f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
      77048f8 [Holden Karau] Fix merge up to master
      d3fe909 [Holden Karau] use toSeq instead
      7a092a3 [Holden Karau] switch resultitr to resultiterable
      eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
      c5075aa [Holden Karau] If guava 14 had iterables
      2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
      11e730c [Holden Karau] Fix streaming tests
      66b583d [Holden Karau] Fix the core test suite to compile
      4ed579b [Holden Karau] Refactor from iterator to iterable
      d052c07 [Holden Karau] Python tests now pass with iterator pandas
      3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
      cd1e81c [Holden Karau] Try and make pickling list iterators work
      c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
      88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
      a5ee714 [Holden Karau] oops, was checking wrong iterator
      e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
      ec8cc3e [Holden Karau] Fix test issues\!
      4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
      fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
      ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
      b692868 [Holden Karau] Revert
      7e533f7 [Holden Karau] Fix the bug
      8a5153a [Holden Karau] Revert me, but we have some stuff to debug
      b4e86a9 [Holden Karau] Add a join based on the problem in SVD
      c4510e2 [Holden Karau] Revert this but for now put things in list pandas
      b4e0b1d [Holden Karau] Fix style issues
      71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
      b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
      37888ec [Holden Karau] core/tests now pass
      249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
      6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
      fe992fe [Holden Karau] hmmm try and fix up basic operation suite
      172705c [Holden Karau] Fix Java API suite
      caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
      88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
      4991af6 [Holden Karau] Fix some tests
      be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
      687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
      ce8ec545
  23. Apr 07, 2014
    • Aaron Davidson's avatar
      SPARK-1099: Introduce local[*] mode to infer number of cores · 0307db0f
      Aaron Davidson authored
      This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #182 from aarondav/110 and squashes the following commits:
      
      a88294c [Aaron Davidson] Rebased changes for new spark-shell
      a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
      0307db0f
  24. Apr 05, 2014
    • Matei Zaharia's avatar
      SPARK-1421. Make MLlib work on Python 2.6 · 0b855167
      Matei Zaharia authored
      The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:
      
      f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
      a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
      0b855167
  25. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
  26. Apr 03, 2014
    • Prashant Sharma's avatar
      Spark 1162 Implemented takeOrdered in pyspark. · c1ea3afb
      Prashant Sharma authored
      Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
      
      We have our own implementation of max heap.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
      
      35f86ba [Prashant Sharma] code review
      2b1124d [Prashant Sharma] fixed tests
      e8a08e2 [Prashant Sharma] Code review comments.
      49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
      c1ea3afb
  27. Apr 02, 2014
    • Xiangrui Meng's avatar
      [SPARK-1212, Part II] Support sparse data in MLlib · 9c65fa76
      Xiangrui Meng authored
      In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
      
      1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
      2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
      3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
      4. Add libSVMFile to MLContext.
      5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
      6. Gradient computation no longer creates temp vectors.
      7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
      
      TODO:
      1. ~~Use axpy when possible.~~
      2. ~~Optimize Naive Bayes.~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #245 from mengxr/vector and squashes the following commits:
      
      eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
      c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
      11999c7 [Xiangrui Meng] Merge branch 'master' into vector
      f7da54b [Xiangrui Meng] add minSplits to libSVMFile
      da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
      493f26f [Xiangrui Meng] Merge branch 'master' into vector
      7c1bc01 [Xiangrui Meng] add a TODO to NB
      b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
      b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
      4addc50 [Xiangrui Meng] merge master
      4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
      f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
      d088552 [Xiangrui Meng] use static constructor for MLContext
      6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
      3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
      0f8759b [Xiangrui Meng] minor updates to NB
      b11659c [Xiangrui Meng] style update
      78c4671 [Xiangrui Meng] add libSVMFile to MLContext
      f0fe616 [Xiangrui Meng] add a test for sparse linear regression
      44733e1 [Xiangrui Meng] use in-place gradient computation
      e981396 [Xiangrui Meng] use axpy in Updater
      db808a1 [Xiangrui Meng] update JavaLR example
      befa592 [Xiangrui Meng] passed scala/java tests
      75c83a4 [Xiangrui Meng] passed test compile
      1859701 [Xiangrui Meng] passed compile
      834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
      135ab72 [Xiangrui Meng] merge glm
      0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
      d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
      3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
      9c65fa76
  28. Mar 30, 2014
    • Prashant Sharma's avatar
      SPARK-1336 Reducing the output of run-tests script. · df1b9f7b
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:
      
      87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
      811170f [Prashant Sharma] Reducing the ouput of run-tests script.
      df1b9f7b
Loading