Skip to content
Snippets Groups Projects
  1. Jul 26, 2014
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] Turning some default configs for PySpark · 75663b57
      Davies Liu authored
      Add several default configs for PySpark, related to serialization in JVM.
      
      spark.serializer = org.apache.spark.serializer.KryoSerializer
      spark.serializer.objectStreamReset = 100
      spark.rdd.compress = True
      
      This will help to reduce the memory usage during RDD.partitionBy()
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1568 from davies/conf and squashes the following commits:
      
      cd316f1 [Davies Liu] remove duplicated line
      f71a355 [Davies Liu] rebase to master, add spark.rdd.compress = True
      8f63f45 [Davies Liu] Merge branch 'master' into conf
      8bc9f08 [Davies Liu] fix unittest
      c04a83d [Davies Liu] some default configs for PySpark
      75663b57
    • Josh Rosen's avatar
      [SPARK-1458] [PySpark] Expose sc.version in Java and PySpark · cf3e9fd8
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1596 from JoshRosen/spark-1458 and squashes the following commits:
      
      fdbb0bf [Josh Rosen] Add SparkContext.version to Python & Java [SPARK-1458]
      cf3e9fd8
  2. Jul 24, 2014
  3. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  4. Jun 20, 2014
    • Anant's avatar
      [SPARK-2061] Made splits deprecated in JavaRDDLike · 010c460d
      Anant authored
      The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
      Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:
      
      b83ce6b [Anant] Fixed syntax issue
      21f9210 [Anant] Fixed version number in deprecation string
      9315b76 [Anant] made related changes to use partitions in python api
      8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
      010c460d
  5. Jun 10, 2014
    • Nick Pentreath's avatar
      SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats · f971d6cb
      Nick Pentreath authored
      So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
      
      This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
      
      # Overview
      The basics are as follows:
      1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
      2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
      3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
      4. ```PickleSerializer``` on the Python side deserializes.
      
      This works "out the box" for simple ```Writable```s:
      * ```Text```
      * ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
      * ```NullWritable```
      * ```BooleanWritable```
      * ```BytesWritable```
      * ```MapWritable```
      
      It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
      
      I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
      ```python
      conf = {"es.resource" : "index/type" }
      rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
      rdd.first()
      ```
      
      I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
      
      # Some things still outstanding:
      1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
      2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
      3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
      4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:
      
      268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
      761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
      4c972d8 [Nick Pentreath] Add license headers
      d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      cde6af9 [Nick Pentreath] Parameterize converter trait
      5ebacfa [Nick Pentreath] Update docs for PySpark input formats
      a985492 [Nick Pentreath] Move Converter examples to own package
      365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
      eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
      1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
      3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
      b65606f [Nick Pentreath] Add converter interface
      5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
      085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
      43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
      94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
      1a4a1d6 [Nick Pentreath] Address @mateiz style comments
      01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
      9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      84fe8e3 [Nick Pentreath] Python programming guide space formatting
      d0f52b6 [Nick Pentreath] Python programming guide
      7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      93ef995 [Nick Pentreath] Add back context.py changes
      9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
      077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
      5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      35b8e3a [Nick Pentreath] Another fix for test ordering
      bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e001b94 [Nick Pentreath] Fix test failures due to ordering
      78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
      64eb051 [Nick Pentreath] Scalastyle fix
      e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
      c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
      1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
      17a656b [Nick Pentreath] remove binary sequencefile for tests
      f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
      450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      31a2fff [Nick Pentreath] Scalastyle fixes
      fc5099e [Nick Pentreath] Add Apache license headers
      4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
      b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
      951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      f6aac55 [Nick Pentreath] Bring back msgpack
      9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
      a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
      7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
      25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
      65360d5 [Nick Pentreath] Adding test SequenceFiles
      0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      d72bf18 [Nick Pentreath] msgpack
      dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e67212a [Nick Pentreath] Add back msgpack dependency
      f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      97ef708 [Nick Pentreath] Remove old writeToStream
      2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
      174f520 [Nick Pentreath] Add back graphx settings
      703ee65 [Nick Pentreath] Add back msgpack
      619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      eb40036 [Nick Pentreath] Remove unused comment lines
      4d7ef2e [Nick Pentreath] Fix indentation
      f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
      0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
      4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
      818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
      4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
      4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
      d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
      f971d6cb
  6. Jun 03, 2014
    • Kan Zhang's avatar
      [SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python · 21e40ed8
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #755 from kanzhang/SPARK-1161 and squashes the following commits:
      
      24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests
      44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10
      d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python
      21e40ed8
  7. May 31, 2014
    • Aaron Davidson's avatar
      SPARK-1839: PySpark RDD#take() shouldn't always read from driver · 9909efc1
      Aaron Davidson authored
      This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #922 from aarondav/take and squashes the following commits:
      
      fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
      9909efc1
  8. May 29, 2014
    • Jyotiska NK's avatar
      Added doctest and method description in context.py · 9cff1dd2
      Jyotiska NK authored
      Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #187 from jyotiska/pyspark_context and squashes the following commits:
      
      356f945 [Jyotiska NK] Added doctest for textFile method in context.py
      5b23686 [Jyotiska NK] Updated context.py with method descriptions
      9cff1dd2
  9. May 24, 2014
    • Andrew Or's avatar
      [SPARK-1900 / 1918] PySpark on YARN is broken · 5081a0a9
      Andrew Or authored
      If I run the following on a YARN cluster
      ```
      bin/spark-submit sheep.py --master yarn-client
      ```
      it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
      ```
      bin/spark-submit file:/path/to/sheep.py --master yarn-client
      ```
      However, this also fails. This time it is because python does not understand URI schemes.
      
      This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
      
      Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #853 from andrewor14/submit-paths and squashes the following commits:
      
      0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
      323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
      3c36587 [Andrew Or] Improve error messages (minor)
      854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
      6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
      3bb0359 [Andrew Or] Update more comments (minor)
      2a1f8a0 [Andrew Or] Update comments (minor)
      6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      a68c4d1 [Andrew Or] Handle Windows python file path correctly
      427a250 [Andrew Or] Resolve paths properly for Windows
      a591a4a [Andrew Or] Update tests for resolving URIs
      6c8621c [Andrew Or] Move resolveURIs to Utils
      db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      f542dce [Andrew Or] Fix outdated tests
      691c4ce [Andrew Or] Ignore special primary resource names
      5342ac7 [Andrew Or] Add missing space in error message
      02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
      5081a0a9
  10. May 21, 2014
  11. May 07, 2014
    • Aaron Davidson's avatar
      SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions · 3308722c
      Aaron Davidson authored
      This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:
      
      - The Python daemon waits for Spark to close the socket before exiting,
        in order to avoid causing spurious IOExceptions in Spark's
        `PythonRDD::WriterThread`.
      - Removes the Python Monitor Thread, which polled for task cancellations
        in order to kill the Python worker. Instead, we do this in the
        onCompleteCallback, since this is guaranteed to be called during
        cancellation.
      - Adds a "completed" variable to TaskContext to avoid the issue noted in
        [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
        Along with this, I removed the "context.interrupted = true" flag in
        the onCompleteCallback.
      - Extracts PythonRDD::WriterThread to its own class.
      
      Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with
      
      ```
      sc.textFile("latlon.tsv").take(5)
      ```
      
      many times without error.
      
      Additionally, in order to test the unswallowed exceptions, I performed
      
      ```
      sc.textFile("s3n://<big file>").count()
      ```
      
      and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #640 from aarondav/pyspark-io and squashes the following commits:
      
      b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
      c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
      3308722c
  12. May 06, 2014
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
  13. Apr 24, 2014
    • Ahir Reddy's avatar
      [SPARK-986]: Job cancelation for PySpark · e53eb4f0
      Ahir Reddy authored
      * Additions to the PySpark API to cancel jobs
      * Monitor Thread in PythonRDD to kill Python workers if a task is interrupted
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #541 from ahirreddy/python-cancel and squashes the following commits:
      
      dfdf447 [Ahir Reddy] Changed success -> completed and made logging message clearer
      6c860ab [Ahir Reddy] PR Comments
      4b4100a [Ahir Reddy] Success flag
      adba6ed [Ahir Reddy] Destroy python workers
      27a2f8f [Ahir Reddy] Start the writer thread...
      d422f7b [Ahir Reddy] Remove unnecesssary vals
      adda337 [Ahir Reddy] Busy wait on the ocntext.interrupted flag, and then kill the python worker
      d9e472f [Ahir Reddy] Revert "removed unnecessary vals"
      5b9cae5 [Ahir Reddy] removed unnecessary vals
      07b54d9 [Ahir Reddy] Fix canceling unit test
      8ae9681 [Ahir Reddy] Don't interrupt worker
      7722342 [Ahir Reddy] Monitor Thread for python workers
      db04e16 [Ahir Reddy] Added canceling api to PySpark
      e53eb4f0
  14. Apr 18, 2014
    • CodingCat's avatar
      SPARK-1483: Rename minSplits to minPartitions in public APIs · e31c8ffc
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-1483
      
      From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #430 from CodingCat/SPARK-1483 and squashes the following commits:
      
      4b60541 [CodingCat] deprecate defaultMinSplits
      ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
      e31c8ffc
  15. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
  16. Mar 10, 2014
  17. Mar 06, 2014
    • Prabin Banka's avatar
      SPARK-1187, Added missing Python APIs · 3d3acef0
      Prabin Banka authored
      The following Python APIs are added,
      RDD.id()
      SparkContext.setJobGroup()
      SparkContext.setLocalProperty()
      SparkContext.getLocalProperty()
      SparkContext.sparkUser()
      
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #75 from prabinb/python-api-backup and squashes the following commits:
      
      cc3c6cd [Prabin Banka] Added missing Python APIs
      3d3acef0
  18. Feb 20, 2014
    • Ahir Reddy's avatar
      SPARK-1114: Allow PySpark to use existing JVM and Gateway · 59b13795
      Ahir Reddy authored
      Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:
      
      a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
      59b13795
  19. Jan 28, 2014
    • Josh Rosen's avatar
      Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72
      Josh Rosen authored
      This fixes SPARK-1043, a bug introduced in 0.9.0
      where PySpark couldn't serialize strings > 64kB.
      
      This fix was written by @tyro89 and @bouk in #512.
      This commit squashes and rebases their pull request
      in order to fix some merge conflicts.
      1381fc72
  20. Jan 01, 2014
  21. Dec 30, 2013
  22. Dec 29, 2013
  23. Dec 28, 2013
  24. Dec 24, 2013
  25. Dec 18, 2013
  26. Nov 10, 2013
  27. Nov 03, 2013
  28. Oct 22, 2013
    • Ewen Cheslack-Postava's avatar
      Pass self to SparkContext._ensure_initialized. · 317a9eb1
      Ewen Cheslack-Postava authored
      The constructor for SparkContext should pass in self so that we track
      the current context and produce errors if another one is created. Add
      a doctest to make sure creating multiple contexts triggers the
      exception.
      317a9eb1
    • Ewen Cheslack-Postava's avatar
      Add classmethod to SparkContext to set system properties. · 56d230e6
      Ewen Cheslack-Postava authored
      Add a new classmethod to SparkContext to set system properties like is
      possible in Scala/Java. Unlike the Java/Scala implementations, there's
      no access to System until the JVM bridge is created. Since
      SparkContext handles that, move the initialization of the JVM
      connection to a separate classmethod that can safely be called
      repeatedly as long as the same instance (or no instance) is provided.
      56d230e6
  29. Sep 08, 2013
  30. Sep 07, 2013
  31. Sep 06, 2013
Loading