Skip to content
Snippets Groups Projects
  1. Aug 06, 2014
    • RJ Nowling's avatar
      [PySpark] Add blanklines to Python docstrings so example code renders correctly · e537b33c
      RJ Nowling authored
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #1808 from rnowling/pyspark_docs and squashes the following commits:
      
      c06d774 [RJ Nowling] Add blanklines to Python docstrings so example code renders correctly
      e537b33c
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  2. Aug 01, 2014
    • Davies Liu's avatar
      [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD · 880eabec
      Davies Liu authored
      Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
      
      This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
      
      root
       |-- field1: integer (nullable = true)
       |-- field2: string (nullable = true)
       |-- field3: struct (nullable = true)
       |    |-- field4: integer (nullable = true)
       |    |-- field5: array (nullable = true)
       |    |    |-- element: integer (containsNull = false)
       |-- field6: array (nullable = true)
       |    |-- element: struct (containsNull = false)
       |    |    |-- field7: string (nullable = true)
      
      Then we can access them by row.field3.field5[0]  or row.field6[5].field7
      
      It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
      
      You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
      
      ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
      
      Or you could use Row to create a class just like namedtuple, for example:
      
      Person = Row("name", "age")
      ctx.inferSchema(rdd.map(lambda x: Person(*x)))
      
      Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
      
      schema = StructType([StructField("name, StringType, True),
                                          StructType("age", IntegerType, True)])
      ctx.applySchema(rdd, schema)
      
      PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1598 from davies/nested and squashes the following commits:
      
      f1d15b6 [Davies Liu] verify schema with the first few rows
      8852aaf [Davies Liu] check type of schema
      abe9e6e [Davies Liu] address comments
      61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
      1e5b801 [Davies Liu] improve cache of classes
      51aa135 [Davies Liu] use Row to infer schema
      e9c0d5c [Davies Liu] remove string typed schema
      353a3f2 [Davies Liu] fix code style
      63de8f8 [Davies Liu] fix typo
      c79ca67 [Davies Liu] fix serialization of nested data
      6b258b5 [Davies Liu] fix pep8
      9d8447c [Davies Liu] apply schema provided by string of names
      f5df97f [Davies Liu] refactor, address comments
      9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
      84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
      0eaaf56 [Davies Liu] fix doc tests
      b3559b4 [Davies Liu] use generated Row instead of namedtuple
      c4ddc30 [Davies Liu] fix conflict between name of fields and variables
      7f6f251 [Davies Liu] address all comments
      d69d397 [Davies Liu] refactor
      2cc2d45 [Davies Liu] refactor
      182fb46 [Davies Liu] refactor
      bc6e9e1 [Davies Liu] switch to new Schema API
      547bf3e [Davies Liu] Merge branch 'master' into nested
      a435b5a [Davies Liu] add docs and code refactor
      2c8debc [Davies Liu] Merge branch 'master' into nested
      644665a [Davies Liu] use tuple and namedtuple for schemardd
      880eabec
  3. Jul 30, 2014
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
  4. Jul 26, 2014
    • Josh Rosen's avatar
      [SPARK-2601] [PySpark] Fix Py4J error when transforming pickleFiles · ba46bbed
      Josh Rosen authored
      Similar to SPARK-1034, the problem was that Py4J didn’t cope well with the fake ClassTags used in the Java API.  It doesn’t look like there’s any reason why PythonRDD needs to take a ClassTag, since it just ignores the type of the previous RDD, so I removed the type parameter and we no longer pass ClassTags from Python.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1605 from JoshRosen/spark-2601 and squashes the following commits:
      
      b68e118 [Josh Rosen] Fix Py4J error when transforming pickleFiles [SPARK-2601]
      ba46bbed
  5. Jul 25, 2014
    • Doris Xin's avatar
      [SPARK-2656] Python version of stratified sampling · 2f75a4a3
      Doris Xin authored
      exact sample size not supported for now.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1554 from dorx/pystratified and squashes the following commits:
      
      4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
      bdc3f8b [Doris Xin] updated unit to check sample holistically
      7713c7b [Doris Xin] Python version of stratified sampling
      2f75a4a3
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  6. Jul 24, 2014
  7. Jul 21, 2014
    • Davies Liu's avatar
      [SPARK-2494] [PySpark] make hash of None consistant cross machines · 872538c6
      Davies Liu authored
      In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1371 from davies/hash_of_none and squashes the following commits:
      
      d01745f [Davies Liu] add comments, remove outdated unit tests
      5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
      b7118aa [Davies Liu] use __builtin__ instead of __builtins__
      839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
      872538c6
  8. Jul 14, 2014
    • Prashant Sharma's avatar
      Made rdd.py pep8 complaint by using Autopep8 and a little manual editing. · aab53496
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1354 from ScrapCodes/pep8-comp-1 and squashes the following commits:
      
      9858ea8 [Prashant Sharma] Code Review
      d8851b7 [Prashant Sharma] Found # noqa works even inside comment blocks. Not sure if it works with all versions of python.
      10c0cef [Prashant Sharma] Made rdd.py pep8 complaint by using Autopep8 and a little manual tweaking.
      aab53496
  9. Jun 20, 2014
    • Anant's avatar
      [SPARK-2061] Made splits deprecated in JavaRDDLike · 010c460d
      Anant authored
      The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
      Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:
      
      b83ce6b [Anant] Fixed syntax issue
      21f9210 [Anant] Fixed version number in deprecation string
      9315b76 [Anant] made related changes to use partitions in python api
      8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
      010c460d
    • Allan Douglas R. de Oliveira's avatar
      SPARK-1868: Users should be allowed to cogroup at least 4 RDDs · 6a224c31
      Allan Douglas R. de Oliveira authored
      Adds cogroup for 4 RDDs.
      
      Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com>
      
      Closes #813 from douglaz/more_cogroups and squashes the following commits:
      
      f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case
      0e9009c [Allan Douglas R. de Oliveira] Added scala tests
      c3ffcdd [Allan Douglas R. de Oliveira] Added java tests
      517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith
      2f402d5 [Allan Douglas R. de Oliveira] Removed TODO
      17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function
      7877a2a [Allan Douglas R. de Oliveira] Fixed code
      ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark
      c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4
      e94963c [Allan Douglas R. de Oliveira] Fixed spacing
      f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues
      d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
      6a224c31
    • Aaron Davidson's avatar
      SPARK-2203: PySpark defaults to use same num reduce partitions as map side · f46e02fc
      Aaron Davidson authored
      For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.
      
      In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2203
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1138 from aarondav/pyfix and squashes the following commits:
      
      1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
      f46e02fc
  10. Jun 17, 2014
    • Sandy Ryza's avatar
      SPARK-2146. Fix takeOrdered doc · 2794990e
      Sandy Ryza authored
      Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits:
      
      185ff18 [Sandy Ryza] Use Seq instead of Array
      c996120 [Sandy Ryza] SPARK-2146.  Fix takeOrdered doc
      2794990e
    • Andrew Ash's avatar
      SPARK-1063 Add .sortBy(f) method on RDD · b92d16b1
      Andrew Ash authored
      This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
      
      I think this is ready for merging.
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@apache.org>
      
      Closes #369 from ash211/sortby and squashes the following commits:
      
      d09147a [Andrew Ash] Fix Ordering import
      43d0a53 [Andrew Ash] Fix missing .collect()
      29a54ed [Andrew Ash] Re-enable test by converting to a closure
      5a95348 [Andrew Ash] Add license for RDDSuiteUtils
      64ed6e3 [Andrew Ash] Remove leaked diff
      d4de69a [Andrew Ash] Remove scar tissue
      63638b5 [Andrew Ash] Add Python version of .sortBy()
      45e0fde [Andrew Ash] Add Java version of .sortBy()
      adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
      9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
      0457b69 [Andrew Ash] Ignore failing test
      99f0baf [Andrew Ash] Merge branch 'master' into sortby
      222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
      3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
      b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
      8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
      381eef2 [Andrew Ash] Correct silly typo
      7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
      0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
      ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
      b92d16b1
    • Kan Zhang's avatar
      [SPARK-2130] End-user friendly String repr for StorageLevel in Python · d81c08ba
      Kan Zhang authored
      JIRA issue https://issues.apache.org/jira/browse/SPARK-2130
      
      This PR adds an end-user friendly String representation for StorageLevel
      in Python, similar to ```StorageLevel.description``` in Scala.
      ```
      >>> rdd = sc.parallelize([1,2])
      >>> storage_level = rdd.getStorageLevel()
      >>> storage_level
      StorageLevel(False, False, False, False, 1)
      >>> print(storage_level)
      Serialized 1x Replicated
      ```
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1096 from kanzhang/SPARK-2130 and squashes the following commits:
      
      7c8b98b [Kan Zhang] [SPARK-2130] Prettier epydoc output
      cc5bf45 [Kan Zhang] [SPARK-2130] End-user friendly String representation for StorageLevel in Python
      d81c08ba
  11. Jun 12, 2014
    • Doris Xin's avatar
      SPARK-1939 Refactor takeSample method in RDD to use ScaSRS · 1de1d703
      Doris Xin authored
      Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      Author: dorx <doris.s.xin@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #916 from dorx/takeSample and squashes the following commits:
      
      5b061ae [Doris Xin] merge master
      444e750 [Doris Xin] edge cases
      3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
      82dde31 [Xiangrui Meng] update pyspark's takeSample
      48d954d [Doris Xin] remove unused imports from RDDSuite
      fb1452f [Doris Xin] allowing num to be greater than count in all cases
      1481b01 [Doris Xin] washing test tubes and making coffee
      dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
      64e445b [Doris Xin] logwarnning as soon as it enters the while loop
      55518ed [Doris Xin] added TODO for logging in rdd.py
      eff89e2 [Doris Xin] addressed reviewer comments.
      ecab508 [Doris Xin] "fixed checkstyle violation
      0a9b3e3 [Doris Xin] "reviewer comment addressed"
      f80f270 [Doris Xin] Merge branch 'master' into takeSample
      ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
      065ebcd [Doris Xin] Merge branch 'master' into takeSample
      9bdd36e [Doris Xin] Check sample size and move computeFraction
      e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
      7cab53a [Doris Xin] fixed import bug in rdd.py
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      1de1d703
    • Sandy Ryza's avatar
      SPARK-554. Add aggregateByKey. · ce92a9c1
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #705 from sryza/sandy-spark-554 and squashes the following commits:
      
      2302b8f [Sandy Ryza] Add MIMA exclude
      f52e0ad [Sandy Ryza] Fix Python tests for real
      2f3afa3 [Sandy Ryza] Fix Python test
      0b735e9 [Sandy Ryza] Fix line lengths
      ae56746 [Sandy Ryza] Fix doc (replace T with V)
      c2be415 [Sandy Ryza] Java and Python aggregateByKey
      23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
      ce92a9c1
    • Jeff Thompson's avatar
      fixed typo in docstring for min() · 43d53d51
      Jeff Thompson authored
      Hi, I found this typo while learning spark and thought I'd do a pull request.
      
      Author: Jeff Thompson <jeffreykeatingthompson@gmail.com>
      
      Closes #1065 from jkthompson/docstring-typo-minmax and squashes the following commits:
      
      29b6a26 [Jeff Thompson] fixed typo in docstring for min()
      43d53d51
  12. Jun 09, 2014
    • Syed Hashmi's avatar
      [SPARK-1308] Add getNumPartitions to pyspark RDD · 6113ac15
      Syed Hashmi authored
      Add getNumPartitions to pyspark RDD to provide an intuitive way to get number of partitions in RDD like we can do in scala today.
      
      Author: Syed Hashmi <shashmi@cloudera.com>
      
      Closes #995 from syedhashmi/master and squashes the following commits:
      
      de0ed5e [Syed Hashmi] [SPARK-1308] Add getNumPartitions to pyspark RDD
      6113ac15
  13. Jun 03, 2014
    • Kan Zhang's avatar
      [SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python · 21e40ed8
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #755 from kanzhang/SPARK-1161 and squashes the following commits:
      
      24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests
      44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10
      d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python
      21e40ed8
    • Erik Selin's avatar
      [SPARK-1468] Modify the partition function used by partitionBy. · 8edc9d03
      Erik Selin authored
      Make partitionBy use a tweaked version of hash as its default partition function
      since the python hash function does not consistently assign the same value
      to None across python processes.
      
      Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468
      
      Author: Erik Selin <erik.selin@jadedpixel.com>
      
      Closes #371 from tyro89/consistent_hashing and squashes the following commits:
      
      201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
      8edc9d03
  14. May 31, 2014
    • Aaron Davidson's avatar
      SPARK-1839: PySpark RDD#take() shouldn't always read from driver · 9909efc1
      Aaron Davidson authored
      This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #922 from aarondav/take and squashes the following commits:
      
      fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
      9909efc1
  15. May 15, 2014
  16. May 10, 2014
  17. Apr 29, 2014
    • Xiangrui Meng's avatar
      [SPARK-1674] fix interrupted system call error in pyspark's RDD.pipe · d33df1c1
      Xiangrui Meng authored
      `RDD.pipe`'s doctest throws interrupted system call exception on Mac. It can be fixed by wrapping `pipe.stdout.readline` in an iterator.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #594 from mengxr/pyspark-pipe and squashes the following commits:
      
      cc32ac9 [Xiangrui Meng] fix interrupted system call error in pyspark's RDD.pipe
      d33df1c1
  18. Apr 25, 2014
    • Holden Karau's avatar
      SPARK-1242 Add aggregate to python rdd · e03bc379
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #139 from holdenk/add_aggregate_to_python_api and squashes the following commits:
      
      0f39ae3 [Holden Karau] Merge in master
      4879c75 [Holden Karau] CR feedback, fix issue with empty RDDs in aggregate
      70b4724 [Holden Karau] Style fixes from code review
      96b047b [Holden Karau] Add aggregate to python rdd
      e03bc379
  19. Apr 24, 2014
    • Arun Ramakrishnan's avatar
      SPARK-1438 RDD.sample() make seed param optional · 35e3d199
      Arun Ramakrishnan authored
      copying form previous pull request https://github.com/apache/spark/pull/462
      
      Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.
      
      In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.
      
      Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
      sample(fraction, withReplacement=false, seed=math.random)
      Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.
      
      If backward compatible is important, 3 new method can be introduced (without default params) like this
      sample(fraction)
      sample(fraction, withReplacement)
      sample(fraction, withReplacement, seed)
      
      Added some tests for the scala RDD takeSample method.
      
      Author: Arun Ramakrishnan <smartnut007@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #477 from smartnut007/master and squashes the following commits:
      
      07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
      b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
      8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
      69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
      0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
      35e3d199
  20. Apr 08, 2014
    • Holden Karau's avatar
      Spark 1271: Co-Group and Group-By should pass Iterable[X] · ce8ec545
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
      
      f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
      77048f8 [Holden Karau] Fix merge up to master
      d3fe909 [Holden Karau] use toSeq instead
      7a092a3 [Holden Karau] switch resultitr to resultiterable
      eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
      c5075aa [Holden Karau] If guava 14 had iterables
      2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
      11e730c [Holden Karau] Fix streaming tests
      66b583d [Holden Karau] Fix the core test suite to compile
      4ed579b [Holden Karau] Refactor from iterator to iterable
      d052c07 [Holden Karau] Python tests now pass with iterator pandas
      3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
      cd1e81c [Holden Karau] Try and make pickling list iterators work
      c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
      88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
      a5ee714 [Holden Karau] oops, was checking wrong iterator
      e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
      ec8cc3e [Holden Karau] Fix test issues\!
      4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
      fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
      ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
      b692868 [Holden Karau] Revert
      7e533f7 [Holden Karau] Fix the bug
      8a5153a [Holden Karau] Revert me, but we have some stuff to debug
      b4e86a9 [Holden Karau] Add a join based on the problem in SVD
      c4510e2 [Holden Karau] Revert this but for now put things in list pandas
      b4e0b1d [Holden Karau] Fix style issues
      71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
      b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
      37888ec [Holden Karau] core/tests now pass
      249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
      6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
      fe992fe [Holden Karau] hmmm try and fix up basic operation suite
      172705c [Holden Karau] Fix Java API suite
      caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
      88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
      4991af6 [Holden Karau] Fix some tests
      be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
      687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
      ce8ec545
  21. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
  22. Apr 03, 2014
    • Prashant Sharma's avatar
      Spark 1162 Implemented takeOrdered in pyspark. · c1ea3afb
      Prashant Sharma authored
      Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
      
      We have our own implementation of max heap.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
      
      35f86ba [Prashant Sharma] code review
      2b1124d [Prashant Sharma] fixed tests
      e8a08e2 [Prashant Sharma] Code review comments.
      49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
      c1ea3afb
  23. Mar 26, 2014
  24. Mar 19, 2014
    • Jyotiska NK's avatar
      Added doctest for map function in rdd.py · 67fa71cb
      Jyotiska NK authored
      Doctest added for map in rdd.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:
      
      a38527f [Jyotiska NK] Added doctest for map function in rdd.py
      67fa71cb
  25. Mar 18, 2014
    • Dan McClary's avatar
      Spark 1246 add min max to stat counter · e3681f26
      Dan McClary authored
      Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.
      
      Author: Dan McClary <dan.mcclary@gmail.com>
      
      Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:
      
      fd3fd4b [Dan McClary] fixed  error, updated test
      82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
      5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
      21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
      1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
      a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
      ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
      1e7056d [Dan McClary] added underscore to getBucket
      37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
      29981f2 [Dan McClary] fixed indentation on doctest comment
      eaf89d9 [Dan McClary] added correct doctest for histogram
      4916016 [Dan McClary] added histogram method, added max and min to statscounter
      e3681f26
  26. Mar 17, 2014
    • CodingCat's avatar
      SPARK-1240: handle the case of empty RDD when takeSample · dc965463
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1240
      
      It seems that the current implementation does not handle the empty RDD case when run takeSample
      
      In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
      
      In the test case, I also add several lines for this case
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:
      
      fef57d4 [CodingCat] fix the same problem in PySpark
      36db06b [CodingCat] create new test cases for takeSample from an empty red
      810948d [CodingCat] further fix
      a40e8fb [CodingCat] replace if with require
      ad483fd [CodingCat] handle the case with empty RDD when take sample
      dc965463
  27. Mar 12, 2014
    • Prashant Sharma's avatar
      SPARK-1162 Added top in python. · b8afe305
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
      
      ece1fa4 [Prashant Sharma] Added top in python.
      b8afe305
    • prabinb's avatar
      Spark-1163, Added missing Python RDD functions · af7f2f10
      prabinb authored
      Author: prabinb <prabin.banka@imaginea.com>
      
      Closes #92 from prabinb/python-api-rdd and squashes the following commits:
      
      51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
      af7f2f10
  28. Mar 10, 2014
    • Prashant Sharma's avatar
      SPARK-1168, Added foldByKey to pyspark. · a59419c2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
      
      db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
      a59419c2
    • jyotiska's avatar
      [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted) · f5518989
      jyotiska authored
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #34 from jyotiska/pyspark_code and squashes the following commits:
      
      c9439be [jyotiska] replaced dict with namedtuple
      a6bf4cd [jyotiska] added callsite info for context.py
      f5518989
    • Prabin Banka's avatar
      SPARK-977 Added Python RDD.zip function · e1e09e0e
      Prabin Banka authored
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #76 from prabinb/python-api-zip and squashes the following commits:
      
      b1a31a0 [Prabin Banka] Added Python RDD.zip function
      e1e09e0e
Loading