Commits · 1b74a27da026aba7dbe2088ee64974d772feb23d · cs525-sp18-g07 / spark

Nov 26, 2013
- Removed unused basestring case from dump_stream. · 1b74a27d
  Josh Rosen authored 11 years ago
  
  1b74a27d
Nov 10, 2013

FramedSerializer: _dumps => dumps, _loads => loads. · 13122ceb
Josh Rosen authored 11 years ago

13122ceb
Send PySpark commands as bytes insetad of strings. · ffa5bedf
Josh Rosen authored 11 years ago

ffa5bedf

Add custom serializer support to PySpark. · cbb7f04a

Josh Rosen authored 11 years ago

For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().

cbb7f04a

Nov 03, 2013

Remove Pickle-wrapping of Java objects in PySpark. · 7d68a81a

Josh Rosen authored 11 years ago

If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.

7d68a81a

Replace magic lengths with constants in PySpark. · a48d88d2

Josh Rosen authored 11 years ago

Write the length of the accumulators section up-front rather
than terminating it with a negative length.  I find this
easier to read.

a48d88d2

Oct 04, 2013

Fixing SPARK-602: PythonPartitioner · c84946fe

Andre Schumacher authored 11 years ago

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.

c84946fe

Jul 16, 2013
- Add Apache license headers and LICENSE and NOTICE files · af3c9d50
  Matei Zaharia authored 12 years ago
  
  af3c9d50
Jun 21, 2013
- Add Python timing instrumentation · 40afe0d2
  Jey Kottalam authored 12 years ago
  
  40afe0d2
Jan 20, 2013
- Added accumulators to PySpark · 8e7f098a
  Matei Zaharia authored 12 years ago
  
  8e7f098a
Jan 01, 2013
- Rename top-level 'pyspark' directory to 'python' · b58340db
  Josh Rosen authored 12 years ago
  
  b58340db
Dec 29, 2012
- Use batching in pyspark parallelize(); fix cartesian() · 26186e2d
  Josh Rosen authored 12 years ago
  
  26186e2d
- Fix bug in pyspark.serializers.batch; add .gitignore. · 6ee1ff26
  Josh Rosen authored 12 years ago
  
  6ee1ff26
Dec 26, 2012
- Add support for batched serialization of Python objects in PySpark. · e2dad156
  Josh Rosen authored 12 years ago
  
  e2dad156
Dec 24, 2012

Use filesystem to collect RDDs in PySpark. · 4608902f

Josh Rosen authored 12 years ago

Passing large volumes of data through Py4J seems
to be slow.  It appears to be faster to write the
data to the local filesystem and read it back from
Python.

4608902f

Oct 19, 2012
- Update Python API for v0.6.0 compatibility. · 52989c8a
  Josh Rosen authored 12 years ago
  
  52989c8a
Aug 27, 2012
- Simplify Python worker; pipeline the map step of partitionBy(). · 200d248d
  Josh Rosen authored 12 years ago
  
  200d248d
Aug 21, 2012

Use only cPickle for serialization in Python API. · fd94e544

Josh Rosen authored 12 years ago

Objects serialized with JSON can be compared for equality, but JSON can be slow
to serialize and only supports a limited range of data types.

fd94e544

Aug 19, 2012
- Add Python API. · 886b39de
  Josh Rosen authored 13 years ago
  
  886b39de