- Nov 26, 2013
-
-
Josh Rosen authored
-
- Nov 10, 2013
-
-
Josh Rosen authored
-
Josh Rosen authored
-
Josh Rosen authored
For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
-
- Nov 03, 2013
-
-
Josh Rosen authored
If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
-
Josh Rosen authored
Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.
-
- Oct 04, 2013
-
-
Andre Schumacher authored
Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
-
- Jul 16, 2013
-
-
Matei Zaharia authored
-
- Jun 21, 2013
-
-
Jey Kottalam authored
-
- Jan 20, 2013
-
-
Matei Zaharia authored
-
- Jan 01, 2013
-
-
Josh Rosen authored
-
- Dec 29, 2012
-
-
Josh Rosen authored
-
Josh Rosen authored
-
- Dec 26, 2012
-
-
Josh Rosen authored
-
- Dec 24, 2012
-
-
Josh Rosen authored
Passing large volumes of data through Py4J seems to be slow. It appears to be faster to write the data to the local filesystem and read it back from Python.
-
- Oct 19, 2012
-
-
Josh Rosen authored
-
- Aug 27, 2012
-
-
Josh Rosen authored
-
- Aug 21, 2012
-
-
Josh Rosen authored
Objects serialized with JSON can be compared for equality, but JSON can be slow to serialize and only supports a limited range of data types.
-
- Aug 19, 2012
-
-
Josh Rosen authored
-