Skip to content
Snippets Groups Projects
Commit 0b855167 authored by Matei Zaharia's avatar Matei Zaharia
Browse files

SPARK-1421. Make MLlib work on Python 2.6

The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.

Author: Matei Zaharia <matei@databricks.com>

Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:

f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
parent 890d63bd
No related branches found
No related tags found
No related merge requests found
...@@ -38,6 +38,5 @@ depends on native Fortran routines. You may need to install the ...@@ -38,6 +38,5 @@ depends on native Fortran routines. You may need to install the
if it is not already present on your nodes. MLlib will throw a linking error if it cannot if it is not already present on your nodes. MLlib will throw a linking error if it cannot
detect these libraries automatically. detect these libraries automatically.
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer.
and Python 2.7.
...@@ -152,7 +152,7 @@ Many of the methods also contain [doctests](http://docs.python.org/2/library/doc ...@@ -152,7 +152,7 @@ Many of the methods also contain [doctests](http://docs.python.org/2/library/doc
# Libraries # Libraries
[MLlib](mllib-guide.html) is also available in PySpark. To use it, you'll need [MLlib](mllib-guide.html) is also available in PySpark. To use it, you'll need
[NumPy](http://www.numpy.org) version 1.7 or newer, and Python 2.7. The [MLlib guide](mllib-guide.html) contains [NumPy](http://www.numpy.org) version 1.7 or newer. The [MLlib guide](mllib-guide.html) contains
some example applications. some example applications.
# Where to Go from Here # Where to Go from Here
......
...@@ -19,11 +19,7 @@ ...@@ -19,11 +19,7 @@
Python bindings for MLlib. Python bindings for MLlib.
""" """
# MLlib currently needs Python 2.7+ and NumPy 1.7+, so complain if lower # MLlib currently needs and NumPy 1.7+, so complain if lower
import sys
if sys.version_info[0:2] < (2, 7):
raise Exception("MLlib requires Python 2.7+")
import numpy import numpy
if numpy.version.version < '1.7': if numpy.version.version < '1.7':
......
...@@ -64,6 +64,7 @@ import cPickle ...@@ -64,6 +64,7 @@ import cPickle
from itertools import chain, izip, product from itertools import chain, izip, product
import marshal import marshal
import struct import struct
import sys
from pyspark import cloudpickle from pyspark import cloudpickle
...@@ -113,6 +114,11 @@ class FramedSerializer(Serializer): ...@@ -113,6 +114,11 @@ class FramedSerializer(Serializer):
where C{length} is a 32-bit integer and data is C{length} bytes. where C{length} is a 32-bit integer and data is C{length} bytes.
""" """
def __init__(self):
# On Python 2.6, we can't write bytearrays to streams, so we need to convert them
# to strings first. Check if the version number is that old.
self._only_write_strings = sys.version_info[0:2] <= (2, 6)
def dump_stream(self, iterator, stream): def dump_stream(self, iterator, stream):
for obj in iterator: for obj in iterator:
self._write_with_length(obj, stream) self._write_with_length(obj, stream)
...@@ -127,7 +133,10 @@ class FramedSerializer(Serializer): ...@@ -127,7 +133,10 @@ class FramedSerializer(Serializer):
def _write_with_length(self, obj, stream): def _write_with_length(self, obj, stream):
serialized = self.dumps(obj) serialized = self.dumps(obj)
write_int(len(serialized), stream) write_int(len(serialized), stream)
stream.write(serialized) if self._only_write_strings:
stream.write(str(serialized))
else:
stream.write(serialized)
def _read_with_length(self, stream): def _read_with_length(self, stream):
length = read_int(stream) length = read_int(stream)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment