Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    04e44b37
    [SPARK-4897] [PySpark] Python 3 support · 04e44b37
    Davies Liu authored
    This PR update PySpark to support Python 3 (tested with 3.4).
    
    Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
    
    TODO: ec2/spark-ec2.py is not fully tested with python3.
    
    Author: Davies Liu <davies@databricks.com>
    Author: twneale <twneale@gmail.com>
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #5173 from davies/python3 and squashes the following commits:
    
    d7d6323 [Davies Liu] fix tests
    6c52a98 [Davies Liu] fix mllib test
    99e334f [Davies Liu] update timeout
    b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    cafd5ec [Davies Liu] adddress comments from @mengxr
    bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    179fc8d [Davies Liu] tuning flaky tests
    8c8b957 [Davies Liu] fix ResourceWarning in Python 3
    5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    4006829 [Davies Liu] fix test
    2fc0066 [Davies Liu] add python3 path
    71535e9 [Davies Liu] fix xrange and divide
    5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ed498c8 [Davies Liu] fix compatibility with python 3
    820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ad7c374 [Davies Liu] fix mllib test and warning
    ef1fc2f [Davies Liu] fix tests
    4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    59bb492 [Davies Liu] fix tests
    1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ca0fdd3 [Davies Liu] fix code style
    9563a15 [Davies Liu] add imap back for python 2
    0b1ec04 [Davies Liu] make python examples work with Python 3
    d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    a716d34 [Davies Liu] test with python 3.4
    f1700e8 [Davies Liu] fix test in python3
    671b1db [Davies Liu] fix test in python3
    692ff47 [Davies Liu] fix flaky test
    7b9699f [Davies Liu] invalidate import cache for Python 3.3+
    9c58497 [Davies Liu] fix kill worker
    309bfbf [Davies Liu] keep compatibility
    5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
    8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    f53e1f0 [Davies Liu] fix tests
    70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
    a39167e [Davies Liu] support customize class in __main__
    814c77b [Davies Liu] run unittests with python 3
    7f4476e [Davies Liu] mllib tests passed
    d737924 [Davies Liu] pass ml tests
    375ea17 [Davies Liu] SQL tests pass
    6cc42a9 [Davies Liu] rename
    431a8de [Davies Liu] streaming tests pass
    78901a7 [Davies Liu] fix hash of serializer in Python 3
    24b2f2e [Davies Liu] pass all RDD tests
    35f48fe [Davies Liu] run future again
    1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
    6e3c21d [Davies Liu] make cloudpickle work with Python3
    2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
    1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
    7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
    b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
    f40d925 [twneale] xrange --> range
    e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
    79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
    2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
    854be27 [Josh Rosen] Run `futurize` on Python code:
    7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
    04e44b37
    History
    [SPARK-4897] [PySpark] Python 3 support
    Davies Liu authored
    This PR update PySpark to support Python 3 (tested with 3.4).
    
    Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
    
    TODO: ec2/spark-ec2.py is not fully tested with python3.
    
    Author: Davies Liu <davies@databricks.com>
    Author: twneale <twneale@gmail.com>
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #5173 from davies/python3 and squashes the following commits:
    
    d7d6323 [Davies Liu] fix tests
    6c52a98 [Davies Liu] fix mllib test
    99e334f [Davies Liu] update timeout
    b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    cafd5ec [Davies Liu] adddress comments from @mengxr
    bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    179fc8d [Davies Liu] tuning flaky tests
    8c8b957 [Davies Liu] fix ResourceWarning in Python 3
    5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    4006829 [Davies Liu] fix test
    2fc0066 [Davies Liu] add python3 path
    71535e9 [Davies Liu] fix xrange and divide
    5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ed498c8 [Davies Liu] fix compatibility with python 3
    820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ad7c374 [Davies Liu] fix mllib test and warning
    ef1fc2f [Davies Liu] fix tests
    4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    59bb492 [Davies Liu] fix tests
    1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    ca0fdd3 [Davies Liu] fix code style
    9563a15 [Davies Liu] add imap back for python 2
    0b1ec04 [Davies Liu] make python examples work with Python 3
    d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    a716d34 [Davies Liu] test with python 3.4
    f1700e8 [Davies Liu] fix test in python3
    671b1db [Davies Liu] fix test in python3
    692ff47 [Davies Liu] fix flaky test
    7b9699f [Davies Liu] invalidate import cache for Python 3.3+
    9c58497 [Davies Liu] fix kill worker
    309bfbf [Davies Liu] keep compatibility
    5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
    8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
    f53e1f0 [Davies Liu] fix tests
    70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
    a39167e [Davies Liu] support customize class in __main__
    814c77b [Davies Liu] run unittests with python 3
    7f4476e [Davies Liu] mllib tests passed
    d737924 [Davies Liu] pass ml tests
    375ea17 [Davies Liu] SQL tests pass
    6cc42a9 [Davies Liu] rename
    431a8de [Davies Liu] streaming tests pass
    78901a7 [Davies Liu] fix hash of serializer in Python 3
    24b2f2e [Davies Liu] pass all RDD tests
    35f48fe [Davies Liu] run future again
    1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
    6e3c21d [Davies Liu] make cloudpickle work with Python3
    2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
    1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
    7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
    b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
    f40d925 [twneale] xrange --> range
    e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
    79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
    2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
    854be27 [Josh Rosen] Run `futurize` on Python code:
    7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
_shared_params_code_gen.py 3.78 KiB
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from __future__ import print_function

header = """#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#"""

# Code generator for shared params (shared.py). Run under this folder with:
# python _shared_params_code_gen.py > shared.py


def _gen_param_code(name, doc, defaultValueStr):
    """
    Generates Python code for a shared param class.

    :param name: param name
    :param doc: param doc
    :param defaultValueStr: string representation of the default value
    :return: code string
    """
    # TODO: How to correctly inherit instance attributes?
    template = '''class Has$Name(Params):
    """
    Mixin for param $name: $doc.
    """

    # a placeholder to make it appear in the generated doc
    $name = Param(Params._dummy(), "$name", "$doc")

    def __init__(self):
        super(Has$Name, self).__init__()
        #: param for $doc
        self.$name = Param(self, "$name", "$doc")
        if $defaultValueStr is not None:
            self._setDefault($name=$defaultValueStr)

    def set$Name(self, value):
        """
        Sets the value of :py:attr:`$name`.
        """
        self.paramMap[self.$name] = value
        return self

    def get$Name(self):
        """
        Gets the value of $name or its default value.
        """
        return self.getOrDefault(self.$name)'''

    Name = name[0].upper() + name[1:]
    return template \
        .replace("$name", name) \
        .replace("$Name", Name) \
        .replace("$doc", doc) \
        .replace("$defaultValueStr", str(defaultValueStr))

if __name__ == "__main__":
    print(header)
    print("\n# DO NOT MODIFY THIS FILE! It was generated by _shared_params_code_gen.py.\n")
    print("from pyspark.ml.param import Param, Params\n\n")
    shared = [
        ("maxIter", "max number of iterations", None),
        ("regParam", "regularization constant", None),
        ("featuresCol", "features column name", "'features'"),
        ("labelCol", "label column name", "'label'"),
        ("predictionCol", "prediction column name", "'prediction'"),
        ("inputCol", "input column name", None),
        ("outputCol", "output column name", None),
        ("numFeatures", "number of features", None)]
    code = []
    for name, doc, defaultValueStr in shared:
        code.append(_gen_param_code(name, doc, defaultValueStr))
    print("\n\n\n".join(code))