Skip to content
Snippets Groups Projects
  • Liang-Chi Hsieh's avatar
    146001a9
    [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs · 146001a9
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.
    
    ### First bug: When MapObjects works on Python-only UDTs
    
    `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
        df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
        df.show()
    
        File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
        : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
        ...
    
    ### Second bug: When Python-only UDTs is the element type of ArrayType
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
        df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
        df.show()
    
    ## How was this patch tested?
    PySpark's sql tests.
    
    Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
    
    Closes #13778 from viirya/fix-pyudt.
    146001a9
    History
    [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs
    Liang-Chi Hsieh authored
    ## What changes were proposed in this pull request?
    
    There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.
    
    ### First bug: When MapObjects works on Python-only UDTs
    
    `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
        df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
        df.show()
    
        File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
        : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
        ...
    
    ### Second bug: When Python-only UDTs is the element type of ArrayType
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
        df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
        df.show()
    
    ## How was this patch tested?
    PySpark's sql tests.
    
    Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
    
    Closes #13778 from viirya/fix-pyudt.