Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    c338c8cf
    [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs · c338c8cf
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR targets to explicitly specify supported types in Pandas UDFs.
    The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.
    
    1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:
    
        ```python
        from pyspark.sql.functions import pandas_udf
        pudf = pandas_udf(lambda x: x, "binary")
        df = spark.createDataFrame([[bytearray(1)]])
        df.select(pudf("_1")).show()
        ```
        ```
        ...
        TypeError: Unsupported type in conversion to Arrow: BinaryType
        ```
    
        We can document this behaviour for its guide.
    
    2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case.
    
        ```python
        from pyspark.sql.functions import pandas_udf, PandasUDFType
        foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
        df = spark.range(100).selectExpr("id", "array(id) as value")
        df.groupBy("id").agg(foo("value")).show()
        ```
    
        ```
        ...
         NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG
        ```
    
    3. Since we can check the return type ahead, we can fail fast before actual execution.
    
        ```python
        # we can fail fast at this stage because we know the schema ahead
        pandas_udf(lambda x: x, BinaryType())
        ```
    
    ## How was this patch tested?
    
    Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #20531 from HyukjinKwon/pudf-cleanup.
    c338c8cf
    History
    [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR targets to explicitly specify supported types in Pandas UDFs.
    The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.
    
    1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:
    
        ```python
        from pyspark.sql.functions import pandas_udf
        pudf = pandas_udf(lambda x: x, "binary")
        df = spark.createDataFrame([[bytearray(1)]])
        df.select(pudf("_1")).show()
        ```
        ```
        ...
        TypeError: Unsupported type in conversion to Arrow: BinaryType
        ```
    
        We can document this behaviour for its guide.
    
    2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case.
    
        ```python
        from pyspark.sql.functions import pandas_udf, PandasUDFType
        foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
        df = spark.range(100).selectExpr("id", "array(id) as value")
        df.groupBy("id").agg(foo("value")).show()
        ```
    
        ```
        ...
         NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG
        ```
    
    3. Since we can check the return type ahead, we can fail fast before actual execution.
    
        ```python
        # we can fail fast at this stage because we know the schema ahead
        pandas_udf(lambda x: x, BinaryType())
        ```
    
    ## How was this patch tested?
    
    Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #20531 from HyukjinKwon/pudf-cleanup.