Skip to content
Snippets Groups Projects
  • Takuya UESHIN's avatar
    63c5bf13
    [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to... · 63c5bf13
    Takuya UESHIN authored
    [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
    
    ## What changes were proposed in this pull request?
    
    In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g.,
    
    ```python
    from pyspark.sql.functions import pandas_udf, col
    import pandas as pd
    
    df = spark.range(10)
    str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
    df.select(str_f(col('id'))).show()
    ```
    
    raises the following exception:
    
    ```
    ...
    
    java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
    	at scala.Predef$.assert(Predef.scala:170)
    	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
    
    ...
    ```
    
    Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.
    
    This pr adds a workaround for the case.
    
    ## How was this patch tested?
    
    Added a test and existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #20507 from ueshin/issues/SPARK-23334.
    63c5bf13
    History
    [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to...
    Takuya UESHIN authored
    [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
    
    ## What changes were proposed in this pull request?
    
    In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g.,
    
    ```python
    from pyspark.sql.functions import pandas_udf, col
    import pandas as pd
    
    df = spark.range(10)
    str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
    df.select(str_f(col('id'))).show()
    ```
    
    raises the following exception:
    
    ```
    ...
    
    java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
    	at scala.Predef$.assert(Predef.scala:170)
    	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
    
    ...
    ```
    
    Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.
    
    This pr adds a workaround for the case.
    
    ## How was this patch tested?
    
    Added a test and existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #20507 from ueshin/issues/SPARK-23334.