Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    a7a93a11
    [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs · a7a93a11
    Davies Liu authored
    ## What changes were proposed in this pull request?
    
    This PR brings the support for chained Python UDFs, for example
    
    ```sql
    select udf1(udf2(a))
    select udf1(udf2(a) + 3)
    select udf1(udf2(a) + udf3(b))
    ```
    
    Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
    
    For example,
    ```python
    >>> sqlContext.sql("select double(double(1))").explain()
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [pythonUDF#10 AS double(double(1))#9]
    :     +- INPUT
    +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
       +- Scan OneRowRelation[]
    >>> sqlContext.sql("select double(double(1) + double(2))").explain()
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
    :     +- INPUT
    +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
       +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
          +- !BatchPythonEvaluation double(1), [pythonUDF#17]
             +- Scan OneRowRelation[]
    ```
    
    TODO: will support multiple unrelated Python UDFs in one batch (another PR).
    
    ## How was this patch tested?
    
    Added new unit tests for chained UDFs.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #12014 from davies/py_udfs.
    a7a93a11
    History
    [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs
    Davies Liu authored
    ## What changes were proposed in this pull request?
    
    This PR brings the support for chained Python UDFs, for example
    
    ```sql
    select udf1(udf2(a))
    select udf1(udf2(a) + 3)
    select udf1(udf2(a) + udf3(b))
    ```
    
    Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
    
    For example,
    ```python
    >>> sqlContext.sql("select double(double(1))").explain()
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [pythonUDF#10 AS double(double(1))#9]
    :     +- INPUT
    +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
       +- Scan OneRowRelation[]
    >>> sqlContext.sql("select double(double(1) + double(2))").explain()
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
    :     +- INPUT
    +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
       +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
          +- !BatchPythonEvaluation double(1), [pythonUDF#17]
             +- Scan OneRowRelation[]
    ```
    
    TODO: will support multiple unrelated Python UDFs in one batch (another PR).
    
    ## How was this patch tested?
    
    Added new unit tests for chained UDFs.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #12014 from davies/py_udfs.