Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    71cfba04
    [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark... · 71cfba04
    hyukjinkwon authored
    [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
    
    We declared the extra dependencies:
    
    https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
    
    In case of PyArrow:
    
    Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
    
    ```
    ======================================================================
    ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
        f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
      File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
        return _create_udf(f=f, returnType=return_type, evalType=eval_type)
      File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
        require_minimum_pyarrow_version()
      File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
        "however, your version was %s." % pyarrow.__version__)
    ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
    
    ----------------------------------------------------------------------
    Ran 33 tests in 8.098s
    
    FAILED (errors=33)
    ```
    
    In case of Pandas:
    
    There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
    
    ## How was this patch tested?
    
    Manually tested by modifying the condition:
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    ```
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
    71cfba04
    History
    [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark...
    hyukjinkwon authored
    [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test)
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test.
    
    We declared the extra dependencies:
    
    https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
    
    In case of PyArrow:
    
    Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed:
    
    ```
    ======================================================================
    ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type
        f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
      File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf
        return _create_udf(f=f, returnType=return_type, evalType=eval_type)
      File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
        require_minimum_pyarrow_version()
      File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version
        "however, your version was %s." % pyarrow.__version__)
    ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0.
    
    ----------------------------------------------------------------------
    Ran 33 tests in 8.098s
    
    FAILED (errors=33)
    ```
    
    In case of Pandas:
    
    There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing.
    
    ## How was this patch tested?
    
    Manually tested by modifying the condition:
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    ```
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.