Skip to content
  • hyukjinkwon's avatar
    224e0e78
    [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column · 224e0e78
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
    
    **1.5.2**
    
    ```python
    >>> df = sqlContext.createDataFrame([[1]])
    >>> 1 in df._1
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **1.6.3**
    
    ```python
    >>> 1 in sqlContext.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **2.1.0**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **Current Master**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **After**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
        raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
    ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
    ```
    
    In more details,
    
    It seems the implementation intended to support this
    
    ```python
    1 in df.column
    ```
    
    However, currently, it throws an exception as below:
    
    ```python
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    What happens here is as below:
    
    ```python
    class Column(object):
        def __contains__(self, item):
            print "I am contains"
            return Column()
        def __nonzero__(self):
            raise Exception("I am nonzero.")
    
    >>> 1 in Column()
    I am contains
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 6, in __nonzero__
    Exception: I am nonzero.
    ```
    
    It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
    
    It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
    
    https://bugs.python.org/issue16011
    http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
    http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
    
    It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
    
    ```python
    class Column(object):
        def __contains__(self, item):
            print "I am contains"
            return Column()
        def __nonzero__(self):
            return "a"
    
    >>> 1 in Column()
    I am contains
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: __nonzero__ should return bool or int, returned str
    ```
    
    ## How was this patch tested?
    
    Added unit tests in `tests.py`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17160 from HyukjinKwon/SPARK-19701.
    224e0e78
    [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
    
    **1.5.2**
    
    ```python
    >>> df = sqlContext.createDataFrame([[1]])
    >>> 1 in df._1
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **1.6.3**
    
    ```python
    >>> 1 in sqlContext.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **2.1.0**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **Current Master**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    **After**
    
    ```python
    >>> 1 in spark.range(1).id
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
        raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
    ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
    ```
    
    In more details,
    
    It seems the implementation intended to support this
    
    ```python
    1 in df.column
    ```
    
    However, currently, it throws an exception as below:
    
    ```python
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
        raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
    ```
    
    What happens here is as below:
    
    ```python
    class Column(object):
        def __contains__(self, item):
            print "I am contains"
            return Column()
        def __nonzero__(self):
            raise Exception("I am nonzero.")
    
    >>> 1 in Column()
    I am contains
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 6, in __nonzero__
    Exception: I am nonzero.
    ```
    
    It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
    
    It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
    
    https://bugs.python.org/issue16011
    http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
    http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
    
    It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
    
    ```python
    class Column(object):
        def __contains__(self, item):
            print "I am contains"
            return Column()
        def __nonzero__(self):
            return "a"
    
    >>> 1 in Column()
    I am contains
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: __nonzero__ should return bool or int, returned str
    ```
    
    ## How was this patch tested?
    
    Added unit tests in `tests.py`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17160 from HyukjinKwon/SPARK-19701.
Loading