Skip to content
Snippets Groups Projects
  • 0x0FFF's avatar
    bf550a4b
    [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function · bf550a4b
    0x0FFF authored
    This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
    The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
    * Timezone information of this datetime is ignored
    * This datetime is assumed to be in local timezone, which depends on the OS timezone setting
    
    Fix includes both code change and regression test. Problem reproduction code on master:
    ```python
    import pytz
    from datetime import datetime
    from pyspark.sql import *
    from pyspark.sql.types import *
    sqc = SQLContext(sc)
    df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
    
    m1 = pytz.timezone('UTC')
    m2 = pytz.timezone('Etc/GMT+3')
    
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    ```
    It gives the same timestamp ignoring time zone:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    ```
    After the fix:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946684800000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946695600000000)
     Scan PhysicalRDD[dt#0]
    ```
    PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8555 from 0x0FFF/SPARK-10162.
    bf550a4b
    History
    [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function
    0x0FFF authored
    This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
    The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
    * Timezone information of this datetime is ignored
    * This datetime is assumed to be in local timezone, which depends on the OS timezone setting
    
    Fix includes both code change and regression test. Problem reproduction code on master:
    ```python
    import pytz
    from datetime import datetime
    from pyspark.sql import *
    from pyspark.sql.types import *
    sqc = SQLContext(sc)
    df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
    
    m1 = pytz.timezone('UTC')
    m2 = pytz.timezone('Etc/GMT+3')
    
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    ```
    It gives the same timestamp ignoring time zone:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    ```
    After the fix:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946684800000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946695600000000)
     Scan PhysicalRDD[dt#0]
    ```
    PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
    
    Author: 0x0FFF <programmerag@gmail.com>
    
    Closes #8555 from 0x0FFF/SPARK-10162.