Skip to content
  • Takuya UESHIN's avatar
    64817c42
    [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to... · 64817c42
    Takuya UESHIN authored
    [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone
    
    ## What changes were proposed in this pull request?
    
    When converting Pandas DataFrame/Series from/to Spark DataFrame using `toPandas()` or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.
    
    For example, let's say we use `"America/Los_Angeles"` as session timezone and have a timestamp value `"1970-01-01 00:00:01"` in the timezone. Btw, I'm in Japan so Python timezone would be `"Asia/Tokyo"`.
    
    The timestamp value from current `toPandas()` will be the following:
    
    ```
    >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
    >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
    >>> df.show()
    +-------------------+
    |                 ts|
    +-------------------+
    |1970-01-01 00:00:01|
    +-------------------+
    
    >>> df.toPandas()
                       ts
    0 1970-01-01 17:00:01
    ```
    
    As you can see, the value becomes `"1970-01-01 17:00:01"` because it respects Python timezone.
    As we discussed in #18664, we consider this behavior is a bug and the value should be `"1970-01-01 00:00:01"`.
    
    ## How was this patch tested?
    
    Added tests and existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19607 from ueshin/issues/SPARK-22395.
    64817c42
    [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to...
    Takuya UESHIN authored
    [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone
    
    ## What changes were proposed in this pull request?
    
    When converting Pandas DataFrame/Series from/to Spark DataFrame using `toPandas()` or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone.
    
    For example, let's say we use `"America/Los_Angeles"` as session timezone and have a timestamp value `"1970-01-01 00:00:01"` in the timezone. Btw, I'm in Japan so Python timezone would be `"Asia/Tokyo"`.
    
    The timestamp value from current `toPandas()` will be the following:
    
    ```
    >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
    >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as ts")
    >>> df.show()
    +-------------------+
    |                 ts|
    +-------------------+
    |1970-01-01 00:00:01|
    +-------------------+
    
    >>> df.toPandas()
                       ts
    0 1970-01-01 17:00:01
    ```
    
    As you can see, the value becomes `"1970-01-01 17:00:01"` because it respects Python timezone.
    As we discussed in #18664, we consider this behavior is a bug and the value should be `"1970-01-01 00:00:01"`.
    
    ## How was this patch tested?
    
    Added tests and existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19607 from ueshin/issues/SPARK-22395.
Loading