Skip to content
Snippets Groups Projects
  • Dongjoon Hyun's avatar
    0f576a57
    [SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not consistent. · 0f576a57
    Dongjoon Hyun authored
    ## What changes were proposed in this pull request?
    
    **createDataFrame** returns inconsistent types for column names.
    ```python
    >>> from pyspark.sql.types import StructType, StructField, StringType
    >>> schema = StructType([StructField(u"col", StringType())])
    >>> df1 = spark.createDataFrame([("a",)], schema)
    >>> df1.columns # "col" is str
    ['col']
    >>> df2 = spark.createDataFrame([("a",)], [u"col"])
    >>> df2.columns # "col" is unicode
    [u'col']
    ```
    
    The reason is only **StructField** has the following code.
    ```
    if not isinstance(name, str):
        name = name.encode('utf-8')
    ```
    This PR adds the same logic into **createDataFrame** for consistency.
    ```
    if isinstance(schema, list):
        schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test (with new python doctest)
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #13097 from dongjoon-hyun/SPARK-15244.
    0f576a57
    History
    [SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not consistent.
    Dongjoon Hyun authored
    ## What changes were proposed in this pull request?
    
    **createDataFrame** returns inconsistent types for column names.
    ```python
    >>> from pyspark.sql.types import StructType, StructField, StringType
    >>> schema = StructType([StructField(u"col", StringType())])
    >>> df1 = spark.createDataFrame([("a",)], schema)
    >>> df1.columns # "col" is str
    ['col']
    >>> df2 = spark.createDataFrame([("a",)], [u"col"])
    >>> df2.columns # "col" is unicode
    [u'col']
    ```
    
    The reason is only **StructField** has the following code.
    ```
    if not isinstance(name, str):
        name = name.encode('utf-8')
    ```
    This PR adds the same logic into **createDataFrame** for consistency.
    ```
    if isinstance(schema, list):
        schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test (with new python doctest)
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #13097 from dongjoon-hyun/SPARK-15244.
tests.py 73.69 KiB