Skip to content
Snippets Groups Projects
  • Bill Chambers's avatar
    603f4453
    [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names · 603f4453
    Bill Chambers authored
    ## What changes were proposed in this pull request?
    
    When a CSV begins with:
    - `,,`
    OR
    - `"","",`
    
    meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
    ```
    "","second column"
    "hello", "there"
    ```
    Then column names would become `"C0", "second column"`.
    
    This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.
    
    ### Current Behavior in Spark <=1.6
    In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.
    
    ### Current Behavior in Spark 2.0
    Spark throws a NullPointerError and will not read in the file.
    
    #### Reproduction in 2.0
    https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
    
    ## How was this patch tested?
    A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.
    
    Author: Bill Chambers <bill@databricks.com>
    Author: Bill Chambers <wchambers@ischool.berkeley.edu>
    
    Closes #13041 from anabranch/master.
    603f4453
    History
    [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names
    Bill Chambers authored
    ## What changes were proposed in this pull request?
    
    When a CSV begins with:
    - `,,`
    OR
    - `"","",`
    
    meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
    ```
    "","second column"
    "hello", "there"
    ```
    Then column names would become `"C0", "second column"`.
    
    This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.
    
    ### Current Behavior in Spark <=1.6
    In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.
    
    ### Current Behavior in Spark 2.0
    Spark throws a NullPointerError and will not read in the file.
    
    #### Reproduction in 2.0
    https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
    
    ## How was this patch tested?
    A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.
    
    Author: Bill Chambers <bill@databricks.com>
    Author: Bill Chambers <wchambers@ischool.berkeley.edu>
    
    Closes #13041 from anabranch/master.