Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    2262a933
    [SPARK-14231] [SQL] JSON data source infers floating-point values as a double... · 2262a933
    hyukjinkwon authored
    [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-14231
    
    Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
    
    But there are few restrictions in Spark `DecimalType` below:
    
    1. The precision cannot be bigger than 38.
    2. scale cannot be bigger than precision.
    
    Currently, both restrictions are not being handled.
    
    This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
    
    So, the codes below:
    
    ```scala
    def doubleRecords: RDD[String] =
      sqlContext.sparkContext.parallelize(
        s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
        s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
    
    val jsonDF = sqlContext.read
      .option("prefersDecimal", "true")
      .json(doubleRecords)
    jsonDF.printSchema()
    ```
    
    produces below:
    
    - **Before**
    
    ```scala
    org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
    	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
    	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
    	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
    	at
    ...
    ```
    
    - **After**
    
    ```scala
    root
     |-- a: double (nullable = true)
     |-- b: double (nullable = true)
    ```
    
    ## How was this patch tested?
    
    Unit tests were used and `./dev/run_tests` for coding style tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #12030 from HyukjinKwon/SPARK-14231.
    2262a933
    History
    [SPARK-14231] [SQL] JSON data source infers floating-point values as a double...
    hyukjinkwon authored
    [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-14231
    
    Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
    
    But there are few restrictions in Spark `DecimalType` below:
    
    1. The precision cannot be bigger than 38.
    2. scale cannot be bigger than precision.
    
    Currently, both restrictions are not being handled.
    
    This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
    
    So, the codes below:
    
    ```scala
    def doubleRecords: RDD[String] =
      sqlContext.sparkContext.parallelize(
        s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
        s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
    
    val jsonDF = sqlContext.read
      .option("prefersDecimal", "true")
      .json(doubleRecords)
    jsonDF.printSchema()
    ```
    
    produces below:
    
    - **Before**
    
    ```scala
    org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
    	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
    	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
    	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
    	at
    ...
    ```
    
    - **After**
    
    ```scala
    root
     |-- a: double (nullable = true)
     |-- b: double (nullable = true)
    ```
    
    ## How was this patch tested?
    
    Unit tests were used and `./dev/run_tests` for coding style tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #12030 from HyukjinKwon/SPARK-14231.