Commit ac84fb64 authored 8 years ago by hyukjinkwon Committed by Wenchen Fan 8 years ago
[SPARK-16434][SQL] Avoid per-record type dispatch in JSON when reading

## What changes were proposed in this pull request?

Currently, `JacksonParser.parse` is doing type-based dispatch for each row to convert the tokens to appropriate values for Spark.
It might not have to be done like this because the schema is already kept.

So, appropriate converters can be created first according to the schema once, and then apply them to each row.

This PR corrects `JacksonParser` so that it creates all converters for the schema once and then applies them to each row rather than type dispatching for every row.

Benchmark was proceeded with the codes below:

#### Parser tests

**Before**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String])
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), "", dummyOption)
  val factory = new JsonFactory()

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    data.foreach { input =>
      val parser = factory.createParser(input)
      parser.nextToken()
      JacksonParser.convertRootField(factory, parser, schema)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1697 / 1807          0.1       13256.9       1.0X
```

**After**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String], new SQLConf())
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), dummyOption)

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    val parser = new JacksonParser(schema, dummyOption)
    data.foreach { input =>
      parser.parse(input)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1401 / 1461          0.1       10947.4       1.0X
```

It seems parsing time is improved by roughly ~20%

#### End-to-End test

```scala
test("Benchmark for JSON reader") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
  withTempPath { path =>
    df.write.format("json").save(path.getCanonicalPath)

    val benchmark = new Benchmark("JSON reader", N)
    benchmark.addCase("reading JSON file", 10) { _ =>
      spark.read.format("json").load(path.getCanonicalPath).collect()
    }
    benchmark.run()
  }
}
```

**Before**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6485 / 6924          0.0       50665.0       1.0X
```

**After**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6350 / 6529          0.0       49609.3       1.0X
```

## How was this patch tested?

Existing test cases should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14102 from HyukjinKwon/SPARK-16434.
parent 7a9e25c3
No related branches found
No related tags found
No related merge requests found
Expand all Hide whitespace changes
Inline Side-by-side
Showing with 297 additions and 216 deletions
Please register or to comment