-
- Downloads
[SPARK-5938] [SPARK-5443] [SQL] Improve JsonRDD performance
This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <nhowell@godaddy.com> Closes #5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala 23 additions, 20 deletions...apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala 4 additions, 0 deletions...rc/main/scala/org/apache/spark/sql/types/StructType.scala
- sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 2 additions, 2 deletionssql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 8 additions, 0 deletionssql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 21 additions, 13 deletions...core/src/main/scala/org/apache/spark/sql/SQLContext.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/InferSchema.scala 171 additions, 0 deletions...rc/main/scala/org/apache/spark/sql/json/InferSchema.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala 78 additions, 21 deletions...c/main/scala/org/apache/spark/sql/json/JSONRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala 77 additions, 0 deletions...in/scala/org/apache/spark/sql/json/JacksonGenerator.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JacksonParser.scala 215 additions, 0 deletions.../main/scala/org/apache/spark/sql/json/JacksonParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JacksonUtils.scala 32 additions, 0 deletions...c/main/scala/org/apache/spark/sql/json/JacksonUtils.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 0 additions, 50 deletions...re/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala
- sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 37 additions, 14 deletions.../src/test/scala/org/apache/spark/sql/json/JsonSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala 47 additions, 8 deletions...test/scala/org/apache/spark/sql/sources/InsertSuite.scala
Loading
Please register or sign in to comment