Skip to content
Snippets Groups Projects
  1. Sep 15, 2017
  2. Sep 13, 2017
    • Yuming Wang's avatar
      [SPARK-20427][SQL] Read JDBC table use custom schema · 17edfec5
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      Auto generated Oracle schema some times not we expect:
      - `number(1)` auto mapped to BooleanType, some times it's not we expect, per [SPARK-20921](
      -  `number` auto mapped to Decimal(38,10), It can't read big data, per [SPARK-20427](
      This PR fix this issue by custom schema as follows:
      val props = new Properties()
      props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
      val dfRead =, "tableWithCustomSchema", props)
      CREATE TEMPORARY VIEW tableWithCustomSchema
      USING org.apache.spark.sql.jdbc
      OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean')
      ## How was this patch tested?
      unit tests
      Author: Yuming Wang <>
      Closes #18266 from wangyum/SPARK-20427.
  3. Sep 10, 2017
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file · 6273a711
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      echo '{"field": 1}
      {"field": 2}
      {"field": "3"}' >/tmp/sample.json
      import org.apache.spark.sql.types._
      val schema = new StructType()
        .add("field", ByteType)
        .add("_corrupt_record", StringType)
      val file = "/tmp/sample.json"
      val dfFromFile =
      |1    |null           |
      |2    |null           |
      |null |{"field": "3"} |
      scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
      res1: Long = 0
      scala> dfFromFile.filter($"_corrupt_record".isNull).count()
      res2: Long = 3
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      ## How was this patch tested?
      Added test case.
      Author: Jen-Ming Chung <>
      Closes #18865 from jmchung/SPARK-21610.
  4. Sep 07, 2017
  5. Sep 05, 2017
  6. Aug 11, 2017
  7. Aug 08, 2017
  8. Aug 01, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21589][SQL][DOC] Add documents about Hive UDF/UDTF/UDAF · 110695db
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added documents about unsupported functions in Hive UDF/UDTF/UDAF.
      This pr relates to #18768 and #18527.
      ## How was this patch tested?
      Author: Takeshi Yamamuro <>
      Closes #18792 from maropu/HOTFIX-20170731.
  9. Jul 06, 2017
    • Tathagata Das's avatar
      [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation · 0217dfd2
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      Few changes to the Structured Streaming documentation
      - Clarify that the entire stream input table is not materialized
      - Add information for Ganglia
      - Add Kafka Sink to the main docs
      - Removed a couple of leftover experimental tags
      - Added more associated reading material and talk videos.
      In addition, broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
      - Added a redirection to avoid breaking internal and possible external links.
      - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.
      ## How was this patch tested?
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      Please review before opening a pull request.
      Author: Tathagata Das <>
      Closes #18485 from tdas/SPARK-21267.
  10. Jun 15, 2017
  11. May 26, 2017
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      ## How was this patch tested?
      Manual tests, docs build.
      Author: zero323 <>
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
  12. May 25, 2017
  13. Apr 19, 2017
    • ymahajan's avatar
      Fixed typos in docs · bdc60569
      ymahajan authored
      ## What changes were proposed in this pull request?
      Typos at a couple of place in the docs.
      ## How was this patch tested?
      build including docs
      Please review before opening a pull request.
      Author: ymahajan <>
      Closes #17690 from ymahajan/master.
  14. Apr 12, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] JSON APIs related documentation fixes · bca4259f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      This PR proposes corrections related to JSON APIs as below:
      - Rendering links in Python documentation
      - Replacing `RDD` to `Dataset` in programing guide
      - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
      - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
      ## How was this patch tested?
      Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
      Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in So, this PR does not fix those.
      Author: hyukjinkwon <>
      Closes #17602 from HyukjinKwon/minor-json-documentation.
  15. Apr 11, 2017
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Update supported versions for Hive Metastore · cde9e328
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs.
      ## How was this patch tested?
      Author: Dongjoon Hyun <>
      Closes #17612 from dongjoon-hyun/metastore.
  16. Mar 23, 2017
    • sureshthalamati's avatar
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to... · c7911807
      sureshthalamati authored
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
      ## What changes were proposed in this pull request?
      Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
      The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
      All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc(url, "TEST.DBCOLTYPETEST", properties)
      ## How was this patch tested?
      Added new test cases to the JDBCWriteSuite
      Author: sureshthalamati <>
      Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
  17. Mar 02, 2017
  18. Feb 25, 2017
    • Boaz Mohar's avatar
      [MINOR][DOCS] Fixes two problems in the SQL programing guide page · 061bcfb8
      Boaz Mohar authored
      ## What changes were proposed in this pull request?
      Removed duplicated lines in sql python example and found a typo.
      ## How was this patch tested?
      Searched for other typo's in the page to minimize PR's.
      Author: Boaz Mohar <>
      Closes #17066 from boazmohar/doc-fix.
  19. Feb 14, 2017
  20. Jan 30, 2017
  21. Jan 25, 2017
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · 3fdce814
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      ## How was this patch tested?
      The patch was tested locally by building the docs. The examples were run as well.
      Author: aokolnychyi <>
      Closes #16329 from aokolnychyi/SPARK-16046.
  22. Jan 07, 2017
  23. Jan 05, 2017
  24. Dec 30, 2016
    • Cheng Lian's avatar
      [SPARK-19016][SQL][DOC] Document scalable partition handling · 871f6114
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      This PR documents the scalable partition handling feature in the body of the programming guide.
      Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
      ## How was this patch tested?
      Author: Cheng Lian <>
      Closes #16424 from liancheng/scalable-partition-handling-doc.
  25. Dec 06, 2016
  26. Dec 05, 2016
  27. Nov 29, 2016
  28. Nov 26, 2016
    • Weiqing Yang's avatar
      [WIP][SQL][DOC] Fix incorrect `code` tag · f4a98e42
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR is to fix incorrect `code` tag in ``
      ## How was this patch tested?
      Author: Weiqing Yang <>
      Closes #15941 from weiqingy/fixtag.
  29. Nov 25, 2016
  30. Nov 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-18413][SQL] Add `maxConnections` JDBCOption · 07beb5d2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.
      **Reported Scenario**
      For the following cases, the number of connections becomes 200 and database cannot handle all of them.
      USING org.apache.spark.sql.jdbc
      OPTIONS (
        url "jdbc:oracle:thin:",
        dbtable "result",
        user "HIVE",
        password "HIVE"
      -- set spark.sql.shuffle.partitions=200
      ## How was this patch tested?
      Manual. Do the followings and see Spark UI.
      **Step 1 (MySQL)**
      CREATE TABLE t1 (a INT);
      CREATE TABLE data (a INT);
      INSERT INTO data VALUES (1);
      INSERT INTO data VALUES (2);
      INSERT INTO data VALUES (3);
      **Step 2 (Spark)**
      SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
      scala> sql("SET spark.sql.shuffle.partitions=3")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      Author: Dongjoon Hyun <>
      Closes #15868 from dongjoon-hyun/SPARK-18413.
  31. Nov 16, 2016
    • Weiqing Yang's avatar
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and... · 241e04bc
      Weiqing Yang authored
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation
      ## What changes were proposed in this pull request?
      Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation.
      ## How was this patch tested?
      Author: Weiqing Yang <>
      Closes #15886 from weiqingy/fixTypo.
  32. Oct 27, 2016
  33. Oct 24, 2016
    • Sean Owen's avatar
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but... · 4ecbe1b9
      Sean Owen authored
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
      ## What changes were proposed in this pull request?
      Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir
      ## How was this patch tested?
      Existing tests.
      Author: Sean Owen <>
      Closes #15382 from srowen/SPARK-17810.
  34. Oct 18, 2016
  35. Oct 14, 2016
  36. Oct 11, 2016
    • hyukjinkwon's avatar
      [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in... · 0c0ad436
      hyukjinkwon authored
      [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
      ## What changes were proposed in this pull request?
      This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options.
      This PR includes some changes as below:
        - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`.
      - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance.
      - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url.
      - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`).
      - Exclude Spark-only options in connection properties.
      ## How was this patch tested?
      Existing tests should cover this.
      Author: hyukjinkwon <>
      Closes #15292 from HyukjinKwon/SPARK-17719.
  37. Oct 10, 2016
    • Wenchen Fan's avatar
      [SPARK-17338][SQL] add global temp view · 23ddff4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
      changes for `SessionCatalog`:
      1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
      2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
      3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
      4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
      5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
      6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
      7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.
      changes for SQL commands:
      1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
      2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
      3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.
      changes for other public API
      1. add a new method `dropGlobalTempView` in `Catalog`
      2. `Catalog.findTable` can find global temp view
      3. add a new method `createGlobalTempView` in `Dataset`
      ## How was this patch tested?
      new tests in `SQLViewSuite`
      Author: Wenchen Fan <>
      Closes #14897 from cloud-fan/global-temp-view.
  38. Sep 26, 2016
    • Justin Pihony's avatar
      [SPARK-14525][SQL] Make work for jdbc · 50b89d05
      Justin Pihony authored
      ## What changes were proposed in this pull request?
      This change modifies the implementation of such that it works with jdbc, and the call to jdbc merely delegates to save.
      ## How was this patch tested?
      This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.
      ## Additional details
      rxin This seems to have been most recently touched by you and was also commented on in the JIRA.
      This contribution is my original work and I license the work to the project under the project's open source license.
      Author: Justin Pihony <>
      Author: Justin Pihony <>
      Closes #12601 from JustinPihony/jdbc_reconciliation.
  39. Sep 17, 2016