Skip to content
Snippets Groups Projects
  1. Jul 25, 2016
  2. Jul 23, 2016
    • Cheng Lian's avatar
      [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding · 53b2456d
      Cheng Lian authored
      This PR is based on PR #14098 authored by wangmiao1981.
      
      ## What changes were proposed in this pull request?
      
      This PR replaces the original Python Spark SQL example file with the following three files:
      
      - `sql/basic.py`
      
        Demonstrates basic Spark SQL features.
      
      - `sql/datasource.py`
      
        Demonstrates various Spark SQL data sources.
      
      - `sql/hive.py`
      
        Demonstrates Spark SQL Hive interaction.
      
      This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14317 from liancheng/py-examples-update.
      53b2456d
  3. Jul 19, 2016
    • WeichenXu's avatar
      [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code · 9674af6f
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update `refreshTable` API in python code of the sql-programming-guide.
      
      This API is added in SPARK-15820
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14220 from WeichenXu123/update_sql_doc_catalog.
      9674af6f
    • Cheng Lian's avatar
      [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update · 1426a080
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".
      
      ## How was this patch tested?
      
      Manually verified the generated HTML page.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14245 from liancheng/minor-scala-example-update.
      1426a080
  4. Jul 14, 2016
    • Shivaram Venkataraman's avatar
      [SPARK-16553][DOCS] Fix SQL example file name in docs · 01c4c1fa
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Fixes a typo in the sql programming guide
      
      ## How was this patch tested?
      
      Building docs locally
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #14208 from shivaram/spark-sql-doc-fix.
      01c4c1fa
  5. Jul 13, 2016
  6. Jul 12, 2016
    • Lianhui Wang's avatar
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose... · 5ad68ba5
      Lianhui Wang authored
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
      
      ## What changes were proposed in this pull request?
      when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
      
      Closes #13494 from lianhuiwang/metadata-only.
      5ad68ba5
  7. Jul 11, 2016
  8. Jun 30, 2016
  9. Jun 28, 2016
  10. Jun 21, 2016
  11. Jun 20, 2016
    • Cheng Lian's avatar
      [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0 · 6df8e388
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
      
      We may also want to add more examples for Scala/Java Dataset typed transformations.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13592 from liancheng/sql-programming-guide-2.0.
      6df8e388
  12. Jun 10, 2016
    • Mortada Mehyar's avatar
      [DOCUMENTATION] fixed groupby aggregation example for pyspark · 675a7371
      Mortada Mehyar authored
      ## What changes were proposed in this pull request?
      
      fixing documentation for the groupby/agg example in python
      
      ## How was this patch tested?
      
      the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()`
      
      after the fix here's how I tested it:
      
      ```
      In [1]: from pyspark.sql import Row
      
      In [2]: import pyspark.sql.functions as func
      
      In [3]: %cpaste
      Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
      :records = [{'age': 19, 'department': 1, 'expense': 100},
      : {'age': 20, 'department': 1, 'expense': 200},
      : {'age': 21, 'department': 2, 'expense': 300},
      : {'age': 22, 'department': 2, 'expense': 300},
      : {'age': 23, 'department': 3, 'expense': 300}]
      :--
      
      In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])
      
      In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show()
      
      +----------+----------+--------+------------+
      |department|department|max(age)|sum(expense)|
      +----------+----------+--------+------------+
      |         1|         1|      20|         300|
      |         2|         2|      22|         600|
      |         3|         3|      23|         300|
      +----------+----------+--------+------------+
      
      Author: Mortada Mehyar <mortada.mehyar@gmail.com>
      
      Closes #13587 from mortada/groupby_agg_doc_fix.
      675a7371
  13. May 22, 2016
  14. May 17, 2016
  15. Apr 29, 2016
    • Sun Rui's avatar
      [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. · 4ae9fe09
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
      
      The function signature is:
      
      	dapply(df, function(localDF) {}, schema = NULL)
      
      R function input: local data.frame from the partition on local node
      R function output: local data.frame
      
      Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
      If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <rui.sun@intel.com>
      Author: Sun Rui <sunrui2016@gmail.com>
      
      Closes #12493 from sun-rui/SPARK-12919.
      4ae9fe09
  16. Apr 25, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date · 6ab4d9e0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
      
      - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
      - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
      - Fix datatypes in `sparkr.md`.
      - Update a data result in `sparkr.md`.
      - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
      - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
      - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
      - Other minor syntax fixes and a typo.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12649 from dongjoon-hyun/SPARK-14883.
      6ab4d9e0
  17. Apr 14, 2016
    • Mark Grover's avatar
      [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly · ff9ae61a
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Removing references to assembly jar in documentation.
      Adding an additional (previously undocumented) usage of spark-submit to run examples.
      
      ## How was this patch tested?
      
      Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #12365 from markgrover/spark-14601.
      ff9ae61a
  18. Apr 11, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix wrong data types in JSON Datasets example. · 1a0cca1f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
      1a0cca1f
  19. Apr 07, 2016
    • Reynold Xin's avatar
      [SPARK-10063][SQL] Remove DirectParquetOutputCommitter · 9ca0760d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
      
      ## How was this patch tested?
      Removed the related tests also.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12229 from rxin/SPARK-10063.
      9ca0760d
  20. Apr 04, 2016
    • Marcelo Vanzin's avatar
      [SPARK-13579][BUILD] Stop building the main Spark assembly. · 24d7d2e4
      Marcelo Vanzin authored
      This change modifies the "assembly/" module to just copy needed
      dependencies to its build directory, and modifies the packaging
      script to pick those up (and remove duplicate jars packages in the
      examples module).
      
      I also made some minor adjustments to dependencies to remove some
      test jars from the final packaging, and remove jars that conflict with each
      other when packaged separately (e.g. servlet api).
      
      Also note that this change restores guava in applications' classpaths, even
      though it's still shaded inside Spark. This is now needed for the Hadoop
      libraries that are packaged with Spark, which now are not processed by
      the shade plugin.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11796 from vanzin/SPARK-13579.
      24d7d2e4
  21. Mar 17, 2016
  22. Mar 16, 2016
  23. Mar 09, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
      
      ```
      -    final ArrayList<Product2<Object, Object>> dataToWrite =
      -      new ArrayList<Product2<Object, Object>>();
      +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
      ```
      
      Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
      
      ## How was this patch tested?
      
      Manual.
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11541 from dongjoon-hyun/SPARK-13702.
      c3689bc2
  24. Feb 22, 2016
  25. Feb 11, 2016
  26. Feb 09, 2016
  27. Feb 01, 2016
  28. Jan 20, 2016
  29. Jan 11, 2016
  30. Jan 04, 2016
    • Josh Rosen's avatar
      [SPARK-12579][SQL] Force user-specified JDBC driver to take precedence · 6c83d938
      Josh Rosen authored
      Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.
      
      In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.
      
      This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).
      
      If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).
      
      This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10519 from JoshRosen/jdbc-driver-precedence.
      6c83d938
  31. Dec 09, 2015
  32. Dec 08, 2015
  33. Dec 01, 2015
  34. Nov 23, 2015
  35. Nov 17, 2015
    • Cheng Lian's avatar
      [SPARK-11089][SQL] Adds option for disabling multi-session in Thrift server · 7b1407c7
      Cheng Lian authored
      This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server.
      
      Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9740 from liancheng/spark-11089.single-session-option.
      7b1407c7
  36. Nov 09, 2015
    • gatorsmile's avatar
      [SPARK-11360][DOC] Loss of nullability when writing parquet files · 2f383788
      gatorsmile authored
      This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #9314 from gatorsmile/lossNull.
      2f383788
    • Rohit Agarwal's avatar
      [DOC][MINOR][SQL] Fix internal link · b541b316
      Rohit Agarwal authored
      It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.
      
      Author: Rohit Agarwal <mindprince@gmail.com>
      
      Closes #9544 from mindprince/patch-2.
      b541b316
Loading