Skip to content
Snippets Groups Projects
  1. May 05, 2016
    • hyukjinkwon's avatar
      [SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0 · ac12b35d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-15148
      
      Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.
      
      This PR upgrades Univocity library from 2.0.2 to 2.1.0.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12923 from HyukjinKwon/SPARK-15148.
      ac12b35d
    • Wenchen Fan's avatar
      [SPARK-14139][SQL] RowEncoder should preserve schema nullability · 55cc1c99
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.
      
      TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.
      
      ## How was this patch tested?
      
      new tests in `RowEncoderSuite`
      
      Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #12364 from cloud-fan/nullable.
      55cc1c99
    • Jason Moore's avatar
      [SPARK-14915][CORE] Don't re-queue a task if another attempt has already succeeded · 77361a43
      Jason Moore authored
      ## What changes were proposed in this pull request?
      
      Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.
      
      ## How was this patch tested?
      
      I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.
      
      Author: Jason Moore <jasonmoore2k@outlook.com>
      
      Closes #12751 from jasonmoore2k/SPARK-14915.
      77361a43
    • Luciano Resende's avatar
      [SPARK-14589][SQL] Enhance DB2 JDBC Dialect docker tests · 10443022
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      
      Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail.
      
      ## How was this patch tested?
      
      By running the integration tests locally.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #12348 from lresende/SPARK-14589.
      10443022
    • Holden Karau's avatar
      [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA" · 4c0d827c
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).
      
      ## How was this patch tested?
      
      Python documentation built locally as HTML and text and verified output.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
      4c0d827c
    • mcheah's avatar
      [SPARK-12154] Upgrade to Jersey 2 · b7fdc23c
      mcheah authored
      ## What changes were proposed in this pull request?
      
      Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.
      
      ## How was this patch tested?
      
      I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #12715 from mccheah/feature/upgrade-jersey.
      b7fdc23c
    • Lining Sun's avatar
      [SPARK-15123] upgrade org.json4s to 3.2.11 version · 592fc455
      Lining Sun authored
      ## What changes were proposed in this pull request?
      
      We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.
      
      ## How was this patch tested?
      
      We built Spark jar and successfully ran our applications in local and cluster modes.
      
      Author: Lining Sun <lining@gmail.com>
      
      Closes #12901 from liningalex/master.
      592fc455
    • Abhinav Gupta's avatar
      [SPARK-15045] [CORE] Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable · 1a5c6fce
      Abhinav Gupta authored
      ## What changes were proposed in this pull request?
      
      Removed the DeadCode as suggested.
      
      Author: Abhinav Gupta <abhi.951990@gmail.com>
      
      Closes #12829 from abhi951990/master.
      1a5c6fce
    • Kousuke Saruta's avatar
      [SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation · 1a9b3415
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.
      
      ## How was this patch tested?
      
      Manually checked.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12908 from sarutak/SPARK-15132.
      1a9b3415
  2. May 04, 2016
    • Davies Liu's avatar
      [MINOR] remove dead code · 42837419
      Davies Liu authored
      42837419
    • Tathagata Das's avatar
      [SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown · bde27b89
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.
      
      ## How was this patch tested?
      
      Updated unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12905 from tdas/SPARK-15131.
      bde27b89
    • gatorsmile's avatar
      [SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File · ef55e46c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.
      
      This PR is to fix the behavior inconsistency issue.
      
      The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.
      
      By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
      **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
      `/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
      **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
      still `/path/something=true/`, and the returned DataFrame will also not contain a column of
      `something`.
      **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
      DataFrame will have the column of `something`.
      
      Users also can override the basePath by setting `basePath` in the options to pass the new base
      path to the data source. For example,
      ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
      and the returned DataFrame will have the column of `something`.
      
      The related PRs:
      - https://github.com/apache/spark/pull/9651
      - https://github.com/apache/spark/pull/10211
      
      #### How was this patch tested?
      Added a couple of test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12828 from gatorsmile/readPartitionedTable.
      ef55e46c
    • Sean Zhong's avatar
      [SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query · 8fb1463d
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR support new SQL syntax CREATE TEMPORARY VIEW.
      Like:
      ```
      CREATE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
      ```
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Sean Zhong <clockfly@gmail.com>
      
      Closes #12872 from clockfly/spark-6399.
      8fb1463d
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
    • sethah's avatar
      [MINOR][SQL] Fix typo in DataFrameReader csv documentation · b2813776
      sethah authored
      ## What changes were proposed in this pull request?
      Typo fix
      
      ## How was this patch tested?
      No tests
      
      My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12912 from sethah/csv_typo.
      b2813776
    • Wenchen Fan's avatar
      [SPARK-15116] In REPL we should create SparkSession first and get SparkContext from it · a432a2b8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      see https://github.com/apache/spark/pull/12873#discussion_r61993910. The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set.
      
      ## How was this patch tested?
      
      verified it locally.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12890 from cloud-fan/repl.
      a432a2b8
    • Sebastien Rainville's avatar
      [SPARK-13001][CORE][MESOS] Prevent getting offers when reached max cores · eb019af9
      Sebastien Rainville authored
      Similar to https://github.com/apache/spark/pull/8639
      
      This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.
      
      Author: Sebastien Rainville <sebastien@hopper.com>
      
      Closes #10924 from sebastienrainville/master.
      eb019af9
    • Dongjoon Hyun's avatar
      [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example. · cdce4e62
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
      
      - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
      - Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
      - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
        - `SqlNetworkWordCount.scala`
        - `JavaSqlNetworkWordCount.java`
        - `sql_network_wordcount.py`
      
      Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
      - `simple_params_example.py`
      - `aft_survival_regression.py`
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12809 from dongjoon-hyun/SPARK-15031.
      cdce4e62
    • Bryan Cutler's avatar
      [SPARK-12299][CORE] Remove history serving functionality from Master · cf2e9da6
      Bryan Cutler authored
      Remove history server functionality from standalone Master.  Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270).  Keeping this functionality out of the Master will help to simplify the process and increase stability.
      
      Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly.  Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
      cf2e9da6
    • Thomas Graves's avatar
      [SPARK-15121] Improve logging of external shuffle handler · 0c00391f
      Thomas Graves authored
      ## What changes were proposed in this pull request?
      
      Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.
      
      ## How was this patch tested?
      
      Ran and saw logs coming out in log file.
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #12900 from tgravescs/SPARK-15121.
      0c00391f
    • Reynold Xin's avatar
      [SPARK-15126][SQL] RuntimeConfig.set should return Unit · 6ae9fc00
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.
      
      ## How was this patch tested?
      Updated unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12902 from rxin/SPARK-15126.
      6ae9fc00
    • Tathagata Das's avatar
      [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning · 0fd3a474
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
      
      This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
      - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
      - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
      - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.
      
      ## How was this patch tested?
      - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
      - Other unit tests are unchanged and pass as expected.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12879 from tdas/SPARK-15103.
      0fd3a474
    • Reynold Xin's avatar
      [SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites · 6274a520
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.
      
      Most of the changes are straightforward move of code. On top of the code moving, I did:
      1. Use SparkSession instead of SQLContext.
      2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12891 from rxin/SPARK-15115.
      6274a520
    • Zheng RuiFeng's avatar
      [MINOR] Add python3 compatibility in python examples · 4530250f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add python3 compatibility in python examples
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12868 from zhengruifeng/fix_gmm_py.
      4530250f
    • Liang-Chi Hsieh's avatar
      [SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate · b85d21fb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.
      
      However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.
      
      For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.
      b85d21fb
    • Reynold Xin's avatar
      [SPARK-15109][SQL] Accept Dataset[_] in joins · d864c55c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12886 from rxin/SPARK-15109.
      d864c55c
    • Liwei Lin's avatar
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the... · e597ec6f
      Liwei Lin authored
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock`
      
      ## What changes were proposed in this pull request?
      
      Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.
      
      We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.
      
      This patch:
      - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
      - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
      - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12797 from lw-lin/add-trigger-test-support.
      e597ec6f
    • Dhruve Ashar's avatar
      [SPARK-4224][CORE][YARN] Support group acls · a4564774
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.
      
      **Changes Proposed in the fix**
      Three new corresponding config entries have been added where the user can specify the groups to be given access.
      
      ```
      spark.admin.acls.groups
      spark.modify.acls.groups
      spark.ui.view.acls.groups
      ```
      
      New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.
      
      A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
      Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```
      
      **How the patch was Tested**
      We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.
      
      Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #12760 from dhruve/impr/SPARK-4224.
      a4564774
    • Dominik Jastrzębski's avatar
      [SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM… · abecbcd5
      Dominik Jastrzębski authored
      ## What changes were proposed in this pull request?
      
      Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.
      
      ## How was this patch tested?
      
      By running KMeansSuite.
      
      Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>
      
      Closes #12609 from dominik-jastrzebski/master.
      abecbcd5
    • Cheng Lian's avatar
      [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command · f152fae3
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output:
      
      ```
      scala> spark.sql("desc extended src").show(100, truncate = false)
      +----------------------------+---------------------------------+-------+
      |col_name                    |data_type                        |comment|
      +----------------------------+---------------------------------+-------+
      |key                         |int                              |       |
      |value                       |string                           |       |
      |                            |                                 |       |
      |# Detailed Table Information|CatalogTable(`default`.`src`, ...|       |
      +----------------------------+---------------------------------+-------+
      
      scala> spark.sql("desc formatted src").show(100, truncate = false)
      +----------------------------+----------------------------------------------------------+-------+
      |col_name                    |data_type                                                 |comment|
      +----------------------------+----------------------------------------------------------+-------+
      |key                         |int                                                       |       |
      |value                       |string                                                    |       |
      |                            |                                                          |       |
      |# Detailed Table Information|                                                          |       |
      |Database:                   |default                                                   |       |
      |Owner:                      |lian                                                      |       |
      |Create Time:                |Mon Jan 04 17:06:00 CST 2016                              |       |
      |Last Access Time:           |Thu Jan 01 08:00:00 CST 1970                              |       |
      |Location:                   |hdfs://localhost:9000/user/hive/warehouse_hive121/src     |       |
      |Table Type:                 |MANAGED                                                   |       |
      |Table Parameters:           |                                                          |       |
      |  transient_lastDdlTime     |1451898360                                                |       |
      |                            |                                                          |       |
      |# Storage Information       |                                                          |       |
      |SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        |       |
      |InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                  |       |
      |OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|       |
      |Num Buckets:                |-1                                                        |       |
      |Bucket Columns:             |[]                                                        |       |
      |Sort Columns:               |[]                                                        |       |
      |Storage Desc Parameters:    |                                                          |       |
      |  serialization.format      |1                                                         |       |
      +----------------------------+----------------------------------------------------------+-------+
      ```
      
      ## How was this patch tested?
      
      A test case is added to `HiveDDLSuite` to check command output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12844 from liancheng/spark-14127-desc-table.
      f152fae3
    • Wenchen Fan's avatar
      [SPARK-15029] improve error message for Generate · 6c12e801
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR improve the error message for `Generate` in 3 cases:
      
      1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
      2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
      3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`
      
      ## How was this patch tested?
      
      new tests in `AnalysisErrorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12810 from cloud-fan/bug.
      6c12e801
    • Cheng Lian's avatar
      [SPARK-14237][SQL] De-duplicate partition value appending logic in various... · bc3760d4
      Cheng Lian authored
      [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
      
      ## What changes were proposed in this pull request?
      
      Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.
      
      A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.
      
      Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.
      
      This PR brings two benefits:
      
      1. Apparently, it de-duplicates partition value appending logic
      
      2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.
      
         Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.
      
      ## How was this patch tested?
      
      Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
      bc3760d4
    • Reynold Xin's avatar
      [SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark · 695f0e91
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality.
      
      With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results.
      
      ## How was this patch tested?
      N/A - this is a test util.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12884 from rxin/SPARK-15107.
      695f0e91
  3. May 03, 2016
    • Davies Liu's avatar
      [SPARK-15095][SQL] remove HiveSessionHook from ThriftServer · 348c1389
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Remove HiveSessionHook
      
      ## How was this patch tested?
      
      No tests needed.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12881 from davies/remove_hooks.
      348c1389
    • Andrew Or's avatar
      [SPARK-14414][SQL] Make DDL exceptions more consistent · 6ba17cd1
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Just a bunch of small tweaks on DDL exception messages.
      
      ## How was this patch tested?
      
      `DDLCommandSuite` et al.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12853 from andrewor14/make-exceptions-consistent.
      6ba17cd1
    • Koert Kuipers's avatar
      [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports · 9e4928b7
      Koert Kuipers authored
      ## What changes were proposed in this pull request?
      Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
      Now this works again:
      import someDataset.sqlContext.implicits._
      
      ## How was this patch tested?
      Add unit test to DatasetSuite that uses the import show above.
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.
      9e4928b7
    • Dongjoon Hyun's avatar
      [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark. · 0903a185
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12860 from dongjoon-hyun/SPARK-15084.
      0903a185
    • Timothy Chen's avatar
      [SPARK-14645][MESOS] Fix python running on cluster mode mesos to have non local uris · c1839c99
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      Fix SparkSubmit to allow non-local python uris
      
      ## How was this patch tested?
      
      Manually tested with mesos-spark-dispatcher
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #12403 from tnachen/enable_remote_python.
      c1839c99
    • Sandeep Singh's avatar
      [SPARK-14422][SQL] Improve handling of optional configs in SQLConf · a8d56f53
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Create a new API for handling Optional Configs in SQLConf.
      Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).
      
      ## How was this patch tested?
      Add test and ran tests locally.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12846 from techaddict/SPARK-14422.
      a8d56f53
    • Shuai Lin's avatar
      [MINOR][DOC] Fixed some python snippets in mllib data types documentation. · c4e0fde8
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Some python snippets is using scala imports and comments.
      
      ## How was this patch tested?
      
      Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #12869 from lins05/fix-mllib-python-snippets.
      c4e0fde8
Loading