Skip to content
Snippets Groups Projects
  1. May 05, 2016
  2. May 04, 2016
    • Davies Liu's avatar
      [MINOR] remove dead code · 42837419
      Davies Liu authored
      42837419
    • Tathagata Das's avatar
      [SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown · bde27b89
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.
      
      ## How was this patch tested?
      
      Updated unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12905 from tdas/SPARK-15131.
      bde27b89
    • gatorsmile's avatar
      [SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File · ef55e46c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.
      
      This PR is to fix the behavior inconsistency issue.
      
      The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.
      
      By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
      **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
      `/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
      **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
      still `/path/something=true/`, and the returned DataFrame will also not contain a column of
      `something`.
      **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
      DataFrame will have the column of `something`.
      
      Users also can override the basePath by setting `basePath` in the options to pass the new base
      path to the data source. For example,
      ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
      and the returned DataFrame will have the column of `something`.
      
      The related PRs:
      - https://github.com/apache/spark/pull/9651
      - https://github.com/apache/spark/pull/10211
      
      #### How was this patch tested?
      Added a couple of test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12828 from gatorsmile/readPartitionedTable.
      ef55e46c
    • Sean Zhong's avatar
      [SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query · 8fb1463d
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR support new SQL syntax CREATE TEMPORARY VIEW.
      Like:
      ```
      CREATE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
      CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
      ```
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Sean Zhong <clockfly@gmail.com>
      
      Closes #12872 from clockfly/spark-6399.
      8fb1463d
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
    • sethah's avatar
      [MINOR][SQL] Fix typo in DataFrameReader csv documentation · b2813776
      sethah authored
      ## What changes were proposed in this pull request?
      Typo fix
      
      ## How was this patch tested?
      No tests
      
      My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12912 from sethah/csv_typo.
      b2813776
    • Wenchen Fan's avatar
      [SPARK-15116] In REPL we should create SparkSession first and get SparkContext from it · a432a2b8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      see https://github.com/apache/spark/pull/12873#discussion_r61993910. The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set.
      
      ## How was this patch tested?
      
      verified it locally.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12890 from cloud-fan/repl.
      a432a2b8
    • Sebastien Rainville's avatar
      [SPARK-13001][CORE][MESOS] Prevent getting offers when reached max cores · eb019af9
      Sebastien Rainville authored
      Similar to https://github.com/apache/spark/pull/8639
      
      This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.
      
      Author: Sebastien Rainville <sebastien@hopper.com>
      
      Closes #10924 from sebastienrainville/master.
      eb019af9
    • Dongjoon Hyun's avatar
      [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example. · cdce4e62
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.
      
      - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
      - Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
      - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
        - `SqlNetworkWordCount.scala`
        - `JavaSqlNetworkWordCount.java`
        - `sql_network_wordcount.py`
      
      Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
      - `simple_params_example.py`
      - `aft_survival_regression.py`
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12809 from dongjoon-hyun/SPARK-15031.
      cdce4e62
    • Bryan Cutler's avatar
      [SPARK-12299][CORE] Remove history serving functionality from Master · cf2e9da6
      Bryan Cutler authored
      Remove history server functionality from standalone Master.  Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270).  Keeping this functionality out of the Master will help to simplify the process and increase stability.
      
      Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly.  Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
      cf2e9da6
    • Thomas Graves's avatar
      [SPARK-15121] Improve logging of external shuffle handler · 0c00391f
      Thomas Graves authored
      ## What changes were proposed in this pull request?
      
      Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.
      
      ## How was this patch tested?
      
      Ran and saw logs coming out in log file.
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #12900 from tgravescs/SPARK-15121.
      0c00391f
    • Reynold Xin's avatar
      [SPARK-15126][SQL] RuntimeConfig.set should return Unit · 6ae9fc00
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.
      
      ## How was this patch tested?
      Updated unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12902 from rxin/SPARK-15126.
      6ae9fc00
    • Tathagata Das's avatar
      [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning · 0fd3a474
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
      
      This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
      - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
      - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
      - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.
      
      ## How was this patch tested?
      - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
      - Other unit tests are unchanged and pass as expected.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12879 from tdas/SPARK-15103.
      0fd3a474
    • Reynold Xin's avatar
      [SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites · 6274a520
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.
      
      Most of the changes are straightforward move of code. On top of the code moving, I did:
      1. Use SparkSession instead of SQLContext.
      2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12891 from rxin/SPARK-15115.
      6274a520
    • Zheng RuiFeng's avatar
      [MINOR] Add python3 compatibility in python examples · 4530250f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add python3 compatibility in python examples
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12868 from zhengruifeng/fix_gmm_py.
      4530250f
    • Liang-Chi Hsieh's avatar
      [SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate · b85d21fb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.
      
      However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.
      
      For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.
      b85d21fb
    • Reynold Xin's avatar
      [SPARK-15109][SQL] Accept Dataset[_] in joins · d864c55c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12886 from rxin/SPARK-15109.
      d864c55c
    • Liwei Lin's avatar
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the... · e597ec6f
      Liwei Lin authored
      [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock`
      
      ## What changes were proposed in this pull request?
      
      Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.
      
      We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.
      
      This patch:
      - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
      - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
      - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12797 from lw-lin/add-trigger-test-support.
      e597ec6f
    • Dhruve Ashar's avatar
      [SPARK-4224][CORE][YARN] Support group acls · a4564774
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.
      
      **Changes Proposed in the fix**
      Three new corresponding config entries have been added where the user can specify the groups to be given access.
      
      ```
      spark.admin.acls.groups
      spark.modify.acls.groups
      spark.ui.view.acls.groups
      ```
      
      New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.
      
      A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
      Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```
      
      **How the patch was Tested**
      We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.
      
      Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #12760 from dhruve/impr/SPARK-4224.
      a4564774
    • Dominik Jastrzębski's avatar
      [SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM… · abecbcd5
      Dominik Jastrzębski authored
      ## What changes were proposed in this pull request?
      
      Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.
      
      ## How was this patch tested?
      
      By running KMeansSuite.
      
      Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>
      
      Closes #12609 from dominik-jastrzebski/master.
      abecbcd5
    • Cheng Lian's avatar
      [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command · f152fae3
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output:
      
      ```
      scala> spark.sql("desc extended src").show(100, truncate = false)
      +----------------------------+---------------------------------+-------+
      |col_name                    |data_type                        |comment|
      +----------------------------+---------------------------------+-------+
      |key                         |int                              |       |
      |value                       |string                           |       |
      |                            |                                 |       |
      |# Detailed Table Information|CatalogTable(`default`.`src`, ...|       |
      +----------------------------+---------------------------------+-------+
      
      scala> spark.sql("desc formatted src").show(100, truncate = false)
      +----------------------------+----------------------------------------------------------+-------+
      |col_name                    |data_type                                                 |comment|
      +----------------------------+----------------------------------------------------------+-------+
      |key                         |int                                                       |       |
      |value                       |string                                                    |       |
      |                            |                                                          |       |
      |# Detailed Table Information|                                                          |       |
      |Database:                   |default                                                   |       |
      |Owner:                      |lian                                                      |       |
      |Create Time:                |Mon Jan 04 17:06:00 CST 2016                              |       |
      |Last Access Time:           |Thu Jan 01 08:00:00 CST 1970                              |       |
      |Location:                   |hdfs://localhost:9000/user/hive/warehouse_hive121/src     |       |
      |Table Type:                 |MANAGED                                                   |       |
      |Table Parameters:           |                                                          |       |
      |  transient_lastDdlTime     |1451898360                                                |       |
      |                            |                                                          |       |
      |# Storage Information       |                                                          |       |
      |SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        |       |
      |InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                  |       |
      |OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|       |
      |Num Buckets:                |-1                                                        |       |
      |Bucket Columns:             |[]                                                        |       |
      |Sort Columns:               |[]                                                        |       |
      |Storage Desc Parameters:    |                                                          |       |
      |  serialization.format      |1                                                         |       |
      +----------------------------+----------------------------------------------------------+-------+
      ```
      
      ## How was this patch tested?
      
      A test case is added to `HiveDDLSuite` to check command output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12844 from liancheng/spark-14127-desc-table.
      f152fae3
    • Wenchen Fan's avatar
      [SPARK-15029] improve error message for Generate · 6c12e801
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR improve the error message for `Generate` in 3 cases:
      
      1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
      2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
      3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`
      
      ## How was this patch tested?
      
      new tests in `AnalysisErrorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12810 from cloud-fan/bug.
      6c12e801
    • Cheng Lian's avatar
      [SPARK-14237][SQL] De-duplicate partition value appending logic in various... · bc3760d4
      Cheng Lian authored
      [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
      
      ## What changes were proposed in this pull request?
      
      Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.
      
      A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.
      
      Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.
      
      This PR brings two benefits:
      
      1. Apparently, it de-duplicates partition value appending logic
      
      2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.
      
         Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.
      
      ## How was this patch tested?
      
      Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
      bc3760d4
    • Reynold Xin's avatar
      [SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark · 695f0e91
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality.
      
      With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results.
      
      ## How was this patch tested?
      N/A - this is a test util.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12884 from rxin/SPARK-15107.
      695f0e91
  3. May 03, 2016
    • Davies Liu's avatar
      [SPARK-15095][SQL] remove HiveSessionHook from ThriftServer · 348c1389
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Remove HiveSessionHook
      
      ## How was this patch tested?
      
      No tests needed.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12881 from davies/remove_hooks.
      348c1389
    • Andrew Or's avatar
      [SPARK-14414][SQL] Make DDL exceptions more consistent · 6ba17cd1
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Just a bunch of small tweaks on DDL exception messages.
      
      ## How was this patch tested?
      
      `DDLCommandSuite` et al.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12853 from andrewor14/make-exceptions-consistent.
      6ba17cd1
    • Koert Kuipers's avatar
      [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports · 9e4928b7
      Koert Kuipers authored
      ## What changes were proposed in this pull request?
      Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
      Now this works again:
      import someDataset.sqlContext.implicits._
      
      ## How was this patch tested?
      Add unit test to DatasetSuite that uses the import show above.
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.
      9e4928b7
    • Dongjoon Hyun's avatar
      [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark. · 0903a185
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12860 from dongjoon-hyun/SPARK-15084.
      0903a185
    • Timothy Chen's avatar
      [SPARK-14645][MESOS] Fix python running on cluster mode mesos to have non local uris · c1839c99
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      Fix SparkSubmit to allow non-local python uris
      
      ## How was this patch tested?
      
      Manually tested with mesos-spark-dispatcher
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #12403 from tnachen/enable_remote_python.
      c1839c99
    • Sandeep Singh's avatar
      [SPARK-14422][SQL] Improve handling of optional configs in SQLConf · a8d56f53
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Create a new API for handling Optional Configs in SQLConf.
      Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).
      
      ## How was this patch tested?
      Add test and ran tests locally.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12846 from techaddict/SPARK-14422.
      a8d56f53
    • Shuai Lin's avatar
      [MINOR][DOC] Fixed some python snippets in mllib data types documentation. · c4e0fde8
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Some python snippets is using scala imports and comments.
      
      ## How was this patch tested?
      
      Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #12869 from lins05/fix-mllib-python-snippets.
      c4e0fde8
    • Andrew Ash's avatar
      [SPARK-15104] Fix spacing in log line · dbacd999
      Andrew Ash authored
      Otherwise get logs that look like this (note no space before NODE_LOCAL)
      
      ```
      INFO  [2016-05-03 21:18:51,477] org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 (TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes)
      ```
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #12880 from ash211/patch-7.
      dbacd999
    • Davies Liu's avatar
      [SQL-15102][SQL] remove delegation token support from ThriftServer · 028c6a5d
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      These API is only useful for Hadoop, may not work for Spark SQL.
      
      The APIs is kept for source compatibility.
      
      ## How was this patch tested?
      
      No unit tests needed.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12878 from davies/remove_delegate.
      028c6a5d
    • gatorsmile's avatar
      [SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better Exceptions · 71296c04
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling
      - In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example,
      
         ```SQL
         SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
         ```
      - Users can specify the total length to be read. For example,
      
         ```SQL
         SELECT * FROM source TABLESAMPLE(100M) s;
         ```
      
      Below is the link for references:
         https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
      
      This PR is to parse and capture these two extra syntax, and issue a better error message.
      
      #### How was this patch tested?
      Added test cases to verify the thrown exceptions
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12838 from gatorsmile/bucketOnRand.
      71296c04
    • yinxusen's avatar
      [SPARK-14973][ML] The CrossValidator and TrainValidationSplit miss the seed when saving and loading · 2e2a6211
      yinxusen authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14973
      
      Add seed support when saving/loading of CrossValidator and TrainValidationSplit.
      
      ## How was this patch tested?
      
      Spark unit test.
      
      Author: yinxusen <yinxusen@gmail.com>
      
      Closes #12825 from yinxusen/SPARK-14973.
      2e2a6211
    • Davies Liu's avatar
      [SPARK-15095][SQL] drop binary mode in ThriftServer · d6c7b2a5
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR drop the support for binary mode in ThriftServer, only HTTP mode is supported now, to reduce the maintain burden.
      
      The code to support binary mode is still kept, just in case if we want it  in future.
      
      ## How was this patch tested?
      
      Updated tests to use HTTP mode.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12876 from davies/hide_binary.
      d6c7b2a5
    • Andrew Or's avatar
      [SPARK-15073][SQL] Hide SparkSession constructor from the public · 588cac41
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Users should use the builder pattern instead.
      
      ## How was this patch tested?
      
      Jenks.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12873 from andrewor14/spark-session-constructor.
      588cac41
Loading