Commits · 592fc455639462fcf00ec02860d7c33470b73273 · cs525-sp18-g07 / spark

May 05, 2016

[SPARK-15123] upgrade org.json4s to 3.2.11 version · 592fc455

Lining Sun authored 8 years ago

## What changes were proposed in this pull request?

We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.

## How was this patch tested?

We built Spark jar and successfully ran our applications in local and cluster modes.

Author: Lining Sun <lining@gmail.com>

Closes #12901 from liningalex/master.

592fc455

[SPARK-15045] [CORE] Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable · 1a5c6fce

Abhinav Gupta authored 8 years ago

## What changes were proposed in this pull request?

Removed the DeadCode as suggested.

Author: Abhinav Gupta <abhi.951990@gmail.com>

Closes #12829 from abhi951990/master.

1a5c6fce

[SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation · 1a9b3415

Kousuke Saruta authored 8 years ago

## What changes were proposed in this pull request?

Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.

## How was this patch tested?

Manually checked.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12908 from sarutak/SPARK-15132.

1a9b3415

May 04, 2016

[MINOR] remove dead code · 42837419
Davies Liu authored 8 years ago

42837419

[SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown · bde27b89

Tathagata Das authored 8 years ago

## What changes were proposed in this pull request?

Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.

## How was this patch tested?

Updated unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12905 from tdas/SPARK-15131.

bde27b89

[SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File · ef55e46c

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.

Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.

The related PRs:
- https://github.com/apache/spark/pull/9651
- https://github.com/apache/spark/pull/10211

#### How was this patch tested?
Added a couple of test cases

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12828 from gatorsmile/readPartitionedTable.

ef55e46c

[SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query · 8fb1463d

Sean Zhong authored 8 years ago

## What changes were proposed in this pull request?

This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```

## How was this patch tested?

Unit tests.

Author: Sean Zhong <clockfly@gmail.com>

Closes #12872 from clockfly/spark-6399.

8fb1463d

[SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

See title.

## How was this patch tested?

PySpark tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12917 from andrewor14/deprecate-hive-context-python.

fa79d346

[MINOR][SQL] Fix typo in DataFrameReader csv documentation · b2813776

sethah authored 8 years ago

## What changes were proposed in this pull request?
Typo fix

## How was this patch tested?
No tests

My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12912 from sethah/csv_typo.

b2813776

[SPARK-15116] In REPL we should create SparkSession first and get SparkContext from it · a432a2b8

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

see https://github.com/apache/spark/pull/12873#discussion_r61993910. The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set.

## How was this patch tested?

verified it locally.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12890 from cloud-fan/repl.

a432a2b8

[SPARK-13001][CORE][MESOS] Prevent getting offers when reached max cores · eb019af9

Sebastien Rainville authored 8 years ago

Similar to https://github.com/apache/spark/pull/8639

This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

Author: Sebastien Rainville <sebastien@hopper.com>

Closes #10924 from sebastienrainville/master.

eb019af9

[SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example. · cdce4e62

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.

- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
  - `SqlNetworkWordCount.scala`
  - `JavaSqlNetworkWordCount.java`
  - `sql_network_wordcount.py`

Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12809 from dongjoon-hyun/SPARK-15031.

cdce4e62

[SPARK-12299][CORE] Remove history serving functionality from Master · cf2e9da6

Bryan Cutler authored 8 years ago

Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability.

Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.

cf2e9da6

[SPARK-15121] Improve logging of external shuffle handler · 0c00391f

Thomas Graves authored 8 years ago

## What changes were proposed in this pull request?

Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.

## How was this patch tested?

Ran and saw logs coming out in log file.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #12900 from tgravescs/SPARK-15121.

0c00391f

[SPARK-15126][SQL] RuntimeConfig.set should return Unit · 6ae9fc00

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12902 from rxin/SPARK-15126.

6ae9fc00

[SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning · 0fd3a474

Tathagata Das authored 8 years ago

## What changes were proposed in this pull request?

File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.

This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.

## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12879 from tdas/SPARK-15103.

0fd3a474

[SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites · 6274a520

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.

Most of the changes are straightforward move of code. On top of the code moving, I did:
1. Use SparkSession instead of SQLContext.
2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12891 from rxin/SPARK-15115.

6274a520

[MINOR] Add python3 compatibility in python examples · 4530250f

Zheng RuiFeng authored 8 years ago

## What changes were proposed in this pull request?
Add python3 compatibility in python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12868 from zhengruifeng/fix_gmm_py.

4530250f

[SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate · b85d21fb

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.

However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.

For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.

b85d21fb

[SPARK-15109][SQL] Accept Dataset[_] in joins · d864c55c

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.

## How was this patch tested?
N/A.

Author: Reynold Xin <rxin@databricks.com>

Closes #12886 from rxin/SPARK-15109.

d864c55c

[SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the... · e597ec6f

Liwei Lin authored 8 years ago

[SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock`

## What changes were proposed in this pull request?

Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.

We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.

This patch:
- fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
- adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
- adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12797 from lw-lin/add-trigger-test-support.

e597ec6f

[SPARK-4224][CORE][YARN] Support group acls · a4564774

Dhruve Ashar authored 8 years ago

## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.

**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.

```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```

New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.

A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```

**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.

Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #12760 from dhruve/impr/SPARK-4224.

a4564774

[SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM… · abecbcd5

Dominik Jastrzębski authored 8 years ago

## What changes were proposed in this pull request?

Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.

## How was this patch tested?

By running KMeansSuite.

Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>

Closes #12609 from dominik-jastrzebski/master.

abecbcd5

[SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command · f152fae3

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output:

```
scala> spark.sql("desc extended src").show(100, truncate = false)
+----------------------------+---------------------------------+-------+
|col_name                    |data_type                        |comment|
+----------------------------+---------------------------------+-------+
|key                         |int                              |       |
|value                       |string                           |       |
|                            |                                 |       |
|# Detailed Table Information|CatalogTable(`default`.`src`, ...|       |
+----------------------------+---------------------------------+-------+

scala> spark.sql("desc formatted src").show(100, truncate = false)
+----------------------------+----------------------------------------------------------+-------+
|col_name                    |data_type                                                 |comment|
+----------------------------+----------------------------------------------------------+-------+
|key                         |int                                                       |       |
|value                       |string                                                    |       |
|                            |                                                          |       |
|# Detailed Table Information|                                                          |       |
|Database:                   |default                                                   |       |
|Owner:                      |lian                                                      |       |
|Create Time:                |Mon Jan 04 17:06:00 CST 2016                              |       |
|Last Access Time:           |Thu Jan 01 08:00:00 CST 1970                              |       |
|Location:                   |hdfs://localhost:9000/user/hive/warehouse_hive121/src     |       |
|Table Type:                 |MANAGED                                                   |       |
|Table Parameters:           |                                                          |       |
|  transient_lastDdlTime     |1451898360                                                |       |
|                            |                                                          |       |
|# Storage Information       |                                                          |       |
|SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        |       |
|InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                  |       |
|OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|       |
|Num Buckets:                |-1                                                        |       |
|Bucket Columns:             |[]                                                        |       |
|Sort Columns:               |[]                                                        |       |
|Storage Desc Parameters:    |                                                          |       |
|  serialization.format      |1                                                         |       |
+----------------------------+----------------------------------------------------------+-------+
```

## How was this patch tested?

A test case is added to `HiveDDLSuite` to check command output.

Author: Cheng Lian <lian@databricks.com>

Closes #12844 from liancheng/spark-14127-desc-table.

f152fae3

[SPARK-15029] improve error message for Generate · 6c12e801

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

This PR improve the error message for `Generate` in 3 cases:

1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`

## How was this patch tested?

new tests in `AnalysisErrorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12810 from cloud-fan/bug.

6c12e801

[SPARK-14237][SQL] De-duplicate partition value appending logic in various... · bc3760d4

Cheng Lian authored 8 years ago

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations

## What changes were proposed in this pull request?

Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.

Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.

This PR brings two benefits:

1. Apparently, it de-duplicates partition value appending logic

2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.

   Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.

bc3760d4

[SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark · 695f0e91

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality.

With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results.

## How was this patch tested?
N/A - this is a test util.

Author: Reynold Xin <rxin@databricks.com>

Closes #12884 from rxin/SPARK-15107.

695f0e91

May 03, 2016

[SPARK-15095][SQL] remove HiveSessionHook from ThriftServer · 348c1389

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

Remove HiveSessionHook

## How was this patch tested?

No tests needed.

Author: Davies Liu <davies@databricks.com>

Closes #12881 from davies/remove_hooks.

348c1389

[SPARK-14414][SQL] Make DDL exceptions more consistent · 6ba17cd1

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

Just a bunch of small tweaks on DDL exception messages.

## How was this patch tested?

`DDLCommandSuite` et al.

Author: Andrew Or <andrew@databricks.com>

Closes #12853 from andrewor14/make-exceptions-consistent.

6ba17cd1

[SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports · 9e4928b7

Koert Kuipers authored 8 years ago

## What changes were proposed in this pull request?
Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
Now this works again:
import someDataset.sqlContext.implicits._

## How was this patch tested?
Add unit test to DatasetSuite that uses the import show above.

Author: Koert Kuipers <koert@tresata.com>

Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.

9e4928b7

[SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark. · 0903a185

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12860 from dongjoon-hyun/SPARK-15084.

0903a185

[SPARK-14645][MESOS] Fix python running on cluster mode mesos to have non local uris · c1839c99

Timothy Chen authored 8 years ago

## What changes were proposed in this pull request?

Fix SparkSubmit to allow non-local python uris

## How was this patch tested?

Manually tested with mesos-spark-dispatcher

Author: Timothy Chen <tnachen@gmail.com>

Closes #12403 from tnachen/enable_remote_python.

c1839c99

[SPARK-14422][SQL] Improve handling of optional configs in SQLConf · a8d56f53

Sandeep Singh authored 8 years ago

## What changes were proposed in this pull request?
Create a new API for handling Optional Configs in SQLConf.
Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).

## How was this patch tested?
Add test and ran tests locally.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12846 from techaddict/SPARK-14422.

a8d56f53

[MINOR][DOC] Fixed some python snippets in mllib data types documentation. · c4e0fde8

Shuai Lin authored 8 years ago

## What changes were proposed in this pull request?

Some python snippets is using scala imports and comments.

## How was this patch tested?

Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #12869 from lins05/fix-mllib-python-snippets.

c4e0fde8

[SPARK-15104] Fix spacing in log line · dbacd999

Andrew Ash authored 8 years ago

Otherwise get logs that look like this (note no space before NODE_LOCAL)

```
INFO  [2016-05-03 21:18:51,477] org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 (TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes)
```

Author: Andrew Ash <andrew@andrewash.com>

Closes #12880 from ash211/patch-7.

dbacd999

[SQL-15102][SQL] remove delegation token support from ThriftServer · 028c6a5d

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

These API is only useful for Hadoop, may not work for Spark SQL.

The APIs is kept for source compatibility.

## How was this patch tested?

No unit tests needed.

Author: Davies Liu <davies@databricks.com>

Closes #12878 from davies/remove_delegate.

028c6a5d

[SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better Exceptions · 71296c04

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling
- In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
   ```
- Users can specify the total length to be read. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(100M) s;
   ```

Below is the link for references:
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

This PR is to parse and capture these two extra syntax, and issue a better error message.

#### How was this patch tested?
Added test cases to verify the thrown exceptions

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12838 from gatorsmile/bucketOnRand.

71296c04

[SPARK-14973][ML] The CrossValidator and TrainValidationSplit miss the seed when saving and loading · 2e2a6211

yinxusen authored 8 years ago

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14973

Add seed support when saving/loading of CrossValidator and TrainValidationSplit.

## How was this patch tested?

Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #12825 from yinxusen/SPARK-14973.

2e2a6211

[SPARK-15095][SQL] drop binary mode in ThriftServer · d6c7b2a5

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

This PR drop the support for binary mode in ThriftServer, only HTTP mode is supported now, to reduce the maintain burden.

The code to support binary mode is still kept, just in case if we want it  in future.

## How was this patch tested?

Updated tests to use HTTP mode.

Author: Davies Liu <davies@databricks.com>

Closes #12876 from davies/hide_binary.

d6c7b2a5

[SPARK-15073][SQL] Hide SparkSession constructor from the public · 588cac41

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

Users should use the builder pattern instead.

## How was this patch tested?

Jenks.

Author: Andrew Or <andrew@databricks.com>

Closes #12873 from andrewor14/spark-session-constructor.

588cac41