Commits · 86c275206605c44e1ebca2f166d62868e44bf029 · cs525-sp18-g07 / spark

Jul 23, 2016

[SPARK-16690][TEST] rename SQLTestUtils.withTempTable to withTempView · 86c27520

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14318 from cloud-fan/minor4.

86c27520

[SPARK-16662][PYSPARK][SQL] fix HiveContext warning bug · ab6e4aea

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

move the `HiveContext` deprecate warning printing statement into `HiveContext` constructor.
so that this warning will appear only when we use `HiveContext`
otherwise this warning will always appear if we reference the pyspark.ml.context code file.

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14301 from WeichenXu123/hiveContext_python_warning_update.

ab6e4aea

[SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug · 25db5167

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

renaming var names to make code more clear:
nnz => weightSum
weightSum => totalWeightSum

and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.

## How was this patch tested?

A new testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14216 from WeichenXu123/multivarOnlineSummary.

25db5167

Jul 22, 2016

[SPARK-16622][SQL] Fix NullPointerException when the returned value of the... · e10b8741

Liang-Chi Hsieh authored 8 years ago

[SPARK-16622][SQL] Fix NullPointerException when the returned value of the called method in Invoke is null

## What changes were proposed in this pull request?

Currently we don't check the value returned by called method in `Invoke`. When the returned value is null and is assigned to a variable of primitive type, `NullPointerException` will be thrown.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14259 from viirya/agg-empty-ds.

e10b8741

[SPARK-16651][PYSPARK][DOC] Make `withColumnRenamed/drop` description more... · 47f5b88d

Dongjoon Hyun authored 8 years ago

[SPARK-16651][PYSPARK][DOC] Make `withColumnRenamed/drop` description more consistent with Scala API

## What changes were proposed in this pull request?

`withColumnRenamed` and `drop` is a no-op if the given column name does not exists. Python documentation also describe that, but this PR adds more explicit line consistently with Scala to reduce the ambiguity.

## How was this patch tested?

It's about docs.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14288 from dongjoon-hyun/SPARK-16651.

47f5b88d

[SPARK-16650] Improve documentation of spark.task.maxFailures · 6c56fff1

Tom Graves authored 8 years ago

Clarify documentation on spark.task.maxFailures

No tests run as its documentation

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #14287 from tgravescs/SPARK-16650.

6c56fff1

[GIT] add pydev & Rstudio project file to gitignore list · b4e16bd5

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

Add Pydev & Rstudio project file to gitignore list, I think the two IEDs are used by many developers.
so that won't need personal gitignore_global config.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14293 from WeichenXu123/update_gitignore.

b4e16bd5

[SPARK-16487][STREAMING] Fix some batches might not get marked as fully processed in JobGenerator · 2c72a443

Ahmed Mahran authored 8 years ago

## What changes were proposed in this pull request?

In `JobGenerator`, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition `(time - zeroTime) is multiple of checkpoint duration?` always evaluates to `true` as the `checkpoint duration` is always set to be equal to the `batch duration`.

![Flowchart](https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png)

This PR fixes this issue so as to improve code readability and to avoid any potential issue in case there is any future change making checkpoint duration to be set different from batch duration.

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14145 from ahmed-mahran/b-mark-batch-fully-processed.

2c72a443

[SPARK-16287][HOTFIX][BUILD][SQL] Fix annotation argument needs to be a constant · e1bd70f4

Jacek Laskowski authored 8 years ago

## What changes were proposed in this pull request?

Build fix for [SPARK-16287][SQL] Implement str_to_map SQL function that has introduced this compilation error:

```
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala:402: error: annotation argument needs to be a constant; found: "_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text ".+("into key/value pairs using delimiters. ").+("Default delimiters are \',\' for pairDelim and \':\' for keyValueDelim.")
    "into key/value pairs using delimiters. " +
                                              ^
```

## How was this patch tested?

Local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #14315 from jaceklaskowski/build-fix-complexTypeCreator.

e1bd70f4

[SPARK-16556][SPARK-16559][SQL] Fix Two Bugs in Bucket Specification · 94f14b52

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?

**Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**

When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](https://github.com/apache/spark/blob/ce3b98bae28af72299722f56e4e4ef831f471ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L339-L354).

For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path '${tempDir.getCanonicalPath}'
)
CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
```

This PR captures it and issues an error message.

**Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**

For example,
```SQL
CREATE TABLE t USING PARQUET
OPTIONS (PATH '${path.toString}')
CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
AS SELECT 1 AS a, 2 AS b
```
The exception we got is
```
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.ArithmeticException: / by zero
```

This PR captures the misuse and issues an appropriate error message.

### How was this patch tested?
Added a test case in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14210 from gatorsmile/createTableWithoutSchema.

94f14b52

Jul 21, 2016

[SPARK-16287][SQL] Implement str_to_map SQL function · df2c6d59

Sandeep Singh authored 8 years ago

## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.

## How was this patch tested?
Pass the Jenkins tests with newly added.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13990 from techaddict/SPARK-16287.

df2c6d59

[SPARK-16334] Maintain single dictionary per row-batch in vectorized parquet reader · 46f80a30

Sameer Agarwal authored 8 years ago

## What changes were proposed in this pull request?

As part of the bugfix in https://github.com/apache/spark/pull/12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader.

## How was this patch tested?

Manual Tests against a number of hand-generated parquet files.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14225 from sameeragarwal/vectorized.

46f80a30

[SPARK-16656][SQL] Try to make CreateTableAsSelectSuite more stable · 9abd99b3

Yin Huai authored 8 years ago

## What changes were proposed in this pull request?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62593/testReport/junit/org.apache.spark.sql.sources/CreateTableAsSelectSuite/create_a_table__drop_it_and_create_another_one_with_the_same_name/ shows that `create a table, drop it and create another one with the same name` failed. But other runs were good. Seems it is a flaky test. This PR tries to make this test more stable.

Author: Yin Huai <yhuai@databricks.com>

Closes #14289 from yhuai/SPARK-16656.

9abd99b3

[SPARK-16194] Mesos Driver env vars · 235cb256

Michael Gummelt authored 8 years ago

## What changes were proposed in this pull request?

Added new configuration namespace: spark.mesos.env.*

This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver.
spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL"

I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler.

And I've refactored the command building logic in `buildDriverCommand`.  Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set.

## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14167 from mgummelt/driver-env-vars.

235cb256

[SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging parquet schema · 69626add

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.

This PR targets both master and branch-2.0.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #14300 from liancheng/revert-pr-14272.

69626add

[SPARK-16640][SQL] Add codegen for Elt function · 6203668d

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

Elt function doesn't support codegen execution now. We should add the support.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14277 from viirya/elt-codegen.

6203668d

[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization · 8674054d

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).

On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.

## How was this patch tested?

New test case added in `ParquetQuerySuite`.

[1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294

Author: Cheng Lian <lian@databricks.com>

Closes #14278 from liancheng/spark-16632-simpler-fix.

8674054d

[SPARK-16226][SQL] Weaken JDBC isolation level to avoid locking when writing partitions · 864b764e

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Saving partitions to JDBC in transaction can use a weaker transaction isolation level to reduce locking. Use better method to check if transactions are supported.

## How was this patch tested?

Existing Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14054 from srowen/SPARK-16226.

864b764e

[MINOR][DOCS][STREAMING] Minor docfix schema of csv rather than parquet in comments · 1bf13ba3

Holden Karau authored 8 years ago

## What changes were proposed in this pull request?

Fix parquet to csv in a comment to match the input format being read.

## How was this patch tested?
N/A (doc change only)

Author: Holden Karau <holden@us.ibm.com>

Closes #14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.

1bf13ba3

Jul 20, 2016

[SPARK-16644][SQL] Aggregate should not propagate constraints containing aggregate expressions · cfa5ae84

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14281 from cloud-fan/bug.

cfa5ae84

[SPARK-16272][CORE] Allow config values to reference conf, env, system props. · 75a06aa2

Marcelo Vanzin authored 8 years ago

This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14022 from vanzin/SPARK-16272.

75a06aa2

[SPARK-16344][SQL] Decoding Parquet array of struct with a single field named "element" · e651900b

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

Due to backward-compatibility reasons, the following Parquet schema is ambiguous:

```
optional group f (LIST) {
  repeated group list {
    optional group element {
      optional int32 element;
    }
  }
}
```

According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:

```
ARRAY<STRUCT<element: INT>>
```

However, when interpreted as a legacy 2-level layout, it's equivalent to

```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```

Historically, to disambiguate these cases, we employed two methods:

- `ParquetSchemaConverter.isElementType()`

  Used to disambiguate the above cases while converting Parquet types to Spark types.

- `ParquetRowConverter.isElementType()`

  Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.

Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.

`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.

In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.

## How was this patch tested?

New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #14014 from liancheng/spark-16344-for-master-and-2.0.

e651900b

[SPARK-16634][SQL] Workaround JVM bug by moving some code out of ctor. · e3cd5b30

Marcelo Vanzin authored 8 years ago

Some 1.7 JVMs have a bug that is triggered by certain Scala-generated
bytecode. GenericArrayData suffers from that and fails to load in certain
JVMs.

Moving the offending code out of the constructor and into a helper method
avoids the issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14271 from vanzin/SPARK-16634.

e3cd5b30

[SPARK-15951] Change Executors Page to use datatables to support sorting columns and searching · b9bab4dc

Kishor Patil authored 8 years ago

1. Create the executorspage-template.html for displaying application information in datables.
2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job.
3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application.
4. Similar changes applicable to Executors Page on history server for a given application.

Snapshots for how it looks like now:
<img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png">

New Executors Page screenshot looks like this:
<img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png">

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #13670 from kishorvpatil/execTemplates.

b9bab4dc

[SPARK-16613][CORE] RDD.pipe returns values for empty partitions · 4b079dc3

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Document RDD.pipe semantics; don't execute process for empty input partitions.

Note this includes the fix in https://github.com/apache/spark/pull/14256 because it's necessary to even test this. One or the other will merge the fix.

## How was this patch tested?

Jenkins tests including new test.

Author: Sean Owen <sowen@cloudera.com>

Closes #14260 from srowen/SPARK-16613.

4b079dc3

[SPARK-15923][YARN] Spark Application rest api returns 'no such app: … · 95abbe53

Weiqing Yang authored 8 years ago

## What changes were proposed in this pull request?
Update monitoring.md.

…<appId>'

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14163 from Sherry302/master.

95abbe53

[SPARK-16440][MLLIB] Destroy broadcasted variables even on driver · 0dc79ffd

Anthony Truchet authored 8 years ago

## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.

## How was this patch tested?
Unit Tests in Word2VecSuite were run locally.

This contribution is done on behalf of Criteo, according to the
terms of the Apache license 2.0.

Author: Anthony Truchet <a.truchet@criteo.com>

Closes #14268 from AnthonyTruchet/SPARK-16440.

0dc79ffd

[SPARK-16632][SQL] Respect Hive schema when merging parquet schema. · 75146be6

Marcelo Vanzin authored 8 years ago

When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.

So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.

Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14272 from vanzin/SPARK-16632.

75146be6

Jul 19, 2016

[SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to SparkSubmitSuite · fc232636

Shivaram Venkataraman authored 8 years ago

## What changes were proposed in this pull request?

This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.

## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14243 from shivaram/sparkr-jar-move.

fc232636

[SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code · 9674af6f

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

update `refreshTable` API in python code of the sql-programming-guide.

This API is added in SPARK-15820

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14220 from WeichenXu123/update_sql_doc_catalog.

9674af6f

[SPARK-14702] Make environment of SparkLauncher launched process more configurable · 004e29cb

Andrew Duffy authored 8 years ago

## What changes were proposed in this pull request?

Adds a few public methods to `SparkLauncher` to allow configuring some extra features of the `ProcessBuilder`, including the working directory, output and error stream redirection.

## How was this patch tested?

Unit testing + simple Spark driver programs

Author: Andrew Duffy <root@aduffy.org>

Closes #14201 from andreweduffy/feature/launcher.

004e29cb

[SPARK-15705][SQL] Change the default value of spark.sql.hive.convertMetastoreOrc to false. · 2ae7b88a

Yin Huai authored 8 years ago

## What changes were proposed in this pull request?
In 2.0, we add a new logic to convert HiveTableScan on ORC tables to Spark's native code path. However, during this conversion, we drop the original metastore schema (https://issues.apache.org/jira/browse/SPARK-15705). Because of this regression, I am changing the default value of `spark.sql.hive.convertMetastoreOrc` to false.

Author: Yin Huai <yhuai@databricks.com>

Closes #14267 from yhuai/SPARK-15705-changeDefaultValue.

2ae7b88a

[SPARK-16602][SQL] `Nvl` function should support numeric-string cases · 162d04a3

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

`Nvl` function should support numeric-straing cases like Hive/Spark1.6. Currently, `Nvl` finds the tightest common types among numeric types. This PR extends that to consider `String` type, too.

```scala
- TypeCoercion.findTightestCommonTypeOfTwo(left.dataType, right.dataType).map { dtype =>
+ TypeCoercion.findTightestCommonTypeToString(left.dataType, right.dataType).map { dtype =>
```

**Before**
```scala
scala> sql("select nvl('0', 1)").collect()
org.apache.spark.sql.AnalysisException: cannot resolve `nvl("0", 1)` due to data type mismatch:
input to function coalesce should all be the same type, but it's [string, int]; line 1 pos 7
```

**After**
```scala
scala> sql("select nvl('0', 1)").collect()
res0: Array[org.apache.spark.sql.Row] = Array([0])
```

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14251 from dongjoon-hyun/SPARK-16602.

162d04a3

[SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: String)` · 0bd76e87

Liwei Lin authored 8 years ago

## What changes were proposed in this pull request?

Currently `RDD.pipe(command: String)`:
- works only when the command is specified without any options, such as `RDD.pipe("wc")`
- does NOT work when the command is specified with some options, such as `RDD.pipe("wc -l")`

This is a regression from Spark 1.6.

This patch adds back the tokenization process in `RDD.pipe(command: String)` to fix this regression.

## How was this patch tested?
Added a test which:
- would pass in `1.6`
- _[prior to this patch]_ would fail in `master`
- _[after this patch]_ would pass in `master`

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14256 from lw-lin/rdd-pipe.

0bd76e87

[SPARK-16494][ML] Upgrade breeze version to 0.12 · 67089149

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c

## How was this patch tested?
No new tests, should pass the existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14150 from yanboliang/spark-16494.

67089149

[SPARK-16478] graphX (added graph caching in strongly connected components) · 5d92326b

Michał Wesołowski authored 8 years ago

## What changes were proposed in this pull request?

I added caching in every iteration for sccGraph that is returned in strongly connected components. Without this cache strongly connected components returned graph that needed to be computed from scratch when some intermediary caches didn't existed anymore.

## How was this patch tested?
I tested it by running code similar to the one  [on databrics](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html). Basically I generated large graph  and computed strongly connected components with changed code, than simply run count on vertices and edges. Count after this update takes few seconds instead 20 minutes.

# statement
contribution is my original work and I license the work to the project under the project's open source license.

Author: Michał Wesołowski <michal.wesolowski@bzwbk.pl>

Closes #14137 from wesolowskim/SPARK-16478.

5d92326b

[SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued... · 6c4b9f4b

Sean Owen authored 8 years ago

[SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued up in the fixed thread pool

## What changes were proposed in this pull request?

Begin failing if checkpoint writes will likely keep up with storage's ability to write them, to fail fast instead of slowly filling memory

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14152 from srowen/SPARK-16395.

6c4b9f4b

[SPARK-16600][MLLIB] fix some latex formula syntax error · 8310c074

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

`\partial\x` ==> `\partial x`
`har{x_i}` ==> `hat{x_i}`

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14246 from WeichenXu123/fix_formular_err.

8310c074

[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar · 6caa2205

Ahmed Mahran authored 8 years ago

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

6caa2205

[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition... · 21a6dd2a

Xin Ren authored 8 years ago

[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent

https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.

21a6dd2a