Commits · 18523c130548f0438dff8d1f25531fd2ed36e517 · cs525-sp18-g07 / spark

Aug 17, 2015

SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression · 18523c13

Prayag Chandran authored 9 years ago

Added since tags to mllib.regression

Author: Prayag Chandran <prayagchandran@gmail.com>

Closes #7518 from prayagchandran/sinceTags and squashes the following commits:

fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression

18523c13

[SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ml.feature.ElementwiseProduct · 0076e821

Yanbo Liang authored 9 years ago

Add Python API, user guide and example for ml.feature.ElementwiseProduct.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8061 from yanboliang/SPARK-9768.

0076e821

[SPARK-9974] [BUILD] [SQL] Makes sure com.twitter:parquet-hadoop-bundle:1.6.0... · 52ae9525

Cheng Lian authored 9 years ago

[SPARK-9974] [BUILD] [SQL] Makes sure com.twitter:parquet-hadoop-bundle:1.6.0 is in SBT assembly jar

PR #7967 enables Spark SQL to persist Parquet tables in Hive compatible format when possible. One of the consequence is that, we have to set input/output classes to `MapredParquetInputFormat`/`MapredParquetOutputFormat`, which rely on com.twitter:parquet-hadoop:1.6.0 bundled with Hive 1.2.1.

When loading such a table in Spark SQL, `o.a.h.h.ql.metadata.Table` first loads these input/output format classes, and thus classes in com.twitter:parquet-hadoop:1.6.0.  However, the scope of this dependency is defined as "runtime", and is not packaged into Spark assembly jar.  This results in a `ClassNotFoundException`.

This issue can be worked around by asking users to add parquet-hadoop 1.6.0 via the `--driver-class-path` option.  However, considering Maven build is immune to this problem, I feel it can be confusing and inconvenient for users.

So this PR fixes this issue by changing scope of parquet-hadoop 1.6.0 to "compile".

Author: Cheng Lian <lian@databricks.com>

Closes #8198 from liancheng/spark-9974/bundle-parquet-1.6.0.

52ae9525

[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg · 088b11ec

Sameer Abhyankar authored 9 years ago

Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome>
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local>

Closes #7729 from sabhyankar/branch_8920.

088b11ec

[SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing · fdaf17f6
Feynman Liang authored 9 years ago
```
mengxr jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8255 from feynmanliang/SPARK-10068.
```
fdaf17f6

[SPARK-9592] [SQL] Fix Last function implemented based on AggregateExpression1. · 772e7c18

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9592

#8113 has the fundamental fix. But, if we want to minimize the number of changed lines, we can go with this one. Then, in 1.6, we merge #8113.

Author: Yin Huai <yhuai@databricks.com>

Closes #8172 from yhuai/lastFix and squashes the following commits:

b28c42a [Yin Huai] Regression test.
af87086 [Yin Huai] Fix last.

772e7c18

[SPARK-9526] [SQL] Utilize randomized tests to reveal potential bugs in sql expressions · b265e282

Yijie Shen authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9526

This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7855 from yjshen/property_check.

b265e282

[SPARK-10036] [SQL] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc · f10660fe

zsxwing authored 9 years ago

This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`.

Author: zsxwing <zsxwing@gmail.com>

Closes #8232 from zsxwing/SPARK-10036 and squashes the following commits:

adf75de [zsxwing] Add extraOptions to the connection properties
57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc

f10660fe

[SPARK-9950] [SQL] Wrong Analysis Error for grouping/aggregating on struct fields · a4acdabb

Wenchen Fan authored 9 years ago

This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8222 from cloud-fan/minor and squashes the following commits:

0bbfb1c [Wenchen Fan] fix style...
7e2d8d9 [Wenchen Fan] add test

a4acdabb

[SPARK-7736] [CORE] [YARN] Make pyspark fail YARN app on failure. · f68d0240

Marcelo Vanzin authored 9 years ago

The YARN backend doesn't like when user code calls `System.exit`,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7751 from vanzin/SPARK-9416.

f68d0240

[SPARK-9924] [WEB UI] Don't schedule checkForLogs while some of them are already running. · ed092a06
Rohit Agarwal authored 9 years ago
```
Author: Rohit Agarwal <rohita@qubole.com>

Closes #8153 from mindprince/SPARK-9924.
```
ed092a06

[SPARK-7837] [SQL] Avoids double closing output writers when commitTask() fails · 76c155dd

Cheng Lian authored 9 years ago

When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice.

Author: Cheng Lian <lian@databricks.com>

Closes #8236 from liancheng/spark-7837/double-closing.

76c155dd

[SPARK-9959] [MLLIB] Association Rules Java Compatibility · f7efda39

Feynman Liang authored 9 years ago

mengxr

Author: Feynman Liang <fliang@databricks.com>

Closes #8206 from feynmanliang/SPARK-9959-arules-java.

f7efda39

[SPARK-9199] [CORE] Upgrade Tachyon version from 0.7.0 -> 0.7.1. · 3ff81ad2

Calvin Jia authored 9 years ago

Updates the tachyon-client version to the latest release.

The main difference between 0.7.0 and 0.7.1 on the client side is to support running Tachyon on local file system by default.

No new non-Tachyon dependencies are added, and no code changes are required since the client API has not changed.

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #8235 from calvinjia/spark-9199-master.

3ff81ad2

[SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter · 26e76058

Yu ISHIKAWA authored 9 years ago

### Summary

- Add `lit` function
- Add `concat`, `greatest`, `least` functions

I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue.

### JIRA
[[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8194 from yu-iskw/SPARK-9856.

26e76058

Aug 16, 2015

[SPARK-10005] [SQL] Fixes schema merging for nested structs · ae2370e7

Cheng Lian authored 9 years ago

In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled.

For example, the schema of a Parquet file to be read can be:

```
message individual {
  required group f1 {
    optional binary f11 (utf8);
  }
}
```

while the global schema is:

```
message global {
  required group f1 {
    optional binary f11 (utf8);
    optional int32 f12;
  }
}
```

This PR fixes this issue by padding missing fields when creating actual converters.

Author: Cheng Lian <lian@databricks.com>

Closes #8228 from liancheng/spark-10005/nested-schema-merging.

ae2370e7

[SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps · cf016075

Matei Zaharia authored 9 years ago

The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.

Author: Matei Zaharia <matei@databricks.com>

Closes #8220 from mateiz/shuffle-loc-fix.

cf016075

[SPARK-8844] [SPARKR] head/collect is broken in SparkR. · 5f9ce738

Sun Rui authored 9 years ago

This is a WIP patch for SPARK-8844  for collecting reviews.

This bug is about reading an empty DataFrame. in readCol(),
      lapply(1:numRows, function(x) {
does not take into consideration the case where numRows = 0.

Will add unit test case.

Author: Sun Rui <rui.sun@intel.com>

Closes #7419 from sun-rui/SPARK-8844.

5f9ce738

[SPARK-9973] [SQL] Correct in-memory columnar buffer size · 182f9b7a

Kun Xu authored 9 years ago

The `initialSize` argument of `ColumnBuilder.initialize()` should be the
number of rows rather than bytes.  However `InMemoryColumnarTableScan`
passes in a byte size, which makes Spark SQL allocate more memory than
necessary when building in-memory columnar buffers.

Author: Kun Xu <viper_kun@163.com>

Closes #8189 from viper-kun/errorSize.

182f9b7a

Aug 15, 2015

[SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming pyspark tests · 1db7179f

Joseph K. Bradley authored 9 years ago

Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally).

CC: mengxr tdas freeman-lab

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8087 from jkbradley/streaming-ml-tests.

1db7179f

[SPARK-9955] [SQL] correct error message for aggregate · 57056725

Wenchen Fan authored 9 years ago

We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8203 from cloud-fan/error-msg and squashes the following commits:

1c67ca7 [Wenchen Fan] move test
7593080 [Wenchen Fan] correct error message for aggregate

57056725

[SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc · a85fb6c0

Herman van Hovell authored 9 years ago

Tiny modification to a few comments ```sbt publishLocal``` work again.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #8209 from hvanhovell/SPARK-9980.

a85fb6c0

[SPARK-9725] [SQL] fix serialization of UTF8String across different JVM · 7c1e5682

Davies Liu authored 9 years ago

The BYTE_ARRAY_OFFSET could be different in JVM with different configurations (for example, different heap size, 24 if heap > 32G, otherwise 16), so offset of UTF8String is not portable, we should handler that during serialization.

Author: Davies Liu <davies@databricks.com>

Closes #8210 from davies/serialize_utf8string.

7c1e5682

Aug 14, 2015

[SPARK-9960] [GRAPHX] sendMessage type fix in LabelPropagation.scala · 71a3af8a
zc he authored 9 years ago
```
Author: zc he <farseer90718@gmail.com>

Closes #8188 from farseer90718/farseer-patch-1.
```
71a3af8a

[SPARK-9984] [SQL] Create local physical operator interface. · 609ce3c0

Reynold Xin authored 9 years ago

This pull request creates a new operator interface that is more similar to traditional database query iterators (with open/close/next/get).

These local operators are not currently used anywhere, but will become the basis for SPARK-9983 (local physical operators for query execution).

cc zsxwing

Author: Reynold Xin <rxin@databricks.com>

Closes #8212 from rxin/SPARK-9984.

609ce3c0

[SPARK-8887] [SQL] Explicit define which data types can be used as dynamic partition columns · 6c4fdbec

Yijie Shen authored 9 years ago

This PR enforce dynamic partition column data type requirements by adding analysis rules.

JIRA: https://issues.apache.org/jira/browse/SPARK-8887

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8201 from yjshen/dynamic_partition_columns.

6c4fdbec

[SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis · ec29f203

Wenchen Fan authored 9 years ago

Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary.

Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata.

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #8215 from marmbrus/pr/7957.

ec29f203

[HOTFIX] fix duplicated braces · 37586e54

Davies Liu authored 9 years ago

Author: Davies Liu <davies@databricks.com>

Closes #8219 from davies/fix_typo.

37586e54

[SPARK-9934] Deprecate NIO ConnectionManager. · e5fd6041

Reynold Xin authored 9 years ago

Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #8162 from rxin/SPARK-9934.

e5fd6041

[SPARK-9949] [SQL] Fix TakeOrderedAndProject's output. · 932b24fd

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9949

Author: Yin Huai <yhuai@databricks.com>

Closes #8179 from yhuai/SPARK-9949.

932b24fd

[SPARK-9968] [STREAMING] Reduced time spent within synchronized block to prevent lock starvation · 18a761ef

Tathagata Das authored 9 years ago

When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8204 from tdas/SPARK-9968 and squashes the following commits:

8cbcc1b [Tathagata Das] Removed unused val
a73b645 [Tathagata Das] Reduced time spent within synchronized block

18a761ef

[SPARK-9966] [STREAMING] Handle couple of corner cases in PIDRateEstimator · f3bfb711

Tathagata Das authored 9 years ago

1. The rate estimator should not estimate any rate when there are no records in the batch, as there is no data to estimate the rate. In the current state, it estimates and set the rate to zero. That is incorrect.

2. The rate estimator should not never set the rate to zero under any circumstances. Otherwise the system will stop receiving data, and stop generating useful estimates (see reason 1). So the fix is to define a parameters that sets a lower bound on the estimated rate, so that the system always receives some data.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8199 from tdas/SPARK-9966 and squashes the following commits:

829f793 [Tathagata Das] Fixed unit test and added comments
3a994db [Tathagata Das] Added min rate and updated tests in PIDRateEstimator

f3bfb711

[SPARK-8670] [SQL] Nested columns can't be referenced in pyspark · 1150a19b

Wenchen Fan authored 9 years ago

This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8202 from cloud-fan/nested.

1150a19b

[SPARK-9981] [ML] Made labels public for StringIndexerModel · 2a6590e5

Joseph K. Bradley authored 9 years ago

Also added unit test for integration between StringIndexerModel and IndexToString

CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8211 from jkbradley/stridx-labels.

2a6590e5

[SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile() · 11ed2b18
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #8213 from davies/fix_window.
```
11ed2b18

[SPARK-9877] [CORE] Fix StandaloneRestServer NPE when submitting application · 9407baa2

jerryshao authored 9 years ago

Detailed exception log can be seen in [SPARK-9877](https://issues.apache.org/jira/browse/SPARK-9877), the problem is when creating `StandaloneRestServer`, `self` (`masterEndpoint`) is null. So this fix is creating `StandaloneRestServer` when `self` is available.

Author: jerryshao <sshao@hortonworks.com>

Closes #8127 from jerryshao/SPARK-9877.

9407baa2

[SPARK-9948] Fix flaky AccumulatorSuite - internal accumulators · 6518ef63

Andrew Or authored 9 years ago

In these tests, we use a custom listener and we assert on fields in the stage / task completion events. However, these events are posted in a separate thread so they're not guaranteed to be posted in time. This commit fixes this flakiness through a job end registration callback.

Author: Andrew Or <andrew@databricks.com>

Closes #8176 from andrewor14/fix-accumulator-suite.

6518ef63

[SPARK-9809] Task crashes because the internal accumulators are not properly initialized · 33bae585

Carson Wang authored 9 years ago

When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist.

Author: Carson Wang <carson.wang@intel.com>

Closes #8090 from carsonwang/SPARK-9809.

33bae585

[SPARK-9828] [PYSPARK] Mutable values should not be default arguments · ffa05c84
MechCoder authored 9 years ago
```
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8110 from MechCoder/spark-9828.
```
ffa05c84

[SPARK-9561] Re-enable BroadcastJoinSuite · ece00566

Andrew Or authored 9 years ago

We can do this now that SPARK-9580 is resolved.

Author: Andrew Or <andrew@databricks.com>

Closes #8208 from andrewor14/reenable-sql-tests.

ece00566