Commits · 348d7c9a93dd00d3d1859342a8eb0aea2e77f597 · cs525-sp18-g07 / spark

Sep 18, 2015

[SPARK-9808] Remove hash shuffle file consolidation. · 348d7c9a
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #8812 from rxin/SPARK-9808-1.
```
348d7c9a

[SPARK-10449] [SQL] Don't merge decimal types with incompatable precision or scales · 3a22b100

Holden Karau authored 9 years ago

From JIRA: Schema merging should only handle struct fields. But currently we also reconcile decimal precision and scale information.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8634 from holdenk/SPARK-10449-dont-merge-different-precision.

3a22b100

[SPARK-10539] [SQL] Project should not be pushed down through Intersect or Except #8742 · c6f8135e

Yijie Shen authored 9 years ago

Intersect and Except are both set operators and they use the all the columns to compare equality between rows. When pushing their Project parent down, the relations they based on would change, therefore not an equivalent transformation.

JIRA: https://issues.apache.org/jira/browse/SPARK-10539

I added some comments based on the fix of https://github.com/apache/spark/pull/8742.

Author: Yijie Shen <henry.yijieshen@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #8823 from yhuai/fix_set_optimization.

c6f8135e

[SPARK-10540] Fixes flaky all-data-type test · 00a2911c

Cheng Lian authored 9 years ago

This PR breaks the original test case into multiple ones (one test case for each data type). In this way, test failure output can be much more readable.

Within each test case, we build a table with two columns, one of them is for the data type to test, the other is an "index" column, which is used to sort the DataFrame and workaround [SPARK-10591] [1]

[1]: https://issues.apache.org/jira/browse/SPARK-10591

Author: Cheng Lian <lian@databricks.com>

Closes #8768 from liancheng/spark-10540/test-all-data-types.

00a2911c

[SPARK-10615] [PYSPARK] change assertEquals to assertEqual · 35e8ab93

Yanbo Liang authored 9 years ago

As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8814 from yanboliang/spark-10615.

35e8ab93

[SPARK-10451] [SQL] Prevent unnecessary serializations in InMemoryColumnarTableScan · 20fd35df

Yash Datta authored 9 years ago

Many of the fields in InMemoryColumnar scan and InMemoryRelation can be made transient.

This reduces my 1000ms job to abt 700 ms . The task size reduces from 2.8 mb to ~1300kb

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #8604 from saucam/serde.

20fd35df

[SPARK-10684] [SQL] StructType.interpretedOrdering need not to be serialized · e3b5d6cb

navis.ryu authored 9 years ago

Kryo fails with buffer overflow even with max value (2G).

{noformat}
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1
Serialization trace:
containsChild (org.apache.spark.sql.catalyst.expressions.BoundReference)
child (org.apache.spark.sql.catalyst.expressions.SortOrder)
array (scala.collection.mutable.ArraySeq)
ordering (org.apache.spark.sql.catalyst.expressions.InterpretedOrdering)
interpretedOrdering (org.apache.spark.sql.types.StructType)
schema (org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema). To avoid this, increase spark.kryoserializer.buffer.max value.
        at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:263)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

Author: navis.ryu <navis@apache.org>

Closes #8808 from navis/SPARK-10684.

e3b5d6cb

Added <code> tag to documentation. · 74d8f7dd
Reynold Xin authored 9 years ago

74d8f7dd

docs/running-on-mesos.md: state default values in default column · 9a56dcdf

Felix Bechstein authored 9 years ago

This PR simply uses the default value column for defaults.

Author: Felix Bechstein <felix.bechstein@otto.de>

Closes #8810 from felixb/fix_mesos_doc.

9a56dcdf

[SPARK-9522] [SQL] SparkSubmit process can not exit if kill application when... · 93c7650a

linweizhong authored 9 years ago

[SPARK-9522] [SQL] SparkSubmit process can not exit if kill application when HiveThriftServer was starting

When we start HiveThriftServer, we will start SparkContext first, then start HiveServer2, if we kill application while HiveServer2 is starting then SparkContext will stop successfully, but SparkSubmit process can not exit.

Author: linweizhong <linweizhong@huawei.com>

Closes #7853 from Sephiroth-Lin/SPARK-9522.

93c7650a

[SPARK-10682] [GRAPHX] Remove Bagel test suites. · d009da2f

Reynold Xin authored 9 years ago

Bagel has been deprecated and we haven't done any changes to it. There is no need to run those tests.

This should speed up tests by 1 min.

Author: Reynold Xin <rxin@databricks.com>

Closes #8807 from rxin/SPARK-10682.

d009da2f

Sep 17, 2015

[SPARK-8518] [ML] Log-linear models for survival analysis · 98f1ea67

Yanbo Liang authored 9 years ago

[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8611 from yanboliang/spark-8518.

98f1ea67

[SPARK-10674] [TESTS] Increase timeouts in SaslIntegrationSuite. · 0f5ef6df

Marcelo Vanzin authored 9 years ago

1s seems to trigger too many times on the jenkins build boxes, so
increase the timeout and cross fingers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8802 from vanzin/SPARK-10674 and squashes the following commits:

3c93117 [Marcelo Vanzin] Use java 7 syntax.
d667d1b [Marcelo Vanzin] [SPARK-10674] [tests] Increase timeouts in SaslIntegrationSuite.

0f5ef6df

[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions · 4fbf3328

Eric Liang authored 9 years ago

This is a pre-req for supporting the ":" operator in the RFormula feature transformer.

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7987 from ericl/interaction.

4fbf3328

[SPARK-10657] Remove SCP-based Jenkins log archiving · f1c91155

Josh Rosen authored 9 years ago

As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them.

Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master.

f1c91155

[SPARK-10394] [ML] Make GBTParams use shared stepSize · 64743870

Yanbo Liang authored 9 years ago

```GBTParams``` has ```stepSize``` as learning rate currently.
ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8552 from yanboliang/spark-10394.

64743870

[SPARK-10639] [SQL] Need to convert UDAF's result from scala to sql type · aad644fb

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10639

Author: Yin Huai <yhuai@databricks.com>

Closes #8788 from yhuai/udafConversion.

aad644fb

[SPARK-10650] Clean before building docs · e0dc2bc2

Michael Armbrust authored 9 years ago

The [published docs for 1.5.0](http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/) have a bunch of test classes in them. The only way I can reproduce this is to `test:compile` before running `unidoc`. To prevent this from happening again, I've added a clean before doc generation.

Author: Michael Armbrust <michael@databricks.com>

Closes #8787 from marmbrus/testsInDocs.

e0dc2bc2

[SPARK-10531] [CORE] AppId is set as AppName in status rest api · 36d8b278
Jeff Zhang authored 9 years ago
```
Verify it manually.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #8688 from zjffdu/SPARK-10531.
```
36d8b278

[SPARK-10172] [CORE] disable sort in HistoryServer webUI · 81b4db37

Josiah Samuel authored 9 years ago

This pull request is to address the JIRA SPARK-10172 (History Server web UI gets messed up when sorting on any column).
The content of the table gets messed up due to the rowspan attribute of the table data(cell) during sorting.
The current table sort library used in SparkUI (sorttable.js) doesn't support/handle cells(td) with rowspans.
The fix will disable the table sort in the web UI, when there are jobs listed with multiple attempts.

Author: Josiah Samuel <josiah_sams@in.ibm.com>

Closes #8506 from josiahsams/SPARK-10172.

81b4db37

[SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keys · 136c77d8

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10642

When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8796 from viirya/fix-pyrdd-lookup.

136c77d8

[SPARK-10660] Doc describe error in the "Running Spark on YARN" page · c88bb5df

yangping.wu authored 9 years ago

In the Configuration section, the **spark.yarn.driver.memoryOverhead** and **spark.yarn.am.memoryOverhead**‘s default value should be "driverMemory * 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the **MEMORY_OVERHEAD_FACTOR** is set to 0.1.0, not 0.07.

Author: yangping.wu <wyphao.2007@163.com>

Closes #8797 from 397090770/SparkOnYarnDocError.

c88bb5df

[SPARK-10459] [SQL] Do not need to have ConvertToSafe for PythonUDF · 2a508df2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10459

As mentioned in the JIRA, `PythonUDF` actually could process `UnsafeRow`.

Specially, the rows in `childResults` in `BatchPythonEvaluation` will be projected to a `MutableRow`. So I think we can enable `canProcessUnsafeRows` for `BatchPythonEvaluation` and get rid of redundant `ConvertToSafe`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8616 from viirya/pyudf-unsafe.

2a508df2

[SPARK-10077] [DOCS] [ML] Add package info for java of ml/feature · e51345e1

Holden Karau authored 9 years ago

Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.

e51345e1

[SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation · 268088b8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8692 from yu-iskw/SPARK-10282.
```
268088b8
[SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpm · c74d38fd
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8665 from yu-iskw/SPARK-10274.
```
c74d38fd
[SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to pyspark.mllib.util · 4a0b56e8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8689 from yu-iskw/SPARK-10279.
```
4a0b56e8
[SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.tree · 39b44cb5
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8685 from yu-iskw/SPARK-10278.
```
39b44cb5
[SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering · 0ded87a4
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8691 from yu-iskw/SPARK-10281.
```
0ded87a4
[SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression · 29bf8aa5
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8693 from yu-iskw/SPARK-10283.
```
29bf8aa5
[SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning · c633ed32
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8694 from yu-iskw/SPARK-10284.
```
c633ed32
[MINOR] [CORE] Fixes minor variable name typo · 69c9830d
Cheng Lian authored 9 years ago
```
Author: Cheng Lian <lian@databricks.com>

Closes #8784 from liancheng/typo-fix.
```
69c9830d

Sep 16, 2015

Tiny style fix for d39f15ea . · 49c649fa
Reynold Xin authored 9 years ago

49c649fa

[SPARK-9794] [SQL] Fix datetime parsing in SparkSQL. · d39f15ea

Kevin Cox authored 9 years ago

This fixes https://issues.apache.org/jira/browse/SPARK-9794 by using a real ISO8601 parser. (courtesy of the xml component of the standard java library)

cc: angelini

Author: Kevin Cox <kevincox@kevincox.ca>

Closes #8396 from kevincox/kevincox-sql-time-parsing.

d39f15ea

[SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame. · 896edb51

Sun Rui authored 9 years ago

1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8711 from sun-rui/SPARK-10050.

896edb51

[SPARK-10589] [WEBUI] Add defense against external site framing · 5dbaf3d3

Sean Owen authored 9 years ago

Set `X-Frame-Options: SAMEORIGIN` to protect against frame-related vulnerability

Author: Sean Owen <sowen@cloudera.com>

Closes #8745 from srowen/SPARK-10589.

5dbaf3d3

[SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.recommendation · d9b7f3e4
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8677 from yu-iskw/SPARK-10276.
```
d9b7f3e4

[SPARK-10511] [BUILD] Reset git repository before packaging source distro · 1894653e

Luciano Resende authored 9 years ago

The calculation of Spark version is downloading
Scala and Zinc in the build directory which is
inflating the size of the source distribution.

Reseting the repo before packaging the source
distribution fix this issue.

Author: Luciano Resende <lresende@apache.org>

Closes #8774 from lresende/spark-10511.

1894653e

[SPARK-10516] [ MLLIB] Added values property in DenseVector · 95b6a810
Vinod K C authored 9 years ago
```
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8682 from vinodkc/fix_SPARK-10516.
```
95b6a810

Sep 15, 2015

[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups · b921fe4d

Joseph K. Bradley authored 9 years ago

Various ML guide cleanups.

* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.

CC: mengxr feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8752 from jkbradley/mlguide-fixes-1.5.

b921fe4d