Commits · c88bb5df94f9696677c3a429472114bc66f32a52 · cs525-sp18-g07 / spark

Sep 17, 2015

[SPARK-10660] Doc describe error in the "Running Spark on YARN" page · c88bb5df

yangping.wu authored 9 years ago

In the Configuration section, the **spark.yarn.driver.memoryOverhead** and **spark.yarn.am.memoryOverhead**‘s default value should be "driverMemory * 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the **MEMORY_OVERHEAD_FACTOR** is set to 0.1.0, not 0.07.

Author: yangping.wu <wyphao.2007@163.com>

Closes #8797 from 397090770/SparkOnYarnDocError.

c88bb5df

[SPARK-10459] [SQL] Do not need to have ConvertToSafe for PythonUDF · 2a508df2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10459

As mentioned in the JIRA, `PythonUDF` actually could process `UnsafeRow`.

Specially, the rows in `childResults` in `BatchPythonEvaluation` will be projected to a `MutableRow`. So I think we can enable `canProcessUnsafeRows` for `BatchPythonEvaluation` and get rid of redundant `ConvertToSafe`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8616 from viirya/pyudf-unsafe.

2a508df2

[SPARK-10077] [DOCS] [ML] Add package info for java of ml/feature · e51345e1

Holden Karau authored 9 years ago

Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.

e51345e1

[SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation · 268088b8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8692 from yu-iskw/SPARK-10282.
```
268088b8
[SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpm · c74d38fd
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8665 from yu-iskw/SPARK-10274.
```
c74d38fd
[SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to pyspark.mllib.util · 4a0b56e8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8689 from yu-iskw/SPARK-10279.
```
4a0b56e8
[SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.tree · 39b44cb5
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8685 from yu-iskw/SPARK-10278.
```
39b44cb5
[SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering · 0ded87a4
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8691 from yu-iskw/SPARK-10281.
```
0ded87a4
[SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression · 29bf8aa5
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8693 from yu-iskw/SPARK-10283.
```
29bf8aa5
[SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning · c633ed32
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8694 from yu-iskw/SPARK-10284.
```
c633ed32
[MINOR] [CORE] Fixes minor variable name typo · 69c9830d
Cheng Lian authored 9 years ago
```
Author: Cheng Lian <lian@databricks.com>

Closes #8784 from liancheng/typo-fix.
```
69c9830d

Sep 16, 2015

Tiny style fix for d39f15ea . · 49c649fa
Reynold Xin authored 9 years ago

49c649fa

[SPARK-9794] [SQL] Fix datetime parsing in SparkSQL. · d39f15ea

Kevin Cox authored 9 years ago

This fixes https://issues.apache.org/jira/browse/SPARK-9794 by using a real ISO8601 parser. (courtesy of the xml component of the standard java library)

cc: angelini

Author: Kevin Cox <kevincox@kevincox.ca>

Closes #8396 from kevincox/kevincox-sql-time-parsing.

d39f15ea

[SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame. · 896edb51

Sun Rui authored 9 years ago

1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.

Author: Sun Rui <rui.sun@intel.com>

Closes #8711 from sun-rui/SPARK-10050.

896edb51

[SPARK-10589] [WEBUI] Add defense against external site framing · 5dbaf3d3

Sean Owen authored 9 years ago

Set `X-Frame-Options: SAMEORIGIN` to protect against frame-related vulnerability

Author: Sean Owen <sowen@cloudera.com>

Closes #8745 from srowen/SPARK-10589.

5dbaf3d3

[SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.recommendation · d9b7f3e4
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8677 from yu-iskw/SPARK-10276.
```
d9b7f3e4

[SPARK-10511] [BUILD] Reset git repository before packaging source distro · 1894653e

Luciano Resende authored 9 years ago

The calculation of Spark version is downloading
Scala and Zinc in the build directory which is
inflating the size of the source distribution.

Reseting the repo before packaging the source
distribution fix this issue.

Author: Luciano Resende <lresende@apache.org>

Closes #8774 from lresende/spark-10511.

1894653e

[SPARK-10516] [ MLLIB] Added values property in DenseVector · 95b6a810
Vinod K C authored 9 years ago
```
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8682 from vinodkc/fix_SPARK-10516.
```
95b6a810

Sep 15, 2015

[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups · b921fe4d

Joseph K. Bradley authored 9 years ago

Various ML guide cleanups.

* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.

CC: mengxr feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8752 from jkbradley/mlguide-fixes-1.5.

b921fe4d

[SPARK-9078] [SQL] Allow jdbc dialects to override the query used to check the table. · 64c29afc

sureshthalamati authored 9 years ago

Current implementation uses query with a LIMIT clause to find if table already exists. This syntax works only in some database systems. This patch changes the default query to the one that is likely to work on most databases, and adds a new method to the JdbcDialect abstract class to allow dialects to override the default query.

I looked at using the JDBC meta data calls, it turns out there is no common way to find the current schema, catalog..etc. There is a new method Connection.getSchema() , but that is available only starting jdk1.7 , and existing jdbc drivers may not have implemented it. Other option was to use jdbc escape syntax clause for LIMIT, not sure on how well this supported in all the databases also. After looking at all the jdbc metadata options my conclusion was most common way is to use the simple select query with 'where 1 =0' , and allow dialects to customize as needed

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #8676 from sureshthalamati/table_exists_spark-9078.

64c29afc

[SPARK-10613] [SPARK-10624] [SQL] Reduce LocalNode tests dependency on SQLContext · 35a19f33

Andrew Or authored 9 years ago

Instead of relying on `DataFrames` to verify our answers, we can just use simple arrays. This significantly simplifies the test logic for `LocalNode`s and reduces a lot of code duplicated from `SparkPlanTest`.

This also fixes an additional issue [SPARK-10624](https://issues.apache.org/jira/browse/SPARK-10624) where the output of `TakeOrderedAndProjectNode` is not actually ordered.

Author: Andrew Or <andrew@databricks.com>

Closes #8764 from andrewor14/sql-local-tests-cleanup.

35a19f33

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator · 38700ea4

Josh Rosen authored 9 years ago

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8544 from JoshRosen/SPARK-10381.

38700ea4

[SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope · 99ecfa59

vinodkc authored 9 years ago

Remove return statements in RDD.takeSample and wrap it withScope

Author: vinodkc <vinod.kc.in@gmail.com>
Author: vinodkc <vinodkc@users.noreply.github.com>
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8730 from vinodkc/fix_takesample_return.

99ecfa59

[SPARK-10612] [SQL] Add prepare to LocalNode. · a63cdc76

Reynold Xin authored 9 years ago

The idea is that we should separate the function call that does memory reservation (i.e. prepare) from the function call that consumes the input (e.g. open()), so all operators can be a chance to reserve memory before they are all consumed.

Author: Reynold Xin <rxin@databricks.com>

Closes #8761 from rxin/SPARK-10612.

a63cdc76

[SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions · b6e99863

Andrew Or authored 9 years ago

*Note: this is for master branch only.* The fix for branch-1.5 is at #8721.

The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.:
```
(1 to 100).par.foreach { _ =>
  sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
}
```
The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path.

Author: Andrew Or <andrew@databricks.com>

Closes #8710 from andrewor14/concurrent-sql-executions.

b6e99863

[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression · be52faa7

DB Tsai authored 9 years ago

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>

Closes #7884 from dbtsai/SPARK-7685.

be52faa7

[SPARK-10475] [SQL] improve column prunning for Project on Sort · 31a229aa

Wenchen Fan authored 9 years ago

Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8644 from cloud-fan/column-prune.

31a229aa

[SPARK-10437] [SQL] Support aggregation expressions in Order By · 841972e2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10437

If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8599 from viirya/orderby-agg.

841972e2

Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py." · b42059d2
Marcelo Vanzin authored 9 years ago
```
This reverts commit 8abef21d.
```
b42059d2

[DOCS] Small fixes to Spark on Yarn doc · 416003b2

Jacek Laskowski authored 9 years ago

* a follow-up to 16b6d186 as `--num-executors` flag is not suppported.
* links + formatting

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8762 from jaceklaskowski/docs-spark-on-yarn.

416003b2

Closes #8738 · 0d9ab016

Xiangrui Meng authored 9 years ago

Closes #8767
Closes #2491
Closes #6795
Closes #2096
Closes #7722

0d9ab016

[PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random · 7ca30b50

noelsmith authored 9 years ago

Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.

Author: noelsmith <mail@noelsmith.com>

Closes #8773 from noel-smith/mllib-random-versionadded-fix.

7ca30b50

[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 8abef21d

Marcelo Vanzin authored 9 years ago

This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.

8abef21d

[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS · c35fdcb7

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-10491

We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.

Let me know if new UT needed.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #8663 from hhbyyh/movedspr.

c35fdcb7

Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
```
09b7e7c1

[SPARK-10598] [DOCS] · 6503c4b5

Robin East authored 9 years ago

Comments preceding toMessage method state: "The edge partition is encoded in the lower
   * 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits.

This contribution is my original work and I license the work to the Spark project under it's open source license.

Author: Robin East <robin.east@xense.co.uk>

Closes #8756 from insidedctm/master.

6503c4b5

Small fixes to docs · 833be733

Jacek Laskowski authored 9 years ago

Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8759 from jaceklaskowski/docs-submitting-apps.

833be733

Sep 14, 2015

[SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.random · a2249359
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8666 from yu-iskw/SPARK-10275.
```
a2249359

[SPARK-10273] Add @since annotation to pyspark.mllib.feature · 610971ec

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.

610971ec

[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement... · 4ae4d547

Yanbo Liang authored 9 years ago

[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly

PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8166 from yanboliang/spark-9793.

4ae4d547