Commits · 794ea553bd0fcfece15b610b47ee86d6644134c9 · cs525-sp18-g07 / spark

Jan 08, 2016

[SPARK-12692][BUILD] Scala style: check no white space before comma and colon · 794ea553

Kousuke Saruta authored 9 years ago

We should not put a white space before `,` and `:` so let's check it.
Because there are lots of style violations, first, I'd like to add a checker, enable and let the level `warning`.
Then, I'd like to fix the style step by step.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10643 from sarutak/SPARK-12692.

794ea553

Jan 07, 2016

Fix indentation for the previous patch. · 726bd3c4
Reynold Xin authored 9 years ago

726bd3c4

[SPARK-12317][SQL] Support units (m,k,g) in SQLConf · 5028a001

Kevin Yu authored 9 years ago

This PR is continue from previous closed PR 10314.

In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input.

For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file.

marmbrus srowen : Can you help review this code changes ? Thanks.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10629 from kevinyu98/spark-12317.

5028a001

[SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for Kryo · 28e0e500

Shixiong Zhu authored 9 years ago

The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10609 from zsxwing/SPARK-12591.

28e0e500

[SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and... · c94199e9

Shixiong Zhu authored 9 years ago

[SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and allowBatching configurations for Streaming

/cc tdas brkyvz

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10453 from zsxwing/streaming-conf.

c94199e9

[SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value} · 5a402199

Sean Owen authored 9 years ago

Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes.

Author: Sean Owen <sowen@cloudera.com>

Closes #10641 from srowen/SPARK-12604.2.

5a402199

[SPARK-12510][STREAMING] Refactor ActorReceiver to support Java · c0c39750

Shixiong Zhu authored 9 years ago

This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.

c0c39750

[SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription · 34dbc8af

Kazuaki Ishizaki authored 9 years ago

Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit``

The policy is here, as describe at https://github.com/apache/spark/pull/10488

Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10524 from kiszk/SPARK-12580.

34dbc8af

[SPARK-12598][CORE] bug in setMinPartitions · 83465183

Darek Blasiak authored 9 years ago

There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <darek.blasiak@640labs.com>

Closes #10546 from datafarmer/setminpartitionsbug.

83465183

[STREAMING][MINOR] More contextual information in logs + minor code i… · 1b2c2162

Jacek Laskowski authored 9 years ago

…mprovements

Please review and merge at your convenience. Thanks!

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10595 from jaceklaskowski/streaming-minor-fixes.

1b2c2162

[MINOR] Fix for BUILD FAILURE for Scala 2.11 · 07b314a5

Jacek Laskowski authored 9 years ago

It was introduced in 917d3fc0

/cc cloud-fan rxin

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.

07b314a5

[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits · f194d991

Sameer Agarwal authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12662

cc yhuai

Author: Sameer Agarwal <sameer@databricks.com>

Closes #10626 from sameeragarwal/randomsplit.

f194d991

[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · 592f6498

zero323 authored 9 years ago

If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.

592f6498

[STREAMING][DOCS][EXAMPLES] Minor fixes · 8113dbda

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.

8113dbda

[SPARK-12542][SQL] support except/intersect in HiveQl · fd1dcfaf

Davies Liu authored 9 years ago

Parse the SQL query with except/intersect in FROM clause for HivQL.

Author: Davies Liu <davies@databricks.com>

Closes #10622 from davies/intersect.

fd1dcfaf

[SPARK-12295] [SQL] external spilling for window functions · 6a1c864a

Davies Liu authored 9 years ago

This PR manage the memory used by window functions (buffered rows), also enable external spilling.

After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.

Author: Davies Liu <davies@databricks.com>

Closes #10605 from davies/unsafe_window.

6a1c864a

[DOC] fix 'spark.memory.offHeap.enabled' default value to false · 84e77a15

zzcclp authored 9 years ago

modify 'spark.memory.offHeap.enabled' default value to false

Author: zzcclp <xm_zzc@sina.com>

Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.

84e77a15

Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None" · e5cde7ab
Yin Huai authored 9 years ago
```
This reverts commit fcd013cf.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
```
e5cde7ab

Jan 06, 2016

[SPARK-12678][CORE] MapPartitionsRDD clearDependencies · b6738520

Guillaume Poulin authored 9 years ago

MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <poulin.guillaume@gmail.com>

Closes #10623 from gpoulin/map_partition_deps.

b6738520

[SPARK-12673][UI] Add missing uri prepending for job description · 174e72ce

jerryshao authored 9 years ago

Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #10618 from jerryshao/SPARK-12673.

174e72ce

[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 · 8e19c766

Josh Rosen authored 9 years ago

This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.

Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.

For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10534 from JoshRosen/remove-ttl-based-cleaning.

8e19c766

[SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile · 6b6d02be

Robert Dodier authored 9 years ago

This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).

For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)

Author: Robert Dodier <robert_dodier@users.sourceforge.net>

Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.

6b6d02be

[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. · a74d743c

Nong Li authored 9 years ago

[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.

We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10589 from nongli/spark-12640.

a74d743c

[SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java · ac56cf60

Sean Owen authored 9 years ago

Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change

Author: Sean Owen <sowen@cloudera.com>

Closes #10554 from srowen/SPARK-12604.

ac56cf60

[SPARK-12539][SQL] support writing bucketed table · 917d3fc0

Wenchen Fan authored 9 years ago

This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
```
df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")
```

When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.

Note that there may be multiply files for one bucket, as the data is distributed.

Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.

Limitations:

* Can't write bucketed data without hive metastore.
* Can't insert bucketed data into existing hive tables.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10498 from cloud-fan/bucket-write.

917d3fc0

[SPARK-12681] [SQL] split IdentifiersParser.g into two files · 6f7ba640

Davies Liu authored 9 years ago

To avoid to have a huge Java source (over 64K loc), that can't be compiled.

cc hvanhovell

Author: Davies Liu <davies@databricks.com>

Closes #10624 from davies/split_ident.

6f7ba640

Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of... · cbaea959

Shixiong Zhu authored 9 years ago

Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url."

This reverts commit 19e4e9fe. Will merge #10618 instead.

cbaea959

[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root... · 19e4e9fe

huangzhaowei authored 9 years ago

[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #10617 from SaintBacchus/SPARK-12672.

19e4e9fe

[SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming · 1e6648d6

Shixiong Zhu authored 9 years ago

Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10621 from zsxwing/SPARK-12617-2.

1e6648d6

[SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' metricName · f82ebb15

BenFradet authored 9 years ago

For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.

This PR aims to fix both issues.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10328 from BenFradet/SPARK-12368.

f82ebb15

[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · fcd013cf

zero323 authored 9 years ago

If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9986 from zero323/SPARK-12006.

fcd013cf

[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst · ea489f14

Herman van Hovell authored 9 years ago

This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:

The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.

The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
- ```CatalystQl```: This implements Query and Expression parsing functionality.
- ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
- ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.

cc rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10583 from hvanhovell/SPARK-12575.

ea489f14

[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier &... · 3aa34882

Yanbo Liang authored 9 years ago

[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed

PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9807 from yanboliang/spark-11815.

3aa34882

[SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml · 95eb6516

Yanbo Liang authored 9 years ago

Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9931 from yanboliang/SPARK-11945.

95eb6516

[SPARK-11531][ML] SparseVector error Msg · 007da1a9

Joshi authored 9 years ago

PySpark SparseVector should have "Found duplicate indices" error message

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #9525 from rekhajoshm/SPARK-11531.

007da1a9

[SPARK-7675][ML][PYSPARK] sparkml params type conversion · 3b29004d

Holden Karau authored 9 years ago

From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.

A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.

This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.

Author: Holden Karau <holden@us.ibm.com>

Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.

3b29004d

[SPARK-11878][SQL] Eliminate distribute by in case group by is present with... · 9061e777

Yash Datta authored 9 years ago

[SPARK-11878][SQL] Eliminate distribute by in case group by is present with exactly the same grouping expressi

For queries like :
select <> from table group by a distribute by a
we can eliminate distribute by ; since group by will anyways do a hash partitioning
Also applicable when user uses Dataframe API

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #9858 from saucam/eliminatedistribute.

9061e777

[SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and... · 94c202c7

Kousuke Saruta authored 9 years ago

[SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and GraphKryoRegistrator which are deprecated and no longer used

Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala  are no longer used so it's time to remove them in Spark 2.0.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10613 from sarutak/SPARK-12665.

94c202c7

[SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and... · 5d871ea4

QiangCai authored 9 years ago

[SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync

I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem.

spark jira
https://issues.apache.org/jira/browse/SPARK-12340

Author: QiangCai <david.caiq@gmail.com>

Closes #10562 from QiangCai/bugfix.

5d871ea4

[SPARK-12578][SQL] Distinct should not be silently ignored when used in an... · b2467b38

Liang-Chi Hsieh authored 9 years ago

[SPARK-12578][SQL] Distinct should not be silently ignored when used in an aggregate function with OVER clause

JIRA: https://issues.apache.org/jira/browse/SPARK-12578

Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10557 from viirya/keep-distinct-hivesql.

b2467b38