Commits · e5904bb5e7d83b3731b312c40f7904c0511019f5 · cs525-sp18-g07 / spark

Jan 10, 2016

[SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":") · e5904bb5

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10684 from sarutak/SPARK-12692-followup-mllib.

e5904bb5

[SPARK-12736][CORE][DEPLOY] Standalone Master cannot be started due t… · b78e028e

Jacek Laskowski authored 9 years ago

…o NoClassDefFoundError: org/spark-project/guava/collect/Maps

/cc srowen rxin

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10674 from jaceklaskowski/SPARK-12736.

b78e028e

Jan 09, 2016
- [SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository. · 5b0d5443
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #10673 from rxin/SPARK-12735.
  5b0d5443
- Close #10665 · 3efd106e
  Reynold Xin authored 9 years ago
  
  3efd106e
- [SPARK-12340] Fix overflow in various take functions. · b23c4521
  Reynold Xin authored 9 years ago
  
  This is a follow-up for the original patch #10562. Author: Reynold Xin <rxin@databricks.com> Closes #10670 from rxin/SPARK-12340.
  b23c4521
- [SPARK-12645][SPARKR] SparkR support hash function · 3d77cffe
  Yanbo Liang authored 9 years ago
  
  Add ```hash``` function for SparkR ```DataFrame```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10597 from yanboliang/spark-12645.
  3d77cffe
Jan 08, 2016

[SPARK-12577] [SQL] Better support of parentheses in partition by and order by... · 95cd5d95

Liang-Chi Hsieh authored 9 years ago

[SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause

JIRA: https://issues.apache.org/jira/browse/SPARK-12577

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10620 from viirya/fix-parentheses.

95cd5d95

[SPARK-4628][BUILD] Remove all non-Maven-Central repositories from build · 090d6913

Josh Rosen authored 9 years ago

This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository.

I tested this by running

```
build/mvn \
        -Phive \
        -Phive-thriftserver \
        -Pkinesis-asl \
        -Pspark-ganglia-lgpl \
        -Pyarn \
        dependency:go-offline
```

inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10659 from JoshRosen/SPARK-4628.

090d6913

[SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuite · 1fdf9bbd

Josh Rosen authored 9 years ago

This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10667 from JoshRosen/block-mgr-tests-cleanup.

1fdf9bbd

[SPARK-12593][SQL] Converts resolved logical plan back to SQL · d9447cac

Cheng Lian authored 9 years ago

This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized.

The current version is still in WIP status, and is quite limited. Known limitations include:

1. The logical plan must be analyzed but not optimized

The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary.

1. The logical plan must be created using HiveQL query string

Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan

```
Filter (a#1 < 10)
+- MetastoreRelation default, src, None
```

need to be canonicalized into the following form before SQL generation:

```
Project [a#1, b#2, c#3]
+- Filter (a#1 < 10)
+- MetastoreRelation default, src, None
```

Otherwise, the SQL generation process will have to handle a large number of special cases.

1. Only a fraction of expressions and basic logical plan operators are supported in this PR

Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings.

Known unsupported components are:

- Expressions
- Part of math expressions
- Part of string expressions (buggy?)
- Null expressions
- Calendar interval literal
- Part of date time expressions
- Complex type creators
- Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN`
- Logical plan operators/patterns
- Cube, rollup, and grouping set
- Script transformation
- Generator
- Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule
- Window functions

Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner:

* For all select queries, we try to convert it back to SQL
* If the query plan is convertible, we parse the generated SQL into a new logical plan
* Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

- [x] Fix failed test cases
- [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
- [x] Comments and documentation

Author: Cheng Lian <lian@databricks.com>

Closes #10541 from liancheng/sql-generation.

d9447cac

[SPARK-4819] Remove Guava's "Optional" from public API · 659fd9d0

Sean Owen authored 9 years ago

Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.

659fd9d0

[SPARK-12654] sc.wholeTextFiles with spark.hadoop.cloneConf=true fail… · 553fd7b9

Thomas Graves authored 9 years ago

…s on secure Hadoop

https://issues.apache.org/jira/browse/SPARK-12654

So the bug here is that WholeTextFileRDD.getPartitions has:
val conf = getConf
in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext.
The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #10651 from tgravescs/SPARK-12654.

553fd7b9

fixed numVertices in transitive closure example · 8c70cb4c
Udo Klein authored 9 years ago
```
Author: Udo Klein <git@blinkenlight.net>

Closes #10642 from udoklein/patch-2.
```
8c70cb4c

[DOCUMENTATION] doc fix of job scheduling · 00d92617

Jeff Zhang authored 9 years ago

spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10657 from zjffdu/doc-fix.

00d92617

[SPARK-12701][CORE] FileAppender should use join to ensure writing thread completion · ea104b8f

Bryan Cutler authored 9 years ago

Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701.

ea104b8f

[SPARK-12687] [SQL] Support from clause surrounded by `()`. · cfe1ba56

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12687

Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10660 from viirya/fix-union.

cfe1ba56

[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition · b9c83533

Sean Owen authored 9 years ago

Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.

b9c83533

[SPARK-12692][BUILD] Scala style: check no white space before comma and colon · 794ea553

Kousuke Saruta authored 9 years ago

We should not put a white space before `,` and `:` so let's check it.
Because there are lots of style violations, first, I'd like to add a checker, enable and let the level `warning`.
Then, I'd like to fix the style step by step.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10643 from sarutak/SPARK-12692.

794ea553

Jan 07, 2016

Fix indentation for the previous patch. · 726bd3c4
Reynold Xin authored 9 years ago

726bd3c4

[SPARK-12317][SQL] Support units (m,k,g) in SQLConf · 5028a001

Kevin Yu authored 9 years ago

This PR is continue from previous closed PR 10314.

In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input.

For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file.

marmbrus srowen : Can you help review this code changes ? Thanks.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10629 from kevinyu98/spark-12317.

5028a001

[SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for Kryo · 28e0e500

Shixiong Zhu authored 9 years ago

The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10609 from zsxwing/SPARK-12591.

28e0e500

[SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and... · c94199e9

Shixiong Zhu authored 9 years ago

[SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and allowBatching configurations for Streaming

/cc tdas brkyvz

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10453 from zsxwing/streaming-conf.

c94199e9

[SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value} · 5a402199

Sean Owen authored 9 years ago

Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes.

Author: Sean Owen <sowen@cloudera.com>

Closes #10641 from srowen/SPARK-12604.2.

5a402199

[SPARK-12510][STREAMING] Refactor ActorReceiver to support Java · c0c39750

Shixiong Zhu authored 9 years ago

This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.

c0c39750

[SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription · 34dbc8af

Kazuaki Ishizaki authored 9 years ago

Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit``

The policy is here, as describe at https://github.com/apache/spark/pull/10488

Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10524 from kiszk/SPARK-12580.

34dbc8af

[SPARK-12598][CORE] bug in setMinPartitions · 83465183

Darek Blasiak authored 9 years ago

There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <darek.blasiak@640labs.com>

Closes #10546 from datafarmer/setminpartitionsbug.

83465183

[STREAMING][MINOR] More contextual information in logs + minor code i… · 1b2c2162

Jacek Laskowski authored 9 years ago

…mprovements

Please review and merge at your convenience. Thanks!

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10595 from jaceklaskowski/streaming-minor-fixes.

1b2c2162

[MINOR] Fix for BUILD FAILURE for Scala 2.11 · 07b314a5

Jacek Laskowski authored 9 years ago

It was introduced in 917d3fc0

/cc cloud-fan rxin

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.

07b314a5

[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits · f194d991

Sameer Agarwal authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12662

cc yhuai

Author: Sameer Agarwal <sameer@databricks.com>

Closes #10626 from sameeragarwal/randomsplit.

f194d991

[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None · 592f6498

zero323 authored 9 years ago

If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.

592f6498

[STREAMING][DOCS][EXAMPLES] Minor fixes · 8113dbda

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.

8113dbda

[SPARK-12542][SQL] support except/intersect in HiveQl · fd1dcfaf

Davies Liu authored 9 years ago

Parse the SQL query with except/intersect in FROM clause for HivQL.

Author: Davies Liu <davies@databricks.com>

Closes #10622 from davies/intersect.

fd1dcfaf

[SPARK-12295] [SQL] external spilling for window functions · 6a1c864a

Davies Liu authored 9 years ago

This PR manage the memory used by window functions (buffered rows), also enable external spilling.

After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.

Author: Davies Liu <davies@databricks.com>

Closes #10605 from davies/unsafe_window.

6a1c864a

[DOC] fix 'spark.memory.offHeap.enabled' default value to false · 84e77a15

zzcclp authored 9 years ago

modify 'spark.memory.offHeap.enabled' default value to false

Author: zzcclp <xm_zzc@sina.com>

Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.

84e77a15

Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None" · e5cde7ab
Yin Huai authored 9 years ago
```
This reverts commit fcd013cf.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
```
e5cde7ab

Jan 06, 2016

[SPARK-12678][CORE] MapPartitionsRDD clearDependencies · b6738520

Guillaume Poulin authored 9 years ago

MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <poulin.guillaume@gmail.com>

Closes #10623 from gpoulin/map_partition_deps.

b6738520

[SPARK-12673][UI] Add missing uri prepending for job description · 174e72ce

jerryshao authored 9 years ago

Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #10618 from jerryshao/SPARK-12673.

174e72ce

[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 · 8e19c766

Josh Rosen authored 9 years ago

This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.

Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.

For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10534 from JoshRosen/remove-ttl-based-cleaning.

8e19c766

[SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile · 6b6d02be

Robert Dodier authored 9 years ago

This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).

For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)

Author: Robert Dodier <robert_dodier@users.sourceforge.net>

Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.

6b6d02be

[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. · a74d743c

Nong Li authored 9 years ago

[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.

We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10589 from nongli/spark-12640.

a74d743c