Commits · b2d195e137fad88d567974659fa7023ff4da96cd · cs525-sp18-g07 / spark

Nov 08, 2015

[SPARK-11554][SQL] add map/flatMap to GroupedDataset · b2d195e1
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9521 from cloud-fan/map.
```
b2d195e1

[SPARK-10046][SQL] Hive warehouse dir not set in current directory when not … · 26739059

xin Wu authored 9 years ago

Doc change to align with HiveConf default in terms of where to create `warehouse` directory.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9365 from xwu0226/spark-10046-commit.

26739059

[SPARK-11451][SQL] Support single distinct count on multiple columns. · 30c8ba71

Herman van Hovell authored 9 years ago

This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9409 from hvanhovell/SPARK-11451.

30c8ba71

[DOC][SQL] Remove redundant out-of-place python snippet · 5c4e6d7e

Rohit Agarwal authored 9 years ago

This snippet seems to be mistakenly introduced at two places in #5348.

Author: Rohit Agarwal <mindprince@gmail.com>

Closes #9540 from mindprince/patch-1.

5c4e6d7e

[SPARK-11476][DOCS] Incorrect function referred to in MLib Random data generation documentation · d9819021
Sean Owen authored 9 years ago
```
Fix Python example to use normalRDD as advertised

Author: Sean Owen <sowen@cloudera.com>

Closes #9529 from srowen/SPARK-11476.
```
d9819021

Nov 07, 2015

[SPARK-11362] [SQL] Use Spark BitSet in BroadcastNestedLoopJoin · 4b69a42e

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11362

We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9316 from viirya/use-spark-bitset.

4b69a42e

[SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-up · ef362846

Herman van Hovell authored 9 years ago

This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.

cc yhuai marmbrus

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9541 from hvanhovell/SPARK-9241-followup.

ef362846

[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python · 2ff0e79a

Yu ISHIKAWA authored 9 years ago

Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.

2ff0e79a

Nov 06, 2015

[SPARK-11112] DAG visualization: display RDD callsite · 7f741905

Andrew Or authored 9 years ago

<img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png">
mateiz sarutak

Author: Andrew Or <andrew@databricks.com>

Closes #9398 from andrewor14/rdd-callsite.

7f741905

[SPARK-11389][CORE] Add support for off-heap memory to MemoryManager · 30b706b7

Josh Rosen authored 9 years ago

In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.

## User-facing changes

This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.

## Internals changes

This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.

There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).

I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.

## TODOs

- [x] Fix handful of test failures in the MemoryManagerSuites.
- [x] Fix remaining TODO comments in code.
- [ ] Document new configuration.
- [x] Fix commented-out tests / asserts:
- [x] UnifiedMemoryManagerSuite.
- [x] Write tests that exercise the new off-heap memory management policies.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9344 from JoshRosen/offheap-memory-accounting.

30b706b7

[HOTFIX] Fix python tests after #9527 · 105732dc

Michael Armbrust authored 9 years ago

#9527 missed updating the python tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #9533 from marmbrus/hotfixTextValue.

105732dc

[SPARK-11546] Thrift server makes too many logs about result schema · 1c80d66e

navis.ryu authored 9 years ago

SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file.

Author: navis.ryu <navis@apache.org>

Closes #9514 from navis/SPARK-11546.

1c80d66e

[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule · 6d0ead32

Herman van Hovell authored 9 years ago

The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.

This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are:
- This can use the faster TungstenAggregate code path.
- It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.

The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.

cc yhuai - Could you also tell me where to add tests for this?

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9406 from hvanhovell/SPARK-9241-rewriter.

6d0ead32

[SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW… · 1ab72b08
Nong Li authored 9 years ago
```
…ithinPartitions.

Author: Nong Li <nong@databricks.com>

Closes #9504 from nongli/spark-11410.
```
1ab72b08

[SPARK-11269][SQL] Java API support & test cases for Dataset · 7e9a9e60

Wenchen Fan authored 9 years ago

This simply brings https://github.com/apache/spark/pull/9358 up-to-date.

Author: Wenchen Fan <wenchen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #9528 from rxin/dataset-java.

7e9a9e60

[SPARK-11555] spark on yarn spark-class --num-workers doesn't work · f6680cdc

Thomas Graves authored 9 years ago

I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.

--num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #9523 from tgravescs/SPARK-11555.

f6680cdc

[SPARK-11217][ML] save/load for non-meta estimators and transformers · c447c9d5

Xiangrui Meng authored 9 years ago

This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:

* class name
* uid
* timestamp
* paramMap

The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.

~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")

Instance.load("path")
~~~

The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.

TODOs:

* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers

cc jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9454 from mengxr/SPARK-11217.

c447c9d5

[SPARK-11561][SQL] Rename text data source's column name to value. · 3a652f69
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9527 from rxin/SPARK-11561.
```
3a652f69

[SPARK-11450] [SQL] Add Unsafe Row processing to Expand · f328feda

Herman van Hovell authored 9 years ago

This PR enables the Expand operator to process and produce Unsafe Rows.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9414 from hvanhovell/SPARK-11450.

f328feda

[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits · 49f1a820

Imran Rashid authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.

49f1a820

Typo fixes + code readability improvements · 62bb2907

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9501 from jaceklaskowski/typos-with-style.

62bb2907

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of... · 8211aab0

Yin Huai authored 9 years ago

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up)

https://issues.apache.org/jira/browse/SPARK-9858

This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #9453 from yhuai/numReducer-followUp.

8211aab0

[SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 · c048929c

Cheng Lian authored 9 years ago

This PR adds test cases that test various column pruning and filter push-down cases.

Author: Cheng Lian <lian@databricks.com>

Closes #9468 from liancheng/spark-10978.follow-up.

c048929c

[SPARK-9162] [SQL] Implement code generation for ScalaUDF · 574141a2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9162

Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9270 from viirya/scalaudf-codegen.

574141a2

[SPARK-11511][STREAMING] Fix NPE when an InputDStream is not used · cf69ce13

Shixiong Zhu authored 9 years ago

Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9476 from zsxwing/SPARK-11511.

cf69ce13

[SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit · 253e87e8

Wenchen Fan authored 9 years ago

A cleanup for https://github.com/apache/spark/pull/9085.

The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
Also added low level unit test at `SqlParserSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9482 from cloud-fan/parser.

253e87e8

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark... · bc5d6c03

Reynold Xin authored 9 years ago

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark various dialects as private.

Author: Reynold Xin <rxin@databricks.com>

Closes #9511 from rxin/SPARK-11541.

bc5d6c03

Nov 05, 2015

[SPARK-11528] [SQL] Typed aggregations for Datasets · 363a476c

Michael Armbrust authored 9 years ago

This PR adds the ability to do typed SQL aggregations.  We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR.

```scala
val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
ds.groupBy(_._1).agg(sum("_2").as[Int]).collect()

res0: Array(("a", 30), ("b", 3), ("c", 1))
```

Author: Michael Armbrust <michael@databricks.com>

Closes #9499 from marmbrus/dataset-agg.

363a476c

[SPARK-7542][SQL] Support off-heap index/sort buffer · eec74ba8

Davies Liu authored 9 years ago

This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution.

Closes #8068

Author: Davies Liu <davies@databricks.com>

Closes #9477 from davies/unsafe_timsort.

eec74ba8

[SPARK-11540][SQL] API audit for QueryExecutionListener. · 3cc2c053
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9509 from rxin/SPARK-11540.
```
3cc2c053

[SPARK-11538][BUILD] Force guava 14 in sbt build. · 5e31db70

Marcelo Vanzin authored 9 years ago

sbt's version resolution code always picks the most recent version, and we
don't want that for guava.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9508 from vanzin/SPARK-11538.

5e31db70

[SPARK-11457][STREAMING][YARN] Fix incorrect AM proxy filter conf recovery from checkpoint · 468ad0ae

jerryshao authored 9 years ago

Currently Yarn AM proxy filter configuration is recovered from checkpoint file when Spark Streaming application is restarted, which will lead to some unwanted behaviors:

1. Wrong RM address if RM is redeployed from failure.
2. Wrong proxyBase, since app id is updated, old app id for proxyBase is wrong.

So instead of recovering from checkpoint file, these configurations should be reloaded each time when app started.

This problem only exists in Yarn cluster mode, for Yarn client mode, these configurations will be updated with RPC message `AddWebUIFilter`.

Please help to review tdas harishreedharan vanzin , thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9412 from jerryshao/SPARK-11457.

468ad0ae

[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree* · 8fa8c837
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9486 from yu-iskw/SPARK-11514.
```
8fa8c837
Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs." · 6091e91f
Reynold Xin authored 9 years ago
```
This reverts commit 9cf56c96.
```
6091e91f

[SPARK-11537] [SQL] fix negative hours/minutes/seconds · 07414afa

Davies Liu authored 9 years ago

Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up).

Author: Davies Liu <davies@databricks.com>

Closes #9502 from davies/neg_hour.

07414afa

[SPARK-11542] [SPARKR] fix glm with long fomular · 24401062

Davies Liu authored 9 years ago

Because deparse() will break the long string into multiple lines, the deserialization will fail

Author: Davies Liu <davies@databricks.com>

Closes #9510 from davies/fix_glm.

24401062

[SPARK-11536][SQL] Remove the internal implicit conversion from Expression to... · b6974f8f

Reynold Xin authored 9 years ago

[SPARK-11536][SQL] Remove the internal implicit conversion from Expression to Column in functions.scala

Author: Reynold Xin <rxin@databricks.com>

Closes #9505 from rxin/SPARK-11536.

b6974f8f

[SPARK-10656][SQL] completely support special chars in DataFrame · d9e30c59

Wenchen Fan authored 9 years ago

the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`.

close https://github.com/apache/spark/pull/8811

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9462 from cloud-fan/special-chars.

d9e30c59

[SPARK-11260][SPARKR] with() function support · b9455d1f

adrian555 authored 9 years ago

Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>

Closes #9443 from adrian555/with.

b9455d1f

[SPARK-11532][SQL] Remove implicit conversion from Expression to Column · 8a5314ef
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9500 from rxin/SPARK-11532.
```
8a5314ef