Commits · f138cb873335654476d1cd1070900b552dd8b21a · cs525-sp18-g07 / spark

Nov 09, 2015

[SPARK-9301][SQL] Add collect_set and collect_list aggregate functions · f138cb87

Nick Buroojy authored 9 years ago

For now they are thin wrappers around the corresponding Hive UDAFs.

One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.

I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.

Do we also want to add these to `functions.py`?

This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089



marmbrus rxin

Author: Nick Buroojy <nick.buroojy@civitaslearning.com>

Closes #9526 from nburoojy/nick/udaf-alias.

(cherry picked from commit a6ee4f98)
Signed-off-by: Michael Armbrust <michael@databricks.com>

f138cb87

[SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example · b7720fa4
Rishabh Bhardwaj authored 9 years ago
```
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9519 from rishabhbhardwaj/SPARK-11337.
```
b7720fa4

[SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example] · 51d41e4b

sachin aggarwal authored 9 years ago

I have tested it on my local, it is working fine, please review

Author: sachin aggarwal <different.sachin@gmail.com>

Closes #9539 from agsachin/SPARK-11552-real.

51d41e4b

[SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraints · 5039a49b

Felix Bechstein authored 9 years ago

this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.

Author: Felix Bechstein <felix.bechstein@otto.de>

Closes #8639 from felixb/decline_offers_constraint_mismatch.

5039a49b

[SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification · 88a3fdcc
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8690 from yu-iskw/SPARK-10280.
```
88a3fdcc
[SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSE · 860ea0d3
Bharat Lal authored 9 years ago
```
Author: Bharat Lal <bharat.iisc@gmail.com>

Closes #9560 from bharatl/SPARK-11581.
```
860ea0d3

[DOCS] Fix typo for Python section on unifying Kafka streams · 874cd66d

chriskang90 authored 9 years ago

1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams.
2) print() should be pprint() for pyspark.

This contribution is my original work, and I license the work to the project under the project's open source license.

Author: chriskang90 <jckang@uchicago.edu>

Closes #9545 from c-kang/streaming_python_typo.

874cd66d

[SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame · cd174882

felixcheung authored 9 years ago

Make sample test less flaky by setting the seed

Tested with
```
repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9549 from felixcheung/rsample.

cd174882

[SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scala · 404a28f4

tedyu authored 9 years ago

As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with:
```
[error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments.
[error] class RDDInfo(
[error]
```
This PR tries to fix the compilation error

Author: tedyu <yuzhihong@gmail.com>

Closes #9538 from tedyu/master.

404a28f4

[SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON · 08a7a836

Charles Yeh authored 9 years ago

I looked at the other endpoints, and they don't seem to be missing any fields.
Added fields:
![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png)

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9472 from CharlesYeh/api_vars.

08a7a836

[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model · 9b88e1dc

fazlan-nazeem authored 9 years ago

The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.

Author: fazlan-nazeem <fazlann@wso2.com>

Closes #9558 from fazlan-nazeem/master.

9b88e1dc

[SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression · d50a66cc

Yanbo Liang authored 9 years ago

Add user guide and example code for ```AFTSurvivalRegression```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9491 from yanboliang/spark-10689.

d50a66cc

[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression · 8c0e1b50

Yanbo Liang authored 9 years ago

Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
 Min        Max
 -0.9509607 0.7291832

$Coefficients
                   Estimate   Std. Error t value   Pr(>|t|)
(Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica  -1.00751   0.09330565 -10.79796 0
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9561 from yanboliang/spark-11494.

8c0e1b50

[DOC][MINOR][SQL] Fix internal link · b541b316

Rohit Agarwal authored 9 years ago

It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.

Author: Rohit Agarwal <mindprince@gmail.com>

Closes #9544 from mindprince/patch-2.

b541b316

[SPARK-11218][CORE] show help messages for start-slave and start-master · 9e48cdfb

Charles Yeh authored 9 years ago

Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh.
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help
Usage: Master [options]

Options:
  -i HOST, --ip HOST     Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST   Hostname to listen on
  -p PORT, --port PORT   Port to listen on (default: 7077)
  --webui-port PORT      Port for web UI (default: 8080)
  --properties-file FILE Path to a custom Spark properties file.
                         Default is conf/spark-defaults.conf.
```
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh
Usage: Worker [options] <master>

Master must be a URL of the form spark://hostname:port

Options:
  -c CORES, --cores CORES  Number of cores to use
  -m MEM, --memory MEM     Amount of memory to use (e.g. 1000M, 2G)
  -d DIR, --work-dir DIR   Directory to run apps in (default: SPARK_HOME/work)
  -i HOST, --ip IP         Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST     Hostname to listen on
  -p PORT, --port PORT     Port to listen on (default: random)
  --webui-port PORT        Port for web UI (default: 8081)
  --properties-file FILE   Path to a custom Spark properties file.
                           Default is conf/spark-defaults.conf.
```

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9432 from CharlesYeh/helpmsg.

9e48cdfb

Nov 08, 2015

[SPARK-11453][SQL] append data to partitioned table will messes up the result · d8b50f70

Wenchen Fan authored 9 years ago

The reason is that:

1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`)
2. When append data to table, we use position to figure out how to match input columns to table's columns.

So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105)

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9408 from cloud-fan/append.

d8b50f70

[SPARK-11564][SQL] Dataset Java API audit · 97b7080c

Reynold Xin authored 9 years ago

A few changes:

1. Removed fold, since it can be confusing for distributed collections.
2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction)
3. Added more documentation and test cases.

The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function.

Author: Reynold Xin <rxin@databricks.com>

Closes #9531 from rxin/SPARK-11564.

97b7080c

[SPARK-11554][SQL] add map/flatMap to GroupedDataset · b2d195e1
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9521 from cloud-fan/map.
```
b2d195e1

[SPARK-10046][SQL] Hive warehouse dir not set in current directory when not … · 26739059

xin Wu authored 9 years ago

Doc change to align with HiveConf default in terms of where to create `warehouse` directory.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9365 from xwu0226/spark-10046-commit.

26739059

[SPARK-11451][SQL] Support single distinct count on multiple columns. · 30c8ba71

Herman van Hovell authored 9 years ago

This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9409 from hvanhovell/SPARK-11451.

30c8ba71

[DOC][SQL] Remove redundant out-of-place python snippet · 5c4e6d7e

Rohit Agarwal authored 9 years ago

This snippet seems to be mistakenly introduced at two places in #5348.

Author: Rohit Agarwal <mindprince@gmail.com>

Closes #9540 from mindprince/patch-1.

5c4e6d7e

[SPARK-11476][DOCS] Incorrect function referred to in MLib Random data generation documentation · d9819021
Sean Owen authored 9 years ago
```
Fix Python example to use normalRDD as advertised

Author: Sean Owen <sowen@cloudera.com>

Closes #9529 from srowen/SPARK-11476.
```
d9819021

Nov 07, 2015

[SPARK-11362] [SQL] Use Spark BitSet in BroadcastNestedLoopJoin · 4b69a42e

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11362

We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9316 from viirya/use-spark-bitset.

4b69a42e

[SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-up · ef362846

Herman van Hovell authored 9 years ago

This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.

cc yhuai marmbrus

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9541 from hvanhovell/SPARK-9241-followup.

ef362846

[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python · 2ff0e79a

Yu ISHIKAWA authored 9 years ago

Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.

2ff0e79a

Nov 06, 2015

[SPARK-11112] DAG visualization: display RDD callsite · 7f741905

Andrew Or authored 9 years ago

<img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png">
mateiz sarutak

Author: Andrew Or <andrew@databricks.com>

Closes #9398 from andrewor14/rdd-callsite.

7f741905

[SPARK-11389][CORE] Add support for off-heap memory to MemoryManager · 30b706b7

Josh Rosen authored 9 years ago

In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.

## User-facing changes

This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.

## Internals changes

This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.

There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).

I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.

## TODOs

- [x] Fix handful of test failures in the MemoryManagerSuites.
- [x] Fix remaining TODO comments in code.
- [ ] Document new configuration.
- [x] Fix commented-out tests / asserts:
- [x] UnifiedMemoryManagerSuite.
- [x] Write tests that exercise the new off-heap memory management policies.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9344 from JoshRosen/offheap-memory-accounting.

30b706b7

[HOTFIX] Fix python tests after #9527 · 105732dc

Michael Armbrust authored 9 years ago

#9527 missed updating the python tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #9533 from marmbrus/hotfixTextValue.

105732dc

[SPARK-11546] Thrift server makes too many logs about result schema · 1c80d66e

navis.ryu authored 9 years ago

SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file.

Author: navis.ryu <navis@apache.org>

Closes #9514 from navis/SPARK-11546.

1c80d66e

[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule · 6d0ead32

Herman van Hovell authored 9 years ago

The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.

This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are:
- This can use the faster TungstenAggregate code path.
- It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.

The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.

cc yhuai - Could you also tell me where to add tests for this?

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9406 from hvanhovell/SPARK-9241-rewriter.

6d0ead32

[SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW… · 1ab72b08
Nong Li authored 9 years ago
```
…ithinPartitions.

Author: Nong Li <nong@databricks.com>

Closes #9504 from nongli/spark-11410.
```
1ab72b08

[SPARK-11269][SQL] Java API support & test cases for Dataset · 7e9a9e60

Wenchen Fan authored 9 years ago

This simply brings https://github.com/apache/spark/pull/9358 up-to-date.

Author: Wenchen Fan <wenchen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #9528 from rxin/dataset-java.

7e9a9e60

[SPARK-11555] spark on yarn spark-class --num-workers doesn't work · f6680cdc

Thomas Graves authored 9 years ago

I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.

--num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #9523 from tgravescs/SPARK-11555.

f6680cdc

[SPARK-11217][ML] save/load for non-meta estimators and transformers · c447c9d5

Xiangrui Meng authored 9 years ago

This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:

* class name
* uid
* timestamp
* paramMap

The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.

~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")

Instance.load("path")
~~~

The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.

TODOs:

* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers

cc jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9454 from mengxr/SPARK-11217.

c447c9d5

[SPARK-11561][SQL] Rename text data source's column name to value. · 3a652f69
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9527 from rxin/SPARK-11561.
```
3a652f69

[SPARK-11450] [SQL] Add Unsafe Row processing to Expand · f328feda

Herman van Hovell authored 9 years ago

This PR enables the Expand operator to process and produce Unsafe Rows.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9414 from hvanhovell/SPARK-11450.

f328feda

[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits · 49f1a820

Imran Rashid authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.

49f1a820

Typo fixes + code readability improvements · 62bb2907

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9501 from jaceklaskowski/typos-with-style.

62bb2907

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of... · 8211aab0

Yin Huai authored 9 years ago

[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up)

https://issues.apache.org/jira/browse/SPARK-9858

This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #9453 from yhuai/numReducer-followUp.

8211aab0

[SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 · c048929c

Cheng Lian authored 9 years ago

This PR adds test cases that test various column pruning and filter push-down cases.

Author: Cheng Lian <lian@databricks.com>

Closes #9468 from liancheng/spark-10978.follow-up.

c048929c