Commits · 689386b1c60997e4505749915f7005a52c207de2 · cs525-sp18-g07 / spark

Nov 10, 2015

[SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBT · 689386b1

Josh Rosen authored 9 years ago

This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine.

Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task.

`dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead.

/cc dragos marmbrus pwendell srowen

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9575 from JoshRosen/SPARK-7841.

689386b1

[SPARK-11382] Replace example code in mllib-decision-tree.md using include_example · a81f47ff

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11382

B.T.W. I fix an error in naive_bayes_example.py.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9596 from yinxusen/SPARK-11382.

a81f47ff

Fix typo in driver page · 5507a9d0

Paul Chandler authored 9 years ago

"Comamnd property" => "Command property"

Author: Paul Chandler <pestilence669@users.noreply.github.com>

Closes #9578 from pestilence669/fix_spelling.

5507a9d0

[SPARK-11598] [SQL] enable tests for ShuffledHashOuterJoin · 521b3cae
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #9573 from davies/join_condition.
```
521b3cae

[SPARK-11599] [SQL] fix NPE when resolve Hive UDF in SQLParser · d6cd3a18

Davies Liu authored 9 years ago

The DataFrame APIs that takes a SQL expression always use SQLParser, then the HiveFunctionRegistry will called outside of Hive state, cause NPE if there is not a active Session State for current thread (in PySpark).

cc rxin yhuai

Author: Davies Liu <davies@databricks.com>

Closes #9576 from davies/hive_udf.

d6cd3a18

Nov 09, 2015

[SPARK-11587][SPARKR] Fix the summary generic to match base R · c4e19b38

Shivaram Venkataraman authored 9 years ago

The signature is summary(object, ...) as defined in
https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9582 from shivaram/summary-fix.

c4e19b38

Add mockito as an explicit test dependency to spark-streaming · 1431319e

Burak Yavuz authored 9 years ago

While sbt successfully compiles as it properly pulls the mockito dependency, maven builds have broken. We need this in ASAP.
tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9584 from brkyvz/fix-master.

1431319e

[SPARK-11333][STREAMING] Add executorId to ReceiverInfo and display it in UI · 6502944f

Shixiong Zhu authored 9 years ago

Expose executorId to `ReceiverInfo` and UI since it's helpful when there are multiple executors running in the same host. Screenshot:

<img width="1058" alt="screen shot 2015-11-02 at 10 52 19 am" src="https://cloud.githubusercontent.com/assets/1000778/10890968/2e2f5512-8150-11e5-8d9d-746e826b69e8.png">

Author: Shixiong Zhu <shixiong@databricks.com>
Author: zsxwing <zsxwing@gmail.com>

Closes #9418 from zsxwing/SPARK-11333.

6502944f

[SPARK-11462][STREAMING] Add JavaStreamingListener · 1f0f14ef

zsxwing authored 9 years ago

Currently, StreamingListener is not Java friendly because it exposes some Scala collections to Java users directly, such as Option, Map.

This PR added a Java version of StreamingListener and a bunch of Java friendly classes for Java users.

Author: zsxwing <zsxwing@gmail.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9420 from zsxwing/java-streaming-listener.

1f0f14ef

[SPARK-11141][STREAMING] Batch ReceivedBlockTrackerLogEvents for WAL writes · 0ce6f9b2

Burak Yavuz authored 9 years ago

When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don't get blocked by the driver for too long.

cc zsxwing tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9143 from brkyvz/batch-wal-writes.

0ce6f9b2

[SPARK-11198][STREAMING][KINESIS] Support de-aggregation of records during recovery · 26062d22

Burak Yavuz authored 9 years ago

While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records.

tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests?

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9403 from brkyvz/kinesis-deaggregation.

26062d22

[SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase · 61f9c871

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11069
quotes from jira:
Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
call the Boolean Param "toLowercase"
set default to false (so behavior does not change)

Actually sklearn converts to lowercase before tokenizing too

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9092 from hhbyyh/tokenLower.

61f9c871

[SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific · 7dc9d8db
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9577 from yu-iskw/SPARK-11610.
```
7dc9d8db
[SPARK-11564][SQL] Fix documentation for DataFrame.take/collect · 675c7e72
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9557 from rxin/SPARK-11564-1.
```
675c7e72

[SPARK-11578][SQL] User API for Typed Aggregation · 9c740a9d

Michael Armbrust authored 9 years ago

This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value.

For example, the following aggregator extracts an `int` from a specific class and adds them up:

```scala
  case class Data(i: Int)

  val customSummer =  new Aggregator[Data, Int, Int] {
    def prepare(d: Data) = d.i
    def reduce(l: Int, r: Int) = l + r
    def present(r: Int) = r
  }.toColumn()

  val ds: Dataset[Data] = ...
  val aggregated = ds.select(customSummer)
```

By using helper functions, users can make a generic `Aggregator` that works on any input type:

```scala
/** An `Aggregator` that adds up any numeric type returned by the given function. */
class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable {
  val numeric = implicitly[Numeric[N]]
  override def zero: N = numeric.zero
  override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
  override def present(reduction: N): N = reduction
}

def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn
```

These aggregators can then be used alongside other built-in SQL aggregations.

```scala
val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
ds
  .groupBy(_._1)
  .agg(
    sum(_._2),                // The aggregator defined above.
    expr("sum(_2)").as[Int],  // A built-in dynatically typed aggregation.
    count("*"))               // A built-in statically typed aggregation.
  .collect()

res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L)
```

The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`.  This will be improved in a followup PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #9555 from marmbrus/dataset-useragg.

9c740a9d

[SPARK-11360][DOC] Loss of nullability when writing parquet files · 2f383788

gatorsmile authored 9 years ago

This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9314 from gatorsmile/lossNull.

2f383788

[SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code · 9565c246

hyukjinkwon authored 9 years ago

Actually this was resolved by https://github.com/apache/spark/pull/8275.

But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both.

I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add
a little bit of corrections.

In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored).

In this PR, I just followed the original way to collect filters by reducing.

I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9554 from HyukjinKwon/SPARK-9557.

9565c246

[SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDataset · fcb57e9c

Wenchen Fan authored 9 years ago

created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9564 from cloud-fan/map.

fcb57e9c

[SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering · 8a233689

Yu ISHIKAWA authored 9 years ago

I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

- This implementation based on a bi-sectiong K-means clustering.
    - It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
    - However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>

Closes #5267 from yu-iskw/new-hierarchical-clustering.

8a233689

[SPARK-11359][STREAMING][KINESIS] Checkpoint to DynamoDB even when new data doesn't come in · a3a7c910

Burak Yavuz authored 9 years ago

Currently, the checkpoints to DynamoDB occur only when new data comes in, as we update the clock for the checkpointState. This PR makes the checkpoint a scheduled execution based on the `checkpointInterval`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9421 from brkyvz/kinesis-checkpoint.

a3a7c910

[SPARK-11595] [SQL] Fixes ADD JAR when the input path contains URL scheme · 150f6a89
Cheng Lian authored 9 years ago
```
Author: Cheng Lian <lian@databricks.com>

Closes #9569 from liancheng/spark-11595.fix-add-jar.
```
150f6a89

[SPARK-9301][SQL] Add collect_set and collect_list aggregate functions · f138cb87

Nick Buroojy authored 9 years ago

For now they are thin wrappers around the corresponding Hive UDAFs.

One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.

I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.

Do we also want to add these to `functions.py`?

This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089



marmbrus rxin

Author: Nick Buroojy <nick.buroojy@civitaslearning.com>

Closes #9526 from nburoojy/nick/udaf-alias.

(cherry picked from commit a6ee4f98)
Signed-off-by: Michael Armbrust <michael@databricks.com>

f138cb87

[SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example · b7720fa4
Rishabh Bhardwaj authored 9 years ago
```
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9519 from rishabhbhardwaj/SPARK-11337.
```
b7720fa4

[SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example] · 51d41e4b

sachin aggarwal authored 9 years ago

I have tested it on my local, it is working fine, please review

Author: sachin aggarwal <different.sachin@gmail.com>

Closes #9539 from agsachin/SPARK-11552-real.

51d41e4b

[SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraints · 5039a49b

Felix Bechstein authored 9 years ago

this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.

Author: Felix Bechstein <felix.bechstein@otto.de>

Closes #8639 from felixb/decline_offers_constraint_mismatch.

5039a49b

[SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification · 88a3fdcc
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8690 from yu-iskw/SPARK-10280.
```
88a3fdcc
[SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSE · 860ea0d3
Bharat Lal authored 9 years ago
```
Author: Bharat Lal <bharat.iisc@gmail.com>

Closes #9560 from bharatl/SPARK-11581.
```
860ea0d3

[DOCS] Fix typo for Python section on unifying Kafka streams · 874cd66d

chriskang90 authored 9 years ago

1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams.
2) print() should be pprint() for pyspark.

This contribution is my original work, and I license the work to the project under the project's open source license.

Author: chriskang90 <jckang@uchicago.edu>

Closes #9545 from c-kang/streaming_python_typo.

874cd66d

[SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame · cd174882

felixcheung authored 9 years ago

Make sample test less flaky by setting the seed

Tested with
```
repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9549 from felixcheung/rsample.

cd174882

[SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scala · 404a28f4

tedyu authored 9 years ago

As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with:
```
[error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments.
[error] class RDDInfo(
[error]
```
This PR tries to fix the compilation error

Author: tedyu <yuzhihong@gmail.com>

Closes #9538 from tedyu/master.

404a28f4

[SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON · 08a7a836

Charles Yeh authored 9 years ago

I looked at the other endpoints, and they don't seem to be missing any fields.
Added fields:
![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png)

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9472 from CharlesYeh/api_vars.

08a7a836

[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model · 9b88e1dc

fazlan-nazeem authored 9 years ago

The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.

Author: fazlan-nazeem <fazlann@wso2.com>

Closes #9558 from fazlan-nazeem/master.

9b88e1dc

[SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression · d50a66cc

Yanbo Liang authored 9 years ago

Add user guide and example code for ```AFTSurvivalRegression```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9491 from yanboliang/spark-10689.

d50a66cc

[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression · 8c0e1b50

Yanbo Liang authored 9 years ago

Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
 Min        Max
 -0.9509607 0.7291832

$Coefficients
                   Estimate   Std. Error t value   Pr(>|t|)
(Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica  -1.00751   0.09330565 -10.79796 0
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9561 from yanboliang/spark-11494.

8c0e1b50

[DOC][MINOR][SQL] Fix internal link · b541b316

Rohit Agarwal authored 9 years ago

It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.

Author: Rohit Agarwal <mindprince@gmail.com>

Closes #9544 from mindprince/patch-2.

b541b316

[SPARK-11218][CORE] show help messages for start-slave and start-master · 9e48cdfb

Charles Yeh authored 9 years ago

Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh.
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help
Usage: Master [options]

Options:
  -i HOST, --ip HOST     Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST   Hostname to listen on
  -p PORT, --port PORT   Port to listen on (default: 7077)
  --webui-port PORT      Port for web UI (default: 8080)
  --properties-file FILE Path to a custom Spark properties file.
                         Default is conf/spark-defaults.conf.
```
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh
Usage: Worker [options] <master>

Master must be a URL of the form spark://hostname:port

Options:
  -c CORES, --cores CORES  Number of cores to use
  -m MEM, --memory MEM     Amount of memory to use (e.g. 1000M, 2G)
  -d DIR, --work-dir DIR   Directory to run apps in (default: SPARK_HOME/work)
  -i HOST, --ip IP         Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST     Hostname to listen on
  -p PORT, --port PORT     Port to listen on (default: random)
  --webui-port PORT        Port for web UI (default: 8081)
  --properties-file FILE   Path to a custom Spark properties file.
                           Default is conf/spark-defaults.conf.
```

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9432 from CharlesYeh/helpmsg.

9e48cdfb

Nov 08, 2015

[SPARK-11453][SQL] append data to partitioned table will messes up the result · d8b50f70

Wenchen Fan authored 9 years ago

The reason is that:

1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`)
2. When append data to table, we use position to figure out how to match input columns to table's columns.

So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105)

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9408 from cloud-fan/append.

d8b50f70

[SPARK-11564][SQL] Dataset Java API audit · 97b7080c

Reynold Xin authored 9 years ago

A few changes:

1. Removed fold, since it can be confusing for distributed collections.
2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction)
3. Added more documentation and test cases.

The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function.

Author: Reynold Xin <rxin@databricks.com>

Closes #9531 from rxin/SPARK-11564.

97b7080c

[SPARK-11554][SQL] add map/flatMap to GroupedDataset · b2d195e1
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9521 from cloud-fan/map.
```
b2d195e1

[SPARK-10046][SQL] Hive warehouse dir not set in current directory when not … · 26739059

xin Wu authored 9 years ago

Doc change to align with HiveConf default in terms of where to create `warehouse` directory.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9365 from xwu0226/spark-10046-commit.

26739059