Commits · 2fb4901b71cee65d40a43e61e3f4411c30cdefc3 · cs525-sp18-g07 / spark

Aug 12, 2015

[SPARK-9916] [BUILD] [SPARKR] removed left-over sparkr.zip copy/create commands from codebase · 2fb4901b

Burak Yavuz authored 9 years ago

sparkr.zip is now built by SparkSubmit on a need-to-build basis.

cc shivaram

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8147 from brkyvz/make-dist-fix.

2fb4901b

[SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes · d7053bea

Xiangrui Meng authored 9 years ago

There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8136 from mengxr/SPARK-9903.

d7053bea

[SPARK-9704] [ML] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs · d2d5e7fe

Joseph K. Bradley authored 9 years ago

Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.

CC: mengxr EronWright

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits:

7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs

d2d5e7fe

[SPARK-9908] [SQL] When spark.sql.tungsten.enabled is false, broadcast join does not work · 4413d085
Yin Huai authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-9908

Author: Yin Huai <yhuai@databricks.com>

Closes #8149 from yhuai/SPARK-9908.
```
4413d085

[SPARK-9827] [SQL] fix fd leak in UnsafeRowSerializer · 7c35746c

Davies Liu authored 9 years ago

Currently, UnsafeRowSerializer does not close the InputStream, will cause fd leak if the InputStream has an open fd in it.

TODO: the fd could still be leaked, if any items in the stream is not consumed. Currently it replies on GC to close the fd in this case.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #8116 from davies/fd_leak.

7c35746c

[SPARK-9870] Disable driver UI and Master REST server in SparkSubmitSuite · 7b13ed27

Josh Rosen authored 9 years ago

I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8124 from JoshRosen/disable-ui-in-sparksubmitsuite.

7b13ed27

[SPARK-9855] [SPARKR] Add expression functions into SparkR whose params are simple · f4bc01f1

Yu ISHIKAWA authored 9 years ago

I added lots of expression functions for SparkR. This PR includes only functions whose params  are only `(Column)` or `(Column, Column)`.  And I think we need to improve how to test those functions. However, it would be better to work on another issue.

## Diff Summary

- Add lots of functions in `functions.R` and their generic in `generic.R`
- Add aliases for `ceiling` and `sign`
- Move expression functions from `column.R` to `functions.R`
- Modify `rdname` from `column` to `functions`

I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way  to define it.

## New Supported Functions

```
approxCountDistinct
ascii
base64
bin
bitwiseNOT
ceil (alias: ceiling)
crc32
dayofmonth
dayofyear
explode
factorial
hex
hour
initcap
isNaN
last_day
length
log2
ltrim
md5
minute
month
negate
quarter
reverse
round
rtrim
second
sha1
signum (alias: sign)
size
soundex
to_date
trim
unbase64
unhex
weekofyear
year

datediff
levenshtein
months_between
nanvl
pmod
```

## JIRA
[[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8123 from yu-iskw/SPARK-9855.

f4bc01f1

[SPARK-9724] [WEB UI] Avoid unnecessary redirects in the Spark Web UI. · 0d1d146c

Rohit Agarwal authored 9 years ago

Author: Rohit Agarwal <rohita@qubole.com>

Closes #8014 from mindprince/SPARK-9724 and squashes the following commits:

a7af5ff [Rohit Agarwal] [SPARK-9724] [WEB UI] Inline attachPrefix and attachPrefixForRedirect. Fix logic of attachPrefix
8a977cd [Rohit Agarwal] [SPARK-9724] [WEB UI] Address review comments: Remove unneeded code, update scaladoc.
b257844 [Rohit Agarwal] [SPARK-9724] [WEB UI] Avoid unnecessary redirects in the Spark Web UI.

0d1d146c

[SPARK-9780] [STREAMING] [KAFKA] prevent NPE if KafkaRDD instantiation … · 8ce60963

cody koeninger authored 9 years ago

…fails

Author: cody koeninger <cody@koeninger.org>

Closes #8133 from koeninger/SPARK-9780 and squashes the following commits:

406259d [cody koeninger] [SPARK-9780][Streaming][Kafka] prevent NPE if KafkaRDD instantiation fails

8ce60963

[SPARK-9449] [SQL] Include MetastoreRelation's inputFiles · 660e6dcf
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #8119 from marmbrus/metastoreInputFiles.
```
660e6dcf
[SPARK-9915] [ML] stopWords should use StringArrayParam · fc1c7fd6
Xiangrui Meng authored 9 years ago
```
hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #8141 from mengxr/SPARK-9915.
```
fc1c7fd6

[SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names... · e6aef557

Xiangrui Meng authored 9 years ago

[SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType

hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #8140 from mengxr/SPARK-9912.

e6aef557

[SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params · 6e409bc1

Holden Karau authored 9 years ago

As per the TODO move weightCol to Shared Params.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.

6e409bc1

[SPARK-9913] [MLLIB] LDAUtils should be private · caa14d9d
Xiangrui Meng authored 9 years ago
```
feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8142 from mengxr/SPARK-9913.
```
caa14d9d

[SPARK-9894] [SQL] Json writer should handle MapData. · 7035d880

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9894

Author: Yin Huai <yhuai@databricks.com>

Closes #8137 from yhuai/jsonMapData.

7035d880

[SPARK-9826] [CORE] Fix cannot use custom classes in log4j.properties · ab7e721c

Michel Lemay authored 9 years ago

Refactor Utils class and create ShutdownHookManager.

NOTE: Wasn't able to run /dev/run-tests on windows machine.
Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark)

ex:
log4j.rootCategory=WARN,console,redis
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.graphx.Pregel=INFO

log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender
log4j.appender.redis.endpoints=hostname:port
log4j.appender.redis.key=mykey
log4j.appender.redis.alwaysBatch=false
log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1

Author: michellemay <mlemay@gmail.com>

Closes #8109 from michellemay/SPARK-9826.

ab7e721c

[SPARK-9092] Fixed incompatibility when both num-executors and dynamic... · 738f3539

Niranjan Padmanabhan authored 9 years ago

… allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com>

Closes #7657 from neurons/SPARK-9092.

738f3539

[SPARK-9907] [SQL] Python crc32 is mistakenly calling md5 · a17384fa
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #8138 from rxin/SPARK-9907.
```
a17384fa

[SPARK-8967] [DOC] add Since annotation · 6f60298b

Xiangrui Meng authored 9 years ago

Add `Since` as a Scala annotation. The benefit is that we can use it without having explicit JavaDoc. This is useful for inherited methods. The limitation is that is doesn't show up in the generated Java API documentation. This might be fixed by modifying genjavadoc. I think we could leave it as a TODO.

This is how the generated Scala doc looks:

`since` JavaDoc tag:

![screen shot 2015-08-11 at 10 00 37 pm](https://cloud.githubusercontent.com/assets/829644/9230761/fa72865c-40d8-11e5-807e-0f3c815c5acd.png)

`Since` annotation:

![screen shot 2015-08-11 at 10 00 28 pm](https://cloud.githubusercontent.com/assets/829644/9230764/0041d7f4-40d9-11e5-8124-c3f3e5d5b31f.png)

rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8131 from mengxr/SPARK-8967.

6f60298b

[SPARK-9789] [ML] Added logreg threshold param back · 551def5d

Joseph K. Bradley authored 9 years ago

Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set.

CC: mengxr dbtsai feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8079 from jkbradley/logreg-reinstate-threshold.

551def5d

[SPARK-9766] [ML] [PySpark] check and add miss docs for PySpark ML · 762bacc1

Yanbo Liang authored 9 years ago

Check and add miss docs for PySpark ML (this issue only check miss docs for o.a.s.ml not o.a.s.mllib).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8059 from yanboliang/SPARK-9766.

762bacc1

[SPARK-9726] [PYTHON] PySpark DF join no longer accepts on=None · 60103ecd

Brennan Ashton authored 9 years ago

rxin

First pull request for Spark so let me know if I am missing anything
The contribution is my original work and I license the work to the project under the project's open source license.

Author: Brennan Ashton <bashton@brennanashton.com>

Closes #8016 from btashton/patch-1.

60103ecd

[SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit param values · 70fe5588

Joseph K. Bradley authored 9 years ago

From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8115 from jkbradley/copyvalues-fix.

70fe5588

[SPARK-9804] [HIVE] Use correct value for isSrcLocal parameter. · 57ec27dd

Marcelo Vanzin authored 9 years ago

If the correct parameter is not provided, Hive will run into an error
because it calls methods that are specific to the local filesystem to
copy the data.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8086 from vanzin/SPARK-9804.

57ec27dd

[SPARK-9747] [SQL] Avoid starving an unsafe operator in aggregation · e0110792

Andrew Or authored 9 years ago

This is the sister patch to #8011, but for aggregation.

In a nutshell: create the `TungstenAggregationIterator` before computing the parent partition. Internally this creates a `BytesToBytesMap` which acquires a page in the constructor as of this patch. This ensures that the aggregation operator is not starved since we reserve at least 1 page in advance.

rxin yhuai

Author: Andrew Or <andrew@databricks.com>

Closes #8038 from andrewor14/unsafe-starve-memory-agg.

e0110792

[SPARK-7583] [MLLIB] User guide update for RegexTokenizer · 66d87c1d

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-7583

User guide update for RegexTokenizer

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7828 from hhbyyh/regexTokenizerDoc.

66d87c1d

[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice · be5d1912

Andrew Or authored 9 years ago

This is based on KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes #8078 from andrewor14/da-double-kill.

be5d1912

[SPARK-8625] [CORE] Propagate user exceptions in tasks back to driver · 2e680668

Tom White authored 9 years ago

This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)

Author: Tom White <tom@cloudera.com>

Closes #7014 from tomwhite/propagate-user-exceptions.

2e680668

[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down · 3ecb3794

Cheng Lian authored 9 years ago

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.

In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.

This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.

The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.

Author: Cheng Lian <lian@databricks.com>

Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.

3ecb3794

[SPARK-9182] [SQL] Filters are not passed through to jdbc source · 9d082245

Yijie Shen authored 9 years ago

This PR fixes unable to push filter down to JDBC source caused by `Cast` during pattern matching.

While we are comparing columns of different type, there's a big chance we need a cast on the column, therefore not match the pattern directly on Attribute and would fail to push down.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8049 from yjshen/jdbc_pushdown.

9d082245

[SPARK-9575] [MESOS] Add docuemntation around Mesos shuffle service. · 741a29f9
Timothy Chen authored 9 years ago
```
andrewor14

Author: Timothy Chen <tnachen@gmail.com>

Closes #7907 from tnachen/mesos_shuffle.
```
741a29f9

[SPARK-8798] [MESOS] Allow additional uris to be fetched with mesos · 5c99d8bf

Timothy Chen authored 9 years ago

Some users like to download additional files in their sandbox that they can refer to from their spark program, or even later mount these files to another directory.

Author: Timothy Chen <tnachen@gmail.com>

Closes #7195 from tnachen/mesos_files.

5c99d8bf

[SPARK-9426] [WEBUI] Job page DAG visualization is not shown · bab89232

Carson Wang authored 9 years ago

To reproduce the issue, go to the stage page and click DAG Visualization once, then go to the job page to show the job DAG visualization. You will only see the first stage of the job.
Root cause: the java script use local storage to remember your selection. Once you click the stage DAG visualization, the local storage set `expand-dag-viz-arrow-stage` to true. When you go to the job page, the js checks `expand-dag-viz-arrow-stage` in the local storage first and will try to show stage DAG visualization on the job page.
To fix this, I set an id to the DAG span to differ job page and stage page. In the js code, we check the id and local storage together to make sure we show the correct DAG visualization.

Author: Carson Wang <carson.wang@intel.com>

Closes #8104 from carsonwang/SPARK-9426.

bab89232

[SPARK-9829] [WEBUI] Display the update value for peak execution memory · 4e3f4b93

zsxwing authored 9 years ago

The peak execution memory is not correct because it shows the sum of finished tasks' values when a task finishes.

This PR fixes it by using the update value rather than the accumulator value.

Author: zsxwing <zsxwing@gmail.com>

Closes #8121 from zsxwing/SPARK-9829.

4e3f4b93

[SPARK-9806] [WEB UI] Don't share ReplayListenerBus between multiple applications · a807fcbe
Rohit Agarwal authored 9 years ago
```
Author: Rohit Agarwal <rohita@qubole.com>

Closes #8088 from mindprince/SPARK-9806.
```
a807fcbe

[SPARK-8366] maxNumExecutorsNeeded should properly handle failed tasks · b85f9a24

xutingjun authored 9 years ago

Author: xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>

Closes #6817 from XuTingjun/SPARK-8366.

b85f9a24

[SPARK-9854] [SQL] RuleExecutor.timeMap should be thread-safe · b1581ac2

Josh Rosen authored 9 years ago

`RuleExecutor.timeMap` is currently a non-thread-safe mutable HashMap; this can lead to infinite loops if multiple threads are concurrently modifying the map.  I believe that this is responsible for some hangs that I've observed in HiveQuerySuite.

This patch addresses this by using a Guava `AtomicLongMap`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8120 from JoshRosen/rule-executor-time-map-fix.

b1581ac2

[SPARK-9831] [SQL] fix serialization with empty broadcast · c3e9a120

Davies Liu authored 9 years ago

Author: Davies Liu <davies@databricks.com>

Closes #8117 from davies/fix_serialization and squashes the following commits:

d21ac71 [Davies Liu] fix serialization with empty broadcast

c3e9a120

Aug 11, 2015

[SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5 · 74a293f4

Eric Liang authored 9 years ago

This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8085 from ericl/docs.

74a293f4

[SPARK-1517] Refactor release scripts to facilitate nightly publishing · 3ef0f329

Patrick Wendell authored 9 years ago

This update contains some code changes to the release scripts that allow easier nightly publishing. I've been using these new scripts on Jenkins for cutting and publishing nightly snapshots for the last month or so, and it has been going well. I'd like to get them merged back upstream so this can be maintained by the community.

The main changes are:
1. Separates the release tagging from various build possibilities for an already tagged release (`release-tag.sh` and `release-build.sh`).
2. Allow for injecting credentials through the environment, including GPG keys. This is then paired with secure key injection in Jenkins.
3. Support for copying build results to a remote directory, and also "rotating" results, e.g. the ability to keep the last N copies of binary or doc builds.

I'm happy if anyone wants to take a look at this - it's not user facing but an internal utility used for generating releases.

Author: Patrick Wendell <patrick@databricks.com>

Closes #7411 from pwendell/release-script-updates and squashes the following commits:

74f9beb [Patrick Wendell] Moving maven build command to a variable
233ce85 [Patrick Wendell] [SPARK-1517] Refactor release scripts to facilitate nightly publishing

3ef0f329