Commits · 517975d89d40a77c7186f488547eed11f79c1e97 · cs525-sp18-g07 / spark

Mar 11, 2015

[SPARK-4924] Add a library for launching Spark jobs programmatically. · 517975d8

Marcelo Vanzin authored 10 years ago

This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

The overall goal of this change is twofold, as described in the bug:

- Provide a public API for launching Spark processes. This is a common request
  from users and currently there's no good answer for it.

- Remove a lot of the duplicated code and other coupling that exists in the
  different parts of Spark that deal with launching processes.

A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:

18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
2ce741f [Marcelo Vanzin] Add lots of quotes.
3b28a75 [Marcelo Vanzin] Update new pom.
a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
897141f [Marcelo Vanzin] Review feedback.
e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
28cd35e [Marcelo Vanzin] Remove stale comment.
b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
5f4ddcc [Marcelo Vanzin] Better usage messages.
92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
6184c07 [Marcelo Vanzin] Rename field.
4c19196 [Marcelo Vanzin] Update comment.
7e66c18 [Marcelo Vanzin] Fix pyspark tests.
0031a8e [Marcelo Vanzin] Review feedback.
c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
28b1434 [Marcelo Vanzin] Add a comment.
304333a [Marcelo Vanzin] Fix propagation of properties file arg.
bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
8ec0243 [Marcelo Vanzin] Add missing newline.
95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
de81da2 [Marcelo Vanzin] Fix CommandUtils.
86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
7cff919 [Marcelo Vanzin] Javadoc updates.
eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
7ed8859 [Marcelo Vanzin] Some more feedback.
54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
61919df [Marcelo Vanzin] Clean leftover debug statement.
aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
e584fc3 [Marcelo Vanzin] Rework command building a little bit.
525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
8ac4e92 [Marcelo Vanzin] Minor test cleanup.
e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
c617539 [Marcelo Vanzin] Review feedback round 1.
fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
a7936ef [Marcelo Vanzin] Fix pyspark tests.
656374e [Marcelo Vanzin] Mima fixes.
4d511e7 [Marcelo Vanzin] Fix tools search code.
7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.

517975d8

[SPARK-5986][MLLib] Add save/load for k-means · 2d4e00ef

Xusen Yin authored 10 years ago

This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits:

6dd74a0 [Xusen Yin] rewrite some functions and classes
cd390fd [Xusen Yin] add indexed point
b144216 [Xusen Yin] remove invalid comments
dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986

2d4e00ef

Mar 10, 2015

[SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide · 26723741

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #4958 from marmbrus/sqlDocs and squashes the following commits:

9351dbc [Michael Armbrust] fix parquet example
6877e13 [Michael Armbrust] add sql examples
d81b7e7 [Michael Armbrust] rxins comments
e393528 [Michael Armbrust] fix order
19c2735 [Michael Armbrust] more on data source load/store
00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide

26723741

Minor doc: Remove the extra blank line in data types javadoc. · 74fb4337

Reynold Xin authored 10 years ago

The extra blank line is preventing the first lines from showing up in the package summary page.

Author: Reynold Xin <rxin@databricks.com>

Closes #4955 from rxin/datatype-docs and squashes the following commits:

1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.

74fb4337

[SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script · 7c7d2d5e

cheng chang authored 10 years ago

This PR comes from Tachyon community to solve the issue:
https://tachyon.atlassian.net/browse/TACHYON-11

An accompanying PR is in mesos/spark-ec2:
https://github.com/mesos/spark-ec2/pull/101

Author: cheng chang <myairia@gmail.com>

Closes #4901 from uronce-cc/master and squashes the following commits:

313aa36 [cheng chang] minor re-wording
fd2a48e [cheng chang] Remove Tachyon when deploying through git hash
1d53c5c [cheng chang] add default value to --tachyon-version
6f8887e [cheng chang] make tachyon version configurable

7c7d2d5e

[SPARK-6191] [EC2] Generalize ability to download libs · d14df06c

Nicholas Chammas authored 10 years ago

Right now we have a method to specifically download boto. This PR generalizes it so it's easy to download additional libraries if we want.

For example, adding new external libraries for spark-ec2 is now as simple as:

```python
external_libs = [
    {
         "name": "boto",
         "version": "2.34.0",
         "md5": "5556223d2d0cc4d06dd4829e671dcecd"
    },
    {
        "name": "PyYAML",
        "version": "3.11",
        "md5": "f50e08ef0fe55178479d3a618efe21db"
    },
    {
        "name": "argparse",
        "version": "1.3.0",
        "md5": "9bcf7f612190885c8c85e30ba41db3c7"
    }
]
```
Likely use cases:
* Downloading PyYAML to allow spark-ec2 configs to be persisted as a YAML file. ([SPARK-925](https://issues.apache.org/jira/browse/SPARK-925))
* Downloading argparse to clean up / modernize our option parsing.

First run output, with PyYAML and argparse added just for demonstration purposes:

```shell
$ ./spark-ec2 --version
Downloading external libraries that spark-ec2 needs from PyPI to /path/to/spark/ec2/lib...
This should be a one-time operation.
 - Downloading boto...
 - Finished downloading boto.
 - Downloading PyYAML...
 - Finished downloading PyYAML.
 - Downloading argparse...
 - Finished downloading argparse.
spark-ec2 1.2.1
```

Output thereafter:

```shell
$ ./spark-ec2 --version
spark-ec2 1.2.1
```

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4919 from nchammas/setup-ec2-libs and squashes the following commits:

a077955 [Nicholas Chammas] print default region
c95fb7d [Nicholas Chammas] to docstring
5448845 [Nicholas Chammas] remove libs added for demo purposes
60d8c23 [Nicholas Chammas] generalize ability to download libs

d14df06c

[SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough · c4c4b07b

Lev Khomich authored 10 years ago

A simple try-catch wrapping KryoException to be more informative.

Author: Lev Khomich <levkhomich@gmail.com>

Closes #4947 from levkhomich/master and squashes the following commits:

0f7a947 [Lev Khomich] [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough

c4c4b07b

[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce · 9a0272fb

Yuhao Yang authored 10 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`.

sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4899 from hhbyyh/adjustPartition and squashes the following commits:

a499630 [Yuhao Yang] update comment
9a2d7b6 [Yuhao Yang] move to comment
f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition
26a564a [Yuhao Yang] add coalesce to LDAExample

9a0272fb

Mar 09, 2015

[SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect() · 8767565c

Davies Liu authored 10 years ago

Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.

This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #4923 from davies/fix_collect and squashes the following commits:

d730286 [Davies Liu] address comments
24c92a4 [Davies Liu] fix style
ba54614 [Davies Liu] use socket to transfer data from JVM
9517c8f [Davies Liu] fix memory leak in collect()

8767565c

[SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames. · 3cac1991

Reynold Xin authored 10 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #4954 from rxin/df-docs and squashes the following commits:

c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.

3cac1991

[Docs] Replace references to SchemaRDD with DataFrame · 70f88148

Reynold Xin authored 10 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #4952 from rxin/schemardd-df-reference and squashes the following commits:

b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame

70f88148

[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments · f7c79920

Theodore Vasiloudis authored 10 years ago

As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.

When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.

This solves this by getting the instance types from the existing instances, and overwriting the default options.

EDIT: Further clarification of the issue:

In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue [5838](https://issues.apache.org/jira/browse/SPARK-5838), where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.

Author: Theodore Vasiloudis <thvasilo@users.noreply.github.com>
Author: Theodore Vasiloudis <tvas@sics.se>

Closes #4916 from thvasilo/SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments and squashes the following commits:

6705b98 [Theodore Vasiloudis] Added comment to clarify setting master instance type to the empty string.
a3d29fe [Theodore Vasiloudis] More trailing whitespace
7b32429 [Theodore Vasiloudis] Removed trailing whitespace
3ebd52a [Theodore Vasiloudis] Make sure that the instance type is correct when relaunching a cluster.

f7c79920

Mar 08, 2015

[GraphX] Improve LiveJournalPageRank example · 55b1b32d

Jacky Li authored 10 years ago

1. Removed unnecessary import
2. Modified usage print since user must specify the --numEPart parameter as it is required in Analytics.main

Author: Jacky Li <jacky.likun@huawei.com>

Closes #4917 from jackylk/import and squashes the following commits:

6c07682 [Jacky Li] fix comment
c0df8f2 [Jacky Li] fix scalastyle
b6235e6 [Jacky Li] fix for comment
87be83b [Jacky Li] remove default value description
5caae76 [Jacky Li] remove import and modify usage

55b1b32d

SPARK-6205 [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError · f16b7b03

Sean Owen authored 10 years ago

Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue

Author: Sean Owen <sowen@cloudera.com>

Closes #4933 from srowen/SPARK-6205 and squashes the following commits:

ddd4d32 [Sean Owen] Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue

f16b7b03

[SPARK-6193] [EC2] Push group filter up to EC2 · 52ed7da1

Nicholas Chammas authored 10 years ago

When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.

This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.

Basically, the problem (and solution) look like this:

```python
>>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
116.96390509605408
>>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
4.629754066467285
```

Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):

```shell
# master
$ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
...
3 loops, best of 3: 9.83 sec per loop

# this PR
$ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
...
3 loops, best of 3: 1.47 sec per loop
```

This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.

Finally, this PR fixes some minor grammar issues related to printing status to the user.  

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:

18802f1 [Nicholas Chammas] ignore shutting-down
f2a5b9f [Nicholas Chammas] fix grammar
d96a489 [Nicholas Chammas] push group filter up to EC2

52ed7da1

Mar 07, 2015

[SPARK-5641] [EC2] Allow spark_ec2.py to copy arbitrary files to cluster · 334c5bd1

Florian Verhein authored 10 years ago

Give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master).

This is an alternative approach to meeting requirements discussed in https://github.com/apache/spark/pull/4487

Author: Florian Verhein <florian.verhein@gmail.com>

Closes #4583 from florianverhein/master and squashes the following commits:

49dee88 [Florian Verhein] removed addition of trailing / in rsync to give user this option, added documentation in help
7b8e3d8 [Florian Verhein] remove unused args
87d922c [Florian Verhein] [SPARK-5641] [EC2] implement --deploy-root-dir

334c5bd1

[Minor]fix the wrong description · 729c05bd

WangTaoTheTonic authored 10 years ago

Found it by accident. I'm not gonna file jira for this as it is a very tiny fix.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #4936 from WangTaoTheTonic/wrongdesc and squashes the following commits:

fb8a8ec [WangTaoTheTonic] fix the wrong description
aca5596 [WangTaoTheTonic] fix the wrong description

729c05bd

[EC2] Reorder print statements on termination · 2646794f

Nicholas Chammas authored 10 years ago

The PR reorders some print statements slightly on cluster termination so that they read better.

For example, from this:

```
Are you sure you want to destroy the cluster spark-cluster-test?
The following instances will be terminated:
Searching for existing cluster spark-cluster-test in region us-west-2...
Found 1 master(s), 2 slaves
> ...
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster spark-cluster-test (y/N):
```

To this:

```
Searching for existing cluster spark-cluster-test in region us-west-2...
Found 1 master(s), 2 slaves
The following instances will be terminated:
> ...
ALL DATA ON ALL NODES WILL BE LOST!!
Are you sure you want to destroy the cluster spark-cluster-test? (y/N)
```

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4932 from nchammas/termination-print-order and squashes the following commits:

c23711d [Nicholas Chammas] reorder prints on termination

2646794f

Mar 06, 2015

Fix python typo (+ Scala, Java typos) · 48a723c9

RobertZK authored 10 years ago

Author: RobertZK <technoguyrob@gmail.com>
Author: Robert Krzyzanowski <technoguyrob@gmail.com>

Closes #4840 from robertzk/patch-1 and squashes the following commits:

d286215 [RobertZK] lambda fix per @laserson
5937989 [Robert Krzyzanowski] Fix python typo

48a723c9

[SPARK-6178][Shuffle] Removed unused imports · dba0b2ea

Vinod K C authored 10 years ago

Author: Vinod K C <vinod.kchuawei.com>

Author: Vinod K C <vinod.kc@huawei.com>

Closes #4900 from vinodkc/unused_imports and squashes the following commits:

5373456 [Vinod K C] Removed empty lines
9da7438 [Vinod K C] Changed order of import
594d471 [Vinod K C] Removed unused imports

dba0b2ea

[Minor] Resolve sbt warnings: postfix operator second should be enabled · 05cb6b34

GuoQiang Li authored 10 years ago

Resolve sbt warnings:

```
[warn] spark/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogManager.scala:155: postfix operator second should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] This can be achieved by adding the import clause 'import scala.language.postfixOps'
[warn] or by setting the compiler option -language:postfixOps.
[warn] See the Scala docs for value scala.language.postfixOps for a discussion
[warn] why the feature should be explicitly enabled.
[warn]         Await.ready(f, 1 second)
[warn]                          ^
```

Author: GuoQiang Li <witgo@qq.com>

Closes #4908 from witgo/sbt_warnings and squashes the following commits:

0629af4 [GuoQiang Li] Resolve sbt warnings: postfix operator second should be enabled

05cb6b34

[core] [minor] Don't pollute source directory when running UtilsSuite. · cd7594ca

Marcelo Vanzin authored 10 years ago

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4921 from vanzin/utils-suite and squashes the following commits:

7795dd4 [Marcelo Vanzin] [core] [minor] Don't pollute source directory when running UtilsSuite.

cd7594ca

[CORE, DEPLOY][minor] align arguments order with docs of worker · d8b3da9d

Zhang, Liye authored 10 years ago

The help message for starting `worker` is `Usage: Worker [options] <master>`. While in `start-slaves.sh`, the format is not align with that, it is confusing for the fist glance.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #4924 from liyezhang556520/startSlaves and squashes the following commits:

7fd5deb [Zhang, Liye] align arguments order with docs of worker

d8b3da9d

Mar 05, 2015

[SQL] Make Strategies a public developer API · eb48fd6e

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #4920 from marmbrus/openStrategies and squashes the following commits:

cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API

eb48fd6e

[SPARK-6163][SQL] jsonFile should be backed by the data source API · 1b4bb25c

Yin Huai authored 10 years ago

jira: https://issues.apache.org/jira/browse/SPARK-6163

Author: Yin Huai <yhuai@databricks.com>

Closes #4896 from yhuai/SPARK-6163 and squashes the following commits:

45e023e [Yin Huai] Address @chenghao-intel's comment.
2e8734e [Yin Huai] Use JSON data source for jsonFile.
92a4a33 [Yin Huai] Test.

1b4bb25c

[SPARK-6145][SQL] fix ORDER BY on nested fields · 5873c713

Wenchen Fan authored 10 years ago

Based on #4904 with style errors fixed.

`LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4918 from marmbrus/pr/4904 and squashes the following commits:

997f84e [Michael Armbrust] fix style
3eedbfc [Wenchen Fan] fix 6145

5873c713

[SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used · 424a86a1

Josh Rosen authored 10 years ago

This patch fixes two issues with the executor log viewing links added in Spark 1.3.  In standalone mode, the log URLs might include a port value of 0 rather than the actual bound port of the UI, which broke the ability to view logs from workers whose web UIs had been configured to bind to ephemeral ports.  In addition, the URLs used workers' local hostnames instead of respecting SPARK_PUBLIC_DNS, which prevented this feature from working properly on Spark EC2 clusters because the links would point to internal DNS names instead of external ones.

I included tests for both of these bugs:

- We now browse to the URLs and verify that they point to the expected pages.
- To test SPARK_PUBLIC_DNS, I changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4903 from JoshRosen/SPARK-6175 and squashes the following commits:

5577f41 [Josh Rosen] Remove println
cfec135 [Josh Rosen] Use webUi.boundPort and publicAddress in log links
27918c7 [Josh Rosen] Add failing unit tests for standalone log URL viewing
c250fbe [Josh Rosen] Respect SparkConf in local-cluster Workers.
422a2ef [Josh Rosen] Use conf.getenv to read SPARK_PUBLIC_DNS

424a86a1

[SPARK-6090][MLLIB] add a basic BinaryClassificationMetrics to PySpark/MLlib · 0bfacd5c

Xiangrui Meng authored 10 years ago

A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.

davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?

Author: Xiangrui Meng <meng@databricks.com>

Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:

009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib

0bfacd5c

SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11 · c9cfba0c

Sean Owen authored 10 years ago

Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

Author: Sean Owen <sowen@cloudera.com>

Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:

eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

c9cfba0c

[SPARK-6153] [SQL] promote guava dep for hive-thriftserver · e06c7dfb

Daoyuan Wang authored 10 years ago

For package thriftserver, guava is used at runtime.

/cc pwendell

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4884 from adrian-wang/test and squashes the following commits:

4600ae7 [Daoyuan Wang] only promote for thriftserver
44dda18 [Daoyuan Wang] promote guava dep for hive

e06c7dfb

Mar 04, 2015

SPARK-5143 [BUILD] [WIP] spark-network-yarn 2.11 depends on spark-network-shuffle 2.10 · 7ac072f7

Sean Owen authored 10 years ago

Update `<scala.binary.version>` prop in POM when switching between Scala 2.10/2.11

ScrapCodes for review. This `sed` command is supposed to just replace the first occurrence, but it replaces them all. Are you more of a `sed` wizard than I? It may be a GNU/BSD thing that is throwing me off. Really, just the first instance should be replaced, hence the `[WIP]`.

NB on OS X the original `sed` command here will create files like `pom.xml-e` through the source tree though it otherwise works. It's like `-e` is also the arg to `-i`. I couldn't get rid of that even with `-i""`. No biggie.

Author: Sean Owen <sowen@cloudera.com>

Closes #4876 from srowen/SPARK-5143 and squashes the following commits:

b060c44 [Sean Owen] Oops, fixed reversed version numbers!
e875d4a [Sean Owen] Add note about non-GNU sed; fix new pom.xml update to work as intended on GNU sed
703e1eb [Sean Owen] Update scala.binary.version prop in POM when switching between Scala 2.10/2.11

7ac072f7

[SPARK-6149] [SQL] [Build] Excludes Guava 15 referenced by jackson-module-scala_2.10 · 1aa90e39

Cheng Lian authored 10 years ago

This PR excludes Guava 15.0 from the SBT build, to make Spark SQL CLI (`bin/spark-sql`) work when compiled against Hive 0.12.0.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4890)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4890 from liancheng/exclude-guava-15 and squashes the following commits:

91ae9fa [Cheng Lian] Moves Guava 15 exclusion from SBT build to POM
282bd2a [Cheng Lian] Excludes Guava 15 referenced by jackson-module-scala_2.10

1aa90e39

[SPARK-6144] [core] Fix addFile when source files are on "hdfs:" · 3a35a0df

Marcelo Vanzin authored 10 years ago

The code failed in two modes: it complained when it tried to re-create a directory that already existed, and it was placing some files in the wrong parent directory. The patch fixes both issues.

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: trystanleftwich <trystan@atscale.com>

Closes #4894 from vanzin/SPARK-6144 and squashes the following commits:

100b3a1 [Marcelo Vanzin] Style fix.
58266aa [Marcelo Vanzin] Fix fetchHcfs file for directories.
91733b7 [trystanleftwich] [SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced jar will fail

3a35a0df

[SPARK-6107][CORE] Display inprogress application information for event log... · f6773edc

Zhang, Liye authored 10 years ago

[SPARK-6107][CORE] Display inprogress application information for event log history for standalone mode

when application is finished running abnormally (Ctrl + c for example), the history event log file is still ends with `.inprogress` suffix. And the application state can not be showed on webUI, User can only see "*Application history not foud xxxx, Application xxx is still in progress*".

For application that not finished normally, the history will show:
![image](https://cloud.githubusercontent.com/assets/4716022/6437137/184f9fc0-c0f5-11e4-88cc-a2eb087e4561.png)

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #4848 from liyezhang556520/showLogInprogress and squashes the following commits:

03589ac [Zhang, Liye] change inprogress to in progress
b55f19f [Zhang, Liye] scala modify after rebase
8aa66a2 [Zhang, Liye] use softer wording
b030bd4 [Zhang, Liye] clean code
79c8cb1 [Zhang, Liye] fix some mistakes
11cdb68 [Zhang, Liye] add a missing space
c29205b [Zhang, Liye] refine code according to sean owen's comments
e9952a7 [Zhang, Liye] scala style fix again
150502d [Zhang, Liye] scala style fix
f11a5da [Zhang, Liye] small fix for file path
22e878b [Zhang, Liye] enable in progress eventlog file

f6773edc

[SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default... · aef8a84e

Liang-Chi Hsieh authored 10 years ago

[SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive

In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`.

Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4870 from viirya/codegen_type and squashes the following commits:

76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.

aef8a84e

[SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client · 76b472f1

Cheng Lian authored 10 years ago

Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4.

Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency.

[1]: |https://github.com/databricks/spark-integration-tests

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4872 from liancheng/remove-docker-client and squashes the following commits:

1f4169e [Cheng Lian] Removes DockerHacks
159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client

76b472f1

[SPARK-3355][Core]: Allow running maven tests in run-tests · 418f38d9

Brennon York authored 10 years ago

Added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites. The only issue I found with this is that, when running maven builds I wasn't able to get individual package tests running without running a `mvn install` first. Not sure what Jenkins is doing wrt its env., but figured its much better to just test everything than install packages in the "~/.m2/" directory and only test individual items, esp. if this is predominantly for the Jenkins build. Thoughts / comments would be great!

Author: Brennon York <brennon.york@capitalone.com>

Closes #4734 from brennonyork/SPARK-3355 and squashes the following commits:

c813d32 [Brennon York] changed mvn call from 'clean compile
616ce30 [Brennon York] fixed merge conflicts
3540de9 [Brennon York] added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites

418f38d9

SPARK-6085 Increase default value for memory overhead · 8d3e2414

tedyu authored 10 years ago

Author: tedyu <yuzhihong@gmail.com>

Closes #4836 from tedyu/master and squashes the following commits:

d65b495 [tedyu] SPARK-6085 Increase default value for memory overhead
1fdd4df [tedyu] SPARK-6085 Increase default value for memory overhead

8d3e2414

[SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bug · 76e20a0a

Xiangrui Meng authored 10 years ago

LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
This is fixed in 0.11, see the description in Breeze project for detail:

https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760

Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Author: DB Tsai <dbtsai@dbtsai.com>

Closes #4879 from dbtsai/breeze and squashes the following commits:

d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
35c2f26 [Xiangrui Meng] fix LRSuite
397a208 [DB Tsai] upgrade breeze

76e20a0a

Mar 03, 2015

[SPARK-6132][HOTFIX] ContextCleaner InterruptedException should be quiet · d334bfbc

Andrew Or authored 10 years ago

If the cleaner is stopped, we shouldn't print a huge stack trace when the cleaner thread is interrupted because we purposefully did this.

Author: Andrew Or <andrew@databricks.com>

Closes #4882 from andrewor14/cleaner-interrupt and squashes the following commits:

8652120 [Andrew Or] Just a hot fix

d334bfbc