Commits · 9b6ebe33db27be38c3036ffeda17096043fb0fb9 · cs525-sp18-g07 / spark

Oct 30, 2014

[SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM... · 9b6ebe33

ravipesala authored 10 years ago

[SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL

Right now it works for only 2 tables like below query.
sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ")

But it does not work for more than 2 tables like below query
sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key").

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2987 from ravipesala/multijoin and squashes the following commits:

429b005 [ravipesala] Support multiple joins

9b6ebe33

SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da

Sean Owen authored 10 years ago

(This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)

Author: Sean Owen <sowen@cloudera.com>

Closes #2814 from srowen/SPARK-1209 and squashes the following commits:

ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark

68cb69da

[SPARK-3661] Respect spark.*.memory in cluster mode · 2f545438

Andrew Or authored 10 years ago

This also includes minor re-organization of the code. Tested locally in both client and deploy modes.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2697 from andrewor14/memory-cluster-mode and squashes the following commits:

01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
ccd468b [Andrew Or] Add some comments per Patrick
c956577 [Andrew Or] Tweak wording
2b4afa0 [Andrew Or] Unused import
47a5a88 [Andrew Or] Correct Spark properties precedence order
bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode

2f545438

[SPARK-4153][WebUI] Update the sort keys for HistoryPage · d3450578

zsxwing authored 10 years ago

Sort "Started", "Completed", "Duration" and "Last Updated" by time.

Author: zsxwing <zsxwing@gmail.com>

Closes #3014 from zsxwing/SPARK-4153 and squashes the following commits:

ec8b9ad [zsxwing] Sort "Started", "Completed", "Duration" and "Last Updated" by time

d3450578

Minor style hot fix after #2711 · 849b43ec

Andrew Or authored 10 years ago

I had planned to fix this when I merged it but I forgot to. witgo

Author: Andrew Or <andrew@databricks.com>

Closes #3018 from andrewor14/command-utils-style and squashes the following commits:

c2959fb [Andrew Or] Style hot fix

849b43ec

[SPARK-4155] Consolidate usages of <driver> · 9334d699

Andrew Or authored 10 years ago

We use "\<driver\>" everywhere. Let's not do that.

Author: Andrew Or <andrew@databricks.com>

Closes #3020 from andrewor14/consolidate-driver and squashes the following commits:

c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
3d751e9 [Andrew Or] Consolidate usages of <driver>

9334d699

[Minor] A few typos in comments and log messages · 5231a3f2

Andrew Or authored 10 years ago

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #3021 from andrewor14/typos and squashes the following commits:

daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
a81ae8f [Andrew Or] Some typos

5231a3f2

[SPARK-4138][SPARK-4139] Improve dynamic allocation settings · 26f092d4

Andrew Or authored 10 years ago

This should be merged after #2746 (SPARK-3795).

**SPARK-4138**. If the user sets both the number of executors and `spark.dynamicAllocation.enabled`, we should throw an exception.

**SPARK-4139**. If the user sets `spark.dynamicAllocation.enabled`, we should use the max number of executors as the starting number of executors because the first job is likely to run immediately after application startup. If the latter is not set, throw an exception.

Author: Andrew Or <andrew@databricks.com>

Closes #3002 from andrewor14/yarn-set-executors and squashes the following commits:

c528fce [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-set-executors
55d4699 [Andrew Or] Bug fix: `isDynamicAllocationEnabled` was always false
2b0ccec [Andrew Or] Start the number of executors at the max
022bfde [Andrew Or] Guard against incompatible settings of number of executors

26f092d4

[SPARK-3319] [SPARK-3338] Resolve Spark submit config paths · 24c51292

Andrew Or authored 10 years ago

The bulk of this PR is comprised of tests. All changes in functionality are made in `SparkSubmit.scala` (~20 lines).

**SPARK-3319.** There is currently a divergence in behavior when the user passes in additional jars through `--jars` and through setting `spark.jars` in the default properties file. The former will happily resolve the paths (e.g. convert `my.jar` to `file:/absolute/path/to/my.jar`), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:

- `--jars` ~ `spark.jars`
- `--files` ~ `spark.files` / `spark.yarn.dist.files`
- `--archives` ~ `spark.yarn.dist.archives`
- `--py-files` ~ `spark.submit.pyFiles`

**SPARK-3338.** This PR also fixes the following bug: if the user sets `spark.submit.pyFiles` in his/her properties file, it does not actually get picked up even if `--py-files` is not set. This is simply because the config is overridden by an empty string.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2232 from andrewor14/resolve-config-paths and squashes the following commits:

fff2869 [Andrew Or] Add spark.yarn.jar
da3a1c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
f0fae64 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
05e03d6 [Andrew Or] Add tests for resolving both command line and config paths
460117e [Andrew Or] Resolve config paths properly
fe039d3 [Andrew Or] Beef up tests to test fixed-pointed-ness of Utils.resolveURI(s)

24c51292

[SPARK-4078] New FsPermission instance w/o FsPermission.createImmutable in eventlog · 9142c9b8

Grace authored 10 years ago

By default, Spark builds its package against Hadoop 1.0.4 version. In that version, it has some FsPermission bug (see [HADOOP-7629] (https://issues.apache.org/jira/browse/HADOOP-7629) by Todd Lipcon). This bug got fixed since 1.1 version. By using that FsPermission.createImmutable() API, end-user may see some RPC exception like below (if turn on eventlog over HDFS).  Here proposes a quick fix to avoid certain exception for all hadoop versions.
```
Exception in thread "main" java.io.IOException: Call to sr484/10.1.2.84:54310 failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
        at org.apache.hadoop.ipc.Client.call(Client.java:1118)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
        at $Proxy6.setPermission(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
        at $Proxy6.setPermission(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285)
        at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572)
        at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138)
        at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
        at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:324)
```

Author: Grace <jie.huang@intel.com>

Closes #2892 from GraceH/eventlog-rpc and squashes the following commits:

58ea038 [Grace] new FsPermission Instance w/o FsPermission.createImmutable

9142c9b8

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either... · fb1fbca2

Tathagata Das authored 10 years ago

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS

As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.

Most of this code has been written by @harishreedharan

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2931 from tdas/driver-ha-rdd and squashes the following commits:

209e49c [Tathagata Das] Better fix to style issue.
4a5866f [Tathagata Das] Addressed one more comment.
ed5fbf0 [Tathagata Das] Minor updates.
b0a18b1 [Tathagata Das] Fixed import order.
20aa7c6 [Tathagata Das] Fixed more line length issues.
29aa099 [Tathagata Das] Fixed line length issues.
9e47b5b [Tathagata Das] Renamed class, simplified+added unit tests.
6e1bfb8 [Tathagata Das] Tweaks testuite to create spark contxt lazily to prevent contxt leaks.
9c86a61 [Tathagata Das] Merge pull request #22 from harishreedharan/driver-ha-rdd
2878c38 [Hari Shreedharan] Shutdown spark context after tests. Formatting/minor fixes
c709f2f [Tathagata Das] Merge pull request #21 from harishreedharan/driver-ha-rdd
5cce16f [Hari Shreedharan] Make sure getBlockLocations uses offset and length to find the blocks on HDFS
eadde56 [Tathagata Das] Transferred HDFSBackedBlockRDD for the driver-ha-working branch

fb1fbca2

[SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the... · 234de923

Tathagata Das authored 10 years ago

[SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the functionality of storage of received data

As part of the initiative to prevent data loss on streaming driver failure, this JIRA tracks the subtask of implementing a ReceivedBlockHandler, that abstracts the functionality of storage of received data blocks. The default implementation will maintain the current behavior of storing the data into BlockManager. The optional implementation will store the data to both BlockManager as well as a write ahead log.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2940 from tdas/driver-ha-rbh and squashes the following commits:

78a4aaa [Tathagata Das] Fixed bug causing test failures.
f192f47 [Tathagata Das] Fixed import order.
df5f320 [Tathagata Das] Updated code to use ReceivedBlockStoreResult as the return type for handler's storeBlock
33c30c9 [Tathagata Das] Added license, and organized imports.
2f025b3 [Tathagata Das] Updates based on PR comments.
18aec1e [Tathagata Das] Moved ReceivedBlockInfo back into spark.streaming.scheduler package
95a4987 [Tathagata Das] Added ReceivedBlockHandler and its associated tests

234de923

SPARK-4111 [MLlib] add regression metrics · d9327192

Yanbo Liang authored 10 years ago

Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala.

Author: Yanbo Liang <yanbohappy@gmail.com>
Author: liangyanbo <liangyanbo@meituan.com>

Closes #2978 from yanbohappy/regression_metrics and squashes the following commits:

730d0a9 [Yanbo Liang] more clearly annotation
3d0bec1 [Yanbo Liang] rename and keep code style
a8ad3e3 [Yanbo Liang] simplify code for keeping style
d454909 [Yanbo Liang] rename parameter and function names, delete unused columns, add reference
2e56282 [liangyanbo] rename r2_score() and remove unused column
43bb12b [liangyanbo] add regression metrics

d9327192

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace · c7ad0852

Joseph E. Gonzalez authored 10 years ago

This simple patch filters out extra whitespace entries.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Joey <joseph.e.gonzalez@gmail.com>

Closes #2996 from jegonzal/loadLibSVM and squashes the following commits:

e0227ab [Joey] improving readability
e028e84 [Joseph E. Gonzalez] fixing whitespace bug in loadLibSVMFile when parsing libSVM files

c7ad0852

[SPARK-4102] Remove unused ShuffleReader.stop() method. · 6db31574

Kay Ousterhout authored 10 years ago

This method is not implemented by the only subclass
(HashShuffleReader), nor is it ever called. While the
use of Scala's fancy "???" was pretty exciting, the method's
existence can only lead to confusion and it therefore should
be deleted.

mateiz was there a reason for adding this that I'm
missing?

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2966 from kayousterhout/SPARK-4102 and squashes the following commits:

532c564 [Kay Ousterhout] Added back commented-out method, as per Matei's request
904655e [Kay Ousterhout] [SPARK-4102] Remove unused ShuffleReader.stop() method.

6db31574

[SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path · cd739bd7

GuoQiang Li authored 10 years ago

- [X] Standalone
- [X] YARN
- [X] Mesos
- [X]  Mac OS X
- [X] Linux
- [ ]  Windows

This is another implementation about #1031

Author: GuoQiang Li <witgo@qq.com>

Closes #2711 from witgo/SPARK-1719 and squashes the following commits:

c7b26f6 [GuoQiang Li] review commits
4488e41 [GuoQiang Li] Refactoring CommandUtils
a444094 [GuoQiang Li] review commits
40c0b4a [GuoQiang Li] Add buildLocalCommand method
c1a0ddd [GuoQiang Li] fix comments
156ce88 [GuoQiang Li] review commit
38aa377 [GuoQiang Li] Refactor CommandUtils.scala
4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala
7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path

cd739bd7

Oct 29, 2014

[SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing... · 12342580

Tathagata Das authored 10 years ago

[SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing block generator throttling

In the unit test that checked whether blocks generated by throttled block generator had expected number of records, the thresholds are too tight, which sometimes led to the test failing.
This PR fixes it by relaxing the thresholds and the time intervals for testing.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2900 from tdas/receiver-suite-flakiness and squashes the following commits:

28508a2 [Tathagata Das] Made the ReceiverSuite test more reliable

12342580

[SPARK-3795] Heuristics for dynamically scaling executors · 8d59b37b

Andrew Or authored 10 years ago

This is part of a bigger effort to provide elastic scaling of executors within a Spark application ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does not provide any functionality by itself; it is a skeleton that is missing a mechanism to be added later in [SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).

Comments and feedback are most welcome. For those of you reviewing this in detail, I highly recommend doing it through your favorite IDE instead of through the diff here.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2746 from andrewor14/scaling-heuristics and squashes the following commits:

8a4fdaa [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
e045df8 [Andrew Or] Add warning message (minor)
dfa31ec [Andrew Or] Fix tests
c0becc4 [Andrew Or] Merging with SPARK-3822
4784f93 [Andrew Or] Reword an awkward log message
181f27f [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
c79e907 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
4672b90 [Andrew Or] It's nano time.
a6a30f2 [Andrew Or] Do not allow min/max executors of 0
c60ec33 [Andrew Or] Rewrite test logic with clocks
b00b680 [Andrew Or] Fix style
c3caa65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
7f9da14 [Andrew Or] Factor out logic to verify bounds on # executors (minor)
f279019 [Andrew Or] Add time mocking tests for polling loop
685e347 [Andrew Or] Factor out clock in polling loop to facilitate testing
3cea7f7 [Andrew Or] Use PrivateMethodTester to keep original class private
3156d81 [Andrew Or] Update comments and exception messages
92f36f9 [Andrew Or] Address minor review comments
abdea61 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
2aefd09 [Andrew Or] Correct listener behavior
9fe6e44 [Andrew Or] Rename variables and configs + update comments and log messages
149cc32 [Andrew Or] Fix style
254c958 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5ff829b [Andrew Or] Add tests for ExecutorAllocationManager
19c6c4b [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5896515 [Andrew Or] Move ExecutorAllocationManager out of scheduler package
9ca8945 [Andrew Or] Rewrite callbacks through the listener interface
5e336b9 [Andrew Or] Remove code from backend to avoid conflict with SPARK-3822
092d1fd [Andrew Or] Remove timeout logic for pending requests
1309fab [Andrew Or] Request executors by specifying the number pending
8bc0e9d [Andrew Or] Add logic to expire pending requests after timeouts
b750ee1 [Andrew Or] Express timers in terms of expiration times + remove retry logic
7f8dd47 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
9d516cc [Andrew Or] Bug fix: Actually trigger the add timer / add retry timer
44f1832 [Andrew Or] Rename configs to include time units
eaae7ef [Andrew Or] Address various review comments
6f8be6c [Andrew Or] Beef up comments on what each of the timers mean
baaa403 [Andrew Or] Simplify variable names (minor)
42beec8 [Andrew Or] Reset whether the add threshold is crossed on cancellation
9bcc0bc [Andrew Or] ExecutorScalingManager -> ExecutorAllocationManager
2784398 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5a97d9e [Andrew Or] Log retry attempts in INFO + clean up logging
2f55c9f [Andrew Or] Do not keep requesting executors even after max attempts
0acd1cb [Andrew Or] Rewrite timer logic with polling
b3c7d44 [Andrew Or] Start the retry timer for adding executors at the right time
9b5f2ea [Andrew Or] Wording changes in comments and log messages
c2203a5 [Andrew Or] Simplify code to access the scheduler backend
e519d08 [Andrew Or] Simplify initialization code
2cc87a7 [Andrew Or] Add retry logic for removing executors
d0b34a6 [Andrew Or] Add retry logic for adding executors
9cc4649 [Andrew Or] Simplifying synchronization logic
67c03c7 [Andrew Or] Correct semantics of adding executors + update comments
6c48ab0 [Andrew Or] Update synchronization comment
8901900 [Andrew Or] Simplify remove policy + change the semantics of add policy
1cc8444 [Andrew Or] Minor wording change
ae5b64a [Andrew Or] Add synchronization
20ec6b9 [Andrew Or] First cut implementation of removing executors dynamically
4077ae2 [Andrew Or] Minor code re-organization
6f1fa66 [Andrew Or] First cut implementation of adding executors dynamically
b2e6dcc [Andrew Or] Add skeleton interface for requesting / killing executors

8d59b37b

[SPARK-4097] Fix the race condition of 'thread' · e7fd8041

zsxwing authored 10 years ago

There is a chance that `thread` is null when calling `thread.interrupt()`.

```Scala
  override def cancel(): Unit = this.synchronized {
    _cancelled = true
    if (thread != null) {
      thread.interrupt()
    }
  }
```
Should put `thread = null` into a `synchronized` block to fix the race condition.

Author: zsxwing <zsxwing@gmail.com>

Closes #2957 from zsxwing/SPARK-4097 and squashes the following commits:

edf0aee [zsxwing] Add comments to explain the lock
c5cfeca [zsxwing] Fix the race condition of 'thread'

e7fd8041

[SPARK-3822] Executor scaling mechanism for Yarn · 1df05a40

Andrew Or authored 10 years ago

This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.

The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.

I have tested this significantly on a stable Yarn cluster.

------------
A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:

485863e [Andrew Or] Minor log message changes
4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
1c57804 [Andrew Or] Reword a few comments + other review comments
6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
02836c0 [Andrew Or] Limit scope of synchronization
4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
73ade46 [Andrew Or] Wording changes (minor)
2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
665f229 [Andrew Or] Mima excludes
79aa2df [Andrew Or] Simplify the request interface by asking for a total
04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
f4783f8 [Andrew Or] Change the semantics of requesting executors
005a124 [Andrew Or] Fix tests
4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
572f5c5 [Andrew Or] Unused import (minor)
f30261c [Andrew Or] Kill multiple executors rather than one at a time
de260d9 [Andrew Or] Simplify by skipping useless null check
9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
c4dfaac [Andrew Or] Avoid thrashing when removing executors
53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
bbee669 [Andrew Or] Add mechanism in yarn client mode

1df05a40

[SPARK-4003] [SQL] add 3 types for java SQL context · 35354676

Daoyuan Wang authored 10 years ago

In JavaSqlContext, we need to let java program use big decimal, timestamp, date types.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2850 from adrian-wang/javacontext and squashes the following commits:

4c4292c [Daoyuan Wang] change underlying type of JavaSchemaRDD as scala
bb0508f [Daoyuan Wang] add test cases
3c58b0d [Daoyuan Wang] add 3 types for java SQL context

35354676

[SPARK-3453] Netty-based BlockTransferService, extracted from Spark core · dff01553

Reynold Xin authored 10 years ago

This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.

In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.

In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.

These issues: have been addressed by folding in #2330:

SPARK-3453: Refactor Netty module to use BlockTransferService interface
SPARK-3018: Release all buffers upon task completion/failure
SPARK-3002: Create a connection pool and reuse clients across different threads
SPARK-3017: Integration tests and unit tests for connection failures
SPARK-3049: Make sure client doesn't block when server/connection has error(s)
SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
SPARK-3503: Disable thread local cache in PooledByteBufAllocator

TODO before mergeable:
- [x] Implement uploadBlock()
- [x] Unit tests for RPC side of code
- [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
- [x] Turn OFF by default (currently on for unit testing)

Author: Reynold Xin <rxin@apache.org>
Author: Aaron Davidson <aaron@databricks.com>
Author: cocoatomo <cocoatomo77@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: Anand Avati <avati@redhat.com>

Closes #2753 from aarondav/netty and squashes the following commits:

cadfd28 [Aaron Davidson] Turn netty off by default
d7be11b [Aaron Davidson] Turn netty on by default
4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
2b0d1c0 [Aaron Davidson] 100ch
0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
14e37f7 [Aaron Davidson] Address Reynold's comments
8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
9da0bc1 [Aaron Davidson] Add RPC unit tests
d236dfd [Aaron Davidson] Remove no-op serializer :)
7b7a26c [Aaron Davidson] Fix Nio compile issue
dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
939f276 [Aaron Davidson] Attempt to make comm. bidirectional
aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
5d98ce3 [Reynold Xin] Flip buffer.
f6c220d [Reynold Xin] Merge with latest master.
407e59a [Reynold Xin] Fix style violation.
a0518c7 [Reynold Xin] Implemented block uploads.
4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
1bdd7ee [Reynold Xin] Fixed tests.
d68f328 [Reynold Xin] Logging close() in case close() fails.
f63fb4c [Reynold Xin] Add more debug message.
6afc435 [Reynold Xin] Added logging.
c066309 [Reynold Xin] Implement java.io.Closeable interface.
519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
f0a16e9 [Reynold Xin] Fixed test hanging.
14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
b2f3281 [Reynold Xin] Added connection pooling.
d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
5cd33d7 [Reynold Xin] Fixed style violation.
cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
064747b [Reynold Xin] Reference count buffers and clean them up properly.
2b44cf1 [Reynold Xin] Added more documentation.
1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.

dff01553

[SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer · 51ce9973

DB Tsai authored 10 years ago

In MultivariateOnlineSummarizer, breeze's activeIterator is used
to loop through the nonZero elements in the vector. However,
activeIterator doesn't perform well due to lots of overhead.
In this PR, native while loop is used for both DenseVector and SparseVector.

The benchmark result with 20 executors using mnist8m dataset:
Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds

After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds

Since MultivariateOnlineSummarizer is used in several places,
the overall performance gain in mllib library will be significant with this PR.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #2992 from dbtsai/SPARK-4129 and squashes the following commits:

b99db6c [DB Tsai] fixed java.lang.ArrayIndexOutOfBoundsException
2b5e882 [DB Tsai] small refactoring
ebe3e74 [DB Tsai] First commit

51ce9973

[FIX] disable benchmark code · 1559495d

Xiangrui Meng authored 10 years ago

I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes.

aarondav

Author: Xiangrui Meng <meng@databricks.com>

Closes #2990 from mengxr/disable-benchmark and squashes the following commits:

c58f070 [Xiangrui Meng] disable benchmark code

1559495d

Oct 28, 2014

[SPARK-4133] [SQL] [PySpark] type conversionfor python udf · 8c0bfd08

Davies Liu authored 10 years ago

Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType.

For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple.

Author: Davies Liu <davies@databricks.com>

Closes #2973 from davies/udf_array and squashes the following commits:

306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
2c00e43 [Davies Liu] fix merge
11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
9df50a2 [Davies Liu] address comments
79afb4e [Davies Liu] type conversionfor python udf

8c0bfd08

[SPARK-3904] [SQL] add constant objectinspector support for udfs · b5e79bf8

Cheng Hao authored 10 years ago

In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception.
e.g.
select named_struct("x", "str") from src limit 1;

I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable).

As well as the `unwrap` method by providing the input `ObjectInspector`.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2762 from chenghao-intel/udf_coi and squashes the following commits:

bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1
2416e5d [Cheng Hao] revert to hive 0.12
5793c01 [Cheng Hao] add space before while
4e56e1b [Cheng Hao] style issue
683d3fd [Cheng Hao] Add golden files
fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject
f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List
8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing

b5e79bf8

[SPARK-4008] Fix "kryo with fold" in KryoSerializerSuite · 1536d703

zsxwing authored 10 years ago

`zeroValue` will be serialized by `spark.closure.serializer` but `spark.closure.serializer` only supports the default Java serializer. So it must not be `ClassWithoutNoArgConstructor`, which can not be serialized by the Java serializer.

This PR changed `zeroValue` to null and updated the test to make it work correctly.

Author: zsxwing <zsxwing@gmail.com>

Closes #2856 from zsxwing/SPARK-4008 and squashes the following commits:

51da655 [zsxwing] [SPARK-4008] Fix "kryo with fold" in KryoSerializerSuite

1536d703

[SPARK-4084] Reuse sort key in Sorter · 84e5da87

Xiangrui Meng authored 10 years ago

Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .

The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.

The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:

~~~
[info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
Java Arrays.sort() on non-primitive int array: Took 13237 ms
Java Arrays.sort() on non-primitive int array: Took 13320 ms
Java Arrays.sort() on non-primitive int array: Took 15718 ms
Java Arrays.sort() on non-primitive int array: Took 13283 ms
Java Arrays.sort() on non-primitive int array: Took 13267 ms
Java Arrays.sort() on non-primitive int array: Took 15122 ms
Java Arrays.sort() on non-primitive int array: Took 15495 ms
Java Arrays.sort() on non-primitive int array: Took 14877 ms
Java Arrays.sort() on non-primitive int array: Took 16429 ms
Java Arrays.sort() on non-primitive int array: Took 14250 ms
Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2701 ms
Java Arrays.sort() on primitive int array: Took 2746 ms
Java Arrays.sort() on primitive int array: Took 2685 ms
Java Arrays.sort() on primitive int array: Took 2735 ms
Java Arrays.sort() on primitive int array: Took 2669 ms
Java Arrays.sort() on primitive int array: Took 2693 ms
Java Arrays.sort() on primitive int array: Took 2680 ms
Java Arrays.sort() on primitive int array: Took 2642 ms
Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
Sorter without key reuse on primitive int array: Took 10732 ms
Sorter without key reuse on primitive int array: Took 12482 ms
Sorter without key reuse on primitive int array: Took 10718 ms
Sorter without key reuse on primitive int array: Took 12650 ms
Sorter without key reuse on primitive int array: Took 10747 ms
Sorter without key reuse on primitive int array: Took 10783 ms
Sorter without key reuse on primitive int array: Took 12721 ms
Sorter without key reuse on primitive int array: Took 10604 ms
Sorter without key reuse on primitive int array: Took 10622 ms
Sorter without key reuse on primitive int array: Took 11843 ms
Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
Sorter with key reuse on primitive int array: Took 5141 ms
Sorter with key reuse on primitive int array: Took 5298 ms
Sorter with key reuse on primitive int array: Took 5066 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5203 ms
Sorter with key reuse on primitive int array: Took 5274 ms
Sorter with key reuse on primitive int array: Took 5186 ms
Sorter with key reuse on primitive int array: Took 5159 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5078 ms
Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
~~~

So with key reuse, it is faster and less likely to trigger GC.

Author: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:

d73c3d0 [Xiangrui Meng] address comments
0b7b682 [Xiangrui Meng] fix mima
a72f53c [Xiangrui Meng] update timeIt
38ba50c [Xiangrui Meng] update timeIt
720f731 [Xiangrui Meng] add doc about JIT specialization
78f2879 [Xiangrui Meng] update tests
7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
b00db4d [Xiangrui Meng] doc and tests
cf94e8a [Xiangrui Meng] renaming
464ddce [Reynold Xin] cherry-pick rxin's commit

84e5da87

[SPARK-3343] [SQL] Add serde support for CTAS · 4b55482a

Cheng Hao authored 10 years ago

Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits:

e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1
cfb3662 [Cheng Hao] revert to hive 0.12
c8a547d [Cheng Hao] Support SerDe properties within CTAS

4b55482a

[Spark 3922] Refactor spark-core to use Utils.UTF_8 · abcafcfb

zsxwing authored 10 years ago

A global UTF8 constant is very helpful to handle encoding problems when converting between String and bytes. There are several solutions here:

1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala
2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7)
3. io.netty.util.CharsetUtil.UTF_8
4. com.google.common.base.Charsets.UTF_8
5. org.apache.commons.lang.CharEncoding.UTF_8
6. org.apache.commons.lang3.CharEncoding.UTF_8

IMO, I prefer option 1) because people can find it easily.

This is a PR for option 1) and only fixes Spark Core.

Author: zsxwing <zsxwing@gmail.com>

Closes #2781 from zsxwing/SPARK-3922 and squashes the following commits:

f974edd [zsxwing] Merge branch 'master' into SPARK-3922
2d27423 [zsxwing] Refactor spark-core to use Refactor spark-core to use Utils.UTF_8

abcafcfb

[SPARK-3988][SQL] add public API for date type · 47a40f60

Daoyuan Wang authored 10 years ago

Add json and python api for date type.
By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2901 from adrian-wang/spark3988 and squashes the following commits:

c51a24d [Daoyuan Wang] convert datetime to date
5670626 [Daoyuan Wang] minor line combine
f760d8e [Daoyuan Wang] fix indent
444f100 [Daoyuan Wang] fix a typo
1d74448 [Daoyuan Wang] fix scala style
8d7dd22 [Daoyuan Wang] add json and python api for date type

47a40f60

[SPARK-3814][SQL] Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL · 5807cb40

ravipesala authored 10 years ago

Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same.
I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits:

a391c7a [ravipesala] Rebase with master

5807cb40

[SPARK-4058] [PySpark] Log file name is hard coded even though there is a variable '$LOG_FILE ' · 6c1b981c

Kousuke Saruta authored 10 years ago

In a script 'python/run-tests', log file name is represented by a variable 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log file name in the script.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2905 from sarutak/SPARK-4058 and squashes the following commits:

7710490 [Kousuke Saruta] Fixed python/run-tests not to use hard-coded log file name

6c1b981c

[SPARK-4065] Add check for IPython on Windows · 2f254dac

Michael Griffiths authored 10 years ago

This issue employs logic similar to the bash launcher (pyspark) to check
if IPTYHON=1, and if so launch ipython with options in IPYTHON_OPTS.
This fix assumes that ipython is available in the system Path, and can
be invoked with a plain "ipython" command.

Author: Michael Griffiths <msjgriffiths@gmail.com>

Closes #2910 from msjgriffiths/pyspark-windows and squashes the following commits:

ef34678 [Michael Griffiths] Change build message to comply with [SPARK-3775]
361e3d8 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows
9ce72d1 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows

2f254dac

[SPARK-4089][Doc][Minor] The version number of Spark in _config.yaml is wrong. · 4d52cec2

Kousuke Saruta authored 10 years ago

The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2943 from sarutak/SPARK-4089 and squashes the following commits:

aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml

4d52cec2

[SPARK-3657] yarn alpha YarnRMClientImpl throws NPE... · 247c529b

Kousuke Saruta authored 10 years ago

[SPARK-3657] yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell

tgravescs reported this issue.

Following is quoted from tgravescs' report.

YarnRMClientImpl.registerApplicationMaster can throw null pointer exception when setting the trackingurl if its empty:

    appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority())

I hit this just start spark-shell without the tracking url set.

14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710)
        at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38)
        at org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2981 from sarutak/SPARK-3657-2 and squashes the following commits:

e2fd6bc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3657
70b8882 [Kousuke Saruta] Fixed NPE thrown

247c529b

[SPARK-4096][YARN]let ApplicationMaster accept executor memory argument in... · 1ea3e3dc

WangTaoTheTonic authored 10 years ago

[SPARK-4096][YARN]let ApplicationMaster accept executor memory argument in same format as JVM memory strings

Here `ApplicationMaster` accept executor memory argument only in number format, we should let it accept JVM style memory strings as well.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2955 from WangTaoTheTonic/modifyDesc and squashes the following commits:

ab98c70 [WangTaoTheTonic] append parameter passed in
3779767 [WangTaoTheTonic] Update executor memory description in the help message

1ea3e3dc

[SPARK-4110] Wrong comments about default settings in spark-daemon.sh · 44d8b45a

Kousuke Saruta authored 10 years ago

In spark-daemon.sh, thare are following comments.

    #   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_PREFIX}/conf.
    #   SPARK_LOG_DIR   Where log files are stored.  PWD by default.

But, I think the default value for SPARK_CONF_DIR is `${SPARK_HOME}/conf` and for SPARK_LOG_DIR is `${SPARK_HOME}/logs`.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2972 from sarutak/SPARK-4110 and squashes the following commits:

5a171a2 [Kousuke Saruta] Fixed wrong comments

44d8b45a

[SPARK-4031] Make torrent broadcast read blocks on use. · 7768a800

Shivaram Venkataraman authored 10 years ago

This avoids reading torrent broadcast variables when they are referenced in the closure but not used in the closure. This is done by using a `lazy val` to read broadcast blocks

cc rxin JoshRosen for review

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2871 from shivaram/broadcast-read-value and squashes the following commits:

1456d65 [Shivaram Venkataraman] Use getUsedTimeMs and remove readObject
d6c5ee9 [Shivaram Venkataraman] Use laxy val to implement readBroadcastBlock
0b34df7 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
9cec507 [Shivaram Venkataraman] Test if broadcast variables are read lazily
768b40b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
8792ed8 [Shivaram Venkataraman] Make torrent broadcast read blocks on use. This avoids reading broadcast variables when they are referenced in the closure but not used by the code.

7768a800

[SPARK-4098][YARN]use appUIAddress instead of appUIHostPort in yarn-client mode · 0ac52e30

WangTaoTheTonic authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-4098

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2958 from WangTaoTheTonic/useAddress and squashes the following commits:

29236e6 [WangTaoTheTonic] use appUIAddress instead of appUIHostPort in yarn-cluster mode

0ac52e30