Commits · 96df92906978c5f58e0cc8ff5eebe5b35a08be3b · cs525-sp18-g07 / spark

Aug 28, 2014

[SPARK-3190] Avoid overflow in VertexRDD.count() · 96df9290

Ankur Dave authored 10 years ago

VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them.

The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion).

```scala
val pairs = sc.parallelize(0L until 500L).map(_ * 10000000)
  .flatMap(start => start until (start + 10000000)).map(x => (x, x))
VertexRDD(pairs).count()
```

Author: Ankur Dave <ankurdave@gmail.com>

Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits:

641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()

96df9290

[SPARK-3285] [examples] Using values.sum is easier to understand than using... · 39012452

Yadong Qi authored 10 years ago

[SPARK-3285] [examples] Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)

def sum[B >: A](implicit num: Numeric[B]): B = foldLeft(num.zero)(num.plus)
Using values.sum is easier to understand than using values.foldLeft(0)(_ + _), so we'd better use values.sum instead of values.foldLeft(0)(_ + _)

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #2182 from watermen/bug-fix3 and squashes the following commits:

17be9fb [Yadong Qi] Update CheckpointSuite.scala
714bda5 [Yadong Qi] Update BasicOperationsSuite.scala
57e704c [Yadong Qi] Update StatefulNetworkWordCount.scala

39012452

[SPARK-3281] Remove Netty specific code in BlockManager / shuffle · be53c54b

Reynold Xin authored 10 years ago

Netty functionality will be added back in subsequent PRs by using the BlockTransferService interface.

Author: Reynold Xin <rxin@apache.org>

Closes #2181 from rxin/SPARK-3281 and squashes the following commits:

5494b0e [Reynold Xin] Fix extra port.
ff6d1e1 [Reynold Xin] [SPARK-3281] Remove Netty specific code in BlockManager.

be53c54b

[SPARK-3264] Allow users to set executor Spark home in Mesos · 41dc5987

Andrew Or authored 10 years ago

The executors and the driver may not share the same Spark home. There is currently one way to set the executor side Spark home in Mesos, through setting `spark.home`. However, this is neither documented nor intuitive. This PR adds a more specific config `spark.mesos.executor.home` and exposes this to the user.

liancheng tnachen

Author: Andrew Or <andrewor14@gmail.com>

Closes #2166 from andrewor14/mesos-spark-home and squashes the following commits:

b87965e [Andrew Or] Merge branch 'master' of github.com:apache/spark into mesos-spark-home
f6abb2e [Andrew Or] Document spark.mesos.executor.home
ca7846d [Andrew Or] Add more specific configuration for executor Spark home in Mesos

41dc5987

[SPARK-2608][Core] Fixed command line option passing issue over Mesos via SPARK_EXECUTOR_OPTS · 6d392b36

Cheng Lian authored 10 years ago

This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608

).

The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue.

Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release.

Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits:

ba59190 [Cheng Lian] Added fine grained Mesos executor support
1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend

(cherry picked from commit 935bffe3)
Signed-off-by: Reynold Xin <rxin@apache.org>

6d392b36

[SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing... · 70d81466

Tatiana Borisova authored 10 years ago

[SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init()

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

Author: Tatiana Borisova <tanyatik@yandex.ru>

Closes #2062 from tanyatik/spark-3150 and squashes the following commits:

9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()

70d81466

[SPARK-3230][SQL] Fix udfs that return structs · 76e3ba42

Michael Armbrust authored 10 years ago

We need to convert the case classes into Rows.

Author: Michael Armbrust <michael@databricks.com>

Closes #2133 from marmbrus/structUdfs and squashes the following commits:

189722f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into structUdfs
8e29b1c [Michael Armbrust] Use existing function
d8d0b76 [Michael Armbrust] Fix udfs that return structs

76e3ba42

[SQL] Fixed 2 comment typos in SQLConf · 68f75dcd

Cheng Lian authored 10 years ago

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2172 from liancheng/sqlconf-typo and squashes the following commits:

115cc71 [Cheng Lian] Fixed 2 comment typos in SQLConf

68f75dcd

[HOTFIX][SQL] Remove cleaning of UDFs · 024178c5

Michael Armbrust authored 10 years ago

It is not safe to run the closure cleaner on slaves. #2153 introduced this which broke all UDF execution on slaves. Will re-add cleaning of UDF closures in a follow-up PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #2174 from marmbrus/fixUdfs and squashes the following commits:

55406de [Michael Armbrust] [HOTFIX] Remove cleaning of UDFs

024178c5

[HOTFIX] Wait for EOF only for the PySpark shell · dafe3434

Andrew Or authored 10 years ago

In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send us an `EOF` before finishing the application. This is applicable for the PySpark shell because we terminate the application the same way. However if we run a python application, for instance, the JVM actually never exits unless it receives a manual EOF from the user. This is causing a few tests to timeout.

We only need to do this for the PySpark shell because Spark submit runs as a python subprocess only in this case. Thus, the normal Spark shell doesn't need to go through this case even though it is also a REPL.

Thanks davies for reporting this.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2170 from andrewor14/bootstrap-hotfix and squashes the following commits:

42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell

dafe3434

Aug 27, 2014

SPARK-3265 Allow using custom ipython executable with pyspark · f38fab97

Rob O'Dwyer authored 10 years ago

Although you can make pyspark use ipython with `IPYTHON=1`, and also change the python executable with `PYSPARK_PYTHON=...`, you can't use both at the same time because it hardcodes the default ipython script.

This makes it use the `PYSPARK_PYTHON` variable if present and fall back to default python, similarly to how the default python executable is handled.

So you can use a custom ipython like so:
`PYSPARK_PYTHON=./anaconda/bin/ipython IPYTHON_OPTS="notebook" pyspark`

Author: Rob O'Dwyer <odwyerrob@gmail.com>

Closes #2167 from robbles/patch-1 and squashes the following commits:

d98e8a9 [Rob O'Dwyer] Allow using custom ipython executable with pyspark

f38fab97

[SPARK-3271] delete unused methods in Utils · b86277c1

scwf authored 10 years ago

delete no used method in Utils

Author: scwf <wangfei1@huawei.com>

Closes #2160 from scwf/delete-no-use-method and squashes the following commits:

d8f6b0d [scwf] delete no use method in Utils

b86277c1

Add line continuation for script to work w/ py2.7.5 · 64d8ecbb

Matthew Farrellee authored 10 years ago

Error was -

$ SPARK_HOME=$PWD/dist ./dev/create-release/generate-changelist.py
  File "./dev/create-release/generate-changelist.py", line 128
    if day < SPARK_REPO_CHANGE_DATE1 or
                                      ^
SyntaxError: invalid syntax

Author: Matthew Farrellee <matt@redhat.com>

Closes #2139 from mattf/master-fix-generate-changelist.py-0 and squashes the following commits:

6b3a900 [Matthew Farrellee] Add line continuation for script to work w/ py2.7.5

64d8ecbb

HOTFIX: Don't build with YARN support for Mapr3 · 8712653f
Patrick Wendell authored 10 years ago

8712653f

[SPARK-3235][SQL] Ensure in-memory tables don't always broadcast. · 7d2a7a91

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits:

5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize
14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans.
8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs
18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast.

7d2a7a91

[SPARK-3065][SQL] Add locale setting to fix results do not match for... · 65253502

luogankun authored 10 years ago

[SPARK-3065][SQL] Add locale setting to fix results do not match for udf_unix_timestamp format "yyyy MMM dd h:mm:ss a" run with not "America/Los_Angeles" TimeZone in HiveCompatibilitySuite

When run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
with not "America/Los_Angeles" TimeZone throws error. [https://issues.apache.org/jira/browse/SPARK-3065]
add locale setting on beforeAll and afterAll method to fix the bug of HiveCompatibilitySuite testcase

Author: luogankun <luogankun@gmail.com>

Closes #1968 from luogankun/SPARK-3065 and squashes the following commits:

c167832 [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
0a25e3a [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite

65253502

[SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location · cc275f4b

Aaron Davidson authored 10 years ago

Currently we do `relation.hiveQlTable.getDataLocation.getPath`, which returns the path-part of the URI (e.g., "s3n://my-bucket/my-path" => "/my-path"). We should do `relation.hiveQlTable.getDataLocation.toString` instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2150 from aarondav/parquet-location and squashes the following commits:

459f72c [Aaron Davidson] [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location

cc275f4b

[SPARK-3252][SQL] Add missing condition for test · 28d41d62

viirya authored 10 years ago

According to the text message, both relations should be tested. So add the missing condition.

Author: viirya <viirya@gmail.com>

Closes #2159 from viirya/fix_test and squashes the following commits:

b1c0f52 [viirya] add missing condition.

28d41d62

[SPARK-3243] Don't use stale spark-driver.* system properties · 63a053ab

Andrew Or authored 10 years ago

If we set both `spark.driver.extraClassPath` and `--driver-class-path`, then the latter correctly overrides the former. However, the value of the system property `spark.driver.extraClassPath` still uses the former, which is actually not added to the class path. This may cause some confusion...

Of course, this also affects other options (i.e. java options, library path, memory...).

Author: Andrew Or <andrewor14@gmail.com>

Closes #2154 from andrewor14/driver-submit-configs-fix and squashes the following commits:

17ec6fc [Andrew Or] Fix tests
0140836 [Andrew Or] Don't forget spark.driver.memory
e39d20f [Andrew Or] Also set spark.driver.extra* configs in client mode

63a053ab

Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this" · 7faf755a

Vida Ha authored 10 years ago

... copy the spark_cluster_tag from a spot instance requests over to the instances.

Author: Vida Ha <vida@databricks.com>

Closes #2163 from vidaha/vida/spark-3213 and squashes the following commits:

5070a70 [Vida Ha] Spark-3214 Fix issue with spark-ec2 not detecting slaves created with 'Launch More Like This' and using Spot Requests

7faf755a

[SPARK-2871] [PySpark] add RDD.lookup(key) · 4fa2fda8

Davies Liu authored 10 years ago

RDD.lookup(key)

        Return the list of values in the RDD for key `key`. This operation
        is done efficiently if the RDD has a known partitioner by only
        searching the partition that the key maps to.

        >>> l = range(1000)
        >>> rdd = sc.parallelize(zip(l, l), 10)
        >>> rdd.lookup(42)  # slow
        [42]
        >>> sorted = rdd.sortByKey()
        >>> sorted.lookup(42)  # fast
        [42]

It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2093 from davies/lookup and squashes the following commits:

1789cd4 [Davies Liu] `f` in foreach could be generator or not.
2871b80 [Davies Liu] Merge branch 'master' into lookup
c6390ea [Davies Liu] address all comments
0f1bce8 [Davies Liu] add test case for lookup()
be0e8ba [Davies Liu] fix preservesPartitioning
eb1305d [Davies Liu] add RDD.lookup(key)

4fa2fda8

[SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter · 48f42781

chutium authored 10 years ago

```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```

Author: chutium <teng.qiu@gmail.com>

Closes #2044 from chutium/parquet-singlefile and squashes the following commits:

4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter

48f42781

[SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL · 191d7cf2

Chip Senkbeil authored 10 years ago

As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem.

My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global).

The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off.

Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath.

The solution now looks like this:
![screen shot 2014-08-13 at 3 46 02 pm](https://cloud.githubusercontent.com/assets/2481802/3912625/f02b1440-232c-11e4-9bf6-bafb3e352d14.png)

Author: Chip Senkbeil <rcsenkbe@us.ibm.com>

Closes #1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits:

f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters
a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error
2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly
a220639 [Chip Senkbeil] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL

191d7cf2

[SPARK-3197] [SQL] Reduce the Expression tree object creations for aggregation function (min/max) · 4238c17d

Cheng Hao authored 10 years ago

Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max.
Here is the benchmark that I've done in my local.

Master | Previous Result (ms) | Current Result (ms)
------------ | ------------- | -------------
local | 3645 | 3416
local[6] | 3602 | 1002

The Benchmark source code.
```
case class Record(key: Int, value: Int)

object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf()))

object DataPrepare extends App {
  import TestHive2._

  val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12)

  runSqlHive("SHOW TABLES")
  runSqlHive("DROP TABLE if exists a")
  runSqlHive("DROP TABLE if exists result")
  rdd.registerAsTable("records")

  runSqlHive("""CREATE TABLE a (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)
  runSqlHive("""CREATE TABLE result (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)

  hql(s"""from records
             | insert into table a
             | select key, value
           """.stripMargin)
}

object PerformanceTest extends App {
  import TestHive2._

  hql("SHOW TABLES")
  hql("set spark.sql.shuffle.partitions=12")

  val cmd = "select min(value), max(value) from a group by key"

  val results = ("Result1", benchmark(cmd)) ::
                ("Result2", benchmark(cmd)) ::
                ("Result3", benchmark(cmd)) :: Nil
  results.foreach { case (prompt, result) => {
      println(s"$prompt: took ${result._1} ms (${result._2} records)")
    }
  }

  def benchmark(cmd: String) = {
    val begin = System.currentTimeMillis()
    val count = hql(cmd).count
    val end = System.currentTimeMillis()
    ((end - begin), count)
  }
}
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits:

db40395 [Cheng Hao] remove the transient and add val for the expression property
d56167d [Cheng Hao] Reduce the Expressions creation

4238c17d

[SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN)... · 3b5eb708

u0jing authored 10 years ago

[SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]" support

JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118

eg:
> SHOW TBLPROPERTIES test;
SHOW TBLPROPERTIES test;
numPartitions	0
numFiles	1
transient_lastDdlTime	1407923642
numRows	0
totalSize	82
rawDataSize	0

eg:
> SHOW COLUMNS  in test;
SHOW COLUMNS  in test;
OK
Time taken: 0.304 seconds
id
stid
bo

Author: u0jing <u9jing@gmail.com>

Closes #2034 from u0jing/spark-3118 and squashes the following commits:

b231d87 [u0jing] add golden answer files
35f4885 [u0jing] add 'show columns' and 'show tblproperties' support

3b5eb708

SPARK-3259 - User data should be given to the master · 5ac4093c

Allan Douglas R. de Oliveira authored 10 years ago

Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>

Closes #2162 from douglaz/user_data_master and squashes the following commits:

10d15f6 [Allan Douglas R. de Oliveira] Give user data also to the master

5ac4093c

[SPARK-3170][CORE][BUG]:RDD info loss in "StorageTab" and "ExecutorTab" · d8298c46

uncleGen authored 10 years ago

compeleted stage only need to remove its own partitions that are no longer cached. However, "StorageTab" may lost some rdds which are cached actually. Not only in "StorageTab", "ExectutorTab" may also lose some rdd info which have been overwritten by last rdd in a same task.
1. "StorageTab": when multiple stages run simultaneously, completed stage will remove rdd info which belong to other stages that are still running.
2. "ExectutorTab": taskcontext may lose some "updatedBlocks" info of rdds in a dependency chain. Like the following example:
val r1 = sc.paralize(..).cache()
val r2 = r1.map(...).cache()
val n = r2.count()

When count the r2, r1 and r2 will be cached finally. So in CacheManager.getOrCompute, the taskcontext should contain "updatedBlocks" of r1 and r2. Currently, the "updatedBlocks" only contain the info of r2.

Author: uncleGen <hustyugm@gmail.com>

Closes #2131 from uncleGen/master_ui_fix and squashes the following commits:

a6a8a0b [uncleGen] fix some coding style
3a1bc15 [uncleGen] fix some error in unit test
56ea488 [uncleGen] there's some line too long
c82ba82 [uncleGen] Bug Fix: RDD info loss in "StorageTab" and "ExecutorTab"

d8298c46

[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. · b92d823a

Marcelo Vanzin authored 10 years ago

This change modifies the Yarn module so that all the logic related
to running the ApplicationMaster is localized. Instead of, previously,
4 different classes with mostly identical code, now we have:

- A single, shared ApplicationMaster class, which can operate both in
  client and cluster mode, and substitutes the old ApplicationMaster
  (for cluster mode) and ExecutorLauncher (for client mode).

The benefit here is that all different execution modes for all supported
yarn versions use the same shared code for monitoring executor allocation,
setting up configuration, and monitoring the process's lifecycle.

- A new YarnRMClient interface, which defines basic RM functionality needed
  by the ApplicationMaster. This interface has concrete implementations for
  each supported Yarn version.

- A new YarnAllocator interface, which just abstracts the existing interface
  of the YarnAllocationHandler class. This is to avoid having to touch the
  allocator code too much in this change, although it might benefit from a
  similar effort in the future.

The end result is much easier to understand code, with much less duplication,
making it much easier to fix bugs, add features, and test everything knowing
that all supported versions will behave the same.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2020 from vanzin/SPARK-2933 and squashes the following commits:

3bbf3e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
ff389ed [Marcelo Vanzin] Do not interrupt reporter thread from within itself.
3a8ed37 [Marcelo Vanzin] Remote stale comment.
0f5142c [Marcelo Vanzin] Review feedback.
41f8c8a [Marcelo Vanzin] Fix app status reporting.
c0794be [Marcelo Vanzin] Correctly clean up staging directory.
92770cc [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
ecaf332 [Marcelo Vanzin] Small fix to shutdown code.
f02d3f8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
f581122 [Marcelo Vanzin] Review feedback.
557fdeb [Marcelo Vanzin] Cleanup a couple more constants.
be6068d [Marcelo Vanzin] Restore shutdown hook to clean up staging dir.
5150993 [Marcelo Vanzin] Some more cleanup.
b6289ab [Marcelo Vanzin] Move cluster/client code to separate methods.
ecb23cd [Marcelo Vanzin] More trivial cleanup.
34f1e63 [Marcelo Vanzin] Fix some questionable error handling.
5657c7d [Marcelo Vanzin] Finish app if SparkContext initialization times out.
0e4be3d [Marcelo Vanzin] Keep "ExecutorLauncher" as the main class for client-mode AM.
91beabb [Marcelo Vanzin] Fix UI filter registration.
8c72239 [Marcelo Vanzin] Trivial cleanups.
99a52d5 [Marcelo Vanzin] Changes to the yarn-alpha project to use common AM code.
848ca6d [Marcelo Vanzin] [SPARK-2933] [yarn] Refactor and cleanup Yarn AM code.

b92d823a

[SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner. · 6f671d04

Hari Shreedharan authored 10 years ago

Currently lot of errors get thrown from Avro IPC layer when the dstream
or sink is shutdown. This PR cleans it up. Some refactoring is done in the
receiver code to put all of the RPC code into a single Try and just recover
from that. The sink code has also been cleaned up.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #2065 from harishreedharan/clean-flume-shutdown and squashes the following commits:

f93a07c [Hari Shreedharan] Formatting fixes.
d7427cc [Hari Shreedharan] More fixes!
a0a8852 [Hari Shreedharan] Fix race condition, hopefully! Minor other changes.
4c9ed02 [Hari Shreedharan] Remove unneeded list in Callback handler. Other misc changes.
8fee36f [Hari Shreedharan] Scala-library is required, else maven build fails. Also catch InterruptedException in TxnProcessor.
445e700 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown
87232e0 [Hari Shreedharan] Refactor Flume Input Stream. Clean up code, better error handling.
9001d26 [Hari Shreedharan] Change log level to debug in TransactionProcessor#shutdown method
e7b8d82 [Hari Shreedharan] Incorporate review feedback
598efa7 [Hari Shreedharan] Clean up some exception handling code
e1027c6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown
ed608c8 [Hari Shreedharan] [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner.

6f671d04

[SPARK-3227] [mllib] Added migration guide for v1.0 to v1.1 · 171a41cb

Joseph K. Bradley authored 10 years ago

The only updates are in DecisionTree.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2146 from jkbradley/mllib-migration and squashes the following commits:

5a1f487 [Joseph K. Bradley] small edit to doc
411d6d9 [Joseph K. Bradley] Added migration guide for v1.0 to v1.1.  The only updates are in DecisionTree.

171a41cb

[SPARK-2830][MLLIB] doc update for 1.1 · 43dfc84f

Xiangrui Meng authored 10 years ago

1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly

atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits:

0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md

43dfc84f

[SPARK-3237][SQL] Fix parquet filters with UDFs · e1139dd6

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #2153 from marmbrus/parquetFilters and squashes the following commits:

712731a [Michael Armbrust] Use closure serializer for sending filters.
1e83f80 [Michael Armbrust] Clean udf functions.

e1139dd6

[SPARK-3139] Made ContextCleaner to not block on shuffles · 3e2864e4

Tathagata Das authored 10 years ago

As a workaround for SPARK-3015, the ContextCleaner was made "blocking", that is, it cleaned items one-by-one. But shuffles can take a long time to be deleted. Given that the RC for 1.1 is imminent, this PR makes a narrow change in the context cleaner - not wait for shuffle cleanups to complete. Also it changes the error messages on failure to delete to be milder warnings, as exceptions in the delete code path for one item does not really stop the actual functioning of the system.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2143 from tdas/cleaner-shuffle-fix and squashes the following commits:

9c84202 [Tathagata Das] Restoring default blocking behavior in ContextCleanerSuite, and added docs to identify that spark.cleaner.referenceTracking.blocking does not control shuffle.
2181329 [Tathagata Das] Mark shuffle cleanup as non-blocking.
e337cc2 [Tathagata Das] Changed semantics based on PR comments.
387b578 [Tathagata Das] Made ContextCleaner to not block on shuffles

3e2864e4

HOTFIX: Minor typo in conf template · 9d65f271
Patrick Wendell authored 10 years ago

9d65f271

[SPARK-3167] Handle special driver configs in Windows · 7557c4cf

Andrew Or authored 10 years ago

This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2129 from andrewor14/windows-config and squashes the following commits:

881a8f0 [Andrew Or] Add reference to Windows taskkill
92e6047 [Andrew Or] Update a few comments (minor)
22b1acd [Andrew Or] Fix style again (minor)
afcffea [Andrew Or] Fix style (minor)
72004c2 [Andrew Or] Actually respect --driver-java-options
803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
eeb34a0 [Andrew Or] Update outdated comment (minor)
35caecc [Andrew Or] In Windows, actually kill Java processes on exit
f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)

7557c4cf

[SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) · bf719056

Reynold Xin authored 10 years ago

This is a HOTFIX for 1.1.

Author: Reynold Xin <rxin@apache.org>
Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2127 from rxin/SPARK-3224 and squashes the following commits:

effb1ce [Reynold Xin] Move log message.
49282b3 [Reynold Xin] Kay's feedback.
3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224
796d282 [Kay Ousterhout] Added unit test for SPARK-3224
3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds.
1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.

bf719056

Aug 26, 2014

Manually close old pull requests · e70aff6c
Matei Zaharia authored 10 years ago
```
Closes #671, Closes #515
```
e70aff6c
Manually close some old pull requests · ee91eb8c
Matei Zaharia authored 10 years ago
```
Closes #530, Closes #223, Closes #738, Closes #546
```
ee91eb8c
Fix unclosed HTML tag in Yarn docs. · d8345471
Josh Rosen authored 10 years ago

d8345471

[SPARK-3240] Adding known issue for MESOS-1688 · be043e3f

Martin Weindel authored 10 years ago

When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion.
From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions.
See [[MESOS-1688] No offers if no memory is allocatable](https://issues.apache.org/jira/browse/MESOS-1688) for details for this problem.

Author: Martin Weindel <martin.weindel@gmail.com>

Closes #1860 from MartinWeindel/master and squashes the following commits:

5762030 [Martin Weindel] reverting work-around
a6bf837 [Martin Weindel] added known issue for issue MESOS-1688
d9d2ca6 [Martin Weindel] work around for problem with Mesos offering semantic (see [https://issues.apache.org/jira/browse/MESOS-1688])

be043e3f