Commits · 5fa53c02fc89af7328a659045c954d72bf0b8664 · cs525-sp18-g07 / spark

Feb 13, 2014

Add c3 instance types to Spark EC2 · 5fa53c02

Christian Lundgren authored 11 years ago

The number of disks for the c3 instance types taken from here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#StorageOnInstanceTypes



Author: Christian Lundgren <christian.lundgren@gameanalytics.com>

Closes #595 from chrisavl/branch-0.9 and squashes the following commits:

c8af5f9 [Christian Lundgren] Add c3 instance types to Spark EC2
(cherry picked from commit 19b4bb2b)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

5fa53c02

Ported hadoopClient jar for < 1.0.1 fix · a3bb8617

Bijay Bisht authored 11 years ago


#522 got messed after i rewrote the branch hadoop_jar_name. So created a new one.

Author: Bijay Bisht <bijay.bisht@gmail.com>

Closes #584 from bijaybisht/hadoop_jar_name_on_0.9.0 and squashes the following commits:

1b6fb3c [Bijay Bisht] Ported hadoopClient jar for < 1.0.1 fix
(cherry picked from commit 8093de1b)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>

a3bb8617

SPARK-1073 Keep GitHub pull request title as commit summary · 6ee0ad8f

Andrew Ash authored 11 years ago

The first line of a git commit message is the line that's used with many git
tools as the most concise textual description of that message.  The most
common use that I see is in the short log, which is a one line per commit
log of recent commits.

This commit moves the line

  Merge pull request #%s from %s.

Lower into the message to reserve the first line of the resulting commit for
the much more important pull request title.

http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

Author: Andrew Ash <andrew@andrewash.com>

Closes #574 from ash211/gh-pr-merge-title and squashes the following commits:

b240823 [Andrew Ash] More merge_message improvements
d2986db [Andrew Ash] Keep GitHub pull request title as commit summary

6ee0ad8f

Merge pull request #592 from rxin/test. · 7fe7a55c

Reynold Xin authored 11 years ago

SPARK-1088: Create a script for running tests so we can have version specific testing on Jenkins.

@pwendell

Author: Reynold Xin <rxin@apache.org>

Closes #592 and squashes the following commits:

be02359 [Reynold Xin] SPARK-1088: Create a script for running tests so we can have version specific testing on Jenkins.

7fe7a55c

Feb 12, 2014

Merge pull request #591 from mengxr/transient-new. · 7e29e027

Xiangrui Meng authored 11 years ago

SPARK-1076: [Fix #578] add @transient to some vals

I'll try to be more careful next time.

Author: Xiangrui Meng <meng@databricks.com>

Closes #591 and squashes the following commits:

2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD

7e29e027

Merge pull request #589 from mengxr/index. · 2bea0709

Xiangrui Meng authored 11 years ago

SPARK-1076: Convert Int to Long to avoid overflow

Patch for PR #578.

Author: Xiangrui Meng <meng@databricks.com>

Closes #589 and squashes the following commits:

98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow

2bea0709

Merge pull request #578 from mengxr/rank. · e733d655

Xiangrui Meng authored 11 years ago

SPARK-1076: zipWithIndex and zipWithUniqueId to RDD

Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel.

The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job.

https://spark-project.atlassian.net/browse/SPARK-1076

== update ==
Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field.

Author: Xiangrui Meng <meng@databricks.com>

Closes #578 and squashes the following commits:

52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates
756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition
630868c [Xiangrui Meng] newline
21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD

e733d655

Merge pull request #583 from colorant/zookeeper. · 68b2c0d0

Raymond Liu authored 11 years ago

Minor fix for ZooKeeperPersistenceEngine to use configured working dir

Author: Raymond Liu <raymond.liu@intel.com>

Closes #583 and squashes the following commits:

91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir

68b2c0d0

Feb 11, 2014

Merge pull request #571 from holdenk/switchtobinarysearch. · b0dab1bb

Holden Karau authored 11 years ago

SPARK-1072 Use binary search when needed in RangePartioner

Author: Holden Karau <holden@pigscanfly.ca>

Closes #571 and squashes the following commits:

f31a2e1 [Holden Karau] Swith to using CollectionsUtils in Partitioner
4c7a0c3 [Holden Karau] Add CollectionsUtil as suggested by aarondav
7099962 [Holden Karau] Add the binary search to only init once
1bef01d [Holden Karau] CR feedback
a21e097 [Holden Karau] Use binary search if we have more than 1000 elements inside of RangePartitioner

b0dab1bb

Merge pull request #577 from hsaputra/fix_simple_streaming_doc. · ba38d989

Henry Saputra authored 11 years ago

SPARK-1075 Fix doc in the Spark Streaming custom receiver closing bracket in the class constructor

The closing parentheses in the constructor in the first code block example is reversed:
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md
index 4e27d65..3fb540c 100644
— a/docs/streaming-custom-receivers.md
+++ b/docs/streaming-custom-receivers.md
@@ -14,7 +14,7 @@ This starts with implementing NetworkReceiver(api/streaming/index.html#org.apa
The following is a simple socket text-stream receiver.
{% highlight scala %}
class SocketTextStreamReceiver(host: String, port: Int(
+ class SocketTextStreamReceiver(host: String, port: Int)
extends NetworkReceiverString
{
protected lazy val blocksGenerator: BlockGenerator =

Author: Henry Saputra <henry@platfora.com>

Closes #577 and squashes the following commits:

6508341 [Henry Saputra] SPARK-1075 Fix doc in the Spark Streaming custom receiver.

ba38d989

Merge pull request #579 from CrazyJvm/patch-1. · 4afe6ccf

Chen Chao authored 11 years ago

"in the source DStream" rather than "int the source DStream"

"flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records from each record int the source DStream."

Author: Chen Chao <crazyjvm@gmail.com>

Closes #579 and squashes the following commits:

4abcae3 [Chen Chao] in the source DStream

4afe6ccf

Feb 10, 2014

Revert "Merge pull request #560 from pwendell/logging. Closes #560." · d6a9bdc0
Patrick Wendell authored 11 years ago
```
This reverts commit b6d40b78.
```
d6a9bdc0

Merge pull request #567 from ScrapCodes/style2. · 919bd7f6

Prashant Sharma authored 11 years ago

SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2

Continuation of PR #557

With this all scala style errors are fixed across the code base !!

The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #567 and squashes the following commits:

3b1ec30 [Prashant Sharma] scala style fixes

919bd7f6

Feb 09, 2014

Merge pull request #566 from martinjaggi/copy-MLlib-d. · 2182aa3c

Martin Jaggi authored 11 years ago

new MLlib documentation for optimization, regression and classification

new documentation with tex formulas, hopefully improving usability and reproducibility of the offered MLlib methods.
also did some minor changes in the code for consistency. scala tests pass.

this is the rebased branch, i deleted the old PR

jira:
https://spark-project.atlassian.net/browse/MLLIB-19

Author: Martin Jaggi <m.jaggi@gmail.com>

Closes #566 and squashes the following commits:

5f0f31e [Martin Jaggi] line wrap at 100 chars
4e094fb [Martin Jaggi] better description of GradientDescent
1d6965d [Martin Jaggi] remove broken url
ea569c3 [Martin Jaggi] telling what updater actually does
964732b [Martin Jaggi] lambda R() in documentation
a6c6228 [Martin Jaggi] better comments in SGD code for regression
b32224a [Martin Jaggi] new optimization documentation
d5dfef7 [Martin Jaggi] new classification and regression documentation
b07ead6 [Martin Jaggi] correct scaling for MSE loss
ba6158c [Martin Jaggi] use d for the number of features
bab2ed2 [Martin Jaggi] renaming LeastSquaresGradient

2182aa3c

Merge pull request #551 from qqsun8819/json-protocol. · afc8f3cb

qqsun8819 authored 11 years ago

[SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself

This is a PR for SPARK-1038. Two major changes:
1 add some fields to JsonProtocol which is new and important to standalone-related data structures
2 Use Diff in liftweb.json to verity the stringified Json output for detecting someone mod type T to Option[T]

Author: qqsun8819 <jin.oyj@alibaba-inc.com>

Closes #551 and squashes the following commits:

fdf0b4e [qqsun8819] [SPARK-1038] 1. Change code style for more readable according to rxin review 2. change submitdate hard-coded string to a date object toString for more complexiblity
095a26f [qqsun8819] [SPARK-1038] mod according to  review of pwendel, use hard-coded json string for json data validation. Each test use its own json string
0524e41 [qqsun8819] Merge remote-tracking branch 'upstream/master' into json-protocol
d203d5c [qqsun8819] [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself

afc8f3cb

Merge pull request #569 from pwendell/merge-fixes. · 94ccf869

Patrick Wendell authored 11 years ago

Fixes bug where merges won't close associated pull request.

Previously we added "Closes #XX" in the title. Github will sometimes
linbreak the title in a way that causes this to not work. This patch
instead adds the line in the body.

This also makes the commit format more concise for merge commits.
We might consider just dropping those in the future.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #569 and squashes the following commits:

732eba1 [Patrick Wendell] Fixes bug where merges won't close associated pull request.

94ccf869

Merge pull request #557 from ScrapCodes/style. Closes #557. · b69f8b2a

Patrick Wendell authored 11 years ago

SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

== Merge branch commits ==

commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4
Author: Prashant Sharma <scrapcodes@gmail.com>
Date:   Sun Feb 9 17:39:07 2014 +0530

    scala style fixes

commit f91709887a8e0b608c5c2b282db19b8a44d53a43
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Jan 24 11:22:53 2014 -0800

    Adding scalastyle snapshot

b69f8b2a

Merge pull request #556 from CodingCat/JettyUtil. Closes #556. · b6dba10a

CodingCat authored 11 years ago

[SPARK-1060] startJettyServer should explicitly use IP information

https://spark-project.atlassian.net/browse/SPARK-1060

In the current implementation, the webserver in Master/Worker is started with

val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers)

inside startJettyServer:

val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC

this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address

Author: CodingCat <zhunansjtu@gmail.com>

== Merge branch commits ==

commit 6c6d9a8ccc9ec4590678a3b34cb03df19092029d
Author: CodingCat <zhunansjtu@gmail.com>
Date:   Thu Feb 6 14:53:34 2014 -0500

    startJettyServer should explicitly use IP information

b6dba10a

Merge pull request #562 from jyotiska/master. Closes #562. · 2ef37c93

jyotiska authored 11 years ago

Added example Python code for sort

I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code.

Author: jyotiska <jyotiska123@gmail.com>

== Merge branch commits ==

commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sun Feb 9 11:00:41 2014 +0530

    Added comments in code on collect() method

commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sat Feb 8 13:12:37 2014 +0530

    Updated python example code sort.py

commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sat Feb 8 12:59:09 2014 +0530

    Added example python code for sort

2ef37c93

Merge pull request #560 from pwendell/logging. Closes #560. · b6d40b78

Patrick Wendell authored 11 years ago

[WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j

To fix this - we add a check when initializing log4j.

Author: Patrick Wendell <pwendell@gmail.com>

== Merge branch commits ==

commit ffdce513877f64b6eed6d36138c3e0003d392889
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Feb 7 15:22:29 2014 -0800

    Logging fix

b6d40b78

Merge pull request #565 from pwendell/dev-scripts. Closes #565. · f892da87

Patrick Wendell authored 11 years ago

SPARK-1066: Add developer scripts to repository.

These are some developer scripts I've been maintaining in a separate public repo. This patch adds them to the Spark repository so they can evolve here and are clearly accessible to all committers.

I may do some small additional clean-up in this PR, but wanted to put them here in case others want to review. There are a few types of scripts here:

1. A tool to merge pull requests.
2. A script for packaging releases.
3. A script for auditing release candidates.

Author: Patrick Wendell <pwendell@gmail.com>

== Merge branch commits ==

commit 5d5d331d01f6fd59c2eb830f652955119b012173
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Sat Feb 8 22:11:47 2014 -0800

    SPARK-1066: Add developer scripts to repository.

f892da87

Feb 08, 2014

Merge pull request #542 from markhamstra/versionBump. Closes #542. · c2341c92

Mark Hamstra authored 11 years ago

Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT

c2341c92

Merge pull request #561 from Qiuzhuang/master. Closes #561. · f0ce736f

Qiuzhuang Lian authored 11 years ago

Kill drivers in postStop() for Worker.

 JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068

Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com>

== Merge branch commits ==

commit 9c19ce63637eee9369edd235979288d3d9fc9105
Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com>
Date:   Sat Feb 8 16:07:39 2014 +0800

    Kill drivers in postStop() for Worker.
     JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068

f0ce736f

Merge pull request #454 from jey/atomic-sbt-download. Closes #454. · 78050805

Jey Kottalam authored 11 years ago

Make sbt download an atomic operation

Modifies the `sbt/sbt` script to gracefully recover when a previous invocation died in the middle of downloading the SBT jar.

Author: Jey Kottalam <jey@cs.berkeley.edu>

== Merge branch commits ==

commit 6c600eb434a2f3e7d70b67831aeebde9b5c0f43b
Author: Jey Kottalam <jey@cs.berkeley.edu>
Date:   Fri Jan 17 10:43:54 2014 -0800

    Make sbt download an atomic operation

78050805

Merge pull request #552 from martinjaggi/master. Closes #552. · fabf1749

Martin Jaggi authored 11 years ago

tex formulas in the documentation

using mathjax.
and spliting the MLlib documentation by techniques

see jira
https://spark-project.atlassian.net/browse/MLLIB-19
and
https://github.com/shivaram/spark/compare/mathjax

Author: Martin Jaggi <m.jaggi@gmail.com>

== Merge branch commits ==

commit 0364bfabbfc347f917216057a20c39b631842481
Author: Martin Jaggi <m.jaggi@gmail.com>
Date:   Fri Feb 7 03:19:38 2014 +0100

    minor polishing, as suggested by @pwendell

commit dcd2142c164b2f602bf472bb152ad55bae82d31a
Author: Martin Jaggi <m.jaggi@gmail.com>
Date:   Thu Feb 6 18:04:26 2014 +0100

    enabling inline latex formulas with $.$

    same mathjax configuration as used in math.stackexchange.com

    sample usage in the linear algebra (SVD) documentation

commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa
Author: Martin Jaggi <m.jaggi@gmail.com>
Date:   Thu Feb 6 17:31:29 2014 +0100

    split MLlib documentation by techniques

    and linked from the main mllib-guide.md site

commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb
Author: Martin Jaggi <m.jaggi@gmail.com>
Date:   Thu Feb 6 16:59:43 2014 +0100

    enable mathjax formula in the .md documentation files

    code by @shivaram

commit d73948db0d9bc36296054e79fec5b1a657b4eab4
Author: Martin Jaggi <m.jaggi@gmail.com>
Date:   Thu Feb 6 16:57:23 2014 +0100

    minor update on how to compile the documentation

fabf1749

Feb 07, 2014

Merge pull request #506 from ash211/intersection. Closes #506. · 3a9d82cc

Andrew Ash authored 11 years ago

SPARK-1062 Add rdd.intersection(otherRdd) method

Author: Andrew Ash <andrew@andrewash.com>

== Merge branch commits ==

commit 5d9982b171b9572649e9828f37ef0b43f0242912
Author: Andrew Ash <andrew@andrewash.com>
Date:   Thu Feb 6 18:11:45 2014 -0800

    Minor fixes

    - style: (v,null) => (v, null)
    - mention the shuffle in Javadoc

commit b86d02f14e810902719cef893cf6bfa18ff9acb0
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sun Feb 2 13:17:40 2014 -0800

    Overload .intersection() for numPartitions and custom Partitioner

commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sun Feb 2 13:05:40 2014 -0800

    Better naming of parameters in intersection's filter

commit b10a6af2d793ec6e9a06c798007fac3f6b860d89
Author: Andrew Ash <andrew@andrewash.com>
Date:   Sat Jan 25 23:06:26 2014 -0800

    Follow spark code format conventions of tab => 2 spaces

commit 965256e4304cca514bb36a1a36087711dec535ec
Author: Andrew Ash <andrew@andrewash.com>
Date:   Fri Jan 24 00:28:01 2014 -0800

    Add rdd.intersection(otherRdd) method

3a9d82cc

Merge pull request #533 from andrewor14/master. Closes #533. · 1896c6e7

Andrew Or authored 11 years ago

External spilling - generalize batching logic

The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way.

Author: Andrew Or <andrewor14@gmail.com>

== Merge branch commits ==

commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82
Author: Andrew Or <andrewor14@gmail.com>
Date:   Wed Feb 5 12:09:32 2014 -0800

    Also privatize fields

commit 090544a87a0767effd0c835a53952f72fc8d24f0
Author: Andrew Or <andrewor14@gmail.com>
Date:   Wed Feb 5 10:58:23 2014 -0800

    Privatize methods

commit 13920c918efe22e66a1760b14beceb17a61fd8cc
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 16:34:15 2014 -0800

    Update docs

commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 13:44:24 2014 -0800

    Typo: phyiscal -> physical

commit 287ef44e593ad72f7434b759be3170d9ee2723d2
Author: Andrew Or <andrewor14@gmail.com>
Date:   Tue Feb 4 13:38:32 2014 -0800

    Avoid reading the entire batch into memory; also simplify streaming logic

    Additionally, address formatting comments.

commit 3df700509955f7074821e9aab1e74cb53c58b5a5
Merge: a531d2e 164489d
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:27:49 2014 -0800

    Merge branch 'master' of github.com:andrewor14/incubator-spark

commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.

commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af
Author: Andrew Or <andrewor14@gmail.com>
Date:   Mon Feb 3 18:18:04 2014 -0800

    Relax assumptions on compressors and serializers when batching

    This commit introduces an intermediate layer of an input stream on the batch level.
    This guards against interference from higher level streams (i.e. compression and
    deserialization streams), especially pre-fetching, without specifically targeting
    particular libraries (Kryo) and forcing shuffle spill compression to use LZF.

1896c6e7

Feb 06, 2014

Merge pull request #450 from kayousterhout/fetch_failures. Closes #450. · 0b448df6

Kay Ousterhout authored 11 years ago

Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit e603784b3a562980e6f1863845097effe2129d3b
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0ef50caff4bbb19fb95a6b82186db1935bf
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.

0b448df6

Merge pull request #321 from kayousterhout/ui_kill_fix. Closes #321. · 18ad59e2

Kay Ousterhout authored 11 years ago

Inform DAG scheduler about all started/finished tasks.

Previously, the DAG scheduler was not always informed
when tasks started and finished. The simplest example here
is for speculated tasks: the DAGScheduler was only told about
the first attempt of a task, meaning that SparkListeners were
also not told about multiple task attempts, so users can't see
what's going on with speculation in the UI.  The DAGScheduler
also wasn't always told about finished tasks, so in the UI, some
tasks will never be shown as finished (this occurs, for example,
if a task set gets killed).

The other problem is that the fairness accounting was wrong
-- the number of running tasks in a pool was decreased when a
task set was considered done, even if all of its tasks hadn't
yet finished.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit c8d547d0f7a17f5a193bef05f5872b9f475675c5
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 15 16:47:33 2014 -0800

    Addressed Reynold's review comments.

    Always use a TaskEndReason (remove the option), and explicitly
    signal when we don't know the reason. Also, always tell
    DAGScheduler (and associated listeners) about started tasks, even
    when they're speculated.

commit 3fee1e2e3c06b975ff7f95d595448f38cce97a04
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Jan 8 22:58:13 2014 -0800

    Fixed broken test and improved logging

commit ff12fcaa2567c5d02b75a1d5db35687225bcd46f
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Sun Dec 29 21:08:20 2013 -0800

    Inform DAG scheduler about all finished tasks.

    Previously, the DAG scheduler was not always informed
    when tasks finished. For example, when a task set was
    aborted, the DAG scheduler was never told when the tasks
    in that task set finished. The DAG scheduler was also
    never told about the completion of speculated tasks.
    This led to confusion with SparkListeners because information
    about the completion of those tasks was never passed on to
    the listeners (so in the UI, for example, some tasks will never
    be shown as finished).

    The other problem is that the fairness accounting was wrong
    -- the number of running tasks in a pool was decreased when a
    task set was considered done, even if all of its tasks hadn't
    yet finished.

18ad59e2

Merge pull request #554 from sryza/sandy-spark-1056. Closes #554. · 446403b6

Sandy Ryza authored 11 years ago

SPARK-1056. Fix header comment in Executor to not imply that it's only u...

...sed for Mesos and Standalone.

Author: Sandy Ryza <sandy@cloudera.com>

== Merge branch commits ==

commit 1f2443d902a26365a5c23e4af9077e1539ed2eab
Author: Sandy Ryza <sandy@cloudera.com>
Date:   Thu Feb 6 15:03:50 2014 -0800

    SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone

446403b6

Merge pull request #498 from ScrapCodes/python-api. Closes #498. · 084839ba

Prashant Sharma authored 11 years ago

Python api additions

Author: Prashant Sharma <prashant.s@imaginea.com>

== Merge branch commits ==

commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Fri Jan 24 11:50:29 2014 +0530

    Josh's and Patricks review comments.

commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 17:27:17 2014 +0530

    fixed doc tests

commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 16:48:43 2014 +0530

    Added keys and values methods for PairFunctions in python

commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:51:26 2014 +0530

    Added foreachPartition

commit 05f05341a187cba829ac0e6c2bdf30be49948c89
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:02:59 2014 +0530

    Added coalesce fucntion to python API

commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 12:52:44 2014 +0530

    added repartition function to python API.

084839ba

Merge pull request #545 from kayousterhout/fix_progress. Closes #545. · 79c95527

Kay Ousterhout authored 11 years ago

Fix off-by-one error with task progress info log.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit 29798fc685c4e7e3eb3bf91c75df7fa8ec94a235
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Feb 5 13:40:01 2014 -0800

    Fix off-by-one error with task progress info log.

79c95527

Merge pull request #526 from tgravescs/yarn_client_stop_am_fix. Closes #526. · 38020961

Thomas Graves authored 11 years ago

spark on yarn - yarn-client mode doesn't always exit immediately

https://spark-project.atlassian.net/browse/SPARK-1049

If you run in the yarn-client mode but you don't get all the workers you requested right away and then you exit your application, the application master stays around until it gets the number of workers you initially requested. This is a waste of resources.  The AM should exit immediately upon the client going away.

This fix simply checks to see if the driver closed while its waiting for the initial # of workers.

Author: Thomas Graves <tgraves@apache.org>

== Merge branch commits ==

commit 03f40a62584b6bdd094ba91670cd4aa6afe7cd81
Author: Thomas Graves <tgraves@apache.org>
Date:   Fri Jan 31 11:23:10 2014 -0600

    spark on yarn - yarn-client mode doesn't always exit immediately

38020961

Merge pull request #549 from CodingCat/deadcode_master. Closes #549. · 18c4ee71

CodingCat authored 11 years ago

remove actorToWorker in master.scala, which is actually not used

actorToWorker is actually not used in the code....just remove it

Author: CodingCat <zhunansjtu@gmail.com>

== Merge branch commits ==

commit 52656c2d4bbf9abcd8bef65d454badb9cb14a32c
Author: CodingCat <zhunansjtu@gmail.com>
Date:   Thu Feb 6 00:28:26 2014 -0500

    remove actorToWorker in master.scala, which is actually not used

18c4ee71

Feb 05, 2014

Merge pull request #544 from kayousterhout/fix_test_warnings. Closes #544. · cc14ba97

Kay Ousterhout authored 11 years ago

Fixed warnings in test compilation.

This commit fixes two problems: a redundant import, and a
deprecated function.

Author: Kay Ousterhout <kayousterhout@gmail.com>

== Merge branch commits ==

commit da9d2e13ee4102bc58888df0559c65cb26232a82
Author: Kay Ousterhout <kayousterhout@gmail.com>
Date:   Wed Feb 5 11:41:51 2014 -0800

    Fixed warnings in test compilation.

    This commit fixes two problems: a redundant import, and a
    deprecated function.

cc14ba97

Merge pull request #540 from sslavic/patch-3. Closes #540. · f7fd80d9

Stevo Slavić authored 11 years ago

Fix line end character stripping for Windows

LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.

Author: Stevo Slavić <sslavic@gmail.com>

== Merge branch commits ==

commit 1e43ba0ea773cc005cf0aef78b6c1755f8e88b27
Author: Stevo Slavić <sslavic@gmail.com>
Date:   Wed Feb 5 14:48:29 2014 +0100

    Fix line end character stripping for Windows

    LogQuery Spark example would produce unwanted result when run on Windows platform because of different, platform specific trailing line end characters (not only \n but \r too).

    This fix makes use of Scala's standard library string functions to properly strip all trailing line end characters, letting Scala handle the platform specific stuff.

f7fd80d9

Feb 04, 2014

Merge pull request #534 from sslavic/patch-1. Closes #534. · 92092879

Stevo Slavić authored 11 years ago

Fixed wrong path to compute-classpath.cmd

compute-classpath.cmd is in bin, not in sbin directory

Author: Stevo Slavić <sslavic@gmail.com>

== Merge branch commits ==

commit 23deca32b69e9429b33ad31d35b7e1bfc9459f59
Author: Stevo Slavić <sslavic@gmail.com>
Date:   Tue Feb 4 15:01:47 2014 +0100

    Fixed wrong path to compute-classpath.cmd

    compute-classpath.cmd is in bin, not in sbin directory

92092879

Merge pull request #535 from sslavic/patch-2. Closes #535. · 0c05cd37

Stevo Slavić authored 11 years ago

Fixed typo in scaladoc

Author: Stevo Slavić <sslavic@gmail.com>

== Merge branch commits ==

commit 0a77f789e281930f4168543cc0d3b3ffbf5b3764
Author: Stevo Slavić <sslavic@gmail.com>
Date:   Tue Feb 4 15:30:27 2014 +0100

    Fixed typo in scaladoc

0c05cd37

Feb 03, 2014

Merge pull request #528 from mengxr/sample. Closes #528. · 23af00f9

Xiangrui Meng authored 11 years ago

 Refactor RDD sampling and add randomSplit to RDD (update)

Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are:

1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513
2) Stratified sampling and importance sampling can be implemented in the same manner as well.

Unit tests are included for samplers and RDD.randomSplit.

This should performance better than my previous request where the BernoulliSampler creates many Iterator instances:
https://github.com/apache/incubator-spark/pull/513

Author: Xiangrui Meng <meng@databricks.com>

== Merge branch commits ==

commit e8ce957e5f0a600f2dec057924f4a2ca6adba373
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 12:21:08 2014 -0800

    more docs to PartitionwiseSampledRDD

commit fbb4586d0478ff638b24bce95f75ff06f713d43b
Author: Xiangrui Meng <meng@databricks.com>
Date:   Mon Feb 3 00:44:23 2014 -0800

    move XORShiftRandom to util.random and use it in BernoulliSampler

commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 11:06:59 2014 -0800

    relax assertions in SortingSuite because the RangePartitioner has large variance in this case

commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:56:28 2014 -0800

    test split ratio of RDD.randomSplit

commit 8a410bc933a60c4d63852606f8bbc812e416d6ae
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Feb 1 09:25:22 2014 -0800

    add a test to ensure seed distribution and minor style update

commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:06:22 2014 -0800

    minor style change

commit 750912b4d77596ed807d361347bd2b7e3b9b7a74
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 18:04:54 2014 -0800

    fix some long lines

commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:59:59 2014 -0800

    add complement to BernoulliSampler and minor style changes

commit dbe2bc2bd888a7bdccb127ee6595840274499403
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 17:45:08 2014 -0800

    switch to partition-wise sampling for better performance

commit a1fca5232308feb369339eac67864c787455bb23
Merge: ac712e48 cf6128f
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 31 16:33:09 2014 -0800

    Merge branch 'sample' of github.com:mengxr/incubator-spark into sample

commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:40:07 2014 -0800

    set SampledRDD deprecated in 1.0

commit f430f847c3df91a3894687c513f23f823f77c255
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 14:38:59 2014 -0800

    update code style

commit a8b5e2021a9204e318c80a44d00c5c495f1befb6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:56:27 2014 -0800

    move package random to util.random

commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 12:50:35 2014 -0800

    add Apache headers and update code style

commit 985609fe1a55655ad11966e05a93c18c138a403d
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:49:25 2014 -0800

    add new lines

commit b21bddf29850a2c006a868869b8f91960a029322
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sun Jan 26 11:46:35 2014 -0800

    move samplers to random.IndependentRandomSampler and add tests

commit c02dacb4a941618e434cefc129c002915db08be6
Author: Xiangrui Meng <meng@databricks.com>
Date:   Sat Jan 25 15:20:24 2014 -0800

    add RandomSampler

commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c
Author: Xiangrui Meng <meng@databricks.com>
Date:   Fri Jan 24 13:23:22 2014 -0800

    init impl of IndependentlySampledRDD

23af00f9

Merge pull request #530 from aarondav/cleanup. Closes #530. · 1625d8c4

Aaron Davidson authored 11 years ago

Remove explicit conversion to PairRDDFunctions in cogroup()

As SparkContext._ is already imported, using the implicit conversion appears to make the code much cleaner. Perhaps there was some sinister reason for doing the conversion explicitly, however.

Author: Aaron Davidson <aaron@databricks.com>

== Merge branch commits ==

commit aa4a63f1bfd5b5178fe67364dd7ce4d84c357996
Author: Aaron Davidson <aaron@databricks.com>
Date:   Sun Feb 2 23:48:04 2014 -0800

    Remove explicit conversion to PairRDDFunctions in cogroup()

    As SparkContext._ is already imported, using the implicit conversion
    appears to make the code much cleaner. Perhaps there was some sinister
    reason for doing the converion explicitly, however.

1625d8c4