Commits · bacfe5ebca8e82317a7596c9fcbf95331c7038a9 · cs525-sp18-g07 / spark

Oct 24, 2013
- Added JavaStreamingContext.transform · bacfe5eb
  Tathagata Das authored 11 years ago
  
  bacfe5eb
Oct 23, 2013

Removed Function3.call() based on Josh's comment. · 9fccb17a
Tathagata Das authored 11 years ago

9fccb17a
Merge branch 'apache-master' into transform · fe8626ef
Tathagata Das authored 11 years ago

fe8626ef

Fixed bug in Java transformWith, added more Java testcases for transform and... · 72d2e1dd

Tathagata Das authored 11 years ago

Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs.

72d2e1dd

Merge pull request #97 from ewencp/pyspark-system-properties · 452aa36d

Matei Zaharia authored 11 years ago

Add classmethod to SparkContext to set system properties.

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.

452aa36d

Oct 22, 2013

Merge pull request #100 from JoshRosen/spark-902 · 9dfcf53a

Reynold Xin authored 11 years ago

Remove redundant Java Function call() definitions

This should fix [SPARK-902](https://spark-project.atlassian.net/browse/SPARK-902), an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler.

Thanks to @MartinWeindel for diagnosing this problem.

(This PR subsumes #30).

9dfcf53a

Remove redundant Java Function call() definitions · 768eb9c9

Josh Rosen authored 11 years ago

This should fix SPARK-902, an issue where some
Java API Function classes could cause
AbstractMethodErrors when user code is compiled
using the Eclipse compiler.

Thanks to @MartinWeindel for diagnosing this
problem.

(This PR subsumes / closes #30)

768eb9c9

Merge pull request #99 from pwendell/master · 97184de1
Patrick Wendell authored 11 years ago
```
Use correct formatting for comments in StoragePerfTester
```
97184de1
Formatting cleanup · ab5ece19
Patrick Wendell authored 11 years ago

ab5ece19
Add notes to python documentation about using SparkContext.setSystemProperty. · c8748c25
Ewen Cheslack-Postava authored 11 years ago

c8748c25

Merge pull request #90 from pwendell/master · c404adb9

Patrick Wendell authored 11 years ago

SPARK-940: Do not directly pass Stage objects to SparkListener.

This patch updates the SparkListener interface to pass StageInfo objects rather than directly pass spark Stages. The reason for this patch is explained in detail in SPARK-940.

c404adb9

Pass self to SparkContext._ensure_initialized. · 317a9eb1

Ewen Cheslack-Postava authored 11 years ago

The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.

317a9eb1

Minor clean-up in review · c22046b3
Patrick Wendell authored 11 years ago

c22046b3
Response to code review and adding some more tests · 7de0ea4d
Patrick Wendell authored 11 years ago

7de0ea4d

Fix for Spark-870. · 2fa3c4c4

Patrick Wendell authored 11 years ago

This patch fixes a bug where the Spark UI didn't display the correct number of total
tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.

It also cleans up the listener API a bit by embedding this information in the
StageInfo class rather than passing it seperately.

2fa3c4c4

SPARK-940: Do not directly pass Stage objects to SparkListener. · a854f5bf
Patrick Wendell authored 11 years ago

a854f5bf
Merge pull request #98 from aarondav/docs · aa9019fc
Matei Zaharia authored 11 years ago
```
Docs: Fix links to RDD API documentation
```
aa9019fc

Merge pull request #82 from JoshRosen/map-output-tracker-refactoring · a0e08f0f

Matei Zaharia authored 11 years ago

Split MapOutputTracker into Master/Worker classes

Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.

I also renamed a few methods and made others protected/private.

a0e08f0f

Docs: Fix links to RDD API documentation · 962bec97
Aaron Davidson authored 11 years ago

962bec97

Add classmethod to SparkContext to set system properties. · 56d230e6

Ewen Cheslack-Postava authored 11 years ago

56d230e6

Merge pull request #92 from tgravescs/sparkYarnFixClasspath · b84193c5

Matei Zaharia authored 11 years ago

Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ...

...to be explicit about inclusion of spark.jar and app.jar.  Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.

b84193c5

Merge pull request #56 from jerryshao/kafka-0.8-dev · 731c94e9
Matei Zaharia authored 11 years ago
```
Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming

Conflicts:
	streaming/pom.xml
```
731c94e9

Merge pull request #87 from aarondav/shuffle-base · 48952d67

Reynold Xin authored 11 years ago

Basic shuffle file consolidation

The Spark shuffle phase can produce a large number of files, as one file is created
per mapper per reducer. For large or repeated jobs, this often produces millions of
shuffle files, which sees extremely degredaded performance from the OS file system.
This patch seeks to reduce that burden by combining multipe shuffle files into one.

This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669.
However, it simplifies the design in order to get the majority of the gain with less
overall intellectual and code burden. The vast majority of code in this pull request
is a refactor to allow the insertion of a clean layer of indirection between logical
block ids and physical files. This, I feel, provides some design clarity in addition
to enabling shuffle file consolidation.

The main goal is to produce one shuffle file per reducer per active mapper thread.
This allows us to isolate the mappers (simplifying the failure modes), while still
allowing us to reduce the number of mappers tremendously for large tasks. In order
to accomplish this, we simply create a new set of shuffle files for every parallel
task, and return the files to a pool which will be given out to the next run task.

I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark:

scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt))
scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }
scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0)

For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two.

Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.

48952d67

Merge ShufflePerfTester patch into shuffle block consolidation · 053ef949
Aaron Davidson authored 11 years ago

053ef949

Oct 21, 2013

Merge pull request #95 from aarondav/perftest · a51359c9
Reynold Xin authored 11 years ago
```
Minor: Put StoragePerfTester in org/apache/
```
a51359c9
Put StoragePerfTester in org/apache/ · 97053c4a
Aaron Davidson authored 11 years ago

97053c4a

Merge pull request #94 from aarondav/mesos-fix · 39d2e9b2

Matei Zaharia authored 11 years ago

Fix mesos urls

This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71.
Previously, we explicitly removed the mesos:// part; with #71, this no longer occurs.

39d2e9b2

Fix mesos urls · 0071f089

Aaron Davidson authored 11 years ago

This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71
Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.

0071f089

Remove executorId from Task.run() · 4aa0ba1d
Aaron Davidson authored 11 years ago

4aa0ba1d
Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath to be explicit · b6571541
tgravescs authored 11 years ago
```
about inclusion of spark.jar and app.jar
```
b6571541

Merge pull request #88 from rxin/clean · aa61bfd3

Patrick Wendell authored 11 years ago

Made the following traits/interfaces/classes non-public:

Made the following traits/interfaces/classes non-public:
SparkHadoopWriter
SparkHadoopMapRedUtil
SparkHadoopMapReduceUtil
SparkHadoopUtil
PythonAccumulatorParam
BlockManagerSlaveActor

aa61bfd3

Updated TransformDStream to allow n-ary DStream transform. Added... · 06664987

Tathagata Das authored 11 years ago

Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.

06664987

Documentation update · 444162af
Aaron Davidson authored 11 years ago

444162af
Close shuffle writers during failure & remove executorId from TaskContext · 947fceaa
Aaron Davidson authored 11 years ago

947fceaa

Merge pull request #41 from pwendell/shuffle-benchmark · 35886f34

Patrick Wendell authored 11 years ago

Provide Instrumentation for Shuffle Write Performance

Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are:

1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes.
2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache.
3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs.

I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.

35886f34

Oct 20, 2013

Merge pull request #89 from rxin/executor · 5b9380e0

Reynold Xin authored 11 years ago

Don't setup the uncaught exception handler in local mode.

This avoids unit test failures for Spark streaming.

java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54)
at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108)
at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)

5b9380e0

Made JobLogger public again and some minor cleanup. · b4d84784
Reynold Xin authored 11 years ago

b4d84784
Merge pull request #80 from rxin/build · 261bcf27
Matei Zaharia authored 11 years ago
```
Exclusion rules for Maven build files.
```
261bcf27
Cleanup old shuffle file metadata from memory · 4b68ddf3
Aaron Davidson authored 11 years ago

4b68ddf3

Merge pull request #75 from JoshRosen/block-manager-cleanup · edc5e3f8

Matei Zaharia authored 11 years ago

Code de-duplication in BlockManager

The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects.

I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.

edc5e3f8