Commits · 9e6f3bdcda1ab48159afa4f54b64d05e42a8688e · cs525-sp18-g07 / spark

Jan 03, 2014
- Changes on top of Prashant's patch. · 9e6f3bdc
  Patrick Wendell authored 11 years ago
  
  Closes #316
  9e6f3bdc
- Restored the previously removed test · bc311bb8
  Prashant Sharma authored 11 years ago
  
  bc311bb8
- fixed review comments · 94f2fffa
  Prashant Sharma authored 11 years ago
  
  94f2fffa
- Merge branch 'master' into spark-1002-remove-jars · b4bb8000
  Prashant Sharma authored 11 years ago
  
  b4bb8000
Jan 02, 2014

Merge pull request #323 from tgravescs/sparkconf_yarn_fix · 498a5f0a

Patrick Wendell authored 11 years ago

fix spark on yarn after the sparkConf changes

This fixes it so that spark on yarn now compiles and works after the sparkConf changes.

There are also other issues I discovered along the way that are broken:
- mvn builds for yarn don't assemble correctly
- unset SPARK_EXAMPLES_JAR isn't handled properly anymore
- I'm pretty sure spark.conf doesn't actually work as its not distributed with yarn

those things can be fixed in separate pr unless others disagree.

498a5f0a

Merge pull request #320 from kayousterhout/erroneous_failed_msg · 0475ca8f

Reynold Xin authored 11 years ago

Remove erroneous FAILED state for killed tasks.

Currently, when tasks are killed, the Executor first sends a
status update for the task with a "KILLED" state, and then
sends a second status update with a "FAILED" state saying that
the task failed due to an exception. The second FAILED state is
misleading/unncessary, and occurs due to a NonLocalReturnControl
Exception that gets thrown due to the way we kill tasks. This
commit eliminates that problem.

I'm not at all sure that this is the best way to fix this problem,
so alternate suggestions welcome. @rxin guessing you're the right
person to look at this.

0475ca8f

fix yarn-client · fced7885
Thomas Graves authored 11 years ago

fced7885
Fix yarn build after sparkConf changes · c6de982b
Thomas Graves authored 11 years ago

c6de982b

Merge pull request #297 from tdas/window-improvement · 588a1695

Patrick Wendell authored 11 years ago

Improvements to DStream window ops and refactoring of Spark's CheckpointSuite

- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.

588a1695

Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spark · 7bafb68d
Matei Zaharia authored 11 years ago

7bafb68d

Merge pull request #319 from kayousterhout/remove_error_method · 5e67cdc8

Reynold Xin authored 11 years ago

Removed redundant TaskSetManager.error() function.

This function was leftover from a while ago, and now just
passes all calls through to the abort() function, so this
commit deletes it.

5e67cdc8

Merge pull request #311 from tmyklebu/master · ca67909c

Matei Zaharia authored 11 years ago

SPARK-991: Report information gleaned from a Python stacktrace in the UI

Scala:

- Added setCallSite/clearCallSite to SparkContext and JavaSparkContext.
  These functions mutate a LocalProperty called "externalCallSite."
- Add a wrapper, getCallSite, that checks for an externalCallSite and, if
  none is found, calls the usual Utils.formatSparkCallSite.
- Change everything that calls Utils.formatSparkCallSite to call
  getCallSite instead. Except getCallSite.
- Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext.

Python:

- Add a gruesome hack to rdd.py that inspects the traceback and guesses
  what you want to see in the UI.
- Add a RAII wrapper around said gruesome hack that calls
  setCallSite/clearCallSite as appropriate.
- Wire said RAII wrapper up around three calls into the Scala code.
  I'm not sure that I hit all the spots with the RAII wrapper. I'm also
  not sure that my gruesome hack does exactly what we want.

One could also approach this change by refactoring
runJob/submitJob/runApproximateJob to take a call site, then threading
that parameter through everything that needs to know it.

One might object to the pointless-looking wrappers in JavaSparkContext.
Unfortunately, I can't directly access the SparkContext from
Python---or, if I can, I don't know how---so I need to wrap everything
that matters in JavaSparkContext.

Conflicts:
	core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

ca67909c

Remove erroneous FAILED state for killed tasks. · a1b438d9

Kay Ousterhout authored 11 years ago

Currently, when tasks are killed, the Executor first sends a
status update for the task with a "KILLED" state, and then
sends a second status update with a "FAILED" state saying that
the task failed due to an exception. The second FAILED state is
misleading/unncessary, and occurs due to a NonLocalReturnControl
Exception that gets thrown due to the way we kill tasks. This
commit eliminates that problem.

a1b438d9

Removed redundant TaskSetManager.error() function. · 5a3c00c9

Kay Ousterhout authored 11 years ago

This function was leftover from a while ago, and now just
passes all calls through to the abort() function, so this
commit deletes it.

5a3c00c9

Removed a repeated test and changed tests to not use uncommons jar · 08ec10de
Prashant Sharma authored 11 years ago

08ec10de
ignoring tests for now, contrary to what I assumed these tests make sense... · 436f3d28
Prashant Sharma authored 11 years ago
```
ignoring tests for now, contrary to what I assumed these tests make sense given what they are testing.
```
436f3d28
Removed sbt folder and changed docs accordingly · 6be4c111
Prashant Sharma authored 11 years ago

6be4c111
Deleted py4j jar and added to assembly dependency · 8821c3a5
Prashant Sharma authored 11 years ago

8821c3a5

Jan 01, 2014

Merge pull request #309 from mateiz/conf2 · 3713f812

Patrick Wendell authored 11 years ago

SPARK-544. Migrate configuration to a SparkConf class

This is still a work in progress based on Prashant and Evan's code. So far I've done the following:

- Got rid of global SparkContext.globalConf
- Passed SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing

This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation). But the API at least is hopefully ready for review. Unfortunately there was a lot of global stuff before due to a "SparkContext.globalConf" method that let you set a "default" configuration of sorts, which meant I had to make some pretty big changes.

3713f812

Fix Python code after change of getOrElse · 7e8d2e8a
Matei Zaharia authored 11 years ago

7e8d2e8a
Fixed two uses of conf.get with no default value in Mesos · 0f606073
Matei Zaharia authored 11 years ago

0f606073

Miscellaneous fixes from code review. · e2c68642

Matei Zaharia authored 11 years ago

Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.

e2c68642

Merge remote-tracking branch 'apache/master' into conf2 · 45ff8f41

Matei Zaharia authored 11 years ago

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

45ff8f41

Merge pull request #312 from pwendell/log4j-fix-2 · c1d928a8

Patrick Wendell authored 11 years ago

SPARK-1008: Logging improvments

1. Adds a default log4j file that gets loaded if users haven't specified a log4j file.
2. Isolates use of the tools assembly jar. I found this produced SLF4J warnings
after building with SBT (and I've seen similar warnings on the mailing list).

c1d928a8

Merge remote-tracking branch 'apache-github/master' into log4j-fix-2 · f8d245bd
Patrick Wendell authored 11 years ago
```
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
```
f8d245bd
Merge remote-tracking branch 'apache/master' into conf2 · 0e5b2adb
Matei Zaharia authored 11 years ago
```
Conflicts:
	project/SparkBuild.scala
```
0e5b2adb

Dec 31, 2013

Merge pull request #314 from witgo/master · 9a0ff721
Reynold Xin authored 11 years ago
```
restore core/pom.xml file modification
```
9a0ff721
restore core/pom.xml file modification · b5d0b3b0
liguoqiang authored 11 years ago

b5d0b3b0

Merge pull request #73 from falaki/ApproximateDistinctCount · 8b8e70eb

Reynold Xin authored 11 years ago

Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.

8b8e70eb

Adding outer checkout when initializing logging · 37c43c9d
Patrick Wendell authored 11 years ago

37c43c9d
Made the code more compact and readable · bee445c9
Hossein Falaki authored 11 years ago

bee445c9
minor improvements · acb03230
Hossein Falaki authored 11 years ago

acb03230
Fix two compile errors introduced in merge · 42bcfb2b
Matei Zaharia authored 11 years ago

42bcfb2b

Merge remote-tracking branch 'apache/master' into conf2 · ba9338f1

Matei Zaharia authored 11 years ago

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala

ba9338f1

Merge pull request #238 from ngbinh/upgradeNetty · 63b411dd

Patrick Wendell authored 11 years ago

upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final

the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy

63b411dd

Merge pull request #289 from tdas/filestream-fix · 55b7e2fd

Patrick Wendell authored 11 years ago

Bug fixes for file input stream and checkpointing

- Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.)
- Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration.
- Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten.
- Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.

55b7e2fd

Fixed comments and long lines based on comments on PR 289. · fcd17a1e
Tathagata Das authored 11 years ago

fcd17a1e
Tiny typo fix · 4abb0c57
Patrick Wendell authored 11 years ago

4abb0c57
Removing use in test · 4d009dca
Patrick Wendell authored 11 years ago

4d009dca
Minor fixes · 3c254f2e
Patrick Wendell authored 11 years ago

3c254f2e