Commits · 743a31a7ca4421cbbd5b615b773997a06a7ab4ee · cs525-sp18-g07 / spark

Nov 27, 2013

Merge pull request #210 from haitaoyao/http-timeout · 743a31a7

Matei Zaharia authored 11 years ago

add http timeout for httpbroadcast

While pulling task bytecode from HttpBroadcast server, there's no timeout value set. This may cause spark executor code hang and other task in the same executor process wait for the lock. I have encountered the issue in my cluster. Here's the stacktrace I captured : https://gist.github.com/haitaoyao/7655830

So add a time out value to ensure the task fail fast.

743a31a7

Nov 26, 2013

Merge pull request #146 from JoshRosen/pyspark-custom-serializers · fb6875dd

Matei Zaharia authored 11 years ago

Custom Serializers for PySpark

This pull request adds support for custom serializers to PySpark.  For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext.

For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers.  It's pretty easy to add support for other serializers, although I still need to add instructions on this.

A few notable changes:

- The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings.  The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD.  This mechanism could also be used to support other data exchange formats, such as MsgPack.
- Several magic numbers were refactored into constants.
- Batching is implemented by wrapping / decorating an unbatched SerDe.

fb6875dd

Merge pull request #207 from henrydavidge/master · 330ada17

Matei Zaharia authored 11 years ago

Log a warning if a task's serialized size is very big

As per Reynold's instructions, we now create a warning level log entry if a task's serialized size is too big. "Too big" is currently defined as 100kb. This warning message is generated at most once for each stage.

330ada17

Merge pull request #212 from markhamstra/SPARK-963 · 615213fb
Matei Zaharia authored 11 years ago
```
[SPARK-963] Fixed races in JobLoggerSuite
```
615213fb
Removed unused basestring case from dump_stream. · 1b74a27d
Josh Rosen authored 11 years ago

1b74a27d
Emit warning when task size > 100KB · 57579934
hhd authored 11 years ago

57579934
[SPARK-963] Wait for SparkListenerBus eventQueue to be empty before checking jobLogger state · ed7ecb93
Mark Hamstra authored 11 years ago

ed7ecb93
Merge pull request #209 from pwendell/better-docs · cb976dfb
Reynold Xin authored 11 years ago
```
Improve docs for shuffle instrumentation
```
cb976dfb
add http timeout for httpbroadcast · db998a6e
haitao.yao authored 11 years ago

db998a6e

Merge pull request #86 from holdenk/master · 18d6df0e

Matei Zaharia authored 11 years ago

Add histogram functionality to DoubleRDDFunctions

This pull request add histogram functionality to the DoubleRDDFunctions.

18d6df0e

Improve docs for shuffle instrumentation · 297c09d4
Patrick Wendell authored 11 years ago

297c09d4

Nov 25, 2013

Fix the test · 7222ee29
Holden Karau authored 11 years ago

7222ee29

Merge pull request #204 from rxin/hash · 0e2109dd

Matei Zaharia authored 11 years ago

OpenHashSet fixes

Incorporated ideas from pull request #200.
- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique.
- Remember the grow threshold instead of recomputing it on each insert

Also added unit tests for size estimation for specialized hash sets and maps.

0e2109dd

Merge pull request #206 from ash211/patch-2 · c46067f0

Matei Zaharia authored 11 years ago

Update tuning.md

Clarify when serializer is used based on recent user@ mailing list discussion.

c46067f0

Merge pull request #201 from rxin/mappartitions · 14bb465b

Matei Zaharia authored 11 years ago

Use the proper partition index in mapPartitionsWIthIndex

mapPartitionsWithIndex uses TaskContext.partitionId as the partition index. TaskContext.partitionId used to be identical to the partition index in a RDD. However, pull request #186 introduced a scenario (with partition pruning) that the two can be different. This pull request uses the right partition index in all mapPartitionsWithIndex related calls.

Also removed the extra MapPartitionsWIthContextRDD and put all the mapPartitions related functionality in MapPartitionsRDD.

14bb465b

Update tuning.md · 08afef37

Andrew Ash authored 11 years ago

Clarify when serializer is used based on recent user@ mailing list discussion.

08afef37

Merge pull request #101 from colorant/yarn-client-scheduler · eb4296c8

Matei Zaharia authored 11 years ago

For SPARK-527, Support spark-shell when running on YARN

sync to trunk and resubmit here

In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote.

This approaching won't support application that involve local interaction and need to be run on where it is launched.

So In this pull request I have a YarnClientClusterScheduler and backend added.

With this scheduler, the user application is launched locally,While the executor will be launched by YARN on remote nodes with a thin AM which only launch the executor and monitor the Driver Actor status, so that when client app is done, it can finish the YARN Application as well.

This enables spark-shell to run upon YARN.

This also enable other Spark applications to have the spark context to run locally with a master-url "yarn-client". Thus e.g. SparkPi could have the result output locally on console instead of output in the log of the remote machine where AM is running on.

Docs also updated to show how to use this yarn-client mode.

eb4296c8

Incorporated ideas from pull request #200. · 466fd064

Reynold Xin authored 11 years ago

- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.

- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique

- Remember the grow threshold instead of recomputing it on each insert

466fd064

Added unit tests for size estimation for specialized hash sets and maps. · 95c55df1
Reynold Xin authored 11 years ago

95c55df1

Nov 24, 2013

Merge pull request #203 from witgo/master · 62889c41
Reynold Xin authored 11 years ago
```
 Fix Maven build for metrics-graphite
```
62889c41
Fix Maven build for metrics-graphite · 98920360
LiGuoqiang authored 11 years ago

98920360

Merge pull request #151 from russellcardullo/add-graphite-sink · 859d62dc

Matei Zaharia authored 11 years ago

Add graphite sink for metrics

This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.

859d62dc

Merge pull request #185 from mkolod/random-number-generator · 65de73c7

Matei Zaharia authored 11 years ago

XORShift RNG with unit tests and benchmark

This patch was introduced to address SPARK-950 - the discussion below the ticket explains not only the rationale, but also the design and testing decisions: https://spark-project.atlassian.net/browse/SPARK-950

To run unit test, start SBT console and type:
compile
test-only org.apache.spark.util.XORShiftRandomSuite
To run benchmark, type:
project core
console
Once the Scala console starts, type:
org.apache.spark.util.XORShiftRandom.benchmark(100000000)
XORShiftRandom is also an object with a main method taking the
number of iterations as an argument, so you can also run it
from the command line.

65de73c7

Merge pull request #197 from aarondav/patrick-fix · 972171b9

Reynold Xin authored 11 years ago

Fix 'timeWriting' stat for shuffle files

Due to concurrent git branches, changes from shuffle file consolidation patch
caused the shuffle write timing patch to no longer actually measure the time,
since it requires time be measured after the stream has been closed.

972171b9

Consolidated both mapPartitions related RDDs into a single MapPartitionsRDD. · e9ff13ec

Reynold Xin authored 11 years ago

Also changed the semantics of the index parameter in mapPartitionsWithIndex from the partition index of the output partition to the partition index in the current RDD.

e9ff13ec

Nov 23, 2013

Merge pull request #200 from mateiz/hash-fix · 718cc803

Reynold Xin authored 11 years ago

AppendOnlyMap fixes

- Chose a more random reshuffling step for values returned by Object.hashCode to avoid some long chaining that was happening for consecutive integers (e.g. `sc.makeRDD(1 to 100000000, 100).map(t => (t, t)).reduceByKey(_ + _).count`)
- Some other small optimizations throughout (see commit comments)

718cc803

Some other optimizations to AppendOnlyMap: · 9837a602

Matei Zaharia authored 11 years ago

- Don't check keys for equality when re-inserting due to growing the
  table; the keys will already be unique
- Remember the grow threshold instead of recomputing it on each insert

9837a602

Fixes to AppendOnlyMap: · 7535d7fb

Matei Zaharia authored 11 years ago

- Use Murmur Hash 3 finalization step to scramble the bits of HashCode
  instead of the simpler version in java.util.HashMap; the latter one
  had trouble with ranges of consecutive integers. Murmur Hash 3 is used
  by fastutil.
- Use Object.equals() instead of Scala's == to compare keys, because the
  latter does extra casts for numeric types (see the equals method in
  https://github.com/scala/scala/blob/master/src/library/scala/runtime/BoxesRunTime.java)

7535d7fb

Merge pull request #198 from ankurdave/zipPartitions-preservesPartitioning · 51aa9d6e

Reynold Xin authored 11 years ago

Support preservesPartitioning in RDD.zipPartitions

In `RDD.zipPartitions`, add support for a `preservesPartitioning` option (similar to `RDD.mapPartitions`) that reuses the first RDD's partitioner.

51aa9d6e

Support preservesPartitioning in RDD.zipPartitions · c1507afc
Ankur Dave authored 11 years ago

c1507afc

Nov 21, 2013

Fix 'timeWriting' stat for shuffle files · ccea38b7

Aaron Davidson authored 11 years ago

ccea38b7

Merge pull request #193 from aoiwelle/patch-1 · 086b097e

Reynold Xin authored 11 years ago

Fix Kryo Serializer buffer documentation inconsistency

The documentation here is inconsistent with the coded default and other documentation.

086b097e

Merge pull request #196 from pwendell/master · f20093c3

Reynold Xin authored 11 years ago

TimeTrackingOutputStream should pass on calls to close() and flush().

Without this fix you get a huge number of open files when running shuffles.

f20093c3

Add YarnClientClusterScheduler and Backend. · ab3cefde

Raymond Liu authored 11 years ago

With this scheduler, the user application is launched locally,
While the executor will be launched by YARN on remote nodes.

This enables spark-shell to run upon YARN.

ab3cefde

TimeTrackingOutputStream should pass on calls to close() and flush(). · 53b94ef2
Patrick Wendell authored 11 years ago
```
Without this fix you get a huge number of open shuffles after running
shuffles.
```
53b94ef2

Nov 20, 2013

Fix Kryo Serializer buffer inconsistency · 21b5478e
Neal Wiggins authored 11 years ago
```
The documentation here is inconsistent with the coded default and other documentation.
```
21b5478e

Merge branch 'master' of github.com:tbfenet/incubator-spark · 2fead510

Reynold Xin authored 11 years ago

PartitionPruningRDD is using index from parent

I was getting a ArrayIndexOutOfBoundsException exception after doing union on pruned RDD. The index it was using on the partition was the index in the original RDD not the new pruned RDD.

2fead510

Merge pull request #191 from hsaputra/removesemicolonscala · 4b895013

Matei Zaharia authored 11 years ago

Cleanup to remove semicolons (;) from Scala code

-) The main reason for this PR is to remove semicolons from single statements of Scala code.
-) Remove unused imports as I see them
-) Fix ASF comment header from some of files (bad copy paste I suppose)

4b895013

Make XORShiftRandom explicit in KMeans and roll it back for RDD · 22724659
Marek Kolodziej authored 11 years ago

22724659

Nov 19, 2013
- Formatting and scoping (private[spark]) updates · bcc6ed30
  Marek Kolodziej authored 11 years ago
  
  bcc6ed30