Commits · acb0323053d270a377e497e975b2dfe59e2f997c · cs525-sp18-g07 / spark

Dec 31, 2013
- minor improvements · acb03230
  Hossein Falaki authored 11 years ago
  
  acb03230
Dec 30, 2013
- Added Java unit tests for countApproxDistinct and countApproxDistinctByKey · d6cded71
  Hossein Falaki authored 11 years ago
  
  d6cded71
- Added Java API for countApproxDistinct · c3073b6c
  Hossein Falaki authored 11 years ago
  
  c3073b6c
- Added Java API for countApproxDistinctByKey · ed06500d
  Hossein Falaki authored 11 years ago
  
  ed06500d
- Added stream 2.5.1 jar depenency · b75d7c98
  Hossein Falaki authored 11 years ago
  
  b75d7c98
- Renamed countDistinct and countDistinctByKey methods to include Approx · a7de8e9b
  Hossein Falaki authored 11 years ago
  
  a7de8e9b
- Using origin version · d50ccc5c
  Hossein Falaki authored 11 years ago
  
  d50ccc5c
Dec 24, 2013
- Merge pull request #286 from rxin/build · d63856c3
  Reynold Xin authored 11 years ago
  
  Show full stack trace and time taken in unit tests.
  d63856c3
Dec 23, 2013

Show full stack trace and time taken in unit tests. · fc80b2e6
Reynold Xin authored 11 years ago

fc80b2e6

Merge pull request #277 from tdas/scheduler-update · 23a9ae6b

Matei Zaharia authored 11 years ago

Refactored the streaming scheduler and added StreamingListener interface

- Refactored the streaming scheduler for cleaner code. Specifically, the JobManager was renamed to JobScheduler, as it does the actual scheduling of Spark jobs to the SparkContext. The earlier Scheduler was renamed to JobGenerator, as it actually generates the jobs from the DStreams. The JobScheduler starts the JobGenerator. Also, moved all the scheduler related code from spark.streaming to spark.streaming.scheduler package.
- Implemented the StreamingListener interface, similar to SparkListener. The streaming version of StatusReportListener prints the batch processing time statistics (for now). Added StreamingListernerSuite to test it.
- Refactored streaming TestSuiteBase for deduping code in the other streaming testsuites.

23a9ae6b

Minor change for PR 277. · 6eaa0505
Tathagata Das authored 11 years ago

6eaa0505
Minor formatting fixes. · f9771690
Tathagata Das authored 11 years ago

f9771690
Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277. · dc3ee6b6
Tathagata Das authored 11 years ago

dc3ee6b6
Merge pull request #244 from leftnoteasy/master · 11107c9d
Reynold Xin authored 11 years ago
```
Added SPARK-968 implementation for review

Added SPARK-968 implementation for review
```
11107c9d
SPARK-968, added executor address showing in aggregated metrics by executors table · 2f689ba9
wangda.tan authored 11 years ago

2f689ba9

Dec 22, 2013
- added changes according to comments from rxin · c979eecd
  wangda.tan authored 11 years ago
  
  c979eecd
Dec 20, 2013
- Minor updated based on comments on PR 277. · 3ddbdbfb
  Tathagata Das authored 11 years ago
  
  3ddbdbfb
- Merge pull request #280 from aarondav/minor · 0bc57c57
  Patrick Wendell authored 11 years ago
  
  Minor cleanup for standalone scheduler See commit messages
  0bc57c57
Dec 19, 2013

Merge pull request #272 from tmyklebu/master · eca68d44

Patrick Wendell authored 11 years ago

Track and report task result serialisation time.

 - DirectTaskResult now has a ByteBuffer valueBytes instead of a T value.
 - DirectTaskResult now has a member function T value() that deserialises valueBytes.
 - Executor serialises value into a ByteBuffer and passes it to DTR's ctor.
 - Executor tracks the time taken to do so and puts it in a new field in TaskMetrics.
 - StagePage now reports serialisation time from TaskMetrics along with the other things it reported.

eca68d44

Fix compiler warning in SparkZooKeeperSession · 6613ab66
Aaron Davidson authored 11 years ago

6613ab66
Remove firstApp from the standalone scheduler Master · 4d74b899
Aaron Davidson authored 11 years ago
```
As a lonely child with no one to care for it... we had to put it down.
```
4d74b899
Extraordinarily minor code/comment cleanup · 1ab031ea
Aaron Davidson authored 11 years ago

1ab031ea

Merge pull request #276 from shivaram/collectPartition · 7990c563

Reynold Xin authored 11 years ago

Add collectPartition to JavaRDD interface.

This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.

Thanks @concretevitamin for the original change and tests.

7990c563

Add comment explaining collectPartitions's use · 9cc3a6d3
Shivaram Venkataraman authored 11 years ago

9cc3a6d3

Make collectPartitions take an array of partitions · d3234f97

Shivaram Venkataraman authored 11 years ago

Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.

d3234f97

Merge pull request #278 from MLnick/java-python-tostring · 440e531a

Matei Zaharia authored 11 years ago

Add toString to Java RDD, and __repr__ to Python RDD

Addresses [SPARK-992](https://spark-project.atlassian.net/browse/SPARK-992)

440e531a

Add toString to Java RDD, and __repr__ to Python RDD · a76f5341
Nick Pentreath authored 11 years ago

a76f5341

Merge pull request #183 from aarondav/spark-959 · d8d3f3e6

Reynold Xin authored 11 years ago

[SPARK-959] Explicitly depend on org.eclipse.jetty.orbit jar

Without this, in some cases, Ivy attempts to download the wrong file and fails, stopping the whole build. See [bug](https://spark-project.atlassian.net/browse/SPARK-959) for more details.

Note that this may not be the best solution, as I do not understand the root cause of why this only happens for some people. However, it is reported to work.

d8d3f3e6

Minor changes. · ec71b445
Tathagata Das authored 11 years ago

ec71b445

[SPARK-959] Explicitly depend on org.eclipse.jetty.orbit jar · eaf6a269

Aaron Davidson authored 11 years ago

Without this, in some cases, Ivy attempts to download the wrong file
and fails, stopping the whole build. See bug for more details.

(This is probably also the beginning of the slow death of our
recently prettified dependencies. Form follow function.)

eaf6a269

Merge pull request #247 from aarondav/minor · bfba5323

Reynold Xin authored 11 years ago

Increase spark.akka.askTimeout default to 30 seconds

In experimental clusters we've observed that a 10 second timeout was insufficient, despite having a low number of nodes and relatively small workload (16 nodes, <1.5 TB data). This would cause an entire job to fail at the beginning of the reduce phase.
There is no particular reason for this value to be small as a timeout should only occur in an exceptional situation.

Also centralized the reading of spark.akka.askTimeout to AkkaUtils (surely this can later be cleaned up to use Typesafe).

Finally, deleted some lurking implicits. If anyone can think of a reason they should still be there, please let me know.

bfba5323

Dec 18, 2013

In experimental clusters we've observed that a 10 second timeout was insufficient, · 293a0af5

Aaron Davidson authored 11 years ago

despite having a low number of nodes and relatively small workload (16 nodes, <1.5 TB data).
This would cause an entire job to fail at the beginning of the reduce phase.
There is no particular reason for this value to be small as a timeout should only occur
in an exceptional situation.

Also centralized the reading of spark.akka.askTimeout to AkkaUtils (surely this can later
be cleaned up to use Typesafe).

Finally, deleted some lurking implicits. If anyone can think of a reason they should still
be there, please let me know.

293a0af5

Merge branch 'apache-master' into scheduler-update · e93b391d

Tathagata Das authored 11 years ago

Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala

e93b391d

Merge pull request #267 from JoshRosen/cygwin · c64a53a4

Reynold Xin authored 11 years ago

Fix Cygwin support in several scripts.

This allows the spark-shell, spark-class, run-example, make-distribution.sh,
and ./bin/start-* scripts to work under Cygwin. Note that this doesn't
support PySpark under Cygwin, since that requires many additional `cygpath`
calls from within Python and will be non-trivial to implement.

This PR was inspired by, and subsumes, #253 (so close #253 after this is merged).

c64a53a4

Added StatsReportListener to generate processing time statistics across multiple batches. · b80ec056
Tathagata Das authored 11 years ago

b80ec056

Merge pull request #274 from azuryy/master · 5ea18727

Reynold Xin authored 11 years ago

Fixed the example link in the Scala programing guid.

The old link cannot access, I changed to the new one.

5ea18727

Add collectPartition to JavaRDD interface. · af0cd6bd
Shivaram Venkataraman authored 11 years ago
```
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
```
af0cd6bd
Add a serialisation time column to the StagePage. · d3b1af4b
Tor Myklebust authored 11 years ago

d3b1af4b
changed the example links in the scala-programming-guid · ad8ce014
fengdong authored 11 years ago

ad8ce014

Merge pull request #273 from rxin/top · f4effb37

Reynold Xin authored 11 years ago

Fixed a performance problem in RDD.top and BoundedPriorityQueue

BoundedPriority was actually traversing the entire queue to calculate the size, resulting in bad performance in insertion.

This should also cherry pick cleanly into branch-0.8.

f4effb37