Commits · 92c7cc00fbf265a16115dc885f84c618d457389c · cs525-sp18-g07 / spark

Nov 21, 2013
- [Deb] fix package of Spark classes adding org.apache prefix in scripts embeded in .deb · 92c7cc00
  dhardy92 authored 11 years ago
  
  92c7cc00
Nov 19, 2013

Merge pull request #181 from BlackNiuza/fix_tasks_number · f568912f
Matei Zaharia authored 11 years ago
```
correct number of tasks in ExecutorsUI

Index `a` is not `execId` here
```
f568912f

Merge pull request #189 from tgravescs/sparkYarnErrorHandling · aa638ed9

Matei Zaharia authored 11 years ago

Impove Spark on Yarn Error handling

Improve cli error handling and only allow a certain number of worker failures before failing the application. This will help prevent users from doing foolish things and their jobs running forever. For instance using 32 bit java but trying to allocate 8G containers. This loops forever without this change, now it errors out after a certain number of retries. The number of tries is configurable. Also increase the frequency we ping the RM to increase speed at which we get containers if they die. The Yarn MR app defaults to pinging the RM every 1 seconds, so the default of 5 seconds here is fine. But that is configurable as well in case people want to change it.

I do want to make sure there aren't any cases that calling stopExecutors in CoarseGrainedSchedulerBackend would cause problems? I couldn't think of any and testing on standalone cluster as well as yarn.

aa638ed9

Merge pull request #187 from aarondav/example-bcast-test · 55925805

Matei Zaharia authored 11 years ago

Enable the Broadcast examples to work in a cluster setting

Since they rely on println to display results, we need to first collect those results to the driver to have them actually display locally.

This issue came up on the mailing lists [here](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3C2013111909591557147628%40ict.ac.cn%3E).

55925805

Impove Spark on Yarn Error handling · 4093e939
tgravescs authored 11 years ago

4093e939

Enable the Broadcast examples to work in a cluster setting · 50fd8d98

Aaron Davidson authored 11 years ago

Since they rely on println to display results, we need to first collect
those results to the driver to have them actually display locally.

50fd8d98

Nov 17, 2013
- use HashSet.empty[Long] instead of Seq[Long] · eda05fa4
  shiyun.wxm authored 11 years ago
  
  eda05fa4
- Merge pull request #182 from rxin/vector · e2ebc3a9
  Reynold Xin authored 11 years ago
  
  Slightly enhanced PrimitiveVector: 1. Added trim() method 2. Added size method. 3. Renamed getUnderlyingArray to array. 4. Minor documentation update.
  e2ebc3a9
- Merge pull request #3 from aarondav/pv-test · 26f616d7
  Reynold Xin authored 11 years ago
  
  Add PrimitiveVectorSuite and fix bug in resize()
  26f616d7
- Add PrimitiveVectorSuite and fix bug in resize() · 85763f49
  Aaron Davidson authored 11 years ago
  
  85763f49
- Return the vector itself for trim and resize method in PrimitiveVector. · 16a2286d
  Reynold Xin authored 11 years ago
  
  16a2286d
- rename "a" to "statusId" · ecfbaf24
  BlackNiuza authored 11 years ago
  
  ecfbaf24
- Slightly enhanced PrimitiveVector: · c30979c7
  Reynold Xin authored 11 years ago
  
  1. Added trim() method 2. Added size method. 3. Renamed getUnderlyingArray to array. 4. Minor documentation update.
  c30979c7
- correct number of tasks in ExecutorsUI · b60839e5
  BlackNiuza authored 11 years ago
  
  b60839e5
Nov 16, 2013

Merge pull request #178 from hsaputra/simplecleanupcode · 1b5b3583

Matei Zaharia authored 11 years ago

Simple cleanup on Spark's Scala code

Simple cleanup on Spark's Scala code while testing some modules:
-) Remove some of unused imports as I found them
-) Remove ";" in the imports statements
-) Remove () at the end of method calls like size that does not have size effect.

1b5b3583

Nov 15, 2013

Simple cleanup on Spark's Scala code while testing core and yarn modules: · c33f8020

Henry Saputra authored 11 years ago

-) Remove some of unused imports as I found them
-) Remove ";" in the imports statements
-) Remove () at the end of method call like size that does not have size effect.

c33f8020

Merge pull request #173 from kayousterhout/scheduler_hang · 96e0fb46

Matei Zaharia authored 11 years ago

Fix bug where scheduler could hang after task failure.

When a task fails, we need to call reviveOffers() so that the
task can be rescheduled on a different machine. In the current code,
the state in ClusterTaskSetManager indicating which tasks are
pending may be updated after revive offers is called (there's a
race condition here), so when revive offers is called, the task set
manager does not yet realize that there are failed tasks that need
to be relaunched.

This isn't currently unit tested but will be once my pull request for
merging the cluster and local schedulers goes in -- at which point
many more of the unit tests will exercise the code paths through
the cluster scheduler (currently the failure test suite uses the local
scheduler, which is why we didn't see this bug before).

96e0fb46

Nov 14, 2013
- Merge pull request #175 from kayousterhout/no_retry_not_serializable · dfd40e9f
  Matei Zaharia authored 11 years ago
  
  Don't retry tasks when they fail due to a NotSerializableException As with my previous pull request, this will be unit tested once the Cluster and Local schedulers get merged.
  dfd40e9f
- Merge pull request #174 from ahirreddy/master · ed25105f
  Matei Zaharia authored 11 years ago
  
  Write Spark UI url to driver file on HDFS This makes the SIMR code path simpler
  ed25105f
- Don't retry tasks when they fail due to a NotSerializableException · 29c88e40
  Kay Ousterhout authored 11 years ago
  
  29c88e40
- Fix bug where scheduler could hang after task failure. · b4546ba9
  Kay Ousterhout authored 11 years ago
  
  When a task fails, we need to call reviveOffers() so that the task can be rescheduled on a different machine. In the current code, the state in ClusterTaskSetManager indicating which tasks are pending may be updated after revive offers is called (there's a race condition here), so when revive offers is called, the task set manager does not yet realize that there are failed tasks that need to be relaunched.
  b4546ba9
- Merge pull request #169 from kayousterhout/mesos_fix · 1a4cfbea
  Reynold Xin authored 11 years ago
  
  Don't ignore spark.cores.max when using Mesos Coarse mode totalCoresAcquired is decremented but never incremented, causing Spark to effectively ignore spark.cores.max in coarse grained Mesos mode.
  1a4cfbea
- Merge pull request #170 from liancheng/hadooprdd-doc-typo · 5a4f4836
  Reynold Xin authored 11 years ago
  
  Fixed a scaladoc typo in HadoopRDD.scala
  5a4f4836
- Merge pull request #171 from RIA-pierre-borckmans/master · d76f5203
  Reynold Xin authored 11 years ago
  
  Fixed typos in the CDH4 distributions version codes. Nothing important, but annoying when doing a copy/paste...
  d76f5203
- Fixed typos in the CDH4 distributions version codes. · bef398e5
  RIA-pierre-borckmans authored 11 years ago
  
  bef398e5
- Fixed a scaladoc typo in HadoopRDD.scala · cc8995c8
  Lian, Cheng authored 11 years ago
  
  cc8995c8
- Don't ignore spark.cores.max when using Mesos Coarse mode · 5125cd34
  Kay Ousterhout authored 11 years ago
  
  5125cd34
Nov 13, 2013

Merge pull request #159 from liancheng/dagscheduler-actor-refine · 2054c61a

Matei Zaharia authored 11 years ago

Migrate the daemon thread started by DAGScheduler to Akka actor

`DAGScheduler` adopts an event queue and a daemon thread polling the it to process events sent to a `DAGScheduler`. This is a classical actor use case. By migrating this thread to Akka actor, we may benefit from both cleaner code and better performance (context switching cost of Akka actor is much less than that of a native thread).

But things become a little complicated when taking existing test code into consideration.

Code in `DAGSchedulerSuite` is somewhat tightly coupled with `DAGScheduler`, and directly calls `DAGScheduler.processEvent` instead of posting event messages to `DAGScheduler`. To minimize code change, I chose to let the actor to delegate messages to `processEvent`. Maybe this doesn't follow conventional actor usage, but I tried to make it apparently correct.

Another tricky part is that, since `DAGScheduler` depends on the `ActorSystem` provided by its field `env`, `env` cannot be null. But the `dagScheduler` field created in `DAGSchedulerSuite.before` was given a null `env`. What's more, `BlockManager.blockIdsToBlockManagers` checks whether `env` is null to determine whether to run the production code or the test code (bad smell here, huh?). I went through all callers of `BlockManager.blockIdsToBlockManagers`, and made sure that if `env != null` holds, then `blockManagerMaster == null` must also hold. That's the logic behind `BlockManager.scala` [line 896](https://github.com/liancheng/incubator-spark/compare/dagscheduler-actor-refine?expand=1#diff-2b643ea78c1add0381754b1f47eec132L896).

At last, since `DAGScheduler` instances are always `start()`ed after creation, I removed the `start()` method, and starts the `eventProcessActor` within the constructor.

2054c61a

Merge pull request #165 from NathanHowell/kerberos-master · 9290e5bc

Matei Zaharia authored 11 years ago

spark-assembly.jar fails to authenticate with YARN ResourceManager

The META-INF/services/ sbt MergeStrategy was discarding support for Kerberos, among others. This pull request changes to a merge strategy similar to sbt-assembly's default. I've also included an update to sbt-assembly 0.9.2, a minor fix to it's zip file handling.

9290e5bc

Write Spark UI url to driver file on HDFS · 0ea1f8b2
Ahir Reddy authored 11 years ago

0ea1f8b2

Merge pull request #166 from ahirreddy/simr-spark-ui · 39af914b

Matei Zaharia authored 11 years ago

SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to ...

...be retrieved by SIMR clients

39af914b

Nov 12, 2013

Merge pull request #137 from tgravescs/sparkYarnJarsHdfsRebase · f49ea28d

Matei Zaharia authored 11 years ago

Allow spark on yarn to be run from HDFS.

Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.  Allows you to specify the files on a different hdfs cluster and it will copy them over. It makes sure permissions are correct and makes sure to put things into public distributed cache so they can be reused amongst users if their permissions are appropriate.  Also add a bit of error handling for missing arguments.

f49ea28d

Merge pull request #153 from ankurdave/stop-spot-cluster · 87f2f4e5

Matei Zaharia authored 11 years ago

Enable stopping and starting a spot cluster

Clusters launched using `--spot-price` contain an on-demand master and spot slaves. Because EC2 does not support stopping spot instances, the spark-ec2 script previously could only destroy such clusters.

This pull request makes it possible to stop and restart a spot cluster.
* The `stop` command works as expected for a spot cluster: the master is stopped and the slaves are terminated.
* To start a stopped spot cluster, the user must invoke `launch --use-existing-master`. This launches fresh spot slaves but resumes the existing master.

87f2f4e5

Merge pull request #160 from xiajunluan/JIRA-923 · b8bf04a0

Matei Zaharia authored 11 years ago

Fix bug JIRA-923

Fix column sort issue in UI for JIRA-923.
https://spark-project.atlassian.net/browse/SPARK-923

Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
	core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala

b8bf04a0

SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to be... · ccb099e8
Ahir Reddy authored 11 years ago
```
SIMR Backend Scheduler will now write Spark UI URL to HDFS, which is to be retrieved by SIMR clients
```
ccb099e8
Upgrade to sbt-assembly 0.9.2 · 48eac0bc
Nathan Howell authored 11 years ago

48eac0bc

spark-assembly.jar fails to authenticate with YARN ResourceManager · 23146a67

Nathan Howell authored 11 years ago

sbt-assembly is setup to pick the first META-INF/services/org.apache.hadoop.security.SecurityInfo file instead of merging them. This causes Kerberos authentication to fail, this manifests itself in the "info:null" debug log statement:

    DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    DEBUG SaslRpcClient: Get kerberos info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    DEBUG UserGroupInformation: PrivilegedAction as:foo@BAR (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:583)
    WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

This previously would just contain a single class:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfo
Archive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo

And now has the full list of classes:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfoArchive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo
    org.apache.hadoop.mapreduce.v2.app.MRClientSecurityInfo
    org.apache.hadoop.mapreduce.v2.security.client.ClientHSSecurityInfo
    org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo
    org.apache.hadoop.yarn.security.ContainerManagerSecurityInfo
    org.apache.hadoop.yarn.security.SchedulerSecurityInfo
    org.apache.hadoop.yarn.security.admin.AdminSecurityInfo
    org.apache.hadoop.yarn.server.RMNMSecurityInfoClass

23146a67

Merge pull request #164 from tdas/kafka-fix · dfd1ebc2

Matei Zaharia authored 11 years ago

Made block generator thread safe to fix Kafka bug.

This is a very important bug fix. Data can and was being lost in the kafka due to this.

dfd1ebc2

Made block generator thread safe to fix Kafka bug. · 7ccbbdac
Tathagata Das authored 11 years ago

7ccbbdac

Nov 11, 2013
- Enable stopping and starting a spot cluster · bc9f7eac
  Ankur Dave authored 11 years ago
  
  bc9f7eac