Commits · 84670f2715392859624df290c1b52eb4ed4a9cb1 · cs525-sp18-g07 / spark

Jan 27, 2014

Merge pull request #466 from liyinan926/file-overwrite-new · 84670f27

Reynold Xin authored 11 years ago

Allow files added through SparkContext.addFile() to be overwritten

This is useful for the cases when a file needs to be refreshed and downloaded by the executors periodically. For example, a possible use case is: the driver periodically renews a Hadoop delegation token and writes it to a token file. The token file needs to be downloaded by the executors whenever it gets renewed. However, the current implementation throws an exception when the target file exists and its contents do not match those of the new source. This PR adds an option to allow files to be overwritten to support use cases similar to the above.

84670f27

Merge pull request #516 from sarutak/master · 3d5c03e2

Reynold Xin authored 11 years ago

modified SparkPluginBuild.scala to use https protocol for accessing gith...

We cannot build Spark behind a proxy although we execute sbt with -Dhttp(s).proxyHost -Dhttp(s).proxyPort -Dhttp(s).proxyUser -Dhttp(s).proxyPassword options.
It's because of using git protocol to clone junit_xml_listener.git.
I could build after modifying SparkPluginBuild.scala.

I reported this issue to JIRA.
https://spark-project.atlassian.net/browse/SPARK-1046

3d5c03e2

Merge pull request #490 from hsaputra/modify_checkoption_with_isdefined · f16c21e2

Reynold Xin authored 11 years ago

Replace the check for None Option with isDefined and isEmpty in Scala code

Propose to replace the Scala check for Option "!= None" with Option.isDefined and "=== None" with Option.isEmpty.

I think this, using method call if possible then operator function plus argument, will make the Scala code easier to read and understand.

Pass compile and tests.

f16c21e2

Merge pull request #460 from srowen/RandomInitialALSVectors · f67ce3e2

Sean Owen authored 11 years ago

Choose initial user/item vectors uniformly on the unit sphere

...rather than within the unit square to possibly avoid bias in the initial state and improve convergence.

The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either.

I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past.

This is pretty minor but wanted to warm up suggesting a few tweaks to ALS.
Please excuse my Scala, pretty new to it.

Author: Sean Owen <sowen@cloudera.com>

== Merge branch commits ==

commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6
Author: Sean Owen <sowen@cloudera.com>
Date:   Mon Jan 27 08:05:25 2014 +0000

    Style: spaces around binary operators

commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e
Author: Sean Owen <sowen@cloudera.com>
Date:   Sun Jan 19 22:50:03 2014 +0000

    Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460

commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7
Author: Sean Owen <sowen@cloudera.com>
Date:   Sat Jan 18 15:54:42 2014 +0000

    Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence

f67ce3e2

modified SparkPluginBuild.scala to use https protocol for accessing github. · 6a5af7b7
sarutak authored 11 years ago

6a5af7b7

Jan 26, 2014

Merge pull request #504 from JoshRosen/SPARK-1025 · c40619d4

Reynold Xin authored 11 years ago

Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.

c40619d4

Merge pull request #511 from JoshRosen/SPARK-1040 · c66a2ef1

Reynold Xin authored 11 years ago

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)

This fixes [SPARK-1040](https://spark-project.atlassian.net/browse/SPARK-1040), an issue where JavaPairRDD.collectAsMap() could sometimes fail with ClassCastException. I applied the same fix to the Spark Streaming Java APIs. The commit message describes the fix in more detail.

I also increased the verbosity of JUnit test output under SBT to make it easier to verify that the Java tests are actually running.

c66a2ef1

Jan 25, 2014

Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040) · 740e865f

Josh Rosen authored 11 years ago

This fixes an issue where collectAsMap() could
fail when called on a JavaPairRDD that was derived
by transforming a non-JavaPairRDD.

The root problem was that we were creating the
JavaPairRDD's ClassTag by casting a
ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]].
To fix this, I cast a ClassTag[Tuple2[_, _]]
instead, since this actually produces a ClassTag
of the appropriate type because ClassTags don't
capture type parameters:

scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res8: Boolean = true

scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]]
res9: Boolean = false

740e865f

Increase JUnit test verbosity under SBT. · 531d9d75

Josh Rosen authored 11 years ago

Upgrade junit-interface plugin from 0.9 to 0.10.

I noticed that the JavaAPISuite tests didn't
appear to display any output locally or under
Jenkins, making it difficult to know whether they
were running.  This change increases the verbosity
to more closely match the ScalaTest tests.

531d9d75

Jan 23, 2014

Merge pull request #505 from JoshRosen/SPARK-1026 · 05be7047

Patrick Wendell authored 11 years ago

Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026)

This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs.

05be7047

Deprecate mapPartitionsWithSplit in PySpark. · 4cebb79c
Josh Rosen authored 11 years ago
```
Also, replace the last reference to it in the docs.

This fixes SPARK-1026.
```
4cebb79c

Merge pull request #503 from pwendell/master · 3d6e7541

Patrick Wendell authored 11 years ago

Fix bug on read-side of external sort when using Snappy.

This case wasn't handled correctly and this patch fixes it.

3d6e7541

Minor fix · ff447321
Patrick Wendell authored 11 years ago

ff447321

Merge pull request #502 from pwendell/clone-1 · c3196171

Patrick Wendell authored 11 years ago

Remove Hadoop object cloning and warn users making Hadoop RDD's.

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in various file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.

c3196171

Merge pull request #501 from JoshRosen/cartesian-rdd-fixes · cad3002f

Patrick Wendell authored 11 years ago

Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034

This pull request fixes two bugs in PySpark's `cartesian()` method:

- [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception
- [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result

The JIRAs have more details describing the fixes.

cad3002f

Minor changes after auditing diff from earlier version · 268ecbd2
Patrick Wendell authored 11 years ago

268ecbd2
Fix for SPARK-1025: PySpark hang on missing files. · f8306849
Josh Rosen authored 11 years ago

f8306849
Response to Matei's review · c58d4ea3
Patrick Wendell authored 11 years ago

c58d4ea3
Fix bug on read-side of external sort when using Snappy. · 0213b403
Patrick Wendell authored 11 years ago
```
This case wasn't handled correctly and this patch fixes it.
```
0213b403

Remove Hadoop object cloning and warn users making Hadoop RDD's. · 71010178

Patrick Wendell authored 11 years ago

The code introduced in #359 used Hadoop's WritableUtils.clone() to
duplicate objects when reading from Hadoop files. Some users have
reported exceptions when cloning data in verious file formats,
including Avro and another custom format.

This patch removes that functionality to ensure stability for the
0.9 release. Instead, it puts a clear warning in the documentation
that copying may be necessary for Hadoop data sets.

71010178

Fix SPARK-978: ClassCastException in PySpark cartesian. · 61569906
Josh Rosen authored 11 years ago

61569906
Fix SPARK-1034: Py4JException on PySpark Cartesian Result · 0035dbbc
Josh Rosen authored 11 years ago

0035dbbc

Merge pull request #406 from eklavya/master · fad6aacf

Josh Rosen authored 11 years ago

Extending Java API coverage

Hi,

I have added three new methods to JavaRDD.

Please review and merge.

fad6aacf

Merge pull request #499 from jianpingjwang/dev1 · a2b47dae
Reynold Xin authored 11 years ago
```
Replace commons-math with jblas in SVDPlusPlus
```
a2b47dae
fixed ClassTag in mapPartitions · 60e74572
eklavya authored 11 years ago

60e74572
Add jblas dependency · 19a01c1b
Jianping J Wang authored 11 years ago

19a01c1b
Add jblas dependency · a5a513e2
Jianping J Wang authored 11 years ago

a5a513e2
Replace commons-math with jblas · cc0fd331
Jianping J Wang authored 11 years ago

cc0fd331

Jan 22, 2014

Merge pull request #496 from pwendell/master · a1cd1851

Patrick Wendell authored 11 years ago

Fix bug in worker clean-up in UI

Introduced in d5a96fec (/cc @aarondav).

This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.

a1cd1851

Merge pull request #447 from CodingCat/SPARK-1027 · 034dce2a

Patrick Wendell authored 11 years ago

fix for SPARK-1027

fix for SPARK-1027  (https://spark-project.atlassian.net/browse/SPARK-1027)

FIXES

1. change sparkhome from String to Option(String) in ApplicationDesc

2. remove sparkhome parameter in LaunchExecutor message

3. adjust involved files

034dce2a

Fix bug in worker clean-up in UI · 62855131
Patrick Wendell authored 11 years ago
```
Introduced in d5a96fec. This should be picked into 0.8 and 0.9 as well.
```
62855131
refactor sparkHome to val · 2b3c4614
CodingCat authored 11 years ago
```
clean code
```
2b3c4614

Merge pull request #495 from srowen/GraphXCommonsMathDependency · 3184facd

Patrick Wendell authored 11 years ago

Fix graphx Commons Math dependency

`graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.)

The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.)

It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.

3184facd

Also add graphx commons-math3 dependeny in sbt build · 4476398f
Sean Owen authored 11 years ago

4476398f
Merge pull request #492 from skicavs/master · a1238bb5
Patrick Wendell authored 11 years ago
```
fixed job name and usage information for the JavaSparkPi example
```
a1238bb5

Depend on Commons Math explicitly instead of accidentally getting it from... · fd0c5b8c

Sean Owen authored 11 years ago

Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3

fd0c5b8c

Merge pull request #478 from sryza/sandy-spark-1033 · 576c4a4c

Patrick Wendell authored 11 years ago

SPARK-1033. Ask for cores in Yarn container requests

Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.

576c4a4c

Merge pull request #493 from kayousterhout/double_add · 5bcfd798

Matei Zaharia authored 11 years ago

Fixed bug where task set managers are added to queue twice

@mateiz can you verify that this is a bug and wasn't intentional? (https://github.com/apache/incubator-spark/commit/90a04dab8d9a2a9a372cea7cdf46cc0fd0f2f76c#diff-7fa4f84a961750c374f2120ca70e96edR551)

This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.

Thanks to @hdc1112 for pointing this out.

5bcfd798

Merge pull request #315 from rezazadeh/sparsesvd · d009b17d

Matei Zaharia authored 11 years ago

Sparse SVD

# Singular Value Decomposition
Given an *m x n* matrix *A*, compute matrices *U, S, V* such that

*A = U * S * V^T*

There is no restriction on m, but we require n^2 doubles to fit in memory.
Further, n should be less than m.

The decomposition is computed by first computing *A^TA = V S^2 V^T*,
computing svd locally on that (since n x n is small),
from which we recover S and V.
Then we compute U via easy matrix multiplication
as *U =  A * V * S^-1*

Only singular vectors associated with the largest k singular values
If there are k such values, then the dimensions of the return will be:

* *S* is *k x k* and diagonal, holding the singular values on diagonal.
* *U* is *m x k* and satisfies U^T*U = eye(k).
* *V* is *n x k* and satisfies V^TV = eye(k).

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in RDDs.

# Testing
Tests included. They test:
- Decomposition promise (A = USV^T)
- For small matrices, output is compared to that of jblas
- Rank 1 matrix test included
- Full Rank matrix test included
- Middle-rank matrix forced via k included

# Example Usage

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixyEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
      val parts = line.split(',')
      MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4

// recover top 1 singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)

println("singular values = " + decomposed.S.data.toArray.mkString)

# Documentation
Added to docs/mllib-guide.md

d009b17d

Fixed bug where task set managers are added to queue twice · 19da82c5

Kay Ousterhout authored 11 years ago

This bug leads to a small performance hit because task
set managers will get offered each rejected resource
offer twice, but doesn't lead to any incorrect functionality.

19da82c5