Commits · bd61f07039064833108070e19b752d4c46045766 · cs525-sp18-g07 / spark

May 01, 2015

[SPARK-5854] personalized page rank · 7d427222

Dan McClary authored 9 years ago

Here's a modification to PageRank which does personalized PageRank. The approach is basically similar to that outlined by Bahmani et al. from 2010 (http://arxiv.org/pdf/1006.2880.pdf).

I'm sure this needs tuning up or other considerations, so let me know how I can improve this.

Author: Dan McClary <dan.mcclary@gmail.com>
Author: dwmclary <dan.mcclary@gmail.com>

Closes #4774 from dwmclary/SPARK-5854-Personalized-PageRank and squashes the following commits:

8b907db [dwmclary] fixed scalastyle errors in PageRankSuite
2c20e5d [dwmclary] merged with upstream master
d6cebac [dwmclary] updated as per style requests
7d00c23 [Dan McClary] fixed line overrun in personalizedVertexPageRank
d711677 [Dan McClary] updated vertexProgram to restore binary compatibility for inner method
bb8d507 [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
fba0edd [Dan McClary] fixed silly mistakes
de51be2 [Dan McClary] cleaned up whitespace between comments and methods
0c30d0c [Dan McClary] updated to maintain binary compatibility
aaf0b4b [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
76773f6 [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
44ada8e [Dan McClary] updated tolerance on chain PPR
1ffed95 [Dan McClary] updated tolerance on chain PPR
b67ac69 [Dan McClary] updated tolerance on chain PPR
a560942 [Dan McClary] rolled PPR into pregel code for PageRank
6dc2c29 [Dan McClary] initial implementation of personalized page rank

7d427222

Apr 11, 2015

SPARK-6710 GraphX Fixed Wrong initial bias in GraphX SVDPlusPlus · 1205f7ea

Michael Malak authored 9 years ago

Author: Michael Malak <michaelmalak@yahoo.com>

Closes #5464 from michaelmalak/master and squashes the following commits:

9d942ba [Michael Malak] SPARK-6710 GraphX Fixed Wrong initial bias in GraphX SVDPlusPlus

1205f7ea

Apr 09, 2015

[SPARK-6758]block the right jetty package in log · 7d92db34

WangTaoTheTonic authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-6758

I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:

e09605b [WangTaoTheTonic] block the right jetty package

7d92db34

Apr 08, 2015

[SPARK-6765] Fix test code style for graphx. · 8d812f99

Reynold Xin authored 9 years ago

So we can turn style checker on for test code.

Author: Reynold Xin <rxin@databricks.com>

Closes #5410 from rxin/test-style-graphx and squashes the following commits:

89e253a [Reynold Xin] [SPARK-6765] Fix test code style for graphx.

8d812f99

Apr 07, 2015

[SPARK-6736][GraphX][Doc]Example of Graph#aggregateMessages has error · ae980eb4

Sasaki Toru authored 9 years ago

Example of Graph#aggregateMessages has error.
Since aggregateMessages is a method of Graph, It should be written "rawGraph.aggregateMessages"

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #5388 from sasakitoa/aggregateMessagesExample and squashes the following commits:

b1d631b [Sasaki Toru] Example of Graph#aggregateMessages has error

ae980eb4

Apr 03, 2015

[SPARK-6428] Turn on explicit type checking for public methods. · 82701ee2

Reynold Xin authored 10 years ago

This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.

Author: Reynold Xin <rxin@databricks.com>

Closes #5342 from rxin/SPARK-6428 and squashes the following commits:

7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.

82701ee2

Mar 26, 2015

[SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference · 39fb5796

Brennon York authored 10 years ago

Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.

To demonstrate a basic example with pseudocode:

```
Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
> Set((0L,0))
```

Author: Brennon York <brennon.york@capitalone.com>

Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:

248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
6575d92 [Brennon York] updated mima exclude
aaa030b [Brennon York] completed graph#minus functionality
7227c0f [Brennon York] beginning work on minus functionality

39fb5796

Mar 22, 2015

[HOTFIX] Build break due to https://github.com/apache/spark/pull/5128 · 7a0da477
Reynold Xin authored 10 years ago

7a0da477

[SPARK-6455] [docs] Correct some mistakes and typos · ab4f516f

Hangchen Yu authored 10 years ago

Correct some typos. Correct a mistake in lib/PageRank.scala. The first PageRank implementation uses standalone Graph interface, but the second uses Pregel interface. It may mislead the code viewers.

Author: Hangchen Yu <yuhc@gitcafe.com>

Closes #5128 from yuhc/master and squashes the following commits:

53e5432 [Hangchen Yu] Merge branch 'master' of https://github.com/yuhc/spark
67b77b5 [Hangchen Yu] [SPARK-6455] [docs] Correct some mistakes and typos
206f2dc [Hangchen Yu] Correct some mistakes and typos.

ab4f516f

Mar 20, 2015

[SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT. · a7456459

Marcelo Vanzin authored 10 years ago

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:

63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
178ba71 [Marcelo Vanzin] Oops.
75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
a45a62c [Marcelo Vanzin] Work around MIMA warning.
1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
cef4603 [Marcelo Vanzin] Indentation.
296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.

a7456459

SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid orphaned temp files · 6f80c3e8

Sean Owen authored 10 years ago

Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify

Author: Sean Owen <sowen@cloudera.com>

Closes #5029 from srowen/SPARK-6338 and squashes the following commits:

27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir
4a212fa [Sean Owen] Standardize a bit more temp dir management
9004081 [Sean Owen] Revert some added recursive-delete calls
57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify

6f80c3e8

Mar 17, 2015

[SPARK-6357][GraphX] Add unapply in EdgeContext · b3e6eca8

Takeshi YAMAMURO authored 10 years ago

This extractor is mainly used for Graph#aggregateMessages*.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #5047 from maropu/AddUnapplyInEdgeContext and squashes the following commits:

87e04df [Takeshi YAMAMURO] Add unapply in EdgeContext

b3e6eca8

Mar 16, 2015

[SPARK-5922][GraphX]: Add diff(other: RDD[VertexId, VD]) in VertexRDD · 45f4c661

Brennon York authored 10 years ago

Changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]. This change maintains backwards compatibility and better unifies the VertexRDD methods to match each other.

Author: Brennon York <brennon.york@capitalone.com>

Closes #4733 from brennonyork/SPARK-5922 and squashes the following commits:

e800f08 [Brennon York] fixed merge conflicts
b9274af [Brennon York] fixed merge conflicts
f86375c [Brennon York] fixed minor include line
398ddb4 [Brennon York] fixed merge conflicts
aac1810 [Brennon York] updated to aggregateUsingIndex and added test to ensure that method works properly
2af0b88 [Brennon York] removed deprecation line
753c963 [Brennon York] fixed merge conflicts and set preference to use the diff(other: VertexRDD[VD]) method
2c678c6 [Brennon York] added mima exclude to exclude new public diff method from VertexRDD
93186f3 [Brennon York] added back the original diff method to sustain binary compatibility
f18356e [Brennon York] changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]

45f4c661

Mar 14, 2015

[SPARK-5790][GraphX]: VertexRDD's won't zip properly for `diff` capability (added tests) · c49d1566

Brennon York authored 10 years ago

Added tests that maropu [created](https://github.com/maropu/spark/blob/1f64794b2ce33e64f340e383d4e8a60639a7eb4b/graphx/src/test/scala/org/apache/spark/graphx/VertexRDDSuite.scala) for vertices with differing partition counts. Wanted to make sure his work got captured /merged as its not in the master branch and I don't believe there's a PR out already for it.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5023 from brennonyork/SPARK-5790 and squashes the following commits:

83bbd29 [Brennon York] added maropu's tests for vertices with differing partition counts

c49d1566

Mar 13, 2015

[SPARK-4600][GraphX]: org.apache.spark.graphx.VertexRDD.diff does not work · b943f5d9

Brennon York authored 10 years ago

Turns out, per the [convo on the JIRA](https://issues.apache.org/jira/browse/SPARK-4600), `diff` is acting exactly as should. It became a large misconception as I thought it meant set difference, when in fact it does not. To that extent I merely updated the `diff` documentation to, hopefully, better reflect its true intentions moving forward.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5015 from brennonyork/SPARK-4600 and squashes the following commits:

1e1d1e5 [Brennon York] reverted internal diff docs
92288f7 [Brennon York] reverted both the test suite and the diff function back to its origin functionality
f428623 [Brennon York] updated diff documentation to better represent its function
cc16d65 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4600
66818b9 [Brennon York] added small secondary diff test
99ad412 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4600
74b8c95 [Brennon York] corrected method by leveraging bitmask operations to correctly return only the portions of that are different from the calling VertexRDD
9717120 [Brennon York] updated diff impl to cause fewer objects to be created
710a21c [Brennon York] working diff given test case
aa57f83 [Brennon York] updated to set ShortestPaths to run 'forward' rather than 'backward'

b943f5d9

Mar 12, 2015

[SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtime · 0cba802a

Xiangrui Meng authored 10 years ago

The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:

48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope

0cba802a

Mar 05, 2015

SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11 · c9cfba0c

Sean Owen authored 10 years ago

Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

Author: Sean Owen <sowen@cloudera.com>

Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:

eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

c9cfba0c

Mar 02, 2015

[SPARK-6103][Graphx]remove unused class to import in EdgeRDDImpl · 49c7a8f6

Lianhui Wang authored 10 years ago

Class TaskContext is unused in EdgeRDDImpl, so we need to remove it from import list.

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #4846 from lianhuiwang/SPARK-6103 and squashes the following commits:

31aed64 [Lianhui Wang] remove unused class to import in EdgeRDDImpl

49c7a8f6

Feb 25, 2015

[SPARK-1955][GraphX]: VertexRDD can incorrectly assume index sharing · 9f603fce

Brennon York authored 10 years ago

Fixes the issue whereby when VertexRDD's are `diff`ed, `innerJoin`ed, or `leftJoin`ed and have different partition sizes they fail under the `zipPartitions` method. This fix tests whether the partitions are equal or not and, if not, will repartition the other to match the partition size of the calling VertexRDD.

Author: Brennon York <brennon.york@capitalone.com>

Closes #4705 from brennonyork/SPARK-1955 and squashes the following commits:

0882590 [Brennon York] updated to properly handle differently-partitioned vertexRDDs

9f603fce

Feb 16, 2015

SPARK-5815 [MLLIB] Part 2. Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS · a3afa4a1

Sean Owen authored 10 years ago

Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only

Author: Sean Owen <sowen@cloudera.com>

Closes #4625 from srowen/SPARK-5815.2 and squashes the following commits:

6fd2ca5 [Sean Owen] Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only

a3afa4a1

Feb 15, 2015

SPARK-5815 [MLLIB] Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS · acf2558d

Sean Owen authored 10 years ago

Deprecate SVDPlusPlus.run and introduce SVDPlusPlus.runSVDPlusPlus with return type that doesn't include DoubleMatrix

CC mengxr

Author: Sean Owen <sowen@cloudera.com>

Closes #4614 from srowen/SPARK-5815 and squashes the following commits:

288cb05 [Sean Owen] Clarify deprecation plans in scaladoc
497458e [Sean Owen] Deprecate SVDPlusPlus.run and introduce SVDPlusPlus.runSVDPlusPlus with return type that doesn't include DoubleMatrix

acf2558d

Feb 13, 2015

SPARK-3290 [GRAPHX] No unpersist callls in SVDPlusPlus · 0ce4e430

Sean Owen authored 10 years ago

This just unpersist()s each RDD in this code that was cache()ed.

Author: Sean Owen <sowen@cloudera.com>

Closes #4234 from srowen/SPARK-3290 and squashes the following commits:

66c1e11 [Sean Owen] unpersist() each RDD that was cache()ed

0ce4e430

Feb 10, 2015

[SPARK-5343][GraphX]: ShortestPaths traverses backwards · 58209612

Brennon York authored 10 years ago

Corrected the logic with ShortestPaths so that the calculation will run forward rather than backwards. Output before looked like:

```scala
import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
lib.ShortestPaths.run(g,Array(3)).vertices.collect
// res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 -> 0)), (2,Map()))
lib.ShortestPaths.run(g,Array(1)).vertices.collect
// res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (3,Map(1 -> 2)), (2,Map(1 -> 1)))
```

And new output after the changes looks like:

```scala
import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
lib.ShortestPaths.run(g,Array(3)).vertices.collect
// res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(3 -> 2)), (2,Map(3 -> 1)), (3,Map(3 -> 0)))
lib.ShortestPaths.run(g,Array(1)).vertices.collect
// res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (2,Map()), (3,Map()))
```

Author: Brennon York <brennon.york@capitalone.com>

Closes #4478 from brennonyork/SPARK-5343 and squashes the following commits:

aa57f83 [Brennon York] updated to set ShortestPaths to run 'forward' rather than 'backward'

58209612

Feb 06, 2015

[SPARK-5380][GraphX] Solve an ArrayIndexOutOfBoundsException when build graph... · 575d2df3

Leolh authored 10 years ago

[SPARK-5380][GraphX]  Solve an ArrayIndexOutOfBoundsException when build graph with a file format error

When I build a graph with a file format error, there will be an ArrayIndexOutOfBoundsException

Author: Leolh <leosandylh@gmail.com>

Closes #4176 from Leolh/patch-1 and squashes the following commits:

94f6d22 [Leolh] Update GraphLoader.scala
23767f1 [Leolh] [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong

575d2df3

Feb 03, 2015

[SPARK-4795][Core] Redesign the "primitive type => Writable" implicit APIs to... · d37978d8

zsxwing authored 10 years ago

[SPARK-4795][Core] Redesign the "primitive type => Writable" implicit APIs to make them be activated automatically

Try to redesign the "primitive type => Writable" implicit APIs to make them be activated automatically and without breaking binary compatibility.

However, this PR will breaking the source compatibility if people use `xxxToXxxWritable` occasionally. See the unit test in `graphx`.

Author: zsxwing <zsxwing@gmail.com>

Closes #3642 from zsxwing/SPARK-4795 and squashes the following commits:

914b2d6 [zsxwing] Add implicit back to the Writables methods
0b9017f [zsxwing] Add some docs
a0e8509 [zsxwing] Merge branch 'master' into SPARK-4795
39343de [zsxwing] Fix the unit test
64853af [zsxwing] Reorganize the rest 'implicit' methods in SparkContext

d37978d8

Feb 02, 2015

[SPARK-5534] [graphx] Graph getStorageLevel fix · f133dece

Joseph K. Bradley authored 10 years ago

This fixes getStorageLevel for EdgeRDDImpl and VertexRDDImpl (and therefore for Graph).

See code example on JIRA which failed before but works with this patch: [https://issues.apache.org/jira/browse/SPARK-5534]
(The added unit tests also failed before but work with this fix.)

Note: I used partitionsRDD, assuming that getStorageLevel will only be called on the driver.

CC: mengxr  (related to LDA PR), rxin  ankurdave   Thanks in advance!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4317 from jkbradley/graphx-storagelevel and squashes the following commits:

1c21e49 [Joseph K. Bradley] made graph getStorageLevel test more robust
18d64ca [Joseph K. Bradley] Added tests for getStorageLevel in VertexRDDSuite, EdgeRDDSuite, GraphSuite
17b488b [Joseph K. Bradley] overrode getStorageLevel in Vertex/EdgeRDDImpl to use partitionsRDD

f133dece

[SPARK-5461] [graphx] Add isCheckpointed, getCheckpointedFiles methods to Graph · 842d0003

Joseph K. Bradley authored 10 years ago

Added the 2 methods to Graph and GraphImpl.  Both make calls to the underlying vertex and edge RDDs.

This is needed for another PR (for LDA): [https://github.com/apache/spark/pull/4047]

Notes:
* getCheckpointedFiles is plural and returns a Seq[String] instead of an Option[String].
* I attempted to test to make sure the methods returned the correct values after checkpointing.  It did not work; I guess that checkpointing does not occur quickly enough?  I noticed that there are not checkpointing tests for RDDs; is it just hard to test well?

CC: rxin

CC: mengxr  (since related to LDA)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4253 from jkbradley/graphx-checkpoint and squashes the following commits:

b680148 [Joseph K. Bradley] added class tag to firstParent call in VertexRDDImpl.isCheckpointed, though not needed to compile
250810e [Joseph K. Bradley] In EdgeRDDImple, VertexRDDImpl, added transient back to partitionsRDD, and made isCheckpointed check firstParent instead of partitionsRDD
695b7a3 [Joseph K. Bradley] changed partitionsRDD in EdgeRDDImpl, VertexRDDImpl to be non-transient
cc00767 [Joseph K. Bradley] added overrides for isCheckpointed, getCheckpointFile in EdgeRDDImpl, VertexRDDImpl. The corresponding Graph methods now work.
188665f [Joseph K. Bradley] improved documentation
235738c [Joseph K. Bradley] Added isCheckpointed and getCheckpointFiles to Graph, GraphImpl

842d0003

Jan 31, 2015

SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` doesn't work with Java 8 · c84d5a10

Sean Owen authored 10 years ago

These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.

Author: Sean Owen <sowen@cloudera.com>

Closes #4193 from srowen/SPARK-3359 and squashes the following commits:

5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible

c84d5a10

Jan 29, 2015

[SPARK-5466] Add explicit guava dependencies where needed. · f9e56945

Marcelo Vanzin authored 10 years ago

One side-effect of shading guava is that it disappears as a transitive
dependency. For Hadoop 2.x, this was masked by the fact that Hadoop
itself depends on guava. But certain versions of Hadoop 1.x also
shade guava, leaving either no guava or some random version pulled
by another dependency on the classpath.

So be explicit about the dependency in modules that use guava directly,
which is the right thing to do anyway.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4272 from vanzin/SPARK-5466 and squashes the following commits:

e3f30e5 [Marcelo Vanzin] Dependency for catalyst is not needed.
d3b2c84 [Marcelo Vanzin] [SPARK-5466] Add explicit guava dependencies where needed.

f9e56945

Jan 23, 2015

[SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImp... · e224dbb0

Takeshi Yamamuro authored 10 years ago

If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl),
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
    graph.aggregateMessages(
      ctx => {
        ctx.sendToSrc(1)
        ctx.sendToDst(2)
      },
      _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, "graph.txt")
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
	at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
    ...

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits:

0cd8942 [Ankur Dave] Use more concise getOrElse
aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions
0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl

e224dbb0

Jan 21, 2015

[SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph... · 3ee3ab59

Kenji Kikushima authored 10 years ago

[SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop

I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64)
If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation?

Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp>

Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits:

4ee18c7 [Ankur Dave] Reword error message and add unit test
d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop.

3ee3ab59

Jan 08, 2015

[SPARK-4048] Enhance and extend hadoop-provided profile. · 48cecf67

Marcelo Vanzin authored 10 years ago

This change does a few things to make the hadoop-provided profile more useful:

- Create new profiles for other libraries / services that might be provided by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these profiles to provide the runtime
  classpath for Spark jobs and daemons.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:

82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.

48cecf67

[SPARK-4917] Add a function to convert into a graph with canonical edges in GraphOps · f825e193

Takeshi Yamamuro authored 10 years ago

Convert bi-directional edges into uni-directional ones instead of 'canonicalOrientation' in GraphLoader.edgeListFile.
This function is useful when a graph is loaded as it is and then is transformed into one with canonical edges.
It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and merges the duplicated edges.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #3760 from maropu/ConvertToCanonicalEdgesSpike and squashes the following commits:

7f8b580 [Takeshi Yamamuro] Add a function to convert into a graph with canonical edges in GraphOps

f825e193

Jan 06, 2015

SPARK-4159 [CORE] Maven build doesn't run JUnit test suites · 4cba6eb4

Sean Owen authored 10 years ago

This PR:

- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.

For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.

Author: Sean Owen <sowen@cloudera.com>

Closes #3651 from srowen/SPARK-4159 and squashes the following commits:

2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent

4cba6eb4

[Minor] Fix comments for GraphX 2D partitioning strategy · 5e3ec111

kj-ki authored 10 years ago

The sum of vertices on matrix (v0 to v11) is 12. And, I think one same block overlaps in this strategy.

This is minor PR, so I didn't file in JIRA.

Author: kj-ki <kikushima.kenji@lab.ntt.co.jp>

Closes #3904 from kj-ki/fix-partitionstrategy-comments and squashes the following commits:

79829d9 [kj-ki] Fix comments for 2D partitioning.

5e3ec111

Dec 31, 2014

[SPARK-5038] Add explicit return type for implicit functions. · 7749dd6c

Reynold Xin authored 10 years ago

As we learned in #3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.

This is a follow up PR for rest of Spark (outside Spark SQL). The original PR for Spark SQL can be found at https://github.com/apache/spark/pull/3859

Author: Reynold Xin <rxin@databricks.com>

Closes #3860 from rxin/implicit and squashes the following commits:

73702f9 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions.

7749dd6c

Dec 07, 2014

[SPARK-4620] Add unpersist in Graph and GraphImpl · 8817fc7f

Takeshi Yamamuro authored 10 years ago

Add an IF to uncache both vertices and edges of Graph/GraphImpl.
This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Ankur Dave <ankurdave@gmail.com>

Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits:

77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl

8817fc7f

[SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark · 2e6b736b

Takeshi Yamamuro authored 10 years ago

This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
It could get performance gains by ~8% in my quick experiments.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits:

8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import
3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark

2e6b736b

Dec 06, 2014

[SPARK-3623][GraphX] GraphX should support the checkpoint operation · e895e0cb

GuoQiang Li authored 10 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #2631 from witgo/SPARK-3623 and squashes the following commits:

a70c500 [GuoQiang Li] Remove java related
4d1e249 [GuoQiang Li] Add comments
e682724 [GuoQiang Li] Graph should support the checkpoint operation

e895e0cb

Dec 02, 2014

[SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow error · 17c162f6

JerryLead authored 10 years ago

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.

Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>

Closes #3544 from JerryLead/my_graphX and squashes the following commits:

628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master

17c162f6