Skip to content
Snippets Groups Projects
Commit 96df9290 authored by Ankur Dave's avatar Ankur Dave Committed by Josh Rosen
Browse files

[SPARK-3190] Avoid overflow in VertexRDD.count()

VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them.

The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion).

```scala
val pairs = sc.parallelize(0L until 500L).map(_ * 10000000)
  .flatMap(start => start until (start + 10000000)).map(x => (x, x))
VertexRDD(pairs).count()
```

Author: Ankur Dave <ankurdave@gmail.com>

Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits:

641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
parent 39012452
No related branches found
No related tags found
No related merge requests found
......@@ -108,7 +108,7 @@ class VertexRDD[@specialized VD: ClassTag](
/** The number of vertices in the RDD. */
override def count(): Long = {
partitionsRDD.map(_.size).reduce(_ + _)
partitionsRDD.map(_.size.toLong).reduce(_ + _)
}
/**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment