-
- Downloads
Merge pull request #578 from mengxr/rank.
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <meng@databricks.com> Closes #578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
Showing
- core/src/main/scala/org/apache/spark/rdd/RDD.scala 24 additions, 12 deletionscore/src/main/scala/org/apache/spark/rdd/RDD.scala
- core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala 69 additions, 0 deletions.../main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
- core/src/main/scala/org/apache/spark/util/Utils.scala 13 additions, 0 deletionscore/src/main/scala/org/apache/spark/util/Utils.scala
- core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala 26 additions, 0 deletionscore/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
- core/src/test/scala/org/apache/spark/util/UtilsSuite.scala 7 additions, 0 deletionscore/src/test/scala/org/apache/spark/util/UtilsSuite.scala
Loading
Please register or sign in to comment