Skip to content
Snippets Groups Projects
  • Xiangrui Meng's avatar
    0cfd2ceb
    [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly · 0cfd2ceb
    Xiangrui Meng authored
    In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
    
    Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
    
    I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
    
    CC: jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:
    
    865b5ca [Xiangrui Meng] make Assignment serializable
    cffa96e [Xiangrui Meng] fix test
    9c0e590 [Xiangrui Meng] remove unused Tuple2
    1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
    0cfd2ceb
    History
    [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly
    Xiangrui Meng authored
    In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
    
    Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
    
    I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
    
    CC: jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:
    
    865b5ca [Xiangrui Meng] make Assignment serializable
    cffa96e [Xiangrui Meng] fix test
    9c0e590 [Xiangrui Meng] remove unused Tuple2
    1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly