Skip to content
Snippets Groups Projects
  • Winston Chen's avatar
    453d7999
    [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly · 453d7999
    Winston Chen authored
    This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
    
    It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:
    
    ```
    15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
    java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
    	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
    	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
    	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    ```
    
    The test case code below reproduces it:
    
    ```
    from pyspark.rdd import RDD
    
    dl = [
        (u'2', {u'director': u'David Lean'}),
        (u'7', {u'director': u'Andrew Dominik'})
    ]
    
    dl_rdd = sc.parallelize(dl)
    tmp = dl_rdd._to_java_object_rdd()
    tmp2 = sc._jvm.SerDe.javaToPython(tmp)
    t = RDD(tmp2, sc)
    t.count()
    
    tmp = t._to_java_object_rdd()
    tmp2 = sc._jvm.SerDe.javaToPython(tmp)
    t = RDD(tmp2, sc)
    t.count() # it blows up here during the 2nd time of conversion
    ```
    
    Author: Winston Chen <wchen@quid.com>
    
    Closes #4146 from wingchen/master and squashes the following commits:
    
    903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR
    5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks
    126be6b [Winston Chen] SPARK-5361, add in test case
    4cf1187 [Winston Chen] SPARK-5361, add in test case
    9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
    453d7999
    History
    [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly
    Winston Chen authored
    This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
    
    It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:
    
    ```
    15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
    java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
    	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
    	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
    	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    ```
    
    The test case code below reproduces it:
    
    ```
    from pyspark.rdd import RDD
    
    dl = [
        (u'2', {u'director': u'David Lean'}),
        (u'7', {u'director': u'Andrew Dominik'})
    ]
    
    dl_rdd = sc.parallelize(dl)
    tmp = dl_rdd._to_java_object_rdd()
    tmp2 = sc._jvm.SerDe.javaToPython(tmp)
    t = RDD(tmp2, sc)
    t.count()
    
    tmp = t._to_java_object_rdd()
    tmp2 = sc._jvm.SerDe.javaToPython(tmp)
    t = RDD(tmp2, sc)
    t.count() # it blows up here during the 2nd time of conversion
    ```
    
    Author: Winston Chen <wchen@quid.com>
    
    Closes #4146 from wingchen/master and squashes the following commits:
    
    903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR
    5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks
    126be6b [Winston Chen] SPARK-5361, add in test case
    4cf1187 [Winston Chen] SPARK-5361, add in test case
    9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
tests.py 72.10 KiB