Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    7387126f
    [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 7387126f
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
    
    with the file, `text.txt` below:
    
    ```
    a
    b
    
    d
    e
    f
    g
    h
    i
    j
    k
    l
    
    ```
    
    - Before
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    UTF8Deserializer(True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/rdd.py", line 811, in collect
        return list(_load_from_socket(port, self._jrdd_deserializer))
      File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
        yield self.loads(stream)
      File ".../spark/python/pyspark/serializers.py", line 544, in loads
        return s.decode("utf-8") if self.use_unicode else s
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
    ```
    
    - After
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
    ```
    
    ## How was this patch tested?
    
    Unit test in `python/pyspark/tests.py`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17282 from HyukjinKwon/SPARK-19872.
    7387126f
    History
    [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
    
    with the file, `text.txt` below:
    
    ```
    a
    b
    
    d
    e
    f
    g
    h
    i
    j
    k
    l
    
    ```
    
    - Before
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    UTF8Deserializer(True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/rdd.py", line 811, in collect
        return list(_load_from_socket(port, self._jrdd_deserializer))
      File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
        yield self.loads(stream)
      File ".../spark/python/pyspark/serializers.py", line 544, in loads
        return s.decode("utf-8") if self.use_unicode else s
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
    ```
    
    - After
    
    ```python
    >>> sc.textFile('text.txt').repartition(1).collect()
    ```
    
    ```
    [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
    ```
    
    ## How was this patch tested?
    
    Unit test in `python/pyspark/tests.py`.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #17282 from HyukjinKwon/SPARK-19872.