Skip to content
Snippets Groups Projects
  • Eric Liang's avatar
    dbfc7aa4
    [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python · dbfc7aa4
    Eric Liang authored
    ## What changes were proposed in this pull request?
    
    For large objects, pickle does not raise useful error messages. However, we can wrap them to be slightly more user friendly:
    
    Example 1:
    ```
    def run():
      import numpy.random as nr
      b = nr.bytes(8 * 1000000000)
      sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
    
    run()
    ```
    
    Before:
    ```
    error: 'i' format requires -2147483648 <= number <= 2147483647
    ```
    
    After:
    ```
    pickle.PicklingError: Object too large to serialize: 'i' format requires -2147483648 <= number <= 2147483647
    ```
    
    Example 2:
    ```
    def run():
      import numpy.random as nr
      b = sc.broadcast(nr.bytes(8 * 1000000000))
      sc.parallelize(range(1000), 1000).map(lambda x: len(b.value)).count()
    
    run()
    ```
    
    Before:
    ```
    SystemError: error return without exception set
    ```
    
    After:
    ```
    cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
    ```
    
    ## How was this patch tested?
    
    Manually tried out these cases
    
    cc davies
    
    Author: Eric Liang <ekl@databricks.com>
    
    Closes #15026 from ericl/spark-17472.
    dbfc7aa4
    History
    [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python
    Eric Liang authored
    ## What changes were proposed in this pull request?
    
    For large objects, pickle does not raise useful error messages. However, we can wrap them to be slightly more user friendly:
    
    Example 1:
    ```
    def run():
      import numpy.random as nr
      b = nr.bytes(8 * 1000000000)
      sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
    
    run()
    ```
    
    Before:
    ```
    error: 'i' format requires -2147483648 <= number <= 2147483647
    ```
    
    After:
    ```
    pickle.PicklingError: Object too large to serialize: 'i' format requires -2147483648 <= number <= 2147483647
    ```
    
    Example 2:
    ```
    def run():
      import numpy.random as nr
      b = sc.broadcast(nr.bytes(8 * 1000000000))
      sc.parallelize(range(1000), 1000).map(lambda x: len(b.value)).count()
    
    run()
    ```
    
    Before:
    ```
    SystemError: error return without exception set
    ```
    
    After:
    ```
    cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
    ```
    
    ## How was this patch tested?
    
    Manually tried out these cases
    
    cc davies
    
    Author: Eric Liang <ekl@databricks.com>
    
    Closes #15026 from ericl/spark-17472.