Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    c5414b68
    [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
    Davies Liu authored
    This patch add profiling support for PySpark, it will show the profiling results
    before the driver exits, here is one example:
    
    ```
    ============================================================
    Profile of RDD<id=3>
    ============================================================
             5146507 function calls (5146487 primitive calls) in 71.094 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
           20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
           20    0.017    0.001    0.017    0.001 {cPickle.dumps}
         1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
           20    0.001    0.000    0.001    0.000 {reduce}
           21    0.001    0.000    0.001    0.000 {cPickle.loads}
           20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
           41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
           40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
           62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
           20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
           20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
        40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
           41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
           40    0.000    0.000   71.072    1.777 rdd.py:304(func)
           20    0.000    0.000   71.094    3.555 worker.py:82(process)
    ```
    
    Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
    by `sc.dump_profiles(path)`, such as
    
    ```python
    >>> sc._conf.set("spark.python.profile", "true")
    >>> rdd = sc.parallelize(range(100)).map(str)
    >>> rdd.count()
    100
    >>> sc.show_profiles()
    ============================================================
    Profile of RDD<id=1>
    ============================================================
             284 function calls (276 primitive calls) in 0.001 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
            4    0.000    0.000    0.000    0.000 {reduce}
         12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
            4    0.000    0.000    0.000    0.000 {cPickle.loads}
            4    0.000    0.000    0.000    0.000 {cPickle.dumps}
          104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
            8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
           12    0.000    0.000    0.000    0.000 rdd.py:303(func)
    ```
    The profiling is disabled by default, can be enabled by "spark.python.profile=true".
    
    Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
    
    This is bugfix of #2351 cc JoshRosen
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2556 from davies/profiler and squashes the following commits:
    
    e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    858e74c [Davies Liu] compatitable with python 2.6
    7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
    2b0daf2 [Davies Liu] fix docs
    7a56c24 [Davies Liu] bugfix
    cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
    fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    09d02c3 [Davies Liu] Merge branch 'master' into profiler
    c23865c [Davies Liu] Merge branch 'master' into profiler
    15d6f18 [Davies Liu] add docs for two configs
    dadee1a [Davies Liu] add docs string and clear profiles after show or dump
    4f8309d [Davies Liu] address comment, add tests
    0a5b6eb [Davies Liu] fix Python UDF
    4b20494 [Davies Liu] add profile for python
    c5414b68
    History
    [SPARK-3478] [PySpark] Profile the Python tasks
    Davies Liu authored
    This patch add profiling support for PySpark, it will show the profiling results
    before the driver exits, here is one example:
    
    ```
    ============================================================
    Profile of RDD<id=3>
    ============================================================
             5146507 function calls (5146487 primitive calls) in 71.094 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
           20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
           20    0.017    0.001    0.017    0.001 {cPickle.dumps}
         1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
           20    0.001    0.000    0.001    0.000 {reduce}
           21    0.001    0.000    0.001    0.000 {cPickle.loads}
           20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
           41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
           40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
           62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
           20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
           20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
        40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
           41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
           40    0.000    0.000   71.072    1.777 rdd.py:304(func)
           20    0.000    0.000   71.094    3.555 worker.py:82(process)
    ```
    
    Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
    by `sc.dump_profiles(path)`, such as
    
    ```python
    >>> sc._conf.set("spark.python.profile", "true")
    >>> rdd = sc.parallelize(range(100)).map(str)
    >>> rdd.count()
    100
    >>> sc.show_profiles()
    ============================================================
    Profile of RDD<id=1>
    ============================================================
             284 function calls (276 primitive calls) in 0.001 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
            4    0.000    0.000    0.000    0.000 {reduce}
         12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
            4    0.000    0.000    0.000    0.000 {cPickle.loads}
            4    0.000    0.000    0.000    0.000 {cPickle.dumps}
          104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
            8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
           12    0.000    0.000    0.000    0.000 rdd.py:303(func)
    ```
    The profiling is disabled by default, can be enabled by "spark.python.profile=true".
    
    Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
    
    This is bugfix of #2351 cc JoshRosen
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2556 from davies/profiler and squashes the following commits:
    
    e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    858e74c [Davies Liu] compatitable with python 2.6
    7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
    2b0daf2 [Davies Liu] fix docs
    7a56c24 [Davies Liu] bugfix
    cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
    fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
    09d02c3 [Davies Liu] Merge branch 'master' into profiler
    c23865c [Davies Liu] Merge branch 'master' into profiler
    15d6f18 [Davies Liu] add docs for two configs
    dadee1a [Davies Liu] add docs string and clear profiles after show or dump
    4f8309d [Davies Liu] address comment, add tests
    0a5b6eb [Davies Liu] fix Python UDF
    4b20494 [Davies Liu] add profile for python
accumulators.py 8.01 KiB