Skip to content
  • Sandeep Singh's avatar
    4c673c65
    [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
    Sandeep Singh authored
    
    
    ## What changes were proposed in this pull request?
    In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
    Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
    
    ## How was this patch tested?
    ```scala
    import random, string
    from pyspark.ml.feature import StringIndexer
    
    l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
    df = spark.createDataFrame(l, ['string'])
    
    for i in range(50):
        indexer = StringIndexer(inputCol='string', outputCol='index')
        indexer.fit(df)
    ```
    * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
    After: garbage collection works as the object is dereferenced, and computation completes
    * Mem footprint tested using profiler
    * Added a parameter copy related test which was failing before.
    
    Author: Sandeep Singh <sandeep@techaddict.me>
    Author: jkbradley <joseph.kurata.bradley@gmail.com>
    
    Closes #15843 from techaddict/SPARK-18274.
    
    (cherry picked from commit 78bb7f80)
    Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
    4c673c65
    [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
    Sandeep Singh authored
    
    
    ## What changes were proposed in this pull request?
    In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
    Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
    
    ## How was this patch tested?
    ```scala
    import random, string
    from pyspark.ml.feature import StringIndexer
    
    l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
    df = spark.createDataFrame(l, ['string'])
    
    for i in range(50):
        indexer = StringIndexer(inputCol='string', outputCol='index')
        indexer.fit(df)
    ```
    * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
    After: garbage collection works as the object is dereferenced, and computation completes
    * Mem footprint tested using profiler
    * Added a parameter copy related test which was failing before.
    
    Author: Sandeep Singh <sandeep@techaddict.me>
    Author: jkbradley <joseph.kurata.bradley@gmail.com>
    
    Closes #15843 from techaddict/SPARK-18274.
    
    (cherry picked from commit 78bb7f80)
    Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
Loading