Skip to content
  • yingjieMiao's avatar
    49bbdcb6
    [Spark] RDD take() method: overestimate too much · 49bbdcb6
    yingjieMiao authored
    In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
    
    `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
    Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
    
    This could be a performance problem. (unless this is the intended behavior)
    
    Author: yingjieMiao <yingjie@42go.com>
    
    Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:
    
    d758218 [yingjieMiao] scala style fix
    a8e74bb [yingjieMiao] python style fix
    4b6e777 [yingjieMiao] infix operator style fix
    4391d3b [yingjieMiao] typo fix.
    692f4e6 [yingjieMiao] cap numPartsToTry
    c4483dc [yingjieMiao] style fix
    1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
    d31ff7e [yingjieMiao] handle the edge case after 1 iteration
    a2aa36b [yingjieMiao] RDD take method: overestimate too much
    49bbdcb6
    [Spark] RDD take() method: overestimate too much
    yingjieMiao authored
    In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
    
    `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
    Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
    
    This could be a performance problem. (unless this is the intended behavior)
    
    Author: yingjieMiao <yingjie@42go.com>
    
    Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:
    
    d758218 [yingjieMiao] scala style fix
    a8e74bb [yingjieMiao] python style fix
    4b6e777 [yingjieMiao] infix operator style fix
    4391d3b [yingjieMiao] typo fix.
    692f4e6 [yingjieMiao] cap numPartsToTry
    c4483dc [yingjieMiao] style fix
    1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
    d31ff7e [yingjieMiao] handle the edge case after 1 iteration
    a2aa36b [yingjieMiao] RDD take method: overestimate too much
Loading