Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    6cf50768
    [SPARK-4548] []SPARK-4517] improve performance of python broadcast · 6cf50768
    Davies Liu authored
    Re-implement the Python broadcast using file:
    
    1) serialize the python object using cPickle, write into disks.
    2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
    3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
    4) During deserialization, writing the data into disk.
    5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
    
    It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
    
    Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
          python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
            python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
    
    Testing with 100 tasks (16 CPUs):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
         python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
            python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3417 from davies/pybroadcast and squashes the following commits:
    
    50a58e0 [Davies Liu] address comments
    b98de1d [Davies Liu] disable gc while unpickle
    e5ee6b9 [Davies Liu] support large string
    09303b8 [Davies Liu] read all data into memory
    dde02dd [Davies Liu] improve performance of python broadcast
    6cf50768
    History
    [SPARK-4548] []SPARK-4517] improve performance of python broadcast
    Davies Liu authored
    Re-implement the Python broadcast using file:
    
    1) serialize the python object using cPickle, write into disks.
    2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
    3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
    4) During deserialization, writing the data into disk.
    5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
    
    It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
    
    Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
          python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
            python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
    
    Testing with 100 tasks (16 CPUs):
    
             name |   1.1   | 1.2 with this patch |  improvement
    ---------|--------|---------|--------
         python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
            python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3417 from davies/pybroadcast and squashes the following commits:
    
    50a58e0 [Davies Liu] address comments
    b98de1d [Davies Liu] disable gc while unpickle
    e5ee6b9 [Davies Liu] support large string
    09303b8 [Davies Liu] read all data into memory
    dde02dd [Davies Liu] improve performance of python broadcast
broadcast.py 3.93 KiB
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import os
import cPickle
import gc
from tempfile import NamedTemporaryFile


__all__ = ['Broadcast']


# Holds broadcasted data received from Java, keyed by its id.
_broadcastRegistry = {}


def _from_id(bid):
    from pyspark.broadcast import _broadcastRegistry
    if bid not in _broadcastRegistry:
        raise Exception("Broadcast variable '%s' not loaded!" % bid)
    return _broadcastRegistry[bid]


class Broadcast(object):

    """
    A broadcast variable created with L{SparkContext.broadcast()}.
    Access its value through C{.value}.

    Examples:

    >>> from pyspark.context import SparkContext
    >>> sc = SparkContext('local', 'test')
    >>> b = sc.broadcast([1, 2, 3, 4, 5])
    >>> b.value
    [1, 2, 3, 4, 5]
    >>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
    [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
    >>> b.unpersist()

    >>> large_broadcast = sc.broadcast(range(10000))
    """

    def __init__(self, sc=None, value=None, pickle_registry=None, path=None):
        """
        Should not be called directly by users -- use L{SparkContext.broadcast()}
        instead.
        """
        if sc is not None:
            f = NamedTemporaryFile(delete=False, dir=sc._temp_dir)
            self._path = self.dump(value, f)
            self._jbroadcast = sc._jvm.PythonRDD.readBroadcastFromFile(sc._jsc, self._path)
            self._pickle_registry = pickle_registry
        else:
            self._jbroadcast = None
            self._path = path

    def dump(self, value, f):
        if isinstance(value, basestring):
            if isinstance(value, unicode):
                f.write('U')
                value = value.encode('utf8')
            else:
                f.write('S')
            f.write(value)
        else:
            f.write('P')
            cPickle.dump(value, f, 2)
        f.close()
        return f.name

    def load(self, path):
        with open(path, 'rb', 1 << 20) as f:
            flag = f.read(1)
            data = f.read()
            if flag == 'P':
                # cPickle.loads() may create lots of objects, disable GC
                # temporary for better performance
                gc.disable()
                try:
                    return cPickle.loads(data)
                finally:
                    gc.enable()
            else:
                return data.decode('utf8') if flag == 'U' else data

    @property
    def value(self):
        """ Return the broadcasted value
        """
        if not hasattr(self, "_value") and self._path is not None:
            self._value = self.load(self._path)
        return self._value

    def unpersist(self, blocking=False):
        """
        Delete cached copies of this broadcast on the executors.
        """
        if self._jbroadcast is None:
            raise Exception("Broadcast can only be unpersisted in driver")
        self._jbroadcast.unpersist(blocking)
        os.unlink(self._path)

    def __reduce__(self):
        if self._jbroadcast is None:
            raise Exception("Broadcast can only be serialized in driver")
        self._pickle_registry.add(self)
        return _from_id, (self._jbroadcast.id(),)


if __name__ == "__main__":
    import doctest
    doctest.testmod()