Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    b41a39e2
    [SPARK-4186] add binaryFiles and binaryRecords in Python · b41a39e2
    Davies Liu authored
    add binaryFiles() and binaryRecords() in Python
    ```
    binaryFiles(self, path, minPartitions=None):
        :: Developer API ::
    
        Read a directory of binary files from HDFS, a local file system
        (available on all nodes), or any Hadoop-supported file system URI
        as a byte array. Each file is read as a single record and returned
        in a key-value pair, where the key is the path of each file, the
        value is the content of each file.
    
        Note: Small files are preferred, large file is also allowable, but
        may cause bad performance.
    
    binaryRecords(self, path, recordLength):
        Load data from a flat binary file, assuming each record is a set of numbers
        with the specified numerical format (see ByteBuffer), and the number of
        bytes per record is constant.
    
        :param path: Directory to the input data files
        :param recordLength: The length at which to split the records
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3078 from davies/binary and squashes the following commits:
    
    cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    3aa349b [Davies Liu] add experimental notes
    24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    1900085 [Davies Liu] bugfix
    bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
    b41a39e2
    History
    [SPARK-4186] add binaryFiles and binaryRecords in Python
    Davies Liu authored
    add binaryFiles() and binaryRecords() in Python
    ```
    binaryFiles(self, path, minPartitions=None):
        :: Developer API ::
    
        Read a directory of binary files from HDFS, a local file system
        (available on all nodes), or any Hadoop-supported file system URI
        as a byte array. Each file is read as a single record and returned
        in a key-value pair, where the key is the path of each file, the
        value is the content of each file.
    
        Note: Small files are preferred, large file is also allowable, but
        may cause bad performance.
    
    binaryRecords(self, path, recordLength):
        Load data from a flat binary file, assuming each record is a set of numbers
        with the specified numerical format (see ByteBuffer), and the number of
        bytes per record is constant.
    
        :param path: Directory to the input data files
        :param recordLength: The length at which to split the records
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3078 from davies/binary and squashes the following commits:
    
    cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    3aa349b [Davies Liu] add experimental notes
    24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
    1900085 [Davies Liu] bugfix
    bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python