Skip to content
Snippets Groups Projects
  • Kan Zhang's avatar
    94d1f46f
    [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
    Kan Zhang authored
    JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
    
    This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
    
    * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
    
    * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
    
    * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
    
    * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
    
    cc MLnick mateiz ahirreddy pwendell
    
    Author: Kan Zhang <kzhang@apache.org>
    
    Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
    
    c01e3ef [Kan Zhang] [SPARK-2024] code formatting
    6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
    d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
    57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
    75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
    0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
    9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
    0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
    7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
    94d1f46f
    History
    [SPARK-2024] Add saveAsSequenceFile to PySpark
    Kan Zhang authored
    JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
    
    This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
    
    * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
    
    * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
    
    * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
    
    * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
    
    cc MLnick mateiz ahirreddy pwendell
    
    Author: Kan Zhang <kzhang@apache.org>
    
    Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
    
    c01e3ef [Kan Zhang] [SPARK-2024] code formatting
    6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
    d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
    57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
    75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
    0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
    9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
    0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
    7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
tests.py 38.60 KiB