Skip to content
Snippets Groups Projects
  • Gabe Mulley's avatar
    1e42e96e
    [SPARK-5138][SQL] Ensure schema can be inferred from a namedtuple · 1e42e96e
    Gabe Mulley authored
    When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error.
    
    Example:
    
    ```python
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from collections import namedtuple
    import os
    
    sc = SparkContext()
    rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
    TextLine = namedtuple('TextLine', 'line length')
    tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
    tuple_rdd.take(5)  # This works
    
    sqlc = SQLContext(sc)
    
    # The following line raises an error
    schema_rdd = sqlc.inferSchema(tuple_rdd)
    ```
    
    The error raised is:
    ```
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
        process()
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft
        yield next(iterator)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct
        raise ValueError("unexpected tuple: %s" % obj)
    TypeError: not all arguments converted during string formatting
    ```
    
    Author: Gabe Mulley <gabe@edx.org>
    
    Closes #3978 from mulby/inferschema-namedtuple and squashes the following commits:
    
    98c61cc [Gabe Mulley] Ensure exception message is populated correctly
    375d96b [Gabe Mulley] Ensure schema can be inferred from a namedtuple
    1e42e96e
    History
    [SPARK-5138][SQL] Ensure schema can be inferred from a namedtuple
    Gabe Mulley authored
    When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error.
    
    Example:
    
    ```python
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from collections import namedtuple
    import os
    
    sc = SparkContext()
    rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
    TextLine = namedtuple('TextLine', 'line length')
    tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
    tuple_rdd.take(5)  # This works
    
    sqlc = SQLContext(sc)
    
    # The following line raises an error
    schema_rdd = sqlc.inferSchema(tuple_rdd)
    ```
    
    The error raised is:
    ```
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
        process()
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft
        yield next(iterator)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct
        raise ValueError("unexpected tuple: %s" % obj)
    TypeError: not all arguments converted during string formatting
    ```
    
    Author: Gabe Mulley <gabe@edx.org>
    
    Closes #3978 from mulby/inferschema-namedtuple and squashes the following commits:
    
    98c61cc [Gabe Mulley] Ensure exception message is populated correctly
    375d96b [Gabe Mulley] Ensure schema can be inferred from a namedtuple