Skip to content
Snippets Groups Projects
  • Reynold Xin's avatar
    05d04e10
    [SPARK-9733][SQL] Improve physical plan explain for data sources · 05d04e10
    Reynold Xin authored
    All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.
    
    Without this patch:
    ```
    == Physical Plan ==
    NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
     Exchange hashpartitioning(date#0,cat#1)
      NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
       PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
    ```
    
    With this patch:
    ```
    == Physical Plan ==
    TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
     Exchange hashpartitioning(date#0,cat#1)
      TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
       ConvertToUnsafe
        Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
    ```
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8024 from rxin/SPARK-9733 and squashes the following commits:
    
    811b90e [Reynold Xin] Fixed Python test case.
    52cab77 [Reynold Xin] Cast.
    eea9ccc [Reynold Xin] Fix test case.
    fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.
    05d04e10
    History
    [SPARK-9733][SQL] Improve physical plan explain for data sources
    Reynold Xin authored
    All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.
    
    Without this patch:
    ```
    == Physical Plan ==
    NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
     Exchange hashpartitioning(date#0,cat#1)
      NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
       PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
    ```
    
    With this patch:
    ```
    == Physical Plan ==
    TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
     Exchange hashpartitioning(date#0,cat#1)
      TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
       ConvertToUnsafe
        Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
    ```
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes #8024 from rxin/SPARK-9733 and squashes the following commits:
    
    811b90e [Reynold Xin] Fixed Python test case.
    52cab77 [Reynold Xin] Cast.
    eea9ccc [Reynold Xin] Fix test case.
    fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.
dataframe.py 49.16 KiB