Skip to content
Snippets Groups Projects
  • Dongjoon Hyun's avatar
    8c54f1eb
    [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 · 8c54f1eb
    Dongjoon Hyun authored
    ## What changes were proposed in this pull request?
    
    Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
    
    - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
    - Maintainability: Reduce the Hive dependency and can remove old legacy code later.
    
    Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
    - Usability: User can use ORC data sources without hive module, i.e, -Phive.
    - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
    
    ## How was this patch tested?
    
    Pass the jenkins.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #18640 from dongjoon-hyun/SPARK-21422.
    8c54f1eb
    History
    [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
    Dongjoon Hyun authored
    ## What changes were proposed in this pull request?
    
    Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
    
    - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
    - Maintainability: Reduce the Hive dependency and can remove old legacy code later.
    
    Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
    - Usability: User can use ORC data sources without hive module, i.e, -Phive.
    - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
    
    ## How was this patch tested?
    
    Pass the jenkins.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #18640 from dongjoon-hyun/SPARK-21422.