Skip to content
Snippets Groups Projects
  1. Jan 19, 2016
  2. Jan 18, 2016
  3. Jan 17, 2016
  4. Jan 16, 2016
    • Davies Liu's avatar
      [SPARK-12796] [SQL] Whole stage codegen · 3c0d2365
      Davies Liu authored
      This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators.
      
      A micro benchmark show that a query with range, filter and projection could be 3X faster then before.
      
      It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan
      ```
      Limit 10
      +- Project [(id#5L + 1) AS (id + 1)#6L]
         +- Filter ((id#5L & 1) = 1)
            +- Range 0, 1, 4, 10, [id#5L]
      ```
      will be translated into
      ```
      Limit 10
      +- WholeStageCodegen
            +- Project [(id#1L + 1) AS (id + 1)#2L]
               +- Filter ((id#1L & 1) = 1)
                  +- Range 0, 1, 4, 10, [id#1L]
      ```
      
      Here is the call graph to generate Java source for A and B (A  support codegen, but B does not):
      
      ```
        *   WholeStageCodegen       Plan A               FakeInput        Plan B
        * =========================================================================
        *
        * -> execute()
        *     |
        *  doExecute() -------->   produce()
        *                             |
        *                          doProduce()  -------> produce()
        *                                                   |
        *                                                doProduce() ---> execute()
        *                                                   |
        *                                                consume()
        *                          doConsume()  ------------|
        *                             |
        *  doConsume()  <-----    consume()
      ```
      
      A SparkPlan that support codegen need to implement doProduce() and doConsume():
      
      ```
      def doProduce(ctx: CodegenContext): (RDD[InternalRow], String)
      def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10735 from davies/whole2.
      3c0d2365
    • Jeff Lam's avatar
      [SPARK-12722][DOCS] Fixed typo in Pipeline example · 86972fa5
      Jeff Lam authored
      http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
      ```
      val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
      ```
      should be
      ```
      val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
      ```
      cc: jkbradley
      
      Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu>
      
      Closes #10769 from Agent007/SPARK-12722.
      86972fa5
    • Wenchen Fan's avatar
      [SPARK-12856] [SQL] speed up hashCode of unsafe array · 2f7d0b68
      Wenchen Fan authored
      We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead.
      
      A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10784 from cloud-fan/array-hashcode.
      2f7d0b68
  5. Jan 15, 2016
    • Davies Liu's avatar
      [SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions)... · 242efb75
      Davies Liu authored
      [SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) into code generated classes
      
      This is a refactor to support codegen for aggregation and broadcast join.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10777 from davies/rename2.
      242efb75
    • Nong Li's avatar
      [SPARK-12644][SQL] Update parquet reader to be vectorized. · 9039333c
      Nong Li authored
      This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
      There are a few particulars in the Parquet encodings that make this much more efficient. In
      particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
      also very suited for this.
      
      This is a work in progress and does not affect the current execution. In subsequent patches, we will
      support more encodings and types before enabling this.
      
      Simple benchmarks indicate this can decode single ints about > 3x faster.
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong <nongli@gmail.com>
      
      Closes #10593 from nongli/spark-12644.
      9039333c
    • Wenchen Fan's avatar
      [SPARK-12649][SQL] support reading bucketed table · 3b5ccb12
      Wenchen Fan authored
      This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases.
      
      TODO(follow-up PRs):
      
      * bucket pruning
      * avoid shuffle for bucketed table join when use any super-set of the bucketing key.
       (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed)
      * recognize hive bucketed table
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10604 from cloud-fan/bucket-read.
      3b5ccb12
    • Josh Rosen's avatar
      [SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile · 8dbbf3e7
      Josh Rosen authored
      This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.
      
      /cc rxin srowen
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
      8dbbf3e7
    • Yin Huai's avatar
      [SPARK-12833][HOT-FIX] Reset the locale after we set it. · f6ddbb36
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10778 from yhuai/resetLocale.
      f6ddbb36
    • Yanbo Liang's avatar
      [SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA · 5f843781
      Yanbo Liang authored
      Add PySpark missing methods and params for ml.feature:
      * ```RegexTokenizer``` should support setting ```toLowercase```.
      * ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
      * ```PCAModel``` should support output ```pc```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9908 from yanboliang/spark-11925.
      5f843781
    • Herman van Hovell's avatar
      [SPARK-12575][SQL] Grammar parity with existing SQL parser · 7cd7f220
      Herman van Hovell authored
      In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.
      
      Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
      - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
      - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
      - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
      - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.
      
      cc rxin viirya marmbrus yhuai cloud-fan
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10745 from hvanhovell/SPARK-12575-2.
      7cd7f220
    • Wenchen Fan's avatar
      [SQL][MINOR] BoundReference do not need to be NamedExpression · 3f1c58d6
      Wenchen Fan authored
      We made it a `NamedExpression` to workaroud some hacky cases long time ago, and now seems it's safe to remove it.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10765 from cloud-fan/minor.
      3f1c58d6
    • Alex Bozarth's avatar
      [SPARK-12716][WEB UI] Add a TOTALS row to the Executors Web UI · 61c45876
      Alex Bozarth authored
      Added a Totals table to the top of the page to display the totals of each applicable column in the executors table.
      
      Old Description:
      ~~Created a TOTALS row containing the totals of each column in the executors UI. By default the TOTALS row appears at the top of the table. When a column is sorted the TOTALS row will always sort to either the top or bottom of the table.~~
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #10668 from ajbozarth/spark12716.
      61c45876
    • Julien Baley's avatar
      Fix typo · 0bb73554
      Julien Baley authored
      disvoered => discovered
      
      Author: Julien Baley <julien.baley@gmail.com>
      
      Closes #10773 from julienbaley/patch-1.
      0bb73554
    • Yin Huai's avatar
      [SPARK-12833][HOT-FIX] Fix scala 2.11 compilation. · 513266c0
      Yin Huai authored
      Seems https://github.com/apache/spark/commit/5f83c6991c95616ecbc2878f8860c69b2826f56c breaks scala 2.11 compilation.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10774 from yhuai/fixScala211Compile.
      513266c0
    • Reynold Xin's avatar
      [SPARK-12667] Remove block manager's internal "external block store" API · ad1503f9
      Reynold Xin authored
      This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.
      
      There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10752 from rxin/remove-offheap.
      ad1503f9
    • Hossein's avatar
      [SPARK-12833][SQL] Initial import of spark-csv · 5f83c699
      Hossein authored
      CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.
      
      This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.
      
      Author: Hossein <hossein@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10766 from rxin/csv.
      5f83c699
    • Davies Liu's avatar
      [MINOR] [SQL] GeneratedExpressionCode -> ExprCode · c5e7076d
      Davies Liu authored
      GeneratedExpressionCode is too long
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10767 from davies/renaming.
      c5e7076d
    • Oscar D. Lara Yejas's avatar
      [SPARK-11031][SPARKR] Method str() on a DataFrame · ba4a6419
      Oscar D. Lara Yejas authored
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #9613 from olarayej/SPARK-11031.
      ba4a6419
Loading