Skip to content
Snippets Groups Projects
  1. Jan 19, 2016
    • scwf's avatar
      [SPARK-2750][WEB UI] Add https support to the Web UI · 43f1d59e
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #10238 from vanzin/SPARK-2750.
      43f1d59e
    • Michael Armbrust's avatar
      [BUILD] Runner for spark packages · efd7eed3
      Michael Armbrust authored
      This is a convenience method added to the SBT build for developers, though if people think its useful we could consider adding a official script that runs using the assembly instead of compiling on demand.  It simply compiles spark (without requiring an assembly), and invokes Spark Submit to download / run the package.
      
      Example Usage:
      ```
      $ build/sbt
      > sparkPackage com.databricks:spark-sql-perf_2.10:0.2.4 com.databricks.spark.sql.perf.RunBenchmark --help
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #10834 from marmbrus/sparkPackageRunner.
      efd7eed3
    • Gábor Lipták's avatar
      [SPARK-11295] Add packages to JUnit output for Python tests · c6f971b4
      Gábor Lipták authored
      SPARK-11295 Add packages to JUnit output for Python tests
      
      This improves grouping/display of test case results.
      
      Author: Gábor Lipták <gliptak@gmail.com>
      
      Closes #9263 from gliptak/SPARK-11295.
      c6f971b4
    • Jakob Odersky's avatar
      [SPARK-12816][SQL] De-alias type when generating schemas · c78e2080
      Jakob Odersky authored
      Call `dealias` on local types to fix schema generation for abstract type members, such as
      
      ```scala
      type KeyValue = (Int, String)
      ```
      
      Add simple test
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #10749 from jodersky/aliased-schema.
      c78e2080
    • Imran Rashid's avatar
      [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings · 4dbd3161
      Imran Rashid authored
      See https://issues.apache.org/jira/browse/SPARK-12560
      
      This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled.  I ran into this while trying to turn them back on with a different version of parquet.  Figure it was good to fix now in any case.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #10510 from squito/SPARK-12560.
      4dbd3161
    • gatorsmile's avatar
      [SPARK-12867][SQL] Nullability of Intersect can be stricter · b72e01e8
      gatorsmile authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12867
      
      When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter.
      
      liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10812 from gatorsmile/nullabilityIntersect.
      b72e01e8
    • Feynman Liang's avatar
      [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label training data · 2388de51
      Feynman Liang authored
      CC jkbradley mengxr dbtsai
      
      Author: Feynman Liang <feynman.liang@gmail.com>
      
      Closes #10743 from feynmanliang/SPARK-12804.
      2388de51
    • Andrew Or's avatar
      [SPARK-12887] Do not expose var's in TaskMetrics · b122c861
      Andrew Or authored
      This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators.
      
      TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug.
      
      Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them.
      
      Parent PR: #10717
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: andrewor14 <andrew@databricks.com>
      
      Closes #10815 from andrewor14/get-or-create-metrics.
      b122c861
    • Wenchen Fan's avatar
      [SPARK-12870][SQL] better format bucket id in file name · e14817b5
      Wenchen Fan authored
      for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10799 from cloud-fan/fix-bucket.
      e14817b5
    • Holden Karau's avatar
      [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means · 0ddba6d8
      Holden Karau authored
      From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
      0ddba6d8
    • Wojciech Jurczyk's avatar
      [MLLIB] Fix CholeskyDecomposition assertion's message · ebd9ce0f
      Wojciech Jurczyk authored
      Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.
      
      Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>
      
      Closes #10818 from wjur/wjur/rename_error_message.
      ebd9ce0f
    • Sean Owen's avatar
      [SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pyspark · d8c4b00a
      Sean Owen authored
      Fix order of arguments that Pyspark RDD.fold passes to its op -  should be (acc, obj) like other implementations.
      
      Obviously, this is a potentially breaking change, so can only happen for 2.x
      
      CC davies
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10771 from srowen/SPARK-7683.
      d8c4b00a
    • proflin's avatar
      [SQL][MINOR] Fix one little mismatched comment according to the codes in interface.scala · c00744e6
      proflin authored
      Author: proflin <proflin.me@gmail.com>
      
      Closes #10824 from proflin/master.
      c00744e6
  2. Jan 18, 2016
  3. Jan 17, 2016
  4. Jan 16, 2016
    • Davies Liu's avatar
      [SPARK-12796] [SQL] Whole stage codegen · 3c0d2365
      Davies Liu authored
      This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators.
      
      A micro benchmark show that a query with range, filter and projection could be 3X faster then before.
      
      It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan
      ```
      Limit 10
      +- Project [(id#5L + 1) AS (id + 1)#6L]
         +- Filter ((id#5L & 1) = 1)
            +- Range 0, 1, 4, 10, [id#5L]
      ```
      will be translated into
      ```
      Limit 10
      +- WholeStageCodegen
            +- Project [(id#1L + 1) AS (id + 1)#2L]
               +- Filter ((id#1L & 1) = 1)
                  +- Range 0, 1, 4, 10, [id#1L]
      ```
      
      Here is the call graph to generate Java source for A and B (A  support codegen, but B does not):
      
      ```
        *   WholeStageCodegen       Plan A               FakeInput        Plan B
        * =========================================================================
        *
        * -> execute()
        *     |
        *  doExecute() -------->   produce()
        *                             |
        *                          doProduce()  -------> produce()
        *                                                   |
        *                                                doProduce() ---> execute()
        *                                                   |
        *                                                consume()
        *                          doConsume()  ------------|
        *                             |
        *  doConsume()  <-----    consume()
      ```
      
      A SparkPlan that support codegen need to implement doProduce() and doConsume():
      
      ```
      def doProduce(ctx: CodegenContext): (RDD[InternalRow], String)
      def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10735 from davies/whole2.
      3c0d2365
    • Jeff Lam's avatar
      [SPARK-12722][DOCS] Fixed typo in Pipeline example · 86972fa5
      Jeff Lam authored
      http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
      ```
      val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
      ```
      should be
      ```
      val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
      ```
      cc: jkbradley
      
      Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu>
      
      Closes #10769 from Agent007/SPARK-12722.
      86972fa5
    • Wenchen Fan's avatar
      [SPARK-12856] [SQL] speed up hashCode of unsafe array · 2f7d0b68
      Wenchen Fan authored
      We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead.
      
      A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10784 from cloud-fan/array-hashcode.
      2f7d0b68
  5. Jan 15, 2016
    • Davies Liu's avatar
      [SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions)... · 242efb75
      Davies Liu authored
      [SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) into code generated classes
      
      This is a refactor to support codegen for aggregation and broadcast join.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10777 from davies/rename2.
      242efb75
    • Nong Li's avatar
      [SPARK-12644][SQL] Update parquet reader to be vectorized. · 9039333c
      Nong Li authored
      This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
      There are a few particulars in the Parquet encodings that make this much more efficient. In
      particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
      also very suited for this.
      
      This is a work in progress and does not affect the current execution. In subsequent patches, we will
      support more encodings and types before enabling this.
      
      Simple benchmarks indicate this can decode single ints about > 3x faster.
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong <nongli@gmail.com>
      
      Closes #10593 from nongli/spark-12644.
      9039333c
    • Wenchen Fan's avatar
      [SPARK-12649][SQL] support reading bucketed table · 3b5ccb12
      Wenchen Fan authored
      This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases.
      
      TODO(follow-up PRs):
      
      * bucket pruning
      * avoid shuffle for bucketed table join when use any super-set of the bucketing key.
       (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed)
      * recognize hive bucketed table
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10604 from cloud-fan/bucket-read.
      3b5ccb12
    • Josh Rosen's avatar
      [SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile · 8dbbf3e7
      Josh Rosen authored
      This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.
      
      /cc rxin srowen
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
      8dbbf3e7
Loading