Skip to content
Snippets Groups Projects
  1. Jan 19, 2016
    • Sun Rui's avatar
      [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR. · 3ac64828
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10309 from sun-rui/SPARK-12337.
      3ac64828
    • felixcheung's avatar
      [SPARK-12168][SPARKR] Add automated tests for conflicted function in R · 37fefa66
      felixcheung authored
      Currently this is reported when loading the SparkR package in R (probably would add is.nan)
      ```
      Loading required package: methods
      
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var
      
      The following objects are masked from ‘package:base’:
      
          colnames, colnames<-, intersect, rank, rbind, sample, subset,
          summary, table, transform
      ```
      
      Adding this test adds an automated way to track changes to masked method.
      Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix.
      
      Incidentally, this might point to how we would fix those inaccessible functions in base or stats.
      Looking for feedback for adding this test.
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10171 from felixcheung/rmaskedtest.
      37fefa66
    • Reynold Xin's avatar
      [SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen · 3e84ef0a
      Reynold Xin authored
      The three optimization cases are:
      
      1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch.
      2. If a branch's condition is a false or null literal, remove that branch.
      3. If only the else branch is left, remove the CaseWhen and use the value from the else branch.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10827 from rxin/SPARK-12770.
      3e84ef0a
    • BenFradet's avatar
      [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column · f6f7ca9d
      BenFradet authored
      This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10472 from BenFradet/SPARK-9716.
      f6f7ca9d
    • scwf's avatar
      [SPARK-2750][WEB UI] Add https support to the Web UI · 43f1d59e
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #10238 from vanzin/SPARK-2750.
      43f1d59e
    • Michael Armbrust's avatar
      [BUILD] Runner for spark packages · efd7eed3
      Michael Armbrust authored
      This is a convenience method added to the SBT build for developers, though if people think its useful we could consider adding a official script that runs using the assembly instead of compiling on demand.  It simply compiles spark (without requiring an assembly), and invokes Spark Submit to download / run the package.
      
      Example Usage:
      ```
      $ build/sbt
      > sparkPackage com.databricks:spark-sql-perf_2.10:0.2.4 com.databricks.spark.sql.perf.RunBenchmark --help
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #10834 from marmbrus/sparkPackageRunner.
      efd7eed3
    • Gábor Lipták's avatar
      [SPARK-11295] Add packages to JUnit output for Python tests · c6f971b4
      Gábor Lipták authored
      SPARK-11295 Add packages to JUnit output for Python tests
      
      This improves grouping/display of test case results.
      
      Author: Gábor Lipták <gliptak@gmail.com>
      
      Closes #9263 from gliptak/SPARK-11295.
      c6f971b4
    • Jakob Odersky's avatar
      [SPARK-12816][SQL] De-alias type when generating schemas · c78e2080
      Jakob Odersky authored
      Call `dealias` on local types to fix schema generation for abstract type members, such as
      
      ```scala
      type KeyValue = (Int, String)
      ```
      
      Add simple test
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #10749 from jodersky/aliased-schema.
      c78e2080
    • Imran Rashid's avatar
      [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings · 4dbd3161
      Imran Rashid authored
      See https://issues.apache.org/jira/browse/SPARK-12560
      
      This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled.  I ran into this while trying to turn them back on with a different version of parquet.  Figure it was good to fix now in any case.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #10510 from squito/SPARK-12560.
      4dbd3161
    • gatorsmile's avatar
      [SPARK-12867][SQL] Nullability of Intersect can be stricter · b72e01e8
      gatorsmile authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12867
      
      When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter.
      
      liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10812 from gatorsmile/nullabilityIntersect.
      b72e01e8
    • Feynman Liang's avatar
      [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label training data · 2388de51
      Feynman Liang authored
      CC jkbradley mengxr dbtsai
      
      Author: Feynman Liang <feynman.liang@gmail.com>
      
      Closes #10743 from feynmanliang/SPARK-12804.
      2388de51
    • Andrew Or's avatar
      [SPARK-12887] Do not expose var's in TaskMetrics · b122c861
      Andrew Or authored
      This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators.
      
      TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug.
      
      Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them.
      
      Parent PR: #10717
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: andrewor14 <andrew@databricks.com>
      
      Closes #10815 from andrewor14/get-or-create-metrics.
      b122c861
    • Wenchen Fan's avatar
      [SPARK-12870][SQL] better format bucket id in file name · e14817b5
      Wenchen Fan authored
      for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10799 from cloud-fan/fix-bucket.
      e14817b5
    • Holden Karau's avatar
      [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means · 0ddba6d8
      Holden Karau authored
      From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
      0ddba6d8
    • Wojciech Jurczyk's avatar
      [MLLIB] Fix CholeskyDecomposition assertion's message · ebd9ce0f
      Wojciech Jurczyk authored
      Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.
      
      Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>
      
      Closes #10818 from wjur/wjur/rename_error_message.
      ebd9ce0f
    • Sean Owen's avatar
      [SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pyspark · d8c4b00a
      Sean Owen authored
      Fix order of arguments that Pyspark RDD.fold passes to its op -  should be (acc, obj) like other implementations.
      
      Obviously, this is a potentially breaking change, so can only happen for 2.x
      
      CC davies
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10771 from srowen/SPARK-7683.
      d8c4b00a
    • proflin's avatar
      [SQL][MINOR] Fix one little mismatched comment according to the codes in interface.scala · c00744e6
      proflin authored
      Author: proflin <proflin.me@gmail.com>
      
      Closes #10824 from proflin/master.
      c00744e6
  2. Jan 18, 2016
  3. Jan 17, 2016
  4. Jan 16, 2016
    • Davies Liu's avatar
      [SPARK-12796] [SQL] Whole stage codegen · 3c0d2365
      Davies Liu authored
      This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators.
      
      A micro benchmark show that a query with range, filter and projection could be 3X faster then before.
      
      It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan
      ```
      Limit 10
      +- Project [(id#5L + 1) AS (id + 1)#6L]
         +- Filter ((id#5L & 1) = 1)
            +- Range 0, 1, 4, 10, [id#5L]
      ```
      will be translated into
      ```
      Limit 10
      +- WholeStageCodegen
            +- Project [(id#1L + 1) AS (id + 1)#2L]
               +- Filter ((id#1L & 1) = 1)
                  +- Range 0, 1, 4, 10, [id#1L]
      ```
      
      Here is the call graph to generate Java source for A and B (A  support codegen, but B does not):
      
      ```
        *   WholeStageCodegen       Plan A               FakeInput        Plan B
        * =========================================================================
        *
        * -> execute()
        *     |
        *  doExecute() -------->   produce()
        *                             |
        *                          doProduce()  -------> produce()
        *                                                   |
        *                                                doProduce() ---> execute()
        *                                                   |
        *                                                consume()
        *                          doConsume()  ------------|
        *                             |
        *  doConsume()  <-----    consume()
      ```
      
      A SparkPlan that support codegen need to implement doProduce() and doConsume():
      
      ```
      def doProduce(ctx: CodegenContext): (RDD[InternalRow], String)
      def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10735 from davies/whole2.
      3c0d2365
    • Jeff Lam's avatar
      [SPARK-12722][DOCS] Fixed typo in Pipeline example · 86972fa5
      Jeff Lam authored
      http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
      ```
      val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
      ```
      should be
      ```
      val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
      ```
      cc: jkbradley
      
      Author: Jeff Lam <sha0lin@alumni.carnegiemellon.edu>
      
      Closes #10769 from Agent007/SPARK-12722.
      86972fa5
    • Wenchen Fan's avatar
      [SPARK-12856] [SQL] speed up hashCode of unsafe array · 2f7d0b68
      Wenchen Fan authored
      We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead.
      
      A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10784 from cloud-fan/array-hashcode.
      2f7d0b68
Loading