Skip to content
Snippets Groups Projects
  1. Mar 03, 2016
    • Dongjoon Hyun's avatar
      [MINOR] Fix typos in comments and testcase name of code · 941b270b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes typos in comments and testcase name of code.
      
      ## How was this patch tested?
      
      manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
      941b270b
    • Sean Owen's avatar
      [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala 2.10, again · 52035d10
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fixes (another) compile problem due to inadvertent use of Option.contains, only in Scala 2.11
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11496 from srowen/SPARK-13423.3.
      52035d10
    • Yanbo Liang's avatar
      [MINOR][ML][DOC] Remove duplicated periods at the end of some sharedParam · ce58e99a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
      cc mengxr srowen
      ## How was this patch tested?
      Documents change, no test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11344 from yanboliang/shared-cleanup.
      ce58e99a
    • hyukjinkwon's avatar
      [SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() · cf95d728
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the support to specify compression codecs for both ORC and Parquet.
      
      ## How was this patch tested?
      
      unittests within IDE and code style tests with `dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11464 from HyukjinKwon/SPARK-13543.
      cf95d728
    • JeremyNixon's avatar
      [SPARK-12877][ML] Add train-validation-split to pyspark · 511d4929
      JeremyNixon authored
      ## What changes were proposed in this pull request?
      The changes proposed were to add train-validation-split to pyspark.ml.tuning.
      
      ## How was the this patch tested?
      This patch was tested through unit tests located in pyspark/ml/test.py.
      
      This is my original work and I license it to Spark.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11335 from JeremyNixon/tvs_pyspark.
      511d4929
    • Steve Loughran's avatar
      [SPARK-13599][BUILD] remove transitive groovy dependencies from Hive · 9a48c656
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.
      
      This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.
      
      ## How was this patch tested?
      
      1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
      1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
      1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver  -Dverbose > target/dependencies.txt`
      1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
      1. Patch applied
      1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
      1. Examined created spark-assembly, verified no org.codehaus packages
      1. Verified that the maven dependency tree no longer references groovy
      
      Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
      9a48c656
    • Xin Ren's avatar
      [SPARK-13013][DOCS] Replace example code in mllib-clustering.md using include_example · 70f6f964
      Xin Ren authored
      Replace example code in mllib-clustering.md using include_example
      https://issues.apache.org/jira/browse/SPARK-13013
      
      The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
      
      Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
      `{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}`
      Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala` and pick code blocks marked "example" and replace code block in
      `{% highlight %}`
       in the markdown.
      
      See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #11116 from keypointt/SPARK-13013.
      70f6f964
    • Sean Owen's avatar
      [SPARK-13423][HOTFIX] Static analysis fixes for 2.x / fixed for Scala 2.10 · 645c3a85
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
      
      ## How was this patch tested?
      
      Jenkins tests / compilation
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11493 from srowen/SPARK-13423.2.
      645c3a85
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
    • Dongjoon Hyun's avatar
      [HOT-FIX] Recover some deprecations for 2.10 compatibility. · 02b7677e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      #11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
      At this moment, we need to support both 2.10 and 2.11.
      This PR recovers some deprecated methods which were replace by [SPARK-13627].
      
      ## How was this patch tested?
      
      Jenkins build: Both 2.10, 2.11.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
      02b7677e
    • Liang-Chi Hsieh's avatar
      [SPARK-13466] [SQL] Remove projects that become redundant after column pruning rule · 7b25dc7b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13466
      
      ## What changes were proposed in this pull request?
      
      With column pruning rule in optimizer, some Project operators will become redundant. We should remove these redundant Projects.
      
      For an example query:
      
          val input = LocalRelation('key.int, 'value.string)
      
          val query =
            Project(Seq($"x.key", $"y.key"),
              Join(
                SubqueryAlias("x", input),
                BroadcastHint(SubqueryAlias("y", input)), Inner, None))
      
      After the first run of column pruning, it would like:
      
          Project(Seq($"x.key", $"y.key"),
            Join(
              Project(Seq($"x.key"), SubqueryAlias("x", input)),
              Project(Seq($"y.key"),      <-- inserted by the rule
              BroadcastHint(SubqueryAlias("y", input))),
              Inner, None))
      
      Actually we don't need the outside Project now. This patch will remove it:
      
          Join(
            Project(Seq($"x.key"), SubqueryAlias("x", input)),
            Project(Seq($"y.key"),
            BroadcastHint(SubqueryAlias("y", input))),
            Inner, None)
      
      ## How was the this patch tested?
      
      Unit test is added into ColumnPruningSuite.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11341 from viirya/remove-redundant-project.
      7b25dc7b
    • Liang-Chi Hsieh's avatar
      [SPARK-13635] [SQL] Enable LimitPushdown optimizer rule because we have... · 1085bd86
      Liang-Chi Hsieh authored
      [SPARK-13635] [SQL] Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-13635
      
      ## What changes were proposed in this pull request?
      
      LimitPushdown optimizer rule has been disabled due to no whole-stage codegen for Limit. As we have whole-stage codegen for Limit now, we should enable it.
      
      ## How was this patch tested?
      
      As we only re-enable LimitPushdown optimizer rule, no need to add new tests for it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11483 from viirya/enable-limitpushdown.
      1085bd86
    • Devaraj K's avatar
      [SPARK-13621][CORE] TestExecutor.scala needs to be moved to test package · 56e3d007
      Devaraj K authored
      Moved TestExecutor.scala from src to test package and removed the unused file TestClient.scala.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11474 from devaraj-kavali/SPARK-13621.
      56e3d007
    • Liang-Chi Hsieh's avatar
      [SPARK-13616][SQL] Let SQLBuilder convert logical plan without a project on top of it · f87ce050
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13616
      
      ## What changes were proposed in this pull request?
      
      It is possibly that a logical plan has been removed `Project` from the top of it. Or the plan doesn't has a top `Project` from the beginning because it is not necessary. Currently the `SQLBuilder` can't convert such plans back to SQL. This change is to add this feature.
      
      ## How was this patch tested?
      
      A test is added to `LogicalPlanToSQLSuite`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11466 from viirya/sqlbuilder-notopselect.
      f87ce050
  2. Mar 02, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13627][SQL][YARN] Fix simple deprecation warnings. · 9c274ac4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to fix the following deprecation warnings.
        * MethodSymbolApi.paramss--> paramLists
        * AnnotationApi.tpe -> tree.tpe
        * BufferLike.readOnly -> toList.
        * StandardNames.nme -> termNames
        * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
        * TypeApi.declarations-> decls
      
      ## How was this patch tested?
      
      Check the compile build log and pass the tests.
      ```
      ./build/sbt
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11479 from dongjoon-hyun/SPARK-13627.
      9c274ac4
    • Wenchen Fan's avatar
      [SPARK-13617][SQL] remove unnecessary GroupingAnalytics trait · b60b8137
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The `trait GroupingAnalytics` only has one implementation, it's an unnecessary abstraction. This PR removes it, and does some code simplification when resolving `GroupingSet`.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11469 from cloud-fan/groupingset.
      b60b8137
    • Takeshi YAMAMURO's avatar
      [SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation · 6250cf1e
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from #11324.
      
      ## How was this patch tested?
      Add more tests in `TextSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #11408 from maropu/SPARK-13528.
      6250cf1e
    • Wenchen Fan's avatar
      [SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python DataFrame · 4dd24811
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11445 from cloud-fan/python-clean.
      4dd24811
    • Nong Li's avatar
      [SPARK-13574] [SQL] Add benchmark to measure string dictionary decode. · e2780ce8
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      Also updated the other benchmarks when the default to use vectorized decode was flipped.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11454 from nongli/benchmark.
      e2780ce8
    • Davies Liu's avatar
      [SPARK-13601] call failure callbacks before writer.close() · b5a59a0f
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close().
      
      ## How was this patch tested?
      
      Added new unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11450 from davies/callback.
      b5a59a0f
    • gatorsmile's avatar
      [SPARK-13535][SQL] Fix Analysis Exceptions when Using Backticks in Transform Clause · 9e01fe2e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      ```SQL
      FROM
      (FROM test SELECT TRANSFORM(key, value) USING 'cat' AS (`thing1` int, thing2 string)) t
      SELECT thing1 + 1
      ```
      This query returns an analysis error, like:
      ```
      Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`thing1`' given input columns: [`thing1`, thing2]; line 3 pos 7
      'Project [unresolvedalias(('thing1 + 1), None)]
      +- SubqueryAlias t
         +- ScriptTransformation [key#2,value#3], cat, [`thing1`#6,thing2#7], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
            +- SubqueryAlias test
               +- Project [_1#0 AS key#2,_2#1 AS value#3]
                  +- LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3],[4,4],[5,5]]
      ```
      
      The backpacks of \`thing1\` should be cleaned before entering Parser/Analyzer. This PR fixes this issue.
      
      #### How was this patch tested?
      
      Added a test case and modified an existing test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11415 from gatorsmile/scriptTransform.
      9e01fe2e
    • Josh Rosen's avatar
      [SPARK-12817] Add BlockManager.getOrElseUpdate and remove CacheManager · d6969ffc
      Josh Rosen authored
      CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication.
      
      Thanks to the addition of block-level read/write locks in #10705, we can refactor the code to remove the CacheManager and replace it with an atomic `BlockManager.getOrElseUpdate()` method.
      
      This pull request replaces / subsumes #10748.
      
      /cc andrewor14 and nongli for review. Note that this changes the locking semantics of a couple of internal BlockManager methods (`doPut()` and `lockNewBlockForWriting`), so please pay attention to the Scaladoc changes and new test cases for those methods.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11436 from JoshRosen/remove-cachemanager.
      d6969ffc
    • gatorsmile's avatar
      [SPARK-13609] [SQL] Support Column Pruning for MapPartitions · 8f8d8a23
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is to prune unnecessary columns when the operator is  `MapPartitions`. The solution is to add an extra `Project` in the child node.
      
      For the other two operators `AppendColumns` and `MapGroups`, it sounds doable. More discussions are required. The major reason is the current implementation of the `inputPlan` of `groupBy` is based on the child of `AppendColumns`. It might be a bug? Thus, will submit a separate PR.
      
      #### How was this patch tested?
      
      Added a test case in ColumnPruningSuite to verify the rule. Added another test case in DatasetSuite.scala to verify the data.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11460 from gatorsmile/datasetPruningNew.
      8f8d8a23
    • lgieron's avatar
      [SPARK-13515] Make FormatNumber work irrespective of locale. · d8afd45f
      lgieron authored
      ## What changes were proposed in this pull request?
      
      Change in class FormatNumber to make it work irrespective of locale.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: lgieron <lgieron@gmail.com>
      
      Closes #11396 from lgieron/SPARK-13515_Fix_Format_Number.
      d8afd45f
    • Wojciech Jurczyk's avatar
      Fix run-tests.py typos · 75e618de
      Wojciech Jurczyk authored
      ## What changes were proposed in this pull request?
      
      The PR fixes typos in an error message in dev/run-tests.py.
      
      Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>
      
      Closes #11467 from wjur/wjur/typos_run_tests.
      75e618de
    • Dongjoon Hyun's avatar
      [MINOR][STREAMING] Replace deprecated `apply` with `create` in example. · 366f26d2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Twitter Algebird deprecated `apply` in HyperLogLog.scala.
      ```
      deprecated("Use toHLL", since = "0.10.0 / 2015-05")
      def apply[T <% Array[Byte]](t: T) = create(t)
      ```
      This PR replace the deprecated usage `apply` with new `create`
      according to the upstream change.
      
      ## How was this patch tested?
      manual.
      ```
      /bin/spark-submit --class org.apache.spark.examples.streaming.TwitterAlgebirdHLL examples/target/scala-2.11/spark-examples-2.0.0-SNAPSHOT-hadoop2.2.0.jar
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11451 from dongjoon-hyun/replace_deprecated_hll_apply.
      366f26d2
  3. Mar 01, 2016
    • jerryshao's avatar
      [BUILD][MINOR] Fix SBT build error with network-yarn module · b4d096de
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      ```
      error] Expected ID character
      [error] Not a valid command: common (similar: completions)
      [error] Expected project ID
      [error] Expected configuration
      [error] Expected ':' (if selecting a configuration)
      [error] Expected key
      [error] Not a valid key: common (similar: commands)
      [error] common/network-yarn/test
      ```
      
      `common/network-yarn` is not a valid sbt project, we should change to `network-yarn`.
      
      ## How was this patch tested?
      
      Locally run the the unit-test.
      
      CC rxin , we should either change here, or change the sbt project name.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11456 from jerryshao/build-fix.
      b4d096de
    • Joseph K. Bradley's avatar
      [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists · 9495c40f
      Joseph K. Bradley authored
      This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.
      
      This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.
      
      CC: thunterdb
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10927 from jkbradley/ml-python-all-list.
      9495c40f
    • sureshthalamati's avatar
      [SPARK-13167][SQL] Include rows with null values for partition column when... · e42724b1
      sureshthalamati authored
      [SPARK-13167][SQL] Include rows with null values for partition column when reading from JDBC datasources.
      
      Rows with null values in partition column are not included in the results because none of the partition
      where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column  to the first JDBC partition where clause.
      
      Example:
      JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1),
      JDBCPartition(THEID >= 2, 2)
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
      e42724b1
    • Davies Liu's avatar
      [SPARK-13598] [SQL] remove LeftSemiJoinBNL · a640c5b4
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that.
      
      ## How was this patch tested?
      
      Updated unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11448 from davies/remove_bnl.
      a640c5b4
    • Reynold Xin's avatar
      [SPARK-13548][BUILD] Move tags and unsafe modules into common · b0ee7d43
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories.
      
      ## How was this patch tested?
      Jenkins should suffice.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11426 from rxin/SPARK-13548.
      b0ee7d43
    • Davies Liu's avatar
      [SPARK-13582] [SQL] defer dictionary decoding in parquet reader · c27ba0d5
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.
      
      This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.
      
      ## How was this patch tested?
      
      Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11437 from davies/decode_dict.
      c27ba0d5
    • Xiangrui Meng's avatar
      Closes #11320 · c37bbb3a
      Xiangrui Meng authored
      Closes #10940
      Closes #11302
      Closes #11430
      Closes #10912
      c37bbb3a
    • Yanbo Liang's avatar
      [SPARK-12811][ML] Estimator for Generalized Linear Models(GLMs) · 5ed48dd8
      Yanbo Liang authored
      Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11136 from yanboliang/spark-12811.
      5ed48dd8
    • Liang-Chi Hsieh's avatar
      [SPARK-13511] [SQL] Add wholestage codegen for limit · c43899a0
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13511
      
      ## What changes were proposed in this pull request?
      
      Current limit operator doesn't support wholestage codegen. This is open to add support for it.
      
      In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator.
      
      Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is:
      
          TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L])
          +- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L])
             +- GlobalLimit 100
                +- Exchange SinglePartition, None
                   +- LocalLimit 100
                      +- Range 0, 1, 1, 524288000, [id#5L]
      
      After add wholestage codegen support:
      
          WholeStageCodegen
          :  +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L])
          :     +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L])
          :        +- GlobalLimit 100
          :           +- INPUT
          +- Exchange SinglePartition, None
             +- WholeStageCodegen
                :  +- LocalLimit 100
                :     +- Range 0, 1, 1, 524288000, [id#40L]
      
      ## How was this patch tested?
      
      A test is added into BenchmarkWholeStageCodegen.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11391 from viirya/wholestage-limit.
      c43899a0
    • Masayoshi TSUZUKI's avatar
      [SPARK-13592][WINDOWS] fix path of spark-submit2.cmd in spark-submit.cmd · 12a2a57e
      Masayoshi TSUZUKI authored
      ## What changes were proposed in this pull request?
      
      This patch fixes the problem that pyspark fails on Windows because pyspark can't find ```spark-submit2.cmd```.
      
      ## How was this patch tested?
      
      manual tests:
        I ran ```bin\pyspark.cmd``` and checked if pyspark is launched correctly after this patch is applyed.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #11442 from tsudukim/feature/SPARK-13592.
      12a2a57e
    • Zheng RuiFeng's avatar
      [SPARK-13550][ML] Add java example for ml.clustering.BisectingKMeans · 3c5f5e3b
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13550
      
      ## What changes were proposed in this pull request?
      
      Just add a java example for ml.clustering.BisectingKMeans
      
      ## How was this patch tested?
      
      manual tests were done.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11428 from zhengruifeng/ml_bkm_je.
      3c5f5e3b
    • Zheng RuiFeng's avatar
      [SPARK-13551][MLLIB] Fix wrong comment and remove meanless lines in... · 0a4b620f
      Zheng RuiFeng authored
      [SPARK-13551][MLLIB] Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-13551
      
      ## What changes were proposed in this pull request?
      
      Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
      
      ## How was this patch tested?
      
      manual test
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11429 from zhengruifeng/mllib_bkm_je.
      0a4b620f
  4. Feb 29, 2016
    • Marcelo Vanzin's avatar
      [SPARK-13478][YARN] Use real user when fetching delegation tokens. · c7fccb56
      Marcelo Vanzin authored
      The Hive client library is not smart enough to notice that the current
      user is a proxy user; so when using a proxy user, it fails to fetch
      delegation tokens from the metastore because of a missing kerberos
      TGT for the current user.
      
      To fix it, just run the code that fetches the delegation token as the
      real logged in user.
      
      Tested on a kerberos cluster both submitting normally and with a proxy
      user; Hive and HBase tokens are retrieved correctly in both cases.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11358 from vanzin/SPARK-13478.
      c7fccb56
Loading