Skip to content
Snippets Groups Projects
  1. Nov 10, 2015
    • Josh Rosen's avatar
      [SPARK-9818] Re-enable Docker tests for JDBC data source · 1dde39d7
      Josh Rosen authored
      This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR.
      
      In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`.
      
      Closes #8101.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #9503 from JoshRosen/docker-jdbc-tests.
      1dde39d7
    • felixcheung's avatar
      [SPARK-11567] [PYTHON] Add Python API for corr Aggregate function · 32790fe7
      felixcheung authored
      like `df.agg(corr("col1", "col2")`
      
      davies
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9536 from felixcheung/pyfunc.
      32790fe7
    • Pravin Gadakh's avatar
      [SPARK-11550][DOCS] Replace example code in mllib-optimization.md using include_example · 638c51d9
      Pravin Gadakh authored
      Author: Pravin Gadakh <pravingadakh177@gmail.com>
      
      Closes #9516 from pravingadakh/SPARK-11550.
      638c51d9
    • Michael Armbrust's avatar
      [SPARK-11616][SQL] Improve toString for Dataset · 724cf7a3
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9586 from marmbrus/dataset-toString.
      724cf7a3
    • unknown's avatar
      [SPARK-7316][MLLIB] RDD sliding window with step · dba1a62c
      unknown authored
      Implementation of step capability for sliding window function in MLlib's RDD.
      
      Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points:
      
      Window | Step | Time | Windows produced
      ------------ | ------------- | ---------- | ----------
      128 | 1 |  6.38 | 9999873
      128 | 10 | 0.9 | 999988
      128 | 100 | 0.41 | 99999
      1024 | 1 | 44.67 | 9998977
      1024 | 10 | 4.74 | 999898
      1024 | 100 | 0.78 | 99990
      ```
      import org.apache.spark.mllib.rdd.RDDFunctions._
      val rdd = sc.parallelize(1 to 10000000, 10)
      rdd.count
      val window = 1024
      val step = 1
      val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9)
      ```
      
      Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net>
      Author: Alexander Ulanov <nashb@yandex.ru>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5855 from avulanov/SPARK-7316-sliding.
      dba1a62c
    • Joseph K. Bradley's avatar
      [SPARK-11618][ML] Minor refactoring of basic ML import/export · 18350a57
      Joseph K. Bradley authored
      Refactoring
      * separated overwrite and param save logic in DefaultParamsWriter
      * added sparkVersion to DefaultParamsWriter
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9587 from jkbradley/logreg-io.
      18350a57
    • Yanbo Liang's avatar
      [ML][R] SparkR::glm summary result to compare with native R · f14e9511
      Yanbo Liang authored
      Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9590 from yanboliang/glm-r-test.
      f14e9511
    • Nong Li's avatar
      [SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjections · 87aedc48
      Nong Li authored
      This patch adds the building blocks for codegening subexpr elimination and implements
      it end to end for UnsafeProjection. The building blocks can be used to do the same thing
      for other operators.
      
      It introduces some utilities to compute common sub expressions. Expressions can be added to
      this data structure. The expr and its children will be recursively matched against existing
      expressions (ones previously added) and grouped into common groups. This is built using
      the existing `semanticEquals`. It does not understand things like commutative or associative
      expressions. This can be done as future work.
      
      After building this data structure, the codegen process takes advantage of it by:
        1. Generating a helper function in the generated class that computes the common
           subexpression. This is done for all common subexpressions that have at least
           two occurrences and the expression tree is sufficiently complex.
        2. When generating the apply() function, if the helper function exists, call that
           instead of regenerating the expression tree. Repeated calls to the helper function
           shortcircuit the evaluation logic.
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong Li <nongli@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #9480 from nongli/spark-10371.
      87aedc48
    • Wenchen Fan's avatar
      [SPARK-11590][SQL] use native json_tuple in lateral view · 53600854
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9562 from cloud-fan/json-tuple.
      53600854
    • Wenchen Fan's avatar
      [SPARK-11578][SQL][FOLLOW-UP] complete the user facing api for typed aggregation · dfcfcbcc
      Wenchen Fan authored
      Currently the user facing api for typed aggregation has some limitations:
      
      * the customized typed aggregation must be the first of aggregation list
      * the customized typed aggregation can only use long as buffer type
      * the customized typed aggregation can only use flat type as result type
      
      This PR tries to remove these limitations.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9599 from cloud-fan/agg.
      dfcfcbcc
    • Oscar D. Lara Yejas's avatar
      [SPARK-10863][SPARKR] Method coltypes() (New version) · 47735cdc
      Oscar D. Lara Yejas authored
      This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.
      
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      
      Closes #9579 from olarayej/SPARK-10863_NEW14.
      47735cdc
    • Yin Huai's avatar
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to... · e0701c75
      Yin Huai authored
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
      
      https://issues.apache.org/jira/browse/SPARK-9830
      
      This PR contains the following main changes.
      * Removing `AggregateExpression1`.
      * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
      * Removing planner rule used to plan `Aggregate`.
      * Linking `MultipleDistinctRewriter` to analyzer.
      * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
      * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
      * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9556 from yhuai/removeAgg1.
      e0701c75
    • Lianhui Wang's avatar
      [SPARK-11252][NETWORK] ShuffleClient should release connection after fetching... · 6e5fc378
      Lianhui Wang authored
      [SPARK-11252][NETWORK] ShuffleClient should release connection after fetching blocks had been completed for external shuffle
      
      with yarn's external shuffle, ExternalShuffleClient of executors reserve its connections for yarn's NodeManager until application has been completed. so it will make NodeManager and executors have many socket connections.
      in order to reduce network pressure of NodeManager's shuffleService, after registerWithShuffleServer or fetchBlocks have been completed in ExternalShuffleClient, connection for NM's shuffleService needs to be closed.andrewor14 rxin vanzin
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #9227 from lianhuiwang/spark-11252.
      6e5fc378
    • Josh Rosen's avatar
      [SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBT · 689386b1
      Josh Rosen authored
      This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine.
      
      Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task.
      
      `dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead.
      
      /cc dragos marmbrus pwendell srowen
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9575 from JoshRosen/SPARK-7841.
      689386b1
    • Xusen Yin's avatar
      [SPARK-11382] Replace example code in mllib-decision-tree.md using include_example · a81f47ff
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11382
      
      B.T.W. I fix an error in naive_bayes_example.py.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #9596 from yinxusen/SPARK-11382.
      a81f47ff
    • Paul Chandler's avatar
      Fix typo in driver page · 5507a9d0
      Paul Chandler authored
      "Comamnd property" => "Command property"
      
      Author: Paul Chandler <pestilence669@users.noreply.github.com>
      
      Closes #9578 from pestilence669/fix_spelling.
      5507a9d0
    • Davies Liu's avatar
      [SPARK-11598] [SQL] enable tests for ShuffledHashOuterJoin · 521b3cae
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9573 from davies/join_condition.
      521b3cae
    • Davies Liu's avatar
      [SPARK-11599] [SQL] fix NPE when resolve Hive UDF in SQLParser · d6cd3a18
      Davies Liu authored
      The DataFrame APIs that takes a SQL expression always use SQLParser, then the HiveFunctionRegistry will called outside of Hive state, cause NPE if there is not a active Session State for current thread (in PySpark).
      
      cc rxin yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9576 from davies/hive_udf.
      d6cd3a18
  2. Nov 09, 2015
Loading