Skip to content
Snippets Groups Projects
  1. Feb 21, 2016
    • Andrew Or's avatar
      [SPARK-13080][SQL] Implement new Catalog API using Hive · 6c3832b2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.
      
      *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.
      
      *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.
      
      The new class hierarchy is as follows:
      ```
      org.apache.spark.sql.catalyst.catalog.Catalog
        - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
        - org.apache.spark.sql.hive.HiveCatalog
      ```
      
      Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.
      
      ## How was the this patch tested?
      All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11293 from rxin/hive-catalog.
      6c3832b2
    • hyukjinkwon's avatar
      [SPARK-13137][SQL] NullPoingException in schema inference for CSV when the first line is empty · 7eb83fef
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13137
      
      This PR adds a filter in schema inference so that it does not emit NullPointException.
      
      Also, I removed `MAX_COMMENT_LINES_IN_HEADER `but instead used a monad chaining with `filter()` and `first()`.
      
      Lastly, I simply added a newline rather than adding a new file for this so that this is covered with the original tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11023 from HyukjinKwon/SPARK-13137.
      7eb83fef
    • Herman van Hovell's avatar
      [SPARK-13136][SQL] Create a dedicated Broadcast exchange operator · b6a873d6
      Herman van Hovell authored
      Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:
      
      - This conflates broadcasting (a data exchange) with joining. Data exchanges should be managed by a different operator.
      - All these nodes implement their own (duplicate) broadcasting logic.
      - Re-use of indices is quite hard.
      
      This PR defines both a ```BroadcastDistribution``` and ```BroadcastPartitioning```, these contain a `BroadcastMode`. The `BroadcastMode` defines the way in which we transform the Array of `InternalRow`'s into an index. We currently support the following `BroadcastMode`'s:
      
      - IdentityBroadcastMode: This broadcasts the rows in their original form.
      - HashSetBroadcastMode: This applies a projection to the input rows, deduplicates these rows and broadcasts the resulting `Set`.
      - HashedRelationBroadcastMode: This transforms the input rows into a `HashedRelation`, and broadcasts this index.
      
      To match this distribution we implement a ```BroadcastExchange``` operator which will perform the broadcast for us, and have ```EnsureRequirements``` plan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.
      
      cc rxin davies
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11083 from hvanhovell/SPARK-13136.
      b6a873d6
    • Reynold Xin's avatar
      [SPARK-13306][SQL] Addendum to uncorrelated scalar subquery · af441ddb
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight.
      
      ## How was the this patch tested?
      unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11285 from rxin/subquery.
      af441ddb
    • Reynold Xin's avatar
      [SPARK-13420][SQL] Rename Subquery logical plan to SubqueryAlias · 0947f098
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions).
      
      ## How was the this patch tested?
      Unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11288 from rxin/SPARK-13420.
      0947f098
    • Luciano Resende's avatar
      [SPARK-13248][STREAMING] Remove deprecated Streaming APIs. · 1a340da8
      Luciano Resende authored
      Remove deprecated Streaming APIs and adjust sample applications.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #11139 from lresende/streaming-deprecated-apis.
      1a340da8
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
    • Zheng RuiFeng's avatar
      [SPARK-13416][GraphX] Add positive check for option 'numIter' in StronglyConnectedComponents · d806ed34
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13416
      
      ## What changes were proposed in this pull request?
      
      The output of StronglyConnectedComponents with numIter no greater than 1 may make no sense. So I just add require check in it.
      
      ## How was the this patch tested?
      
       unit tests passed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11284 from zhengruifeng/scccheck.
      d806ed34
  2. Feb 20, 2016
    • Davies Liu's avatar
      [SPARK-13306] [SQL] uncorrelated scalar subquery · 79250712
      Davies Liu authored
      A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.
      
      All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.
      
      The plans for query
      ```sql
      select 1 + (select 2 + (select 3))
      ```
      looks like this
      ```
      == Parsed Logical Plan ==
      'Project [unresolvedalias((1 + subquery#1),None)]
      :- OneRowRelation$
      +- 'Subquery subquery#1
         +- 'Project [unresolvedalias((2 + subquery#0),None)]
            :- OneRowRelation$
            +- 'Subquery subquery#0
               +- 'Project [unresolvedalias(3,None)]
                  +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      _c0: int
      Project [(1 + subquery#1) AS _c0#4]
      :- OneRowRelation$
      +- Subquery subquery#1
         +- Project [(2 + subquery#0) AS _c0#3]
            :- OneRowRelation$
            +- Subquery subquery#0
               +- Project [3 AS _c0#2]
                  +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [(1 + subquery#1) AS _c0#4]
      :- OneRowRelation$
      +- Subquery subquery#1
         +- Project [(2 + subquery#0) AS _c0#3]
            :- OneRowRelation$
            +- Subquery subquery#0
               +- Project [3 AS _c0#2]
                  +- OneRowRelation$
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [(1 + subquery#1) AS _c0#4]
      :     :- INPUT
      :     +- Subquery subquery#1
      :        +- WholeStageCodegen
      :           :  +- Project [(2 + subquery#0) AS _c0#3]
      :           :     :- INPUT
      :           :     +- Subquery subquery#0
      :           :        +- WholeStageCodegen
      :           :           :  +- Project [3 AS _c0#2]
      :           :           :     +- INPUT
      :           :           +- Scan OneRowRelation[]
      :           +- Scan OneRowRelation[]
      +- Scan OneRowRelation[]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11190 from davies/scalar_subquery.
      79250712
    • gatorsmile's avatar
      [SPARK-13310] [SQL] Resolve Missing Sorting Columns in Generate · f88c641b
      gatorsmile authored
      ```scala
      // case 1: missing sort columns are resolvable if join is true
      sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c")
      // case 2: missing sort columns are not resolvable if join is false. Thus, issue an error message in this case
      sql("SELECT explode(a) AS val FROM data order by val, c")
      ```
      
      When sort columns are not in `Generate`, we can resolve them when `join` is equal to `true`. Still trying to add more test cases for the other `UnaryNode` types.
      
      Could you review the changes? davies cloud-fan Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11198 from gatorsmile/missingInSort.
      f88c641b
    • Timothy Chen's avatar
      [SPARK-13414][MESOS] Allow multiple dispatchers to be launched. · a4a081d1
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      Users might want to start multiple mesos dispatchers, as each dispatcher can potentially be part of different roles and used for multi-tenancy.
      
      To allow multiple Mesos dispatchers to be launched, we need to be able to specify a instance number when starting the dispatcher daemon.
      
      ## How was the this patch tested?
      
      Manual testing
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #11281 from tnachen/multiple_cluster_dispatchers.
      a4a081d1
    • Zheng RuiFeng's avatar
      [SPARK-13386][GRAPHX] ConnectedComponents should support maxIteration option · 6ce7c481
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13386
      
      ## What changes were proposed in this pull request?
      
      add maxIteration option for ConnectedComponents algorithm
      
      ## How was the this patch tested?
      
      unit tests passed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11268 from zhengruifeng/ccwithmax.
      6ce7c481
    • Holden Karau's avatar
      [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests · 9ca79c1e
      Holden Karau authored
      Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
      In part this is a follow up to https://github.com/apache/spark/pull/10999
      Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
      9ca79c1e
    • Shixiong Zhu's avatar
      [SPARK-13408] [CORE] Ignore errors when it's already reported in JobWaiter · dfb2ae2f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JobWaiter.taskSucceeded` will be called for each task. When `resultHandler` throws an exception, `taskSucceeded` will also throw it for each task. DAGScheduler just catches it and reports it like this:
      ```Scala
                        try {
                          job.listener.taskSucceeded(rt.outputId, event.result)
                        } catch {
                          case e: Exception =>
                            // TODO: Perhaps we want to mark the resultStage as failed?
                            job.listener.jobFailed(new SparkDriverExecutionException(e))
                        }
      ```
      Therefore `JobWaiter.jobFailed` may be called multiple times.
      
      So `JobWaiter.jobFailed` should use `Promise.tryFailure` instead of `Promise.failure` because the latter one doesn't support calling multiple times.
      
      ## How was the this patch tested?
      
      Jenkins tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11280 from zsxwing/SPARK-13408.
      dfb2ae2f
    • Reynold Xin's avatar
      Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" · 6624a588
      Reynold Xin authored
      This reverts commit 4f9a6648.
      6624a588
    • Kai Jiang's avatar
      [SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs · 4f9a6648
      Kai Jiang authored
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #10527 from vectorijk/spark-12567.
      4f9a6648
    • gatorsmile's avatar
      [SPARK-12594] [SQL] Outer Join Elimination by Filter Conditions · ec7a1d6e
      gatorsmile authored
      Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.
      
      - `full outer` -> `inner` if both sides have such predicates
      - `left outer` -> `inner` if the right side has such predicates
      - `right outer` -> `inner` if the left side has such predicates
      - `full outer` -> `left outer` if only the left side has such predicates
      - `full outer` -> `right outer` if only the right side has such predicates
      
      If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.
      
      The original PR is https://github.com/apache/spark/pull/10542
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10567 from gatorsmile/outerJoinEliminationByFilterCond.
      ec7a1d6e
  3. Feb 19, 2016
  4. Feb 18, 2016
    • gatorsmile's avatar
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return... · c776fce9
      gatorsmile authored
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return Indeterministic Results When Data Partitions are not fixed.
      
      `rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```
      
      Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.
      
      jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11232 from gatorsmile/randSeed.
      c776fce9
    • Davies Liu's avatar
      [SPARK-13237] [SQL] generated broadcast outer join · 95e1ab22
      Davies Liu authored
      This PR support codegen for broadcast outer join.
      
      In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11130 from davies/gen_out.
      95e1ab22
    • Davies Liu's avatar
      [SPARK-13351][SQL] fix column pruning on Expand · 26f38bb8
      Davies Liu authored
      Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11225 from davies/fix_pruning_expand.
      26f38bb8
    • Sean Owen's avatar
      [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares... · 78562535
      Sean Owen authored
      [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares Option and String directly.
      
      ## What changes were proposed in this pull request?
      
      Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager)
      
      ## How was the this patch tested?
      
      Running Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11253 from srowen/SPARK-13371.
      78562535
  5. Feb 17, 2016
  6. Feb 16, 2016
    • junhao's avatar
      [SPARK-11627] Add initial input rate limit for spark streaming backpressure mechanism. · 7218c0eb
      junhao authored
      https://issues.apache.org/jira/browse/SPARK-11627
      
      Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
      In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute  success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
      
      Author: junhao <junhao@mogujie.com>
      
      Closes #9593 from junhaoMg/junhao-dev.
      7218c0eb
    • Josh Rosen's avatar
      [SPARK-13308] ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases · 5f37aad4
      Josh Rosen authored
      ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see #10705).
      
      This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code.
      
      /cc zsxwing for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11193 from JoshRosen/add-missing-release-calls-in-network-layer.
      5f37aad4
    • Marcelo Vanzin's avatar
      [SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteAheadLog. · c7d00a24
      Marcelo Vanzin authored
      The new logger name is under the org.apache.spark namespace.
      The detection of the caller name was also enhanced a bit to ignore
      some common things that show up in the call stack.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11165 from vanzin/SPARK-13280.
      c7d00a24
Loading