Skip to content
Snippets Groups Projects
  1. Feb 22, 2016
    • Holden Karau's avatar
      [SPARK-13399][STREAMING] Fix checkpointsuite type erasure warnings · 1b144455
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Change the checkpointsuite getting the outputstreams to explicitly be unchecked on the generic type so as to avoid the warnings. This only impacts test code.
      
      Alternatively we could encode the type tag in the TestOutputStreamWithPartitions and filter the type tag as well - but this is unnecessary since multiple testoutputstreams are not registered and the previous code was not actually checking this type.
      
      ## How was the this patch tested?
      
      unit tests (streaming/testOnly org.apache.spark.streaming.CheckpointSuite)
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11286 from holdenk/SPARK-13399-checkpointsuite-type-erasure.
      1b144455
    • Yong Gang Cao's avatar
      [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and... · ef1047fc
      Yong Gang Cao authored
      [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
      
      add support of arbitrary length sentence by using the nature representation of sentences in the input.
      
      add new similarity functions and add normalization option for distances in synonym finding
      add new accessor for internal structure(the vocabulary and wordindex) for convenience
      
      need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
      
      jira link: https://issues.apache.org/jira/browse/SPARK-12153
      
      Author: Yong Gang Cao <ygcao@amazon.com>
      Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
      
      Closes #10152 from ygcao/improvementForSentenceBoundary.
      ef1047fc
    • Huaxin Gao's avatar
      [SPARK-13186][STREAMING] migrate away from SynchronizedMap · 8f35d3ea
      Huaxin Gao authored
      trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.
      
      Author: Huaxin Gao <huaxing@us.ibm.com>
      
      Closes #11250 from huaxingao/spark__13186.
      8f35d3ea
    • jerryshao's avatar
      [SPARK-13426][CORE] Remove the support of SIMR · 39ff1545
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This PR removes the support of SIMR, since SIMR is not actively used and maintained for a long time, also is not supported from `SparkSubmit`, so here propose to remove it.
      
      ## How was the this patch tested?
      
      This patch is tested locally by running unit tests.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11296 from jerryshao/SPARK-13426.
      39ff1545
  2. Feb 21, 2016
    • Yanbo Liang's avatar
      [SPARK-13379][MLLIB] Fix MLlib LogisticRegressionWithLBFGS set regularization incorrectly · 8a4ed788
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix MLlib LogisticRegressionWithLBFGS regularization map as:
      ```SquaredL2Updater``` -> ```elasticNetParam = 0.0```
      ```L1Updater``` -> ```elasticNetParam = 1.0```
      cc dbtsai
      ## How was the this patch tested?
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11258 from yanboliang/spark-13379.
      8a4ed788
    • Reynold Xin's avatar
      [HOTFIX] Fix compilation break · 9bf6a926
      Reynold Xin authored
      9bf6a926
    • hyukjinkwon's avatar
      [SPARK-13381][SQL] Support for loading CSV with a single function call · 819b0ea0
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13381
      
      This PR adds the support to load CSV data directly by a single call with given paths.
      
      Also, I corrected this to refer all paths rather than the first path in schema inference, which JSON datasource dose.
      
      Several unitests were added for each functionality.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11262 from HyukjinKwon/SPARK-13381.
      819b0ea0
    • Liang-Chi Hsieh's avatar
      [SPARK-13321][SQL] Support nested UNION in parser · 55d6fdf2
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13321
      
      The following SQL can not be parsed with current parser:
      
          SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) AS u_1
      
      We should fix it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11204 from viirya/nested-union.
      55d6fdf2
    • Robin East's avatar
      [SPARK-3650][GRAPHX] Triangle Count handles reverse edges incorrectly · 3d79f106
      Robin East authored
      jegonzal ankurdave please could you review
      
      ## What changes were proposed in this pull request?
      
      Reworking of jegonzal PR #2495 to address the issue identified in SPARK-3650. Code amended to use the convertToCanonicalEdges method.
      
      ## How was the this patch tested?
      
      Patch was tested using the unit tests created in PR #2495
      
      Author: Robin East <robin.east@xense.co.uk>
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #11290 from insidedctm/spark-3650.
      3d79f106
    • Franklyn D'souza's avatar
      [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. · 0f90f4e6
      Franklyn D'souza authored
      ## What changes were proposed in this pull request?
      
      This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
      
      This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
      
      ```
      from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
      from pyspark.sql import types
      
      schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
      
      a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
      b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
      
      c = a.unionAll(b)
      ```
      
      ## How was the this patch tested?
      
      Tested using two unit tests in sql/test.py and the DataFrameSuite.
      
      Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
      
      Author: Franklyn D'souza <franklynd@gmail.com>
      
      Closes #11279 from damnMeddlingKid/udt-union-all.
      0f90f4e6
    • Shixiong Zhu's avatar
      [SPARK-13271][SQL] Better error message if 'path' is not specified · 0cbadf28
      Shixiong Zhu authored
      Improved the error message as per discussion in https://github.com/apache/spark/pull/11034#discussion_r52111238. Also made `path` and `metadataPath` in FileStreamSource case insensitive.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11154 from zsxwing/path.
      0cbadf28
    • Shixiong Zhu's avatar
      [SPARK-13405][STREAMING][TESTS] Make sure no messages leak to the next test · 76bd98d9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the test failure `org.apache.spark.sql.util.ContinuousQueryListenerSuite.event ordering`: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/202/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/
      
      ```
            org.scalatest.exceptions.TestFailedException:
      Assert failed: : null equaled null onQueryTerminated called before onQueryStarted
      org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
      	org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
      	org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
      	org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
      	org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
      	org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
      	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
      	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
      	org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
      	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
      ```
      
      In the previous codes, when the test `adding and removing listener` finishes, there may be still some QueryTerminated events in the listener bus queue. Then when `event ordering` starts to run, it may see these events and throw the above exception.
      
      This PR just added `waitUntilEmpty` in `after` to make sure all events be consumed after each test.
      
      ## How was the this patch tested?
      
      Jenkins tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11275 from zsxwing/SPARK-13405.
      76bd98d9
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix typos in `configuration.md` and `hardware-provisioning.md` · 03e62aa3
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes some typos in the following documentation files.
       * `NOTICE`, `configuration.md`, and `hardware-provisioning.md`.
      
      ## How was the this patch tested?
      
      manual tests
      
      Author: Dongjoon Hyun <dongjoonapache.org>
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11289 from dongjoon-hyun/minor_fix_typos_notice_and_confdoc.
      03e62aa3
    • Andrew Or's avatar
      [SPARK-13080][SQL] Implement new Catalog API using Hive · 6c3832b2
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.
      
      *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.
      
      *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.
      
      The new class hierarchy is as follows:
      ```
      org.apache.spark.sql.catalyst.catalog.Catalog
        - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
        - org.apache.spark.sql.hive.HiveCatalog
      ```
      
      Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.
      
      ## How was the this patch tested?
      All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11293 from rxin/hive-catalog.
      6c3832b2
    • hyukjinkwon's avatar
      [SPARK-13137][SQL] NullPoingException in schema inference for CSV when the first line is empty · 7eb83fef
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13137
      
      This PR adds a filter in schema inference so that it does not emit NullPointException.
      
      Also, I removed `MAX_COMMENT_LINES_IN_HEADER `but instead used a monad chaining with `filter()` and `first()`.
      
      Lastly, I simply added a newline rather than adding a new file for this so that this is covered with the original tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11023 from HyukjinKwon/SPARK-13137.
      7eb83fef
    • Herman van Hovell's avatar
      [SPARK-13136][SQL] Create a dedicated Broadcast exchange operator · b6a873d6
      Herman van Hovell authored
      Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:
      
      - This conflates broadcasting (a data exchange) with joining. Data exchanges should be managed by a different operator.
      - All these nodes implement their own (duplicate) broadcasting logic.
      - Re-use of indices is quite hard.
      
      This PR defines both a ```BroadcastDistribution``` and ```BroadcastPartitioning```, these contain a `BroadcastMode`. The `BroadcastMode` defines the way in which we transform the Array of `InternalRow`'s into an index. We currently support the following `BroadcastMode`'s:
      
      - IdentityBroadcastMode: This broadcasts the rows in their original form.
      - HashSetBroadcastMode: This applies a projection to the input rows, deduplicates these rows and broadcasts the resulting `Set`.
      - HashedRelationBroadcastMode: This transforms the input rows into a `HashedRelation`, and broadcasts this index.
      
      To match this distribution we implement a ```BroadcastExchange``` operator which will perform the broadcast for us, and have ```EnsureRequirements``` plan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.
      
      cc rxin davies
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #11083 from hvanhovell/SPARK-13136.
      b6a873d6
    • Reynold Xin's avatar
      [SPARK-13306][SQL] Addendum to uncorrelated scalar subquery · af441ddb
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight.
      
      ## How was the this patch tested?
      unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11285 from rxin/subquery.
      af441ddb
    • Reynold Xin's avatar
      [SPARK-13420][SQL] Rename Subquery logical plan to SubqueryAlias · 0947f098
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions).
      
      ## How was the this patch tested?
      Unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11288 from rxin/SPARK-13420.
      0947f098
    • Luciano Resende's avatar
      [SPARK-13248][STREAMING] Remove deprecated Streaming APIs. · 1a340da8
      Luciano Resende authored
      Remove deprecated Streaming APIs and adjust sample applications.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #11139 from lresende/streaming-deprecated-apis.
      1a340da8
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
    • Zheng RuiFeng's avatar
      [SPARK-13416][GraphX] Add positive check for option 'numIter' in StronglyConnectedComponents · d806ed34
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13416
      
      ## What changes were proposed in this pull request?
      
      The output of StronglyConnectedComponents with numIter no greater than 1 may make no sense. So I just add require check in it.
      
      ## How was the this patch tested?
      
       unit tests passed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11284 from zhengruifeng/scccheck.
      d806ed34
  3. Feb 20, 2016
    • Davies Liu's avatar
      [SPARK-13306] [SQL] uncorrelated scalar subquery · 79250712
      Davies Liu authored
      A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.
      
      All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.
      
      The plans for query
      ```sql
      select 1 + (select 2 + (select 3))
      ```
      looks like this
      ```
      == Parsed Logical Plan ==
      'Project [unresolvedalias((1 + subquery#1),None)]
      :- OneRowRelation$
      +- 'Subquery subquery#1
         +- 'Project [unresolvedalias((2 + subquery#0),None)]
            :- OneRowRelation$
            +- 'Subquery subquery#0
               +- 'Project [unresolvedalias(3,None)]
                  +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      _c0: int
      Project [(1 + subquery#1) AS _c0#4]
      :- OneRowRelation$
      +- Subquery subquery#1
         +- Project [(2 + subquery#0) AS _c0#3]
            :- OneRowRelation$
            +- Subquery subquery#0
               +- Project [3 AS _c0#2]
                  +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [(1 + subquery#1) AS _c0#4]
      :- OneRowRelation$
      +- Subquery subquery#1
         +- Project [(2 + subquery#0) AS _c0#3]
            :- OneRowRelation$
            +- Subquery subquery#0
               +- Project [3 AS _c0#2]
                  +- OneRowRelation$
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [(1 + subquery#1) AS _c0#4]
      :     :- INPUT
      :     +- Subquery subquery#1
      :        +- WholeStageCodegen
      :           :  +- Project [(2 + subquery#0) AS _c0#3]
      :           :     :- INPUT
      :           :     +- Subquery subquery#0
      :           :        +- WholeStageCodegen
      :           :           :  +- Project [3 AS _c0#2]
      :           :           :     +- INPUT
      :           :           +- Scan OneRowRelation[]
      :           +- Scan OneRowRelation[]
      +- Scan OneRowRelation[]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11190 from davies/scalar_subquery.
      79250712
    • gatorsmile's avatar
      [SPARK-13310] [SQL] Resolve Missing Sorting Columns in Generate · f88c641b
      gatorsmile authored
      ```scala
      // case 1: missing sort columns are resolvable if join is true
      sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c")
      // case 2: missing sort columns are not resolvable if join is false. Thus, issue an error message in this case
      sql("SELECT explode(a) AS val FROM data order by val, c")
      ```
      
      When sort columns are not in `Generate`, we can resolve them when `join` is equal to `true`. Still trying to add more test cases for the other `UnaryNode` types.
      
      Could you review the changes? davies cloud-fan Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11198 from gatorsmile/missingInSort.
      f88c641b
    • Timothy Chen's avatar
      [SPARK-13414][MESOS] Allow multiple dispatchers to be launched. · a4a081d1
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      Users might want to start multiple mesos dispatchers, as each dispatcher can potentially be part of different roles and used for multi-tenancy.
      
      To allow multiple Mesos dispatchers to be launched, we need to be able to specify a instance number when starting the dispatcher daemon.
      
      ## How was the this patch tested?
      
      Manual testing
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #11281 from tnachen/multiple_cluster_dispatchers.
      a4a081d1
    • Zheng RuiFeng's avatar
      [SPARK-13386][GRAPHX] ConnectedComponents should support maxIteration option · 6ce7c481
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13386
      
      ## What changes were proposed in this pull request?
      
      add maxIteration option for ConnectedComponents algorithm
      
      ## How was the this patch tested?
      
      unit tests passed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11268 from zhengruifeng/ccwithmax.
      6ce7c481
    • Holden Karau's avatar
      [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests · 9ca79c1e
      Holden Karau authored
      Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
      In part this is a follow up to https://github.com/apache/spark/pull/10999
      Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
      9ca79c1e
    • Shixiong Zhu's avatar
      [SPARK-13408] [CORE] Ignore errors when it's already reported in JobWaiter · dfb2ae2f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JobWaiter.taskSucceeded` will be called for each task. When `resultHandler` throws an exception, `taskSucceeded` will also throw it for each task. DAGScheduler just catches it and reports it like this:
      ```Scala
                        try {
                          job.listener.taskSucceeded(rt.outputId, event.result)
                        } catch {
                          case e: Exception =>
                            // TODO: Perhaps we want to mark the resultStage as failed?
                            job.listener.jobFailed(new SparkDriverExecutionException(e))
                        }
      ```
      Therefore `JobWaiter.jobFailed` may be called multiple times.
      
      So `JobWaiter.jobFailed` should use `Promise.tryFailure` instead of `Promise.failure` because the latter one doesn't support calling multiple times.
      
      ## How was the this patch tested?
      
      Jenkins tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11280 from zsxwing/SPARK-13408.
      dfb2ae2f
    • Reynold Xin's avatar
      Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" · 6624a588
      Reynold Xin authored
      This reverts commit 4f9a6648.
      6624a588
    • Kai Jiang's avatar
      [SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs · 4f9a6648
      Kai Jiang authored
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #10527 from vectorijk/spark-12567.
      4f9a6648
    • gatorsmile's avatar
      [SPARK-12594] [SQL] Outer Join Elimination by Filter Conditions · ec7a1d6e
      gatorsmile authored
      Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.
      
      - `full outer` -> `inner` if both sides have such predicates
      - `left outer` -> `inner` if the right side has such predicates
      - `right outer` -> `inner` if the left side has such predicates
      - `full outer` -> `left outer` if only the left side has such predicates
      - `full outer` -> `right outer` if only the right side has such predicates
      
      If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.
      
      The original PR is https://github.com/apache/spark/pull/10542
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10567 from gatorsmile/outerJoinEliminationByFilterCond.
      ec7a1d6e
  4. Feb 19, 2016
  5. Feb 18, 2016
    • gatorsmile's avatar
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return... · c776fce9
      gatorsmile authored
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return Indeterministic Results When Data Partitions are not fixed.
      
      `rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```
      
      Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.
      
      jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11232 from gatorsmile/randSeed.
      c776fce9
    • Davies Liu's avatar
      [SPARK-13237] [SQL] generated broadcast outer join · 95e1ab22
      Davies Liu authored
      This PR support codegen for broadcast outer join.
      
      In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11130 from davies/gen_out.
      95e1ab22
    • Davies Liu's avatar
      [SPARK-13351][SQL] fix column pruning on Expand · 26f38bb8
      Davies Liu authored
      Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11225 from davies/fix_pruning_expand.
      26f38bb8
Loading