Skip to content
Snippets Groups Projects
  1. Feb 20, 2016
    • Zheng RuiFeng's avatar
      [SPARK-13386][GRAPHX] ConnectedComponents should support maxIteration option · 6ce7c481
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13386
      
      ## What changes were proposed in this pull request?
      
      add maxIteration option for ConnectedComponents algorithm
      
      ## How was the this patch tested?
      
      unit tests passed
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11268 from zhengruifeng/ccwithmax.
      6ce7c481
    • Holden Karau's avatar
      [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests · 9ca79c1e
      Holden Karau authored
      Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
      In part this is a follow up to https://github.com/apache/spark/pull/10999
      Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
      9ca79c1e
    • Shixiong Zhu's avatar
      [SPARK-13408] [CORE] Ignore errors when it's already reported in JobWaiter · dfb2ae2f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `JobWaiter.taskSucceeded` will be called for each task. When `resultHandler` throws an exception, `taskSucceeded` will also throw it for each task. DAGScheduler just catches it and reports it like this:
      ```Scala
                        try {
                          job.listener.taskSucceeded(rt.outputId, event.result)
                        } catch {
                          case e: Exception =>
                            // TODO: Perhaps we want to mark the resultStage as failed?
                            job.listener.jobFailed(new SparkDriverExecutionException(e))
                        }
      ```
      Therefore `JobWaiter.jobFailed` may be called multiple times.
      
      So `JobWaiter.jobFailed` should use `Promise.tryFailure` instead of `Promise.failure` because the latter one doesn't support calling multiple times.
      
      ## How was the this patch tested?
      
      Jenkins tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11280 from zsxwing/SPARK-13408.
      dfb2ae2f
    • Reynold Xin's avatar
      Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" · 6624a588
      Reynold Xin authored
      This reverts commit 4f9a6648.
      6624a588
    • Kai Jiang's avatar
      [SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs · 4f9a6648
      Kai Jiang authored
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #10527 from vectorijk/spark-12567.
      4f9a6648
    • gatorsmile's avatar
      [SPARK-12594] [SQL] Outer Join Elimination by Filter Conditions · ec7a1d6e
      gatorsmile authored
      Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.
      
      - `full outer` -> `inner` if both sides have such predicates
      - `left outer` -> `inner` if the right side has such predicates
      - `right outer` -> `inner` if the left side has such predicates
      - `full outer` -> `left outer` if only the left side has such predicates
      - `full outer` -> `right outer` if only the right side has such predicates
      
      If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.
      
      The original PR is https://github.com/apache/spark/pull/10542
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10567 from gatorsmile/outerJoinEliminationByFilterCond.
      ec7a1d6e
  2. Feb 19, 2016
  3. Feb 18, 2016
    • gatorsmile's avatar
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return... · c776fce9
      gatorsmile authored
      [SPARK-13380][SQL][DOCUMENT] Document Rand(seed) and Randn(seed) Return Indeterministic Results When Data Partitions are not fixed.
      
      `rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```
      
      Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.
      
      jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11232 from gatorsmile/randSeed.
      c776fce9
    • Davies Liu's avatar
      [SPARK-13237] [SQL] generated broadcast outer join · 95e1ab22
      Davies Liu authored
      This PR support codegen for broadcast outer join.
      
      In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11130 from davies/gen_out.
      95e1ab22
    • Davies Liu's avatar
      [SPARK-13351][SQL] fix column pruning on Expand · 26f38bb8
      Davies Liu authored
      Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11225 from davies/fix_pruning_expand.
      26f38bb8
    • Sean Owen's avatar
      [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares... · 78562535
      Sean Owen authored
      [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares Option and String directly.
      
      ## What changes were proposed in this pull request?
      
      Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager)
      
      ## How was the this patch tested?
      
      Running Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11253 from srowen/SPARK-13371.
      78562535
  4. Feb 17, 2016
  5. Feb 16, 2016
    • junhao's avatar
      [SPARK-11627] Add initial input rate limit for spark streaming backpressure mechanism. · 7218c0eb
      junhao authored
      https://issues.apache.org/jira/browse/SPARK-11627
      
      Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
      In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute  success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
      
      Author: junhao <junhao@mogujie.com>
      
      Closes #9593 from junhaoMg/junhao-dev.
      7218c0eb
    • Josh Rosen's avatar
      [SPARK-13308] ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases · 5f37aad4
      Josh Rosen authored
      ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see #10705).
      
      This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code.
      
      /cc zsxwing for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11193 from JoshRosen/add-missing-release-calls-in-network-layer.
      5f37aad4
    • Marcelo Vanzin's avatar
      [SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteAheadLog. · c7d00a24
      Marcelo Vanzin authored
      The new logger name is under the org.apache.spark namespace.
      The detection of the caller name was also enhanced a bit to ignore
      some common things that show up in the call stack.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11165 from vanzin/SPARK-13280.
      c7d00a24
    • Takuya UESHIN's avatar
      [SPARK-12976][SQL] Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange. · 19dc69de
      Takuya UESHIN authored
      Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #10894 from ueshin/issues/SPARK-12976.
      19dc69de
    • BenFradet's avatar
      [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general · 00c72d27
      BenFradet authored
      This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10411 from BenFradet/SPARK-12247.
      00c72d27
    • Miles Yucht's avatar
      Correct SparseVector.parse documentation · 827ed1c0
      Miles Yucht authored
      There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.
      
      Author: Miles Yucht <miles@databricks.com>
      
      Closes #11213 from mgyucht/fix-sparsevector-docs.
      827ed1c0
    • gatorsmile's avatar
      [SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns · fee739f0
      gatorsmile authored
      Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.
      
      This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6
      
      For example, the following query returns a wrong result:
      ```scala
      sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
           " grouping sets((), (course), (course, earnings))" +
           " order by course, sum").show()
      ```
      Before the fix, the results are like
      ```
      [null,null]
      [Java,null]
      [Java,20000.0]
      [Java,30000.0]
      [dotNET,null]
      [dotNET,5000.0]
      [dotNET,10000.0]
      [dotNET,48000.0]
      ```
      After the fix, the results become correct:
      ```
      [null,113000.0]
      [Java,20000.0]
      [Java,30000.0]
      [Java,50000.0]
      [dotNET,5000.0]
      [dotNET,10000.0]
      [dotNET,48000.0]
      [dotNET,63000.0]
      ```
      
      UPDATE:  This PR also deprecated the external column: GROUPING__ID.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11100 from gatorsmile/groupingSets.
      fee739f0
  6. Feb 15, 2016
  7. Feb 14, 2016
    • Josh Rosen's avatar
      [SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN · a8bbc4f5
      Josh Rosen authored
      This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:
      
      - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
      - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.
      
      These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.
      
      When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11121 from JoshRosen/limit-pushdown-2.
      a8bbc4f5
    • Carson Wang's avatar
      [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method... · 7cb4d74c
      Carson Wang authored
      [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance
      
      The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))`
      
      The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #11090 from carsonwang/SPARK-13185.
      7cb4d74c
    • Claes Redestad's avatar
      [SPARK-13278][CORE] Launcher fails to start with JDK 9 EA · 22e9723d
      Claes Redestad authored
      See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme.
      
      Author: Claes Redestad <claes.redestad@gmail.com>
      
      Closes #11160 from cl4es/master.
      22e9723d
Loading