Skip to content
Snippets Groups Projects
  1. Feb 17, 2016
  2. Feb 16, 2016
    • junhao's avatar
      [SPARK-11627] Add initial input rate limit for spark streaming backpressure mechanism. · 7218c0eb
      junhao authored
      https://issues.apache.org/jira/browse/SPARK-11627
      
      Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
      In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute  success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
      
      Author: junhao <junhao@mogujie.com>
      
      Closes #9593 from junhaoMg/junhao-dev.
      7218c0eb
    • Josh Rosen's avatar
      [SPARK-13308] ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases · 5f37aad4
      Josh Rosen authored
      ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see #10705).
      
      This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code.
      
      /cc zsxwing for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11193 from JoshRosen/add-missing-release-calls-in-network-layer.
      5f37aad4
    • Marcelo Vanzin's avatar
      [SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteAheadLog. · c7d00a24
      Marcelo Vanzin authored
      The new logger name is under the org.apache.spark namespace.
      The detection of the caller name was also enhanced a bit to ignore
      some common things that show up in the call stack.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11165 from vanzin/SPARK-13280.
      c7d00a24
    • Takuya UESHIN's avatar
      [SPARK-12976][SQL] Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange. · 19dc69de
      Takuya UESHIN authored
      Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #10894 from ueshin/issues/SPARK-12976.
      19dc69de
    • BenFradet's avatar
      [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general · 00c72d27
      BenFradet authored
      This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10411 from BenFradet/SPARK-12247.
      00c72d27
    • Miles Yucht's avatar
      Correct SparseVector.parse documentation · 827ed1c0
      Miles Yucht authored
      There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.
      
      Author: Miles Yucht <miles@databricks.com>
      
      Closes #11213 from mgyucht/fix-sparsevector-docs.
      827ed1c0
    • gatorsmile's avatar
      [SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns · fee739f0
      gatorsmile authored
      Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.
      
      This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6
      
      For example, the following query returns a wrong result:
      ```scala
      sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
           " grouping sets((), (course), (course, earnings))" +
           " order by course, sum").show()
      ```
      Before the fix, the results are like
      ```
      [null,null]
      [Java,null]
      [Java,20000.0]
      [Java,30000.0]
      [dotNET,null]
      [dotNET,5000.0]
      [dotNET,10000.0]
      [dotNET,48000.0]
      ```
      After the fix, the results become correct:
      ```
      [null,113000.0]
      [Java,20000.0]
      [Java,30000.0]
      [Java,50000.0]
      [dotNET,5000.0]
      [dotNET,10000.0]
      [dotNET,48000.0]
      [dotNET,63000.0]
      ```
      
      UPDATE:  This PR also deprecated the external column: GROUPING__ID.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11100 from gatorsmile/groupingSets.
      fee739f0
  3. Feb 15, 2016
  4. Feb 14, 2016
    • Josh Rosen's avatar
      [SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN · a8bbc4f5
      Josh Rosen authored
      This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:
      
      - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
      - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.
      
      These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.
      
      When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11121 from JoshRosen/limit-pushdown-2.
      a8bbc4f5
    • Carson Wang's avatar
      [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method... · 7cb4d74c
      Carson Wang authored
      [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance
      
      The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))`
      
      The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #11090 from carsonwang/SPARK-13185.
      7cb4d74c
    • Claes Redestad's avatar
      [SPARK-13278][CORE] Launcher fails to start with JDK 9 EA · 22e9723d
      Claes Redestad authored
      See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme.
      
      Author: Claes Redestad <claes.redestad@gmail.com>
      
      Closes #11160 from cl4es/master.
      22e9723d
    • Amit Dev's avatar
      [SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy · 331293c3
      Amit Dev authored
      Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps.
      
      Author: Amit Dev <amitdev@gmail.com>
      
      Closes #11180 from amitdev/master.
      331293c3
  5. Feb 13, 2016
    • Reynold Xin's avatar
      [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. · 354d4c24
      Reynold Xin authored
      This pull request has the following changes:
      
      1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
      
      2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
      
      3. Move everything in execution/python.scala into the newly created execution.python package.
      
      Most of the diffs are just straight copy-paste.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11181 from rxin/SPARK-13296.
      354d4c24
    • Sean Owen's avatar
      [SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is deprecated · 388cd9ea
      Sean Owen authored
      Replace `getStackTraceString` with `Utils.exceptionString`
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11182 from srowen/SPARK-13172.
      388cd9ea
    • Reynold Xin's avatar
      Closes #11185 · 610196f9
      Reynold Xin authored
      610196f9
    • Liang-Chi Hsieh's avatar
      [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test · e3441e3f
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12363
      
      This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #10539 from viirya/fix-poweriter.
      e3441e3f
    • markpavey's avatar
      [SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft Windows · 374c4b28
      markpavey authored
      Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.
      
      Is it worth considering also including this fix in any future 1.5.x releases (if any)?
      
      I confirm this is my own original work and license it to the Spark project under its open source license.
      
      Author: markpavey <mark.pavey@thefilter.com>
      
      Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.
      374c4b28
  6. Feb 12, 2016
    • Davies Liu's avatar
      [SPARK-13293][SQL] generate Expand · 2228f074
      Davies Liu authored
      Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.
      
      After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.
      
      Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11177 from davies/gen_expand.
      2228f074
    • Michael Gummelt's avatar
      [SPARK-5095] remove flaky test · 62b1c07e
      Michael Gummelt authored
      Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test.
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #11164 from mgummelt/fix_mesos_tests.
      62b1c07e
    • Michael Gummelt's avatar
      [SPARK-5095] Fix style in mesos coarse grained scheduler code · 38bc6018
      Michael Gummelt authored
      andrewor14 This addressed your style comments from #10993
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #11187 from mgummelt/fix_mesos_style.
      38bc6018
    • vijaykiran's avatar
      [SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format · 42d65681
      vijaykiran authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.
      
      Author: vijaykiran <mail@vijaykiran.com>
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
      42d65681
    • Yanbo Liang's avatar
      [SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop · 90de6b2f
      Yanbo Liang authored
      PySpark support ```covar_samp``` and ```covar_pop```.
      
      cc rxin davies marmbrus
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10876 from yanboliang/spark-12962.
      90de6b2f
    • hyukjinkwon's avatar
      [SPARK-13260][SQL] count(*) does not work with CSV data source · ac7d6af1
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-13260
      This is a quicky fix for `count(*)`.
      
      When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
      
      Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11169 from HyukjinKwon/SPARK-13260.
      ac7d6af1
    • Reynold Xin's avatar
      [SPARK-13282][SQL] LogicalPlan toSql should just return a String · c4d5ad80
      Reynold Xin authored
      Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11171 from rxin/SPARK-13282.
      c4d5ad80
    • Davies Liu's avatar
      [SPARK-12705] [SQL] push missing attributes for Sort · 5b805df2
      Davies Liu authored
      The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11153 from davies/resolve_sort.
      5b805df2
    • Holden Karau's avatar
      [SPARK-13154][PYTHON] Add linting for pydocs · 64515e5f
      Holden Karau authored
      We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
      
      Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
      64515e5f
    • Yanbo Liang's avatar
      [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means · a183dda6
      Yanbo Liang authored
      Add Python API for spark.ml bisecting k-means.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10889 from yanboliang/spark-12974.
      a183dda6
    • Sanket's avatar
      [SPARK-6166] Limit number of in flight outbound requests · 894921d8
      Sanket authored
      This JIRA is related to
      https://github.com/apache/spark/pull/5852
      Had to do some minor rework and test to make sure it
      works with current version of spark.
      
      Author: Sanket <schintap@untilservice-lm>
      
      Closes #10838 from redsanket/limit-outbound-connections.
      894921d8
  7. Feb 11, 2016
    • Steve Loughran's avatar
      [SPARK-7889][WEBUI] HistoryServer updates UI for incomplete apps · a2c7dcf6
      Steve Loughran authored
      When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available.  It does this by checking if a version of the app has been loaded with a larger *filesize*.  If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.
      
      https://issues.apache.org/jira/browse/SPARK-7889
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #11118 from squito/SPARK-7889-alternate.
      a2c7dcf6
Loading