- Feb 15, 2016
-
-
JeremyNixon authored
Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example.
-
Takeshi YAMAMURO authored
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10918 from maropu/RemoveDeprecateInPregel.
-
- Feb 14, 2016
-
-
Josh Rosen authored
This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes #11121 from JoshRosen/limit-pushdown-2.
-
Carson Wang authored
[SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))` The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage. Author: Carson Wang <carson.wang@intel.com> Closes #11090 from carsonwang/SPARK-13185.
-
Claes Redestad authored
See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme. Author: Claes Redestad <claes.redestad@gmail.com> Closes #11160 from cl4es/master.
-
Amit Dev authored
Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes #11180 from amitdev/master.
-
- Feb 13, 2016
-
-
Reynold Xin authored
This pull request has the following changes: 1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs. 2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package. 3. Move everything in execution/python.scala into the newly created execution.python package. Most of the diffs are just straight copy-paste. Author: Reynold Xin <rxin@databricks.com> Closes #11181 from rxin/SPARK-13296.
-
Sean Owen authored
Replace `getStackTraceString` with `Utils.exceptionString` Author: Sean Owen <sowen@cloudera.com> Closes #11182 from srowen/SPARK-13172.
-
Reynold Xin authored
-
Liang-Chi Hsieh authored
JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter.
-
markpavey authored
Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.
-
- Feb 12, 2016
-
-
Davies Liu authored
Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns. After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup. Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns). Author: Davies Liu <davies@databricks.com> Closes #11177 from davies/gen_expand.
-
Michael Gummelt authored
Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #11164 from mgummelt/fix_mesos_tests.
-
Michael Gummelt authored
andrewor14 This addressed your style comments from #10993 Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #11187 from mgummelt/fix_mesos_style.
-
vijaykiran authored
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module. Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
-
Yanbo Liang authored
PySpark support ```covar_samp``` and ```covar_pop```. cc rxin davies marmbrus Author: Yanbo Liang <ybliang8@gmail.com> Closes #10876 from yanboliang/spark-12962.
-
hyukjinkwon authored
https://issues.apache.org/jira/browse/SPARK-13260 This is a quicky fix for `count(*)`. When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count. Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11169 from HyukjinKwon/SPARK-13260.
-
Reynold Xin authored
Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not). Author: Reynold Xin <rxin@databricks.com> Closes #11171 from rxin/SPARK-13282.
-
Davies Liu authored
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate). Author: Davies Liu <davies@databricks.com> Closes #11153 from davies/resolve_sort.
-
Holden Karau authored
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced. Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present. Author: Holden Karau <holden@us.ibm.com> Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
-
Yanbo Liang authored
Add Python API for spark.ml bisecting k-means. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10889 from yanboliang/spark-12974.
-
Sanket authored
This JIRA is related to https://github.com/apache/spark/pull/5852 Had to do some minor rework and test to make sure it works with current version of spark. Author: Sanket <schintap@untilservice-lm> Closes #10838 from redsanket/limit-outbound-connections.
-
- Feb 11, 2016
-
-
Steve Loughran authored
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI. https://issues.apache.org/jira/browse/SPARK-7889 Author: Steve Loughran <stevel@hortonworks.com> Author: Imran Rashid <irashid@cloudera.com> Closes #11118 from squito/SPARK-7889-alternate.
-
Tommy YU authored
Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
-
Earthson Lu authored
https://issues.apache.org/jira/browse/SPARK-12746 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #10697 from Earthson/SPARK-12746.
-
Herman van Hovell authored
It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors. cc rxin / viirya Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11174 from hvanhovell/ANTLR-warnings-as-errors.
-
Davies Liu authored
This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row. <img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png"> Author: Davies Liu <davies@databricks.com> Closes #11170 from davies/gen_metric.
-
Liu Xiang authored
https://issues.apache.org/jira/browse/SPARK-12765 Author: Liu Xiang <lxmtlab@gmail.com> Closes #10720 from sloth2012/sloth.
-
sethah authored
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047.
-
Yanbo Liang authored
PySpark ml.clustering support export/import. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10999 from yanboliang/spark-13035.
-
Yanbo Liang authored
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode). cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10975 from yanboliang/clustering-cleanup.
-
Kai Jiang authored
PySpark ml.recommendation support export/import. Author: Kai Jiang <jiangkai@gmail.com> Closes #11044 from vectorijk/spark-13037.
-
Yu ISHIKAWA authored
cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9535 from yu-iskw/SPARK-11515.
-
Yu ISHIKAWA authored
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265.
-
Reynold Xin authored
This reverts commit 50fa6fd1.
-
Sital Kedia authored
This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes #11167 from sitalkedia/fix_stuck_driver and squashes the following commits: 3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function
-
jayadevanmurali authored
Add the table name validation at the temp table creation Author: jayadevanmurali <jayadevan.m@tcs.com> Closes #11051 from jayadevanmurali/branch-0.2-SPARK-12982.
-
Liang-Chi Hsieh authored
JIRA: https://issues.apache.org/jira/browse/SPARK-13277 There is an ANTLR warning during compilation: warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input This patch is to fix it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11168 from viirya/fix-parser-using.
-
Tathagata Das authored
Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things - Makes sure no SparkContext is active after every test - Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11166 from tdas/fix-failuresuite.
-
Alex Bozarth authored
Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #11038 from ajbozarth/spark13124.
-