Commits · 22e9723d6208f2cd2dfa26487ea1c041cb9d7dcd · cs525-sp18-g07 / spark

Feb 14, 2016

[SPARK-13278][CORE] Launcher fails to start with JDK 9 EA · 22e9723d

Claes Redestad authored 9 years ago

See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme.

Author: Claes Redestad <claes.redestad@gmail.com>

Closes #11160 from cl4es/master.

22e9723d

[SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy · 331293c3

Amit Dev authored 9 years ago

Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps.

Author: Amit Dev <amitdev@gmail.com>

Closes #11180 from amitdev/master.

331293c3

Feb 13, 2016

[SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. · 354d4c24

Reynold Xin authored 9 years ago

This pull request has the following changes:

1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.

2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.

3. Move everything in execution/python.scala into the newly created execution.python package.

Most of the diffs are just straight copy-paste.

Author: Reynold Xin <rxin@databricks.com>

Closes #11181 from rxin/SPARK-13296.

354d4c24

[SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is deprecated · 388cd9ea

Sean Owen authored 9 years ago

Replace `getStackTraceString` with `Utils.exceptionString`

Author: Sean Owen <sowen@cloudera.com>

Closes #11182 from srowen/SPARK-13172.

388cd9ea

Closes #11185 · 610196f9
Reynold Xin authored 9 years ago

610196f9

[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test · e3441e3f

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.

e3441e3f

[SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft Windows · 374c4b28

markpavey authored 9 years ago

Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.

Is it worth considering also including this fix in any future 1.5.x releases (if any)?

I confirm this is my own original work and license it to the Spark project under its open source license.

Author: markpavey <mark.pavey@thefilter.com>

Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.

374c4b28

Feb 12, 2016

[SPARK-13293][SQL] generate Expand · 2228f074

Davies Liu authored 9 years ago

Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.

After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.

Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).

Author: Davies Liu <davies@databricks.com>

Closes #11177 from davies/gen_expand.

2228f074

[SPARK-5095] remove flaky test · 62b1c07e

Michael Gummelt authored 9 years ago

Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #11164 from mgummelt/fix_mesos_tests.

62b1c07e

[SPARK-5095] Fix style in mesos coarse grained scheduler code · 38bc6018

Michael Gummelt authored 9 years ago

andrewor14 This addressed your style comments from #10993

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #11187 from mgummelt/fix_mesos_style.

38bc6018

[SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format · 42d65681

vijaykiran authored 9 years ago

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.

42d65681

[SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop · 90de6b2f

Yanbo Liang authored 9 years ago

PySpark support ```covar_samp``` and ```covar_pop```.

cc rxin davies marmbrus

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10876 from yanboliang/spark-12962.

90de6b2f

[SPARK-13260][SQL] count(*) does not work with CSV data source · ac7d6af1

hyukjinkwon authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for `count(*)`.

When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.

Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11169 from HyukjinKwon/SPARK-13260.

ac7d6af1

[SPARK-13282][SQL] LogicalPlan toSql should just return a String · c4d5ad80

Reynold Xin authored 9 years ago

Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).

Author: Reynold Xin <rxin@databricks.com>

Closes #11171 from rxin/SPARK-13282.

c4d5ad80

[SPARK-12705] [SQL] push missing attributes for Sort · 5b805df2

Davies Liu authored 9 years ago

The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).

Author: Davies Liu <davies@databricks.com>

Closes #11153 from davies/resolve_sort.

5b805df2

[SPARK-13154][PYTHON] Add linting for pydocs · 64515e5f

Holden Karau authored 9 years ago

We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.

Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.

Author: Holden Karau <holden@us.ibm.com>

Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.

64515e5f

[SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means · a183dda6

Yanbo Liang authored 9 years ago

Add Python API for spark.ml bisecting k-means.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10889 from yanboliang/spark-12974.

a183dda6

[SPARK-6166] Limit number of in flight outbound requests · 894921d8

Sanket authored 9 years ago

This JIRA is related to
https://github.com/apache/spark/pull/5852
Had to do some minor rework and test to make sure it
works with current version of spark.

Author: Sanket <schintap@untilservice-lm>

Closes #10838 from redsanket/limit-outbound-connections.

894921d8

Feb 11, 2016

[SPARK-7889][WEBUI] HistoryServer updates UI for incomplete apps · a2c7dcf6

Steve Loughran authored 9 years ago

When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.

https://issues.apache.org/jira/browse/SPARK-7889

Author: Steve Loughran <stevel@hortonworks.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes #11118 from squito/SPARK-7889-alternate.

a2c7dcf6

[SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter · d3e2e202

Tommy YU authored 9 years ago

Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

d3e2e202

[SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) · 5f1c3590

Earthson Lu authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12746

Author: Earthson Lu <Earthson.Lu@gmail.com>

Closes #10697 from Earthson/SPARK-12746.

5f1c3590

[SPARK-13277][BUILD] Follow-up ANTLR warnings are treated as build errors · 8121a4b1

Herman van Hovell authored 9 years ago

It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors.

cc rxin / viirya

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11174 from hvanhovell/ANTLR-warnings-as-errors.

8121a4b1

[SPARK-12915][SQL] add SQL metrics of numOutputRows for whole stage codegen · b10af5e2

Davies Liu authored 9 years ago

This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row.

<img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png">

Author: Davies Liu <davies@databricks.com>

Closes #11170 from davies/gen_metric.

b10af5e2

[SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lost transformSchema · a5257048
Liu Xiang authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-12765

Author: Liu Xiang <lxmtlab@gmail.com>

Closes #10720 from sloth2012/sloth.
```
a5257048

[SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error · b3546738

sethah authored 9 years ago

Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.

b3546738

[SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import · 30e00955

Yanbo Liang authored 9 years ago

PySpark ml.clustering support export/import.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10999 from yanboliang/spark-13035.

30e00955

[MINOR][ML][PYSPARK] Cleanup test cases of clustering.py · 2426eb3e

Yanbo Liang authored 9 years ago

Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10975 from yanboliang/clustering-cleanup.

2426eb3e

[SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/import · c8f667d7

Kai Jiang authored 9 years ago

PySpark ml.recommendation support export/import.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11044 from vectorijk/spark-13037.

c8f667d7

[SPARK-11515][ML] QuantileDiscretizer should take random seed · 574571c8
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9535 from yu-iskw/SPARK-11515.
```
574571c8

[SPARK-13265][ML] Refactoring of basic ML import/export for other file system besides HDFS · efb65e09

Yu ISHIKAWA authored 9 years ago

jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11151 from yu-iskw/SPARK-13265.

efb65e09

Revert "[SPARK-13279] Remove O(n^2) operation from scheduler." · c86009ce
Reynold Xin authored 9 years ago
```
This reverts commit 50fa6fd1.
```
c86009ce

[SPARK-13279] Remove O(n^2) operation from scheduler. · 50fa6fd1

Sital Kedia authored 9 years ago

This commit removes an unnecessary duplicate check in addPendingTask that meant
that scheduling a task set took time proportional to (# tasks)^2.

Author: Sital Kedia <skedia@fb.com>

Closes #11167 from sitalkedia/fix_stuck_driver and squashes the following commits:

3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function

50fa6fd1

[SPARK-12982][SQL] Add table name validation in temp table registration · 0d50a220

jayadevanmurali authored 9 years ago

Add the table name validation at the temp table creation

Author: jayadevanmurali <jayadevan.m@tcs.com>

Closes #11051 from jayadevanmurali/branch-0.2-SPARK-12982.

0d50a220

[SPARK-13277][SQL] ANTLR ignores other rule using the USING keyword · e31c8073

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-13277

There is an ANTLR warning during compilation:

    warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
    Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3

    As a result, alternative(s) 3 were disabled for that input

This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11168 from viirya/fix-parser-using.

e31c8073

[STREAMING][TEST] Fix flaky streaming.FailureSuite · 219a74a7

Tathagata Das authored 9 years ago

Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things
- Makes sure no SparkContext is active after every test
- Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11166 from tdas/fix-failuresuite.

219a74a7

[SPARK-13124][WEB UI] Fixed CSS and JS issues caused by addition of JQuery DataTables · 13c17cbb

Alex Bozarth authored 9 years ago

Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #11038 from ajbozarth/spark13124.

13c17cbb

[SPARK-13074][CORE] Add JavaSparkContext. getPersistentRDDs method · f9ae99fe

Junyang authored 9 years ago

The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API.

Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users.

Author: Junyang <fly.shenjy@gmail.com>

Closes #10978 from flyjy/master.

f9ae99fe

[SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template · c2f21d88

Sasaki Toru authored 9 years ago

In spark-env.sh.template, there are multi-byte characters, this PR will remove it.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.

c2f21d88

[SPARK-13270][SQL] Remove extra new lines in whole stage codegen and include... · 18bcbbdd

Nong Li authored 9 years ago

[SPARK-13270][SQL] Remove extra new lines in whole stage codegen and include pipeline plan in comments.

Author: Nong Li <nong@databricks.com>

Closes #11155 from nongli/spark-13270.

18bcbbdd

[SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using Union in SQL · e88bff12

gatorsmile authored 9 years ago

Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan.

For example, before the fix, the following query has a plan with two `Distinct`
```scala
sql("select * from t0 union select * from t0").explain(true)
```
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
      +- 'Project [unresolvedalias(*,None)]
         +- 'Subquery u_1
            +- 'Distinct
               +- 'Union
                  :- 'Project [unresolvedalias(*,None)]
                  :  +- 'UnresolvedRelation `t0`, None
                  +- 'Project [unresolvedalias(*,None)]
                     +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
      +- Project [id#16L]
         +- Subquery u_1
            +- Distinct
               +- Union
                  :- Project [id#16L]
                  :  +- Subquery t0
                  :     +- Relation[id#16L] ParquetRelation
                  +- Project [id#16L]
                     +- Subquery t0
                        +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
         +- Relation[id#16L] ParquetRelation
```
After the fix, the plan is changed without the extra `Distinct` as follows:
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
   +- 'Distinct
      +- 'Union
         :- 'Project [unresolvedalias(*,None)]
         :  +- 'UnresolvedRelation `t0`, None
         +- 'Project [unresolvedalias(*,None)]
           +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
   +- Distinct
      +- Union
        :- Project [id#16L]
        :  +- Subquery t0
        :     +- Relation[id#16L] ParquetRelation
        +- Project [id#16L]
          +- Subquery t0
          +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
    +- Relation[id#16L] ParquetRelation
```

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11120 from gatorsmile/unionDistinct.

e88bff12