Commits · 11ed2b180ec86523a94679a8b8132fadb911ccd5 · cs525-sp18-g07 / spark

Aug 14, 2015

[SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile() · 11ed2b18
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #8213 from davies/fix_window.
```
11ed2b18

[SPARK-9877] [CORE] Fix StandaloneRestServer NPE when submitting application · 9407baa2

jerryshao authored 9 years ago

Detailed exception log can be seen in [SPARK-9877](https://issues.apache.org/jira/browse/SPARK-9877), the problem is when creating `StandaloneRestServer`, `self` (`masterEndpoint`) is null. So this fix is creating `StandaloneRestServer` when `self` is available.

Author: jerryshao <sshao@hortonworks.com>

Closes #8127 from jerryshao/SPARK-9877.

9407baa2

[SPARK-9948] Fix flaky AccumulatorSuite - internal accumulators · 6518ef63

Andrew Or authored 9 years ago

In these tests, we use a custom listener and we assert on fields in the stage / task completion events. However, these events are posted in a separate thread so they're not guaranteed to be posted in time. This commit fixes this flakiness through a job end registration callback.

Author: Andrew Or <andrew@databricks.com>

Closes #8176 from andrewor14/fix-accumulator-suite.

6518ef63

[SPARK-9809] Task crashes because the internal accumulators are not properly initialized · 33bae585

Carson Wang authored 9 years ago

When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist.

Author: Carson Wang <carson.wang@intel.com>

Closes #8090 from carsonwang/SPARK-9809.

33bae585

[SPARK-9828] [PYSPARK] Mutable values should not be default arguments · ffa05c84
MechCoder authored 9 years ago
```
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8110 from MechCoder/spark-9828.
```
ffa05c84

[SPARK-9561] Re-enable BroadcastJoinSuite · ece00566

Andrew Or authored 9 years ago

We can do this now that SPARK-9580 is resolved.

Author: Andrew Or <andrew@databricks.com>

Closes #8208 from andrewor14/reenable-sql-tests.

ece00566

[SPARK-9946] [SPARK-9589] [SQL] fix NPE and thread-safety in TaskMemoryManager · 3bc55287

Davies Liu authored 9 years ago

Currently, we access the `page.pageNumer` after it's freed, that could be modified by other thread, cause NPE.

The same TaskMemoryManager could be used by multiple threads (for example, Python UDF and TransportScript), so it should be thread safe to allocate/free memory/page. The underlying Bitset and HashSet are not thread safe, we should put them inside a synchronized block.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #8177 from davies/memory_manager.

3bc55287

[SPARK-9923] [CORE] ShuffleMapStage.numAvailableOutputs should be an Int instead of Long · 57c2d088

Neelesh Srinivas Salian authored 9 years ago

Modified type of ShuffleMapStage.numAvailableOutputs from Long to Int

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #8183 from nssalian/SPARK-9923.

57c2d088

[SPARK-9929] [SQL] support metadata in withColumn · 34d610be

Wenchen Fan authored 9 years ago

in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8159 from cloud-fan/withColumn.

34d610be

[SPARK-8744] [ML] Add a public constructor to StringIndexer · a7317ccd

Holden Karau authored 9 years ago

It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.

a7317ccd

[SPARK-9956] [ML] Make trees work with one-category features · 7ecf0c46

Joseph K. Bradley authored 9 years ago

This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.

As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.

Targeted for 1.5 and master

CC: manishamde mengxr yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8187 from jkbradley/tree-1cat.

7ecf0c46

[SPARK-9661] [MLLIB] minor clean-up of SPARK-9661 · a0e1abbd

Xiangrui Meng authored 9 years ago

Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #8190 from mengxr/SPARK-9661-fix.

a0e1abbd

[SPARK-9958] [SQL] Make HiveThriftServer2Listener thread-safe and update the... · c8677d73

zsxwing authored 9 years ago

[SPARK-9958] [SQL] Make HiveThriftServer2Listener thread-safe and update the tab name to "JDBC/ODBC Server"

This PR fixed the thread-safe issue of HiveThriftServer2Listener, and also changed the tab name to "JDBC/ODBC Server" since it's conflict with the new SQL tab.

<img width="1377" alt="thriftserver" src="https://cloud.githubusercontent.com/assets/1000778/9265707/c46f3f2c-4269-11e5-8d7e-888c9113ab4f.png">

Author: zsxwing <zsxwing@gmail.com>

Closes #8185 from zsxwing/SPARK-9958.

c8677d73

[MINOR] [SQL] Remove canEqual in Row · 7c7c7529

Liang-Chi Hsieh authored 9 years ago

As `InternalRow` does not extend `Row` now, I think we can remove it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8170 from viirya/remove_canequal.

7c7c7529

Aug 13, 2015

[SPARK-9945] [SQL] pageSize should be calculated from executor.memory · bd35385d

Davies Liu authored 9 years ago

Currently, pageSize of TungstenSort is calculated from driver.memory, it should use executor.memory instead.

Also, in the worst case, the safeFactor could be 4 (because of rounding), increase it to 16.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #8175 from davies/page_size.

bd35385d

[SPARK-9580] [SQL] Replace singletons in SQL tests · 8187b3ae

Andrew Or authored 9 years ago

A fundamental limitation of the existing SQL tests is that *there is simply no way to create your own `SparkContext`*. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure.

This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch *all* the SQL test files.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111)
<!-- Reviewable:end -->

Author: Andrew Or <andrew@databricks.com>

Closes #8111 from andrewor14/sql-tests-refactor.

8187b3ae

[SPARK-9943] [SQL] deserialized UnsafeHashedRelation should be serializable · c50f97da

Davies Liu authored 9 years ago

When the free memory in executor goes low, the cached broadcast objects need to serialized into disk, but currently the deserialized UnsafeHashedRelation can't be serialized , fail with NPE. This PR fixes that.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #8174 from davies/serialize_hashed.

c50f97da

[SPARK-8976] [PYSPARK] fix open mode in python3 · 693949ba

Davies Liu authored 9 years ago

This bug only happen on Python 3 and Windows.

I tested this manually with python 3 and disable python daemon, no unit test yet.

Author: Davies Liu <davies@databricks.com>

Closes #8181 from davies/open_mode.

693949ba

[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString · 6c5858bc

Xiangrui Meng authored 9 years ago

What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.

~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.

jkbradley holdenk

Author: Xiangrui Meng <meng@databricks.com>

Closes #8152 from mengxr/SPARK-9922.

6c5858bc

[SPARK-9935] [SQL] EqualNotNull not processed in ORC · c2520f50

hyukjinkwon authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9935

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #8163 from HyukjinKwon/master.

c2520f50

[SPARK-9942] [PYSPARK] [SQL] ignore exceptions while try to import pandas · a8d2f4c5

Davies Liu authored 9 years ago

If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions.

Author: Davies Liu <davies@databricks.com>

Closes #8173 from davies/fix_pandas.

a8d2f4c5

[SPARK-9661] [MLLIB] [ML] Java compatibility · 864de8ea

MechCoder authored 9 years ago

I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.

1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8126 from MechCoder/java_incop.

864de8ea

[SPARK-9649] Fix MasterSuite, third time's a charm · 8815ba2f

Andrew Or authored 9 years ago

This particular test did not load the default configurations so
it continued to start the REST server, which causes port bind
exceptions.

8815ba2f

[MINOR] [DOC] fix mllib pydoc warnings · 65fec798

Xiangrui Meng authored 9 years ago

Switch to correct Sphinx syntax. MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #8169 from mengxr/mllib-pydoc-fix.

65fec798

[MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel · 4b70798c

Yanbo Liang authored 9 years ago

To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8164 from yanboliang/mlp-name.

4b70798c

[SPARK-8965] [DOCS] Add ml-guide Python Example: Estimator, Transformer, and Param · 7a539ef3

Rosstin authored 9 years ago

Added ml-guide Python Example: Estimator, Transformer, and Param
/docs/_site/ml-guide.html

Author: Rosstin <asterazul@gmail.com>

Closes #8081 from Rosstin/SPARK-8965.

7a539ef3

[SPARK-9073] [ML] spark.ml Models copy() should call setParent when there is a parent · 2932e25d

lewuathe authored 9 years ago

Copied ML models must have the same parent of original ones

Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>

Closes #7447 from Lewuathe/SPARK-9073.

2932e25d

[SPARK-9757] [SQL] Fixes persistence of Parquet relation with decimal column · 69930310

Cheng Lian authored 9 years ago

PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8130 from liancheng/spark-9757/old-hive-parquet-decimal.

69930310

[SPARK-9885] [SQL] Also pass barrierPrefixes and sharedPrefixes to... · 84a27916

Yin Huai authored 9 years ago

[SPARK-9885] [SQL] Also pass barrierPrefixes and sharedPrefixes to IsolatedClientLoader when hiveMetastoreJars is set to maven.

https://issues.apache.org/jira/browse/SPARK-9885

cc marmbrus liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #8158 from yhuai/classloaderMaven.

84a27916

[SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol · 68f99571

Xiangrui Meng authored 9 years ago

This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.

This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.

jkbradley yu-iskw

Author: Xiangrui Meng <meng@databricks.com>

Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:

149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API

68f99571

[SPARK-9927] [SQL] Revert 8049 since it's pushing wrong filter down · d0b18919

Yijie Shen authored 9 years ago

I made a mistake in #8049 by casting literal value to attribute's data type, which would cause simply truncate the literal value and push a wrong filter down.

JIRA: https://issues.apache.org/jira/browse/SPARK-9927

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8157 from yjshen/rever8049.

d0b18919

[SPARK-9914] [ML] define setters explicitly for Java and use setParam group in RFormula · d7eb371e

Xiangrui Meng authored 9 years ago

The problem with defining setters in the base class is that it doesn't return the correct type in Java.

ericl

Author: Xiangrui Meng <meng@databricks.com>

Closes #8143 from mengxr/SPARK-9914 and squashes the following commits:

d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group

d7eb371e

Aug 12, 2015

[SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation · df543892
shikai.tang authored 9 years ago
```
Author: shikai.tang <tar.sky06@gmail.com>

Closes #7429 from mosessky/master.
```
df543892
[SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler · 5fc058a1
Xiangrui Meng authored 9 years ago
```
hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #8145 from mengxr/SPARK-9917.
```
5fc058a1

[SPARK-9832] [SQL] add a thread-safe lookup for BytesToBytseMap · a8ab2634

Davies Liu authored 9 years ago

This patch add a thread-safe lookup for BytesToBytseMap, and use that in broadcasted HashedRelation.

Author: Davies Liu <davies@databricks.com>

Closes #8151 from davies/safeLookup.

a8ab2634

[SPARK-9920] [SQL] The simpleString of TungstenAggregate does not show its output · 22782190

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9920

Taking `sqlContext.sql("select i, sum(j1) as sum from testAgg group by i").explain()` as an example, the output of our current master is
```
== Physical Plan ==
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)]
 TungstenExchange hashpartitioning(i#0)
  TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)]
   Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
With this PR, the output will be
```
== Physical Plan ==
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)], output=[i#0,sum#18L])
 TungstenExchange hashpartitioning(i#0)
  TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)], output=[i#0,currentSum#22L])
   Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```

Author: Yin Huai <yhuai@databricks.com>

Closes #8150 from yhuai/SPARK-9920.

22782190

[SPARK-9916] [BUILD] [SPARKR] removed left-over sparkr.zip copy/create commands from codebase · 2fb4901b

Burak Yavuz authored 9 years ago

sparkr.zip is now built by SparkSubmit on a need-to-build basis.

cc shivaram

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8147 from brkyvz/make-dist-fix.

2fb4901b

[SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes · d7053bea

Xiangrui Meng authored 9 years ago

There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8136 from mengxr/SPARK-9903.

d7053bea

[SPARK-9704] [ML] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs · d2d5e7fe

Joseph K. Bradley authored 9 years ago

Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.

CC: mengxr EronWright

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits:

7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs

d2d5e7fe

[SPARK-9908] [SQL] When spark.sql.tungsten.enabled is false, broadcast join does not work · 4413d085
Yin Huai authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-9908

Author: Yin Huai <yhuai@databricks.com>

Closes #8149 from yhuai/SPARK-9908.
```
4413d085