Commits · f02af7c8f7f43e4cfe3c412d2b5ea4128669ce22 · cs525-sp18-g07 / spark

Jun 05, 2015

[SPARK-8116][PYSPARK] Allow sc.range() to take a single argument. · f02af7c8

Ted Blackman authored 9 years ago

Author: Ted Blackman <ted.blackman@gmail.com>

Closes #6656 from belisarius222/branch-1.4 and squashes the following commits:

747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.

f02af7c8

Jun 04, 2015

[SPARK-8098] [WEBUI] Show correct length of bytes on log page · 3ba6fc51

Carson Wang authored 9 years ago


The log page should only show desired length of bytes. Currently it shows bytes from the startIndex to the end of the file. The "Next" button on the page is always disabled.

Author: Carson Wang <carson.wang@intel.com>

Closes #6640 from carsonwang/logpage and squashes the following commits:

58cb3fd [Carson Wang] Show correct length of bytes on log page

(cherry picked from commit 63bc0c44)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

3ba6fc51

[SPARK-8027] [SPARKR] Move man pages creation to install-dev.sh · 0b71b851

Shivaram Venkataraman authored 9 years ago


This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available

Related to discussion in #6567

cc pwendell srowen -- Let me know if this looks better

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6593 from shivaram/sparkr-pom-cleanup and squashes the following commits:

b282241 [Shivaram Venkataraman] Remove sparkr-docs from release script as well
8f100a5 [Shivaram Venkataraman] Move man pages creation to install-dev.sh This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available

(cherry picked from commit 3dc00528)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

0b71b851

[SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference. · 81ff7a90

Mike Dusenberry authored 9 years ago


Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests.  Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:

514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests.  Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.

(cherry picked from commit df7da07a)
Signed-off-by: Reynold Xin <rxin@databricks.com>

81ff7a90

Fix maxTaskFailures comment · daf9451a

Daniel Darabos authored 9 years ago


If maxTaskFailures is 1, the task set is aborted after 1 task failure. Other documentation and the code supports this reading, I think it's just this comment that was off. It's easy to make this mistake — can you please double-check if I'm correct? Thanks!

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #6621 from darabos/patch-2 and squashes the following commits:

dfebdec [Daniel Darabos] Fix comment.

(cherry picked from commit 10ba1880)
Signed-off-by: Sean Owen <sowen@cloudera.com>

daf9451a

Jun 03, 2015

[BUILD] Fix Maven build for Kinesis · 84da6531

Andrew Or authored 9 years ago

A necessary dependency that is transitively referenced is not
provided, causing compilation failures in builds that provide
the kinesis-asl profile.

84da6531

[SPARK-7558] Demarcate tests in unit-tests.log (1.4) · bfe74b34

Andrew Or authored 9 years ago

This includes the following commits:

original: 9eb222c1
hotfix1: 8c997933
hotfix2: a4f24123
scalastyle check: 609c4923

---
Original patch #6441
Branch-1.3 patch #6602

Author: Andrew Or <andrew@databricks.com>

Closes #6598 from andrewor14/demarcate-tests-1.4 and squashes the following commits:

4c3c566 [Andrew Or] Merge branch 'branch-1.4' of github.com:apache/spark into demarcate-tests-1.4
e217b78 [Andrew Or] [SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike
46d4361 [Andrew Or] Various whitespace changes (minor)
3d9bf04 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
eaa520e [Andrew Or] Fix tests?
b4d93de [Andrew Or] Fix tests
634a777 [Andrew Or] Fix log message
a932e8d [Andrew Or] Fix manual things that cannot be covered through automation
8bc355d [Andrew Or] Add core tests as dependencies in all modules
75d361f [Andrew Or] Introduce base abstract class for all test suites

bfe74b34

[BUILD] Use right branch when checking against Hive (1.4) · 584a2ba2

Andrew Or authored 9 years ago

For branch-1.4.

This is identical to #6629 and is strictly not necessary. I'm opening this as a PR since it changes Jenkins test behavior and I want to test it out here.

Author: Andrew Or <andrew@databricks.com>

Closes #6630 from andrewor14/build-check-hive-1.4 and squashes the following commits:

186ec65 [Andrew Or] [BUILD] Use right branch when checking against Hive

584a2ba2

[BUILD] Increase Jenkins test timeout · 96f71b10

Andrew Or authored 9 years ago

Currently hive tests alone take 40m. The right thing to do is
to reduce the test time. However, that is a bigger project and
we currently have PRs blocking on tests not timing out.

96f71b10

[SPARK-8084] [SPARKR] Make SparkR scripts fail on error · c2c12907

Shivaram Venkataraman authored 9 years ago


cc shaneknapp pwendell JoshRosen

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6623 from shivaram/SPARK-8084 and squashes the following commits:

0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error

(cherry picked from commit 0576c3c4)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

c2c12907

[SPARK-8088] don't attempt to lower number of executors by 0 · 16748694

Ryan Williams authored 9 years ago


Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #6624 from ryan-williams/execs and squashes the following commits:

b6f71d4 [Ryan Williams] don't attempt to lower number of executors by 0

(cherry picked from commit 51898b51)
Signed-off-by: Andrew Or <andrew@databricks.com>

16748694

[HOTFIX] [TYPO] Fix typo in #6546 · 0bc9a3ec
Andrew Or authored 9 years ago

0bc9a3ec
[HOTFIX] Unbreak build from backporting #6546 · d0be9508
Andrew Or authored 9 years ago
```
This is caused by 7e46ea02.
```
d0be9508

[SPARK-8051] [MLLIB] make StringIndexerModel silent if input column does not exist · b2a22a65

Xiangrui Meng authored 9 years ago


This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6595 from mengxr/SPARK-8051 and squashes the following commits:

b6a36b9 [Xiangrui Meng] add doc
f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
8ee7c7e [Xiangrui Meng] use SparkFunSuite
e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist

(cherry picked from commit 26c9d7a0)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

b2a22a65

[SPARK-3674] [EC2] Clear SPARK_WORKER_INSTANCES when using YARN · ca21fff7

Shivaram Venkataraman authored 9 years ago


cc andrewor14

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:

db244ae [Shivaram Venkataraman] Make Python Lint happy
0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN

(cherry picked from commit d3e026f8)
Signed-off-by: Andrew Or <andrew@databricks.com>

ca21fff7

[SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and... · 7e46ea02

zsxwing authored 9 years ago

[SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite

The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <zsxwing@gmail.com>

Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite

(cherry picked from commit f2713478)
Signed-off-by: Andrew Or <andrew@databricks.com>

Conflicts:
	core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala

7e46ea02

[SPARK-8001] [CORE] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout · 306837e4

zsxwing authored 9 years ago


Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6550 from zsxwing/SPARK-8001 and squashes the following commits:

607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout

(cherry picked from commit 1d8669f1)
Signed-off-by: Andrew Or <andrew@databricks.com>

306837e4

[SPARK-8083] [MESOS] Use the correct base path in mesos driver page. · 59399a8f

Timothy Chen authored 9 years ago


Author: Timothy Chen <tnachen@gmail.com>

Closes #6615 from tnachen/mesos_driver_path and squashes the following commits:

4f47b7c [Timothy Chen] Use the correct base path in mesos driver page.

(cherry picked from commit bfbf12b3)
Signed-off-by: Andrew Or <andrew@databricks.com>

59399a8f

[MINOR] [UI] Improve confusing message on log page · 31e0ae9e

Andrew Or authored 9 years ago

It's good practice to check if the input path is in the directory
we expect to avoid potentially confusing error messages.

31e0ae9e

[SPARK-8054] [MLLIB] Added several Java-friendly APIs + unit tests · bfab61f3

Joseph K. Bradley authored 9 years ago


Java-friendly APIs added:
* GaussianMixture.run()
* GaussianMixtureModel.predict()
* DistributedLDAModel.javaTopicDistributions()
* StreamingKMeans: trainOn, predictOn, predictOnValues
* Statistics.corr
* params
  * added doc to w() since Java docs do not inherit doc
  * removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam
  * made DoubleArrayParam Java-friendly w() actually Java-friendly

I generated the doc and verified all changes.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6562 from jkbradley/java-api-1.4 and squashes the following commits:

c16821b [Joseph K. Bradley] Small fixes based on code review.
d955581 [Joseph K. Bradley] unit test fixes
29b6b0d [Joseph K. Bradley] small fixes
fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params

(cherry picked from commit 20a26b59)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

bfab61f3

[SPARK-8074] Parquet should throw AnalysisException during setup for data... · 1f90a06b

Reynold Xin authored 9 years ago

[SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.

Author: Reynold Xin <rxin@databricks.com>

Closes #6608 from rxin/parquet-analysis and squashes the following commits:

b5dc8e2 [Reynold Xin] Code review feedback.
5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.

(cherry picked from commit 939e4f3d)
Signed-off-by: Reynold Xin <rxin@databricks.com>

1f90a06b

[SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable... · f67a27d0

Sun Rui authored 9 years ago

[SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.

Author: Sun Rui <rui.sun@intel.com>

Closes #6605 from sun-rui/SPARK-8063 and squashes the following commits:

51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.

(cherry picked from commit 708c63bb)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

f67a27d0

[SPARK-7980] [SQL] Support SQLContext.range(end) · 0a1dad6c

animesh authored 9 years ago


1. range() overloaded in SQLContext.scala
2. range() modified in python sql context.py
3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py

Author: animesh <animesh@apache.spark>

Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:

935899c [animesh] SPARK-7980:python+scala changes

(cherry picked from commit d053a31b)
Signed-off-by: Reynold Xin <rxin@databricks.com>

0a1dad6c

[SPARK-7973] [SQL] Increase the timeout of two CliSuite tests. · 54a4ea40

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-7973



Author: Yin Huai <yhuai@databricks.com>

Closes #6525 from yhuai/SPARK-7973 and squashes the following commits:

763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes.
e598a08 [Yin Huai] Increase the timeout to 3 minutes.

(cherry picked from commit f1646e10)
Signed-off-by: Yin Huai <yhuai@databricks.com>

54a4ea40

[SPARK-8060] Improve DataFrame Python test coverage and documentation. · ee7f365b

Reynold Xin authored 9 years ago


Author: Reynold Xin <rxin@databricks.com>

Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:

baa8ad5 [Reynold Xin] Code review feedback.
f081d47 [Reynold Xin] More documentation updates.
c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.

(cherry picked from commit ce320cb2)
Signed-off-by: Reynold Xin <rxin@databricks.com>

ee7f365b

[SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust · bd57af38

MechCoder authored 9 years ago


The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`

It fails in my system since I have version `1.10` :P

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6579 from MechCoder/np_ver and squashes the following commits:

15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust

(cherry picked from commit 452eb82d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

bd57af38

[SPARK-8043] [MLLIB] [DOC] update NaiveBayes and SVM examples in doc · 33edb2b7

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-8043



I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6584 from hhbyyh/naiveDocExample and squashes the following commits:

a01a206 [Yuhao Yang] fix for Gaussian mixture
2fb8b96 [Yuhao Yang] update NaiveBayes and SVM examples in doc

(cherry picked from commit 43adbd56)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

33edb2b7

[SPARK-8053] [MLLIB] renamed scalingVector to scalingVec · 88399c34

Joseph K. Bradley authored 9 years ago


I searched the Spark codebase for all occurrences of "scalingVector"

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6596 from jkbradley/scalingVec-rename and squashes the following commits:

d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec

(cherry picked from commit 07c16cb5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

88399c34

Jun 02, 2015

[SPARK-7547] [ML] Scala Example code for ElasticNet · 6391be87

DB Tsai authored 9 years ago


This is scala example code for both linear and logistic regression. Python and Java versions are to be added.

Author: DB Tsai <dbt@netflix.com>

Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:

e7ca406 [DB Tsai] fix test
6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
136e0dd [DB Tsai] address feedback
1ec29d4 [DB Tsai] fix style
9462f5f [DB Tsai] add example

(cherry picked from commit a86b3e9b)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

6391be87

[SPARK-7387] [ML] [DOC] CrossValidator example code in Python · 6a3e32ad

Ram Sriharsha authored 9 years ago


Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:

63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
aeb6bb6 [Ram Sriharsha] Python Style Fix
54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
615e91c [Ram Sriharsha] cleanup
204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python

(cherry picked from commit c3f4c325)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

6a3e32ad

Preparing development version 1.4.0-SNAPSHOT · ab713af5
Patrick Wendell authored 9 years ago

ab713af5
Preparing Spark release v1.4.0-rc4 · 22596c53
Patrick Wendell authored 9 years ago

View commits for tag v1.4.0 v1.4.0

22596c53

[SQL] [TEST] [MINOR] Follow-up of PR #6493, use Guava API to ensure Java 6 friendliness · 0d837209

Cheng Lian authored 9 years ago


This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.

cc andrewor14 pwendell, this should also be back ported to branch-1.4.

Author: Cheng Lian <lian@databricks.com>

Closes #6547 from liancheng/override-log4j and squashes the following commits:

c900cfd [Cheng Lian] Addresses Shixiong's comment
72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness

(cherry picked from commit 5cd6a63d)
Signed-off-by: Andrew Or <andrew@databricks.com>

0d837209

[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in... · daeaa0c5

Cheng Lian authored 9 years ago

[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior

The `HiveThriftServer2Test` relies on proper logging behavior to assert whether the Thrift server daemon process is started successfully. However, some other jar files listed in the classpath may potentially contain an unexpected Log4J configuration file which overrides the logging behavior.

This PR writes a temporary `log4j.properties` and prepend it to driver classpath before starting the testing Thrift server process to ensure proper logging behavior.

cc andrewor14 yhuai

Author: Cheng Lian <lian@databricks.com>

Closes #6493 from liancheng/override-log4j and squashes the following commits:

c489e0e [Cheng Lian] Fixes minor Scala styling issue
b46ef0d [Cheng Lian] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior

daeaa0c5

Preparing development version 1.4.0-SNAPSHOT · e3c35b21
Patrick Wendell authored 9 years ago

e3c35b21
Preparing Spark release v1.4.0-rc4 · a14fad11
Patrick Wendell authored 9 years ago

a14fad11

[SPARK-8049] [MLLIB] drop tmp col from OneVsRest output · 97d4cd07

Xiangrui Meng authored 9 years ago


The temporary column should be dropped after we get the prediction column. harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6592 from mengxr/SPARK-8049 and squashes the following commits:

1d89107 [Xiangrui Meng] use SparkFunSuite
6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output

(cherry picked from commit 89f21f66)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

97d4cd07

Preparing development version 1.4.0-SNAPSHOT · 92ccc5ba
Patrick Wendell authored 9 years ago

92ccc5ba
Preparing Spark release v1.4.0-rc4 · d630f4d6
Patrick Wendell authored 9 years ago

d630f4d6

[SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() · 6b0f6156

Davies Liu authored 9 years ago


Thanks ogirardot, closes #6580

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6590 from davies/when and squashes the following commits:

c0f2069 [Davies Liu] fix Column.when() and otherwise()

(cherry picked from commit 605ddbb2)
Signed-off-by: Reynold Xin <rxin@databricks.com>

6b0f6156