Commits · 7a4f326c00fb33c384b4fb927310d687ec063329 · cs525-sp18-g07 / spark

Sep 05, 2015

[SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the programming guides and python docs · 7a4f326c

Tathagata Das authored 9 years ago

- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8595 from tdas/SPARK-10440.

7a4f326c

[HOTFIX] [SQL] Fixes compilation error · 6c751940

Cheng Lian authored 9 years ago

Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155.

Author: Cheng Lian <lian@databricks.com>

Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.

6c751940

Sep 04, 2015

[SPARK-9925] [SQL] [TESTS] Set SQLConf.SHUFFLE_PARTITIONS.key correctly for tests · 47058ca5

Yin Huai authored 9 years ago

This PR fix the failed test and conflict for #8155

https://issues.apache.org/jira/browse/SPARK-9925

Closes #8155

Author: Yin Huai <yhuai@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #8602 from davies/shuffle_partitions.

47058ca5

[SPARK-10402] [DOCS] [ML] Add defaults to the scaladoc for params in ml/ · 22eab706

Holden Karau authored 9 years ago

We should make sure the scaladoc for params includes their default values through the models in ml/

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.

22eab706

[SPARK-10311] [STREAMING] Reload appId and attemptId when app starts with... · eafe3723

xutingjun authored 9 years ago

[SPARK-10311] [STREAMING] Reload appId and attemptId when app starts with checkpoint file in cluster mode

Author: xutingjun <xutingjun@huawei.com>

Closes #8477 from XuTingjun/streaming-attempt.

eafe3723

[SPARK-10454] [SPARK CORE] wait for empty event queue · 2e1c1755
robbins authored 9 years ago
```
Author: robbins <robbins@uk.ibm.com>

Closes #8605 from robbinspg/DAGSchedulerSuite-fix.
```
2e1c1755

[SPARK-9669] [MESOS] Support PySpark on Mesos cluster mode. · b087d23e

Timothy Chen authored 9 years ago

Support running pyspark with cluster mode on Mesos!
This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI.

Author: Timothy Chen <tnachen@gmail.com>

Closes #8349 from tnachen/mesos_python.

b087d23e

[SPARK-10450] [SQL] Minor improvements to readability / style / typos etc. · 3339e6f6
Andrew Or authored 9 years ago
```
Author: Andrew Or <andrew@databricks.com>

Closes #8603 from andrewor14/minor-sql-changes.
```
3339e6f6

[SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fails to analyze · c3c0e431

Wenchen Fan authored 9 years ago

This PR takes over https://github.com/apache/spark/pull/8389.

This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests.

In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class.

I propose we refactor as follows:

1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`.
2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`)

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8584 from cloud-fan/cleanupTests.

c3c0e431

MAINTENANCE: Automated closing of pull requests. · 804a0126

Michael Armbrust authored 9 years ago

This commit exists to close the following pull requests on Github:

Closes #1890 (requested by andrewor14, JoshRosen)
Closes #3558 (requested by JoshRosen, marmbrus)
Closes #3890 (requested by marmbrus)
Closes #3895 (requested by andrewor14, marmbrus)
Closes #4055 (requested by andrewor14)
Closes #4105 (requested by andrewor14)
Closes #4812 (requested by marmbrus)
Closes #5109 (requested by andrewor14)
Closes #5178 (requested by andrewor14)
Closes #5298 (requested by marmbrus)
Closes #5393 (requested by marmbrus)
Closes #5449 (requested by andrewor14)
Closes #5468 (requested by marmbrus)
Closes #5715 (requested by marmbrus)
Closes #6192 (requested by marmbrus)
Closes #6319 (requested by marmbrus)
Closes #6326 (requested by marmbrus)
Closes #6349 (requested by marmbrus)
Closes #6380 (requested by andrewor14)
Closes #6554 (requested by marmbrus)
Closes #6696 (requested by marmbrus)
Closes #6868 (requested by marmbrus)
Closes #6951 (requested by marmbrus)
Closes #7129 (requested by marmbrus)
Closes #7188 (requested by marmbrus)
Closes #7358 (requested by marmbrus)
Closes #7379 (requested by marmbrus)
Closes #7628 (requested by marmbrus)
Closes #7715 (requested by marmbrus)
Closes #7782 (requested by marmbrus)
Closes #7914 (requested by andrewor14)
Closes #8051 (requested by andrewor14)
Closes #8269 (requested by andrewor14)
Closes #8448 (requested by andrewor14)
Closes #8576 (requested by andrewor14)

804a0126

[MINOR] Minor style fix in SparkR · 143e521d

Shivaram Venkataraman authored 9 years ago

`dev/lintr-r` passes on my machine now

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8601 from shivaram/sparkr-style-fix.

143e521d

Sep 03, 2015

[SPARK-10003] Improve readability of DAGScheduler · cf421386

Andrew Or authored 9 years ago

Note: this is not intended to be in Spark 1.5!

This patch rewrites some code in the `DAGScheduler` to make it more readable. In particular
- there were blocks of code that are unnecessary and removed for simplicity
- there were abstractions that are unnecessary and made the code hard to navigate
- other minor changes

Author: Andrew Or <andrew@databricks.com>

Closes #8217 from andrewor14/dag-scheduler-readability and squashes the following commits:

57abca3 [Andrew Or] Move comment back into if case
574fb1e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-scheduler-readability
64a9ed2 [Andrew Or] Remove unnecessary code + minor code rewrites

cf421386

[SPARK-10421] [BUILD] Exclude curator artifacts from tachyon dependencies. · 208fbca1

Marcelo Vanzin authored 9 years ago

This avoids them being mistakenly pulled instead of the newer ones that
Spark actually uses. Spark only depends on these artifacts transitively,
so sometimes maven just decides to pick tachyon's version of the
dependency for whatever reason.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8577 from vanzin/SPARK-10421.

208fbca1

[SPARK-10435] Spark submit should fail fast for Mesos cluster mode with R · 08b07509

Andrew Or authored 9 years ago

It's not supported yet so we should error with a clear message.

Author: Andrew Or <andrew@databricks.com>

Closes #8590 from andrewor14/mesos-cluster-r-guard.

08b07509

[SPARK-9591] [CORE] Job may fail for exception during getting remote block · db4c130f

jeanlyn authored 9 years ago

[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591)
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #7927 from jeanlyn/catch_exception.

db4c130f

[SPARK-10430] [CORE] Added hashCode methods in AccumulableInfo and RDDOperationScope · 11ef32c5
Vinod K C authored 9 years ago
```
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8581 from vinodkc/fix_RDDOperationScope_Hashcode.
```
11ef32c5

[SPARK-9672] [MESOS] Don’t include SPARK_ENV_LOADED when passing env vars · e62f4a46

Pat Shields authored 9 years ago

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Pat Shields <yeoldefortran@gmail.com>

Closes #7979 from pashields/env-loading-on-driver.

e62f4a46

[SPARK-9869] [STREAMING] Wait for all event notifications before asserting results · 754f853b
robbins authored 9 years ago
```
Author: robbins <robbins@uk.ibm.com>

Closes #8589 from robbinspg/InputStreamSuite-fix.
```
754f853b
[SPARK-10431] [CORE] Fix intermittent test failure. Wait for event queue to be clear · d911c682
robbins authored 9 years ago
```
Author: robbins <robbins@uk.ibm.com>

Closes #8582 from robbinspg/InputOutputMetricsSuite.
```
d911c682
[SPARK-10432] spark.port.maxRetries documentation is unclear · 49aff7b9
Tom Graves authored 9 years ago
```
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #8585 from tgravescs/SPARK-10432.
```
49aff7b9

[SPARK-8951] [SPARKR] support Unicode characters in collect() · af0e3125

CHOIJAEHONG authored 9 years ago

Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.

Author: CHOIJAEHONG <redrock07@naver.com>

Closes #7494 from CHOIJAEHONG1/SPARK-8951.

af0e3125

[SPARK-9596] [SQL] treat hadoop classes as shared one in IsolatedClientLoader · 3abc0d51

WangTaoTheTonic authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9596

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #7931 from WangTaoTheTonic/SPARK-9596.

3abc0d51

[SPARK-10332] [CORE] Fix yarn spark executor validation · 67580f1f

Holden Karau authored 9 years ago

From Jira:
Running spark-submit with yarn with number-executors equal to 0 when not using dynamic allocation should error out.
In spark 1.5.0 it continues and ends up hanging.
yarn.ClientArguments still has the check so something else must have changed.
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 0 ....
spark 1.4.1 errors with:
java.lang.IllegalArgumentException:
Number of executors was 0, but must be at least 1
(or 0 if dynamic executor allocation is enabled).

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8580 from holdenk/SPARK-10332-spark-submit-to-yarn-executors-0-message.

67580f1f

[SPARK-10411] [SQL] Move visualization above explain output and hide explain by default · 0349b5b4

zsxwing authored 9 years ago

New screenshots after this fix:

<img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png">

Default:
<img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png">

After clicking `+details`:
<img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png">

Author: zsxwing <zsxwing@gmail.com>

Closes #8570 from zsxwing/SPARK-10411.

0349b5b4

[SPARK-10379] preserve first page in UnsafeShuffleExternalSorter · 62b4690d
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #8543 from davies/preserve_page.
```
62b4690d

[SPARK-10247] [CORE] improve readability of a test case in DAGSchedulerSuite · 3ddb9b32

Imran Rashid authored 9 years ago

This is pretty minor, just trying to improve the readability of `DAGSchedulerSuite`, I figure every bit helps. Before whenever I read this test, I never knew what "should work" and "should be ignored" really meant -- this adds some asserts & updates comments to make it more clear. Also some reformatting per a suggestion from markhamstra on https://github.com/apache/spark/pull/7699

Author: Imran Rashid <irashid@cloudera.com>

Closes #8434 from squito/SPARK-10247.

3ddb9b32

Removed code duplication in ShuffleBlockFetcherIterator · f6c447f8

Evan Racah authored 9 years ago

Added fetchUpToMaxBytes() to prevent having to update both code blocks when a change is made.

Author: Evan Racah <ejracah@gmail.com>

Closes #8514 from eracah/master.

f6c447f8

[SPARK-8707] RDD#toDebugString fails if any cached RDD has invalid partitions · 0985d2c3

navis.ryu authored 9 years ago

Added numPartitions(evaluate: Boolean) to RDD. With "evaluate=true" the method is same with "partitions.length". With "evaluate=false", it checks checked-out or already evaluated partitions in the RDD to get number of partition. If it's not those cases, returns -1. RDDInfo.partitionNum calls numPartition only when it's accessed.

Author: navis.ryu <navis@apache.org>

Closes #7127 from navis/SPARK-8707.

0985d2c3

[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException · 4bd85d06

Ilya Ganelin authored 9 years ago

The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort.

To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully.

I've added test cases to exercise the most obvious scenarios.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #5636 from ilganeli/SPARK-5945.

4bd85d06

Sep 02, 2015

[SPARK-9723] [ML] params getordefault should throw more useful error · 44948a2e

Holden Karau authored 9 years ago

Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.

44948a2e

[SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to override clone method · 03f3e91f
Yin Huai authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-10422

Author: Yin Huai <yhuai@databricks.com>

Closes #8578 from yhuai/SPARK-10422.
```
03f3e91f

[SPARK-10417] [SQL] Iterating through Column results in infinite loop · 6cd98c18

0x0FFF authored 9 years ago

`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)

Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8574 from 0x0FFF/SPARK-10417.

6cd98c18

[SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle data. · 2da3a9e9

Marcelo Vanzin authored 9 years ago

To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8218 from vanzin/SPARK-10004.

2da3a9e9

[SPARK-10389] [SQL] support order by non-attribute grouping expression on Aggregate · fc483077

Wenchen Fan authored 9 years ago

For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8548 from cloud-fan/support-order-by-non-attribute.

fc483077

[SPARK-10034] [SQL] add regression test for Sort on Aggregate · 56c4c172

Wenchen Fan authored 9 years ago

Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name.

However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8231 from cloud-fan/sort-agg.

56c4c172

[SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorrect on JobHistory UI. · c3b881a7
Chuan Shao authored 9 years ago
```
Author: ArcherShao <shaochuan@huawei.com>

Closes #5886 from ArcherShao/SPARK-7336.
```
c3b881a7

Sep 01, 2015

[SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection · 00d9af5e

0x0FFF authored 9 years ago

This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement

Issue reproduction on master:
```
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8556 from 0x0FFF/SPARK-10392.

00d9af5e

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function · bf550a4b

0x0FFF authored 9 years ago

This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting

Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))

m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')

df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
 Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo

Author: 0x0FFF <programmerag@gmail.com>

Closes #8555 from 0x0FFF/SPARK-10162.

bf550a4b

[SPARK-4223] [CORE] Support * in acls. · ec012805

zhuol authored 9 years ago

SPARK-4223.

Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.

Manual tests to verify that: "*" works for any user in:
a. Spark ui: view and kill stage.     Done.
b. Spark history server.                  Done.
c. Yarn application killing.  Done.

Author: zhuol <zhuol@yahoo-inc.com>

Closes #8398 from zhuoliu/4223.

ec012805

[SPARK-10398] [DOCS] Migrate Spark download page to use new lua mirroring scripts · 3f63bd60

Sean Owen authored 9 years ago

Migrate Apache download closer.cgi refs to new closer.lua

This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately.

Author: Sean Owen <sowen@cloudera.com>

Closes #8557 from srowen/SPARK-10398.

3f63bd60