Commits · e8ec2a7b01cc86329a6fbafc3d371bdfd79fc1d6 · cs525-sp18-g07 / spark

Oct 30, 2015

Revert "[SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0." · e8ec2a7b
Yin Huai authored 9 years ago
```
This reverts commit 4f5e60c6.
```
e8ec2a7b

[SPARK-11423] remove MapPartitionsWithPreparationRDD · 45029bfd

Davies Liu authored 9 years ago

Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.

This PR basically revert #8543, #8511, #8038, #8011

Author: Davies Liu <davies@databricks.com>

Closes #9381 from davies/remove_prepare2.

45029bfd

[SPARK-11340][SPARKR] Support setting driver properties when starting Spark... · bb5a2af0

felixcheung authored 9 years ago

[SPARK-11340][SPARKR] Support setting driver properties when starting Spark from R programmatically or from RStudio

Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.

shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9290 from felixcheung/rdrivermem.

bb5a2af0

[SPARK-11342][TESTS] Allow to set hadoop profile when running dev/ru… · 729f983e
Jeff Zhang authored 9 years ago
```
…n_tests

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9295 from zjffdu/SPARK-11342.
```
729f983e
[SPARK-11210][SPARKR] Add window functions into SparkR [step 2]. · 40c77fb2
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9196 from sun-rui/SPARK-11210.
```
40c77fb2

[SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in... · fab710a9

Sun Rui authored 9 years ago

[SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in RRDD in the PR for SPARK-10971.

Author: Sun Rui <rui.sun@intel.com>

Closes #9368 from sun-rui/SPARK-11414.

fab710a9

[SPARK-10986][MESOS] Set the context class loader in the Mesos executor backend. · 0451b001

Iulian Dragos authored 9 years ago

See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details.

This fixes the `ClassNotFoundException` for Spark classes in the serializer.

I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization.

I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue).

tnachen andrewor14

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #9282 from dragos/issue/mesos-classloader.

0451b001

[SPARK-11393] [SQL] CoGroupedIterator should respect the fact that... · 14d08b99

Wenchen Fan authored 9 years ago

[SPARK-11393] [SQL] CoGroupedIterator should respect the fact that GroupedIterator.hasNext is not idempotent

When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9346 from cloud-fan/cogroup and squashes the following commits:

9be67c8 [Wenchen Fan] SPARK-11393

14d08b99

[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail · 59db9e9c

hyukjinkwon authored 9 years ago

When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disables predicate push down when using merged schema in this PR.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9327 from HyukjinKwon/SPARK-11103.

59db9e9c

[SPARK-11207] [ML] Add test cases for solver selection of LinearRegres… · 86d65265

Lewuathe authored 9 years ago

…sion as followup. This is the follow up work of SPARK-10668.

* Fix miner style issues.
* Add test case for checking whether solver is selected properly.

Author: Lewuathe <lewuathe@me.com>
Author: lewuathe <lewuathe@me.com>

Closes #9180 from Lewuathe/SPARK-11207.

86d65265

[SPARK-11417] [SQL] no @Override in codegen · eb59b94c

Davies Liu authored 9 years ago

Older version of Janino (>2.7) does not support Override, we should not use that in codegen.

Author: Davies Liu <davies@databricks.com>

Closes #9372 from davies/no_override.

eb59b94c

[SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management · 56419cf1

Davies Liu authored 9 years ago

This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.

Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).

The PrepareRDD may be not needed anymore, could be removed in follow up PR.

The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).

```python
sqlContext.setConf("spark.sql.shuffle.partitions", "1")
df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
j.explain()
print j.count()
```

For thread-safety, here what I'm got:

1) Without calling spill(), the operators should only be used by single thread, no safety problems.

2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.

3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.

4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.

5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).

Author: Davies Liu <davies@databricks.com>

Closes #9241 from davies/force_spill.

56419cf1

Oct 29, 2015

[SPARK-11409][SPARKR] Enable url link in R doc for Persist · d89be0bf

felixcheung authored 9 years ago

Quick one line doc fix
link is not clickable
![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png)

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9363 from felixcheung/rpersistdoc.

d89be0bf

[SPARK-11301] [SQL] fix case sensitivity for filter on partitioned columns · 96cf87f6
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9271 from cloud-fan/filter.
```
96cf87f6

[SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0. · 4f5e60c6

Calvin Jia authored 9 years ago

Upgrades the tachyon-client version to the latest release.

No new dependencies are added and no spark facing APIs are changed. The removal of the `tachyon-underfs-s3` exclusion will enable users to use S3 out of the box and there are no longer any additional external dependencies added by the module.

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #9204 from calvinjia/spark-11236.

4f5e60c6

[SPARK-10532][EC2] Added --profile option to specify the name of profile · f21ef8db

teramonagi authored 9 years ago

"profiles" give us the way that you can specify the set of credentials you want to use when you initialize a connection to AWS.

You can keep multiple sets of credentials in the same credentials files using different profile names.
For example, you can use --profile option to do that when you use "aws cli tool".

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Author: teramonagi <teramonagi@gmail.com>

Closes #8696 from teramonagi/SPARK-10532.

f21ef8db

[SPARK-10641][SQL] Add Skewness and Kurtosis Support · a01cbf5d

sethah authored 9 years ago

Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics

Author: sethah <seth.hendrickson16@gmail.com>

Closes #9003 from sethah/SPARK-10641.

a01cbf5d

[SPARK-11188][SQL] Elide stacktraces in bin/spark-sql for AnalysisExceptions · 8185f038

Dilip Biswal authored 9 years ago

Only print the error message to the console for Analysis Exceptions in sql-shell.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9194 from dilipbiswal/spark-11188.

8185f038

[SPARK-11246] [SQL] Table cache for Parquet broken in 1.5 · f7a51dee

xin Wu authored 9 years ago

The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation.
The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key to the cached InMemoryRelation.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9326 from xwu0226/spark-11246-commit.

f7a51dee

[SPARK-11388][BUILD] Fix self closing tags. · 3bb2a8d7

Herman van Hovell authored 9 years ago

Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ...

This PR fixes those.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9339 from hvanhovell/SPARK-11388.

3bb2a8d7

[SPARK-11318] Include hive profile in make-distribution.sh command · f304f9c9
tedyu authored 9 years ago
```
Author: tedyu <yuzhihong@gmail.com>

Closes #9281 from tedyu/master.
```
f304f9c9

[SPARK-11370] [SQL] fix a bug in GroupedIterator and create unit test for it · f79ebf2a

Wenchen Fan authored 9 years ago

Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9330 from cloud-fan/group.

f79ebf2a

[SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type correctly · 87f28fc2

Wenchen Fan authored 9 years ago

For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403.

However, for top level primitive type, we use `dataTypeFor`, which is wrong.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9337 from cloud-fan/encoder.

87f28fc2

Oct 28, 2015

[SPARK-11322] [PYSPARK] Keep full stack trace in captured exception · 3dfa4ea5

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11322

As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9283 from viirya/py-exception-stacktrace.

3dfa4ea5

[SPARK-11351] [SQL] support hive interval literal · 0cb7662d
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9304 from cloud-fan/interval.
```
0cb7662d

[SPARK-11376][SQL] Removes duplicated `mutableRow` field · e5b89978

Cheng Lian authored 9 years ago

This PR fixes a mistake in the code generated by `GenerateColumnAccessor`. Interestingly, although the code is illegal in Java (the class has two fields with the same name), Janino accepts it happily and accidentally works properly.

Author: Cheng Lian <lian@databricks.com>

Closes #9335 from liancheng/spark-11376.fix-generated-code.

e5b89978

[SPARK-11363] [SQL] LeftSemiJoin should be LeftSemi in SparkStrategies · 20dfd467

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11363

In SparkStrategies some places use LeftSemiJoin. It should be LeftSemi.

cc chenghao-intel liancheng

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9318 from viirya/no-left-semi-join.

20dfd467

[SPARK-11292] [SQL] Python API for text data source · 5aa05219

Reynold Xin authored 9 years ago

Adds DataFrameReader.text and DataFrameWriter.text.

Author: Reynold Xin <rxin@databricks.com>

Closes #9259 from rxin/SPARK-11292.

5aa05219

[SPARK-11377] [SQL] withNewChildren should not convert StructType to Seq · 032748bb

Michael Armbrust authored 9 years ago

This is minor, but I ran into while writing Datasets and while it wasn't needed for the final solution, it was super confusing so we should fix it.

Basically we recurse into `Seq` to see if they have children. This breaks because we don't preserve the original subclass of `Seq` (and `StructType <:< Seq[StructField]`). Since a struct can never contain children, lets just not recurse into it.

Author: Michael Armbrust <michael@databricks.com>

Closes #9334 from marmbrus/structMakeCopy.

032748bb

[SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver · f92b7b98

Yanbo Liang authored 9 years ago

[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9328 from yanboliang/spark-11367.

f92b7b98

[SPARK-11369][ML][R] SparkR glm should support setting standardize · fba9e954

Yanbo Liang authored 9 years ago

SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9331 from yanboliang/spark-11369.

fba9e954

Typo in mllib-evaluation-metrics.md · fd9e345c

Mageswaran.D authored 9 years ago

Recall by threshold snippet was using "precisionByThreshold"

Author: Mageswaran.D <mageswaran1989@gmail.com>

Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.

fd9e345c

[SPARK-11313][SQL] implement cogroup on DataSets (support 2 datasets) · 075ce491

Wenchen Fan authored 9 years ago

A simpler version of https://github.com/apache/spark/pull/9279, only support 2 datasets.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9324 from cloud-fan/cogroup2.

075ce491

[SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of WeightedLeastSquare.Instance · 5f1cee6f

Nakul Jindal authored 9 years ago

WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.

5f1cee6f

[MINOR][ML] fix compile warns · 82c1c577

Xiangrui Meng authored 9 years ago

This fixes some compile time warnings.

Author: Xiangrui Meng <meng@databricks.com>

Closes #9319 from mengxr/mllib-compile-warn-20151027.

82c1c577

[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix... · 826e1e30

Sean Owen authored 9 years ago

[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes https://github.com/apache/spark/pull/9293

Author: Sean Owen <sowen@cloudera.com>

Closes #9309 from srowen/SPARK-11302.2.

826e1e30

Oct 27, 2015

[SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some cases · d9c60398

Cheng Hao authored 9 years ago

In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8652 from chenghao-intel/cartesian.

d9c60398

[SPARK-11178] Improving naming around task failures. · b960a890

Kay Ousterhout authored 9 years ago

Commit af3bc59d introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task.  That commit introduced some vague
naming that this commit attempts to fix; in particular:

(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed (and reversed) to "exitCausedByApp". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.

(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9164 from kayousterhout/SPARK-11178.

b960a890

[SPARK-11212][CORE][STREAMING] Make preferred locations support... · 9fbd75ab

zsxwing authored 9 years ago

[SPARK-11212][CORE][STREAMING] Make preferred locations support ExecutorCacheTaskLocation and update…

… ReceiverTracker and ReceiverSchedulingPolicy to use it

This PR includes the following changes:

1. Add a new preferred location format, `executor_<host>_<executorID>` (e.g., "executor_localhost_2"), to support specifying the executor locations for RDD.
2. Use the new preferred location format in `ReceiverTracker` to optimize the starting time of Receivers when there are multiple executors in a host.

The goal of this PR is to enable the streaming scheduler to place receivers (which run as tasks) in specific executors. Basically, I want to have more control on the placement of the receivers such that they are evenly distributed among the executors. We tried to do this without changing the core scheduling logic. But it does not allow specifying particular executor as preferred location, only at the host level. So if there are two executors in the same host, and I want two receivers to run on them (one on each executor), I cannot specify that. Current code only specifies the host as preference, which may end up launching both receivers on the same executor. We try to work around it but restarting a receiver when it does not launch in the desired executor and hope that next time it will be started in the right one. But that cause lots of restarts, and delays in correctly launching the receiver.

So this change, would allow the streaming scheduler to specify the exact executor as the preferred location. Also this is not exposed to the user, only the streaming scheduler uses this.

Author: zsxwing <zsxwing@gmail.com>

Closes #9181 from zsxwing/executor-location.

9fbd75ab

[SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a write · 4f030b9e

Burak Yavuz authored 9 years ago

Currently the Write Ahead Log in Spark Streaming flushes data as writes need to be made. S3 does not support flushing of data, data is written once the stream is actually closed.
In case of failure, the data for the last minute (default rolling interval) will not be properly written. Therefore we need a flag to close the stream after the write, so that we achieve read after write consistency.

cc tdas zsxwing

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9285 from brkyvz/caw-wal.

4f030b9e