Commits · f47dbf27fa034629fab12d0f3c89ab75edb03f86 · cs525-sp18-g07 / spark

Apr 20, 2016

[SPARK-14602][YARN] Use SparkConf to propagate the list of cached files. · f47dbf27

Marcelo Vanzin authored 9 years ago

This change avoids using the environment to pass this information, since
with many jars it's easy to hit limits on certain OSes. Instead, it encodes
the information into the Spark configuration propagated to the AM.

The first problem that needed to be solved is a chicken & egg issue: the
config file is distributed using the cache, and it needs to contain information
about the files that are being distributed. To solve that, the code now treats
the config archive especially, and uses slightly different code to distribute
it, so that only its cache path needs to be saved to the config file.

The second problem is that the extra information would show up in the Web UI,
which made the environment tab even more noisy than it already is when lots
of jars are listed. This is solved by two changes: the list of cached files
is now read only once in the AM, and propagated down to the ExecutorRunnable
code (which actually sends the list to the NMs when starting containers). The
second change is to unset those config entries after the list is read, so that
the SparkContext never sees them.

Tested with both client and cluster mode by running "run-example SparkPi". This
uploads a whole lot of files when run from a build dir (instead of a distribution,
where the list is cleaned up), and I verified that the configs do not show
up in the UI.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12487 from vanzin/SPARK-14602.

f47dbf27

[SPARK-14769][SQL] Create built-in functionality for variable substitution · 334c293e

Reynold Xin authored 9 years ago

## What changes were proposed in this pull request?
In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage.

Note that this pull request does not yet use this functionality anywhere.

## How was this patch tested?
Added VariableSubstitutionSuite for unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12538 from rxin/SPARK-14769.

334c293e

[SPARK-14770][SQL] Remove unused queries in hive module test resources · b28fe448

Reynold Xin authored 9 years ago

## What changes were proposed in this pull request?
We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest.

## How was this patch tested?
N/A - removing unused test resources.

Author: Reynold Xin <rxin@databricks.com>

Closes #12540 from rxin/SPARK-14770.

b28fe448

[SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually · fd826819

Subhobrata Dey authored 9 years ago

## What changes were proposed in this pull request?

3 testcases namely,

```
"count is partially aggregated"
"count distinct is partially aggregated"
"mixed aggregates are partially aggregated"
```

were failing when running PlannerSuite individually.
The PR provides a fix for this.

## How was this patch tested?

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12532 from sbcd90/plannersuitetestsfix.

fd826819

[SPARK-13842] [PYSPARK] pyspark.sql.types.StructType accessor enhancements · e7791c4f

Sheamus K. Parkes authored 9 years ago

## What changes were proposed in this pull request?

Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance.
  - Iterating a `StructType` will iterate its fields
    - `[field.name for field in my_structtype]`
  - Indexing with a string will return a field by name
    - `my_structtype['my_field_name']`
  - Indexing with an integer will return a field by position
    - `my_structtype[0]`
  - Indexing with a slice will return a new `StructType` with just the chosen fields:
    - `my_structtype[1:3]`
  - The length is the number of fields (should also provide "truthiness" for free)
    - `len(my_structtype) == 2`

## How was this patch tested?

Extended the unit test coverage in the accompanying `tests.py`.

Author: Sheamus K. Parkes <shea.parkes@milliman.com>

Closes #12251 from skparkes/pyspark-structtype-enhance.

e7791c4f

[SPARK-14678][SQL] Add a file sink log to support versioning and compaction · 7bc94855

Shixiong Zhu authored 9 years ago

## What changes were proposed in this pull request?

This PR adds a special log for FileStreamSink for two purposes:

- Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
- Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.

FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.

FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).

## How was this patch tested?

FileStreamSinkLogSuite

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12435 from zsxwing/sink-log.

7bc94855

[MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter · 296c384a

Yanbo Liang authored 9 years ago

## What changes were proposed in this pull request?
#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12529 from yanboliang/typeConverter.

296c384a

[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState... · 8fc267ab

Andrew Or authored 9 years ago

[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class

## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.

## How was this patch tested?
Existing tests

This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12522 from yhuai/spark-session.

8fc267ab

[SPARK-14741][SQL] Fixed error in reading json file stream inside a partitioned directory · cb8ea9e1

Tathagata Das authored 9 years ago

## What changes were proposed in this pull request?

Consider the following directory structure
dir/col=X/some-files
If we create a text format streaming dataframe on `dir/col=X/`  then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
```
18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
```
The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.

## How was this patch tested?

New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12517 from tdas/SPARK-14741.

cb8ea9e1

[SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std · acc7e592

Joseph K. Bradley authored 9 years ago

## What changes were proposed in this pull request?

Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.

This PR documents this fact.

## How was this patch tested?

doc only

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12519 from jkbradley/scaler-variance-doc.

acc7e592

[MINOR][ML][PYSPARK] Fix omissive param setters which should use _set method · 08f84d7a

Yanbo Liang authored 9 years ago

## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12531 from yanboliang/setters-omissive.

08f84d7a

[SPARK-14725][CORE] Remove HttpServer class · 90cbc82f

jerryshao authored 9 years ago

## What changes were proposed in this pull request?

This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it.

## How was this patch tested?

Unit test is verified locally.

Author: jerryshao <sshao@hortonworks.com>

Closes #12526 from jerryshao/SPARK-14725.

90cbc82f

[SPARK-14742][DOCS] Redirect spark-ec2 doc to new location · b4e76a9a

Sean Owen authored 9 years ago

## What changes were proposed in this pull request?

Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs

## How was this patch tested?

`jekyll build` and checked with the browser

Author: Sean Owen <sowen@cloudera.com>

Closes #12534 from srowen/SPARK-14742.

b4e76a9a

[SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4

Burak Yavuz authored 9 years ago

## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.

80bf48f4

[SPARK-8171][WEB UI] Javascript based infinite scrolling for the log page · 83427788

Alex Bozarth authored 9 years ago

Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10910 from ajbozarth/spark8171.

83427788

[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF · ed9d8038

Yuhao Yang authored 9 years ago

## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.

ed9d8038

[SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf) · 17db4bfe

Liwei Lin authored 9 years ago

## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.

17db4bfe

[SPARK-14679][UI] Fix UI DAG visualization OOM. · a3451119

Ryan Blue authored 9 years ago

## What changes were proposed in this pull request?

The DAG visualization can cause an OOM when generating the DOT file.
This happens because clusters are not correctly deduped by a contains
check because they use the default equals implementation. This adds a
working equals implementation.

## How was this patch tested?

This adds a test suite that checks the new equals implementation.

Author: Ryan Blue <blue@apache.org>

Closes #12437 from rdblue/SPARK-14679-fix-ui-oom.

a3451119

[SPARK-9013][SQL] generate MutableProjection directly instead of return a function · 7abe9a65

Wenchen Fan authored 9 years ago

`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.

However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #7373 from cloud-fan/project.

7abe9a65

[SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. · 14869ae6

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.

14869ae6

Apr 19, 2016

[MINOR] [SQL] Re-enable `explode()` and `json_tuple()` testcases in ExpressionToSQLSuite · 6f1ec1f2

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`.

## How was this patch tested?

Pass the Jenkins tests (with re-enabled test cases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12329 from dongjoon-hyun/minor_enable_testcases.

6f1ec1f2

[SPARK-14600] [SQL] Push predicates through Expand · 856bc465

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14600

This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12496 from cloud-fan/expand.

856bc465

[SPARK-14704][CORE] create accumulators in TaskMetrics · 85d759ca

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side.
After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12472 from cloud-fan/acc.

85d759ca

[SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation · 78b38109

Luciano Resende authored 9 years ago

## What changes were proposed in this pull request?

Change SubquerySuite to validate test results utilizing checkAnswer helper method

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12269 from lresende/SPARK-13419.

78b38109

[SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package. · 8eedf0b5

Sun Rui authored 9 years ago

## What changes were proposed in this pull request?

Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11811 from sun-rui/SPARK-13905.

8eedf0b5

[SPARK-14705][YARN] support Multiple FileSystem for YARN STAGING DIR · 4514aebd

Lianhui Wang authored 9 years ago

## What changes were proposed in this pull request?
In SPARK-13063, It makes the SPARK YARN STAGING DIR as configurable. But it only support default FileSystem. If there are many clusters, It can be different FileSystem for different cluster in our spark.

## How was this patch tested?
I have tested it successfully with following commands:
MASTER=yarn-client ./bin/spark-shell --conf spark.yarn.stagingDir=hdfs:namenode2/temp
$SPARK_HOME/bin/spark-submit --conf spark.yarn.stagingDir=hdfs:namenode2/temp

cc tgravescs vanzin andrewor14

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #12473 from lianhuiwang/SPARK-14705.

4514aebd

[SPARK-13929] Use Scala reflection for UDTs · 3ae25f24

Joan authored 9 years ago

## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.

3ae25f24

[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into... · 10f273d8

Cheng Lian authored 9 years ago

[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178

## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.

10f273d8

[SPARK-14717] [PYTHON] Scala, Python APIs for Dataset.unpersist differ in default blocking value · 36641423

felixcheung authored 9 years ago

## What changes were proposed in this pull request?

Change unpersist blocking parameter default value to match Scala

## How was this patch tested?

unit tests, manual tests

jkbradley davies

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12507 from felixcheung/pyunpersist.

36641423

Revert "[SPARK-14719] WriteAheadLogBasedBlockHandler should ignore BlockManager put errors" · a685e65a
Josh Rosen authored 9 years ago
```
This reverts commit ed2de029.
```
a685e65a

[SPARK-12224][SPARKR] R support for JDBC source · ecd877e8

felixcheung authored 9 years ago

Add R API for `read.jdbc`, `write.jdbc`.

Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.

Refactored some code into util so they could be tested.

Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.

Tested:
```
# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar

# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)

# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)

# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2

# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1

# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"

write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10480 from felixcheung/rreadjdbc.

ecd877e8

[SPARK-14733] Allow custom timing control in microbenchmarks · 008a8bbe

Eric Liang authored 9 years ago

## What changes were proposed in this pull request?

The current benchmark framework runs a code block for several iterations and reports statistics. However there is no way to exclude per-iteration setup time from the overall results. This PR adds a timer control object passed into the closure that can be used for this purpose.

## How was this patch tested?

Existing benchmark code. Also see https://github.com/apache/spark/pull/12490

Author: Eric Liang <ekl@databricks.com>

Closes #12502 from ericl/spark-14733.

008a8bbe

[SPARK-4226] [SQL] Support IN/EXISTS Subqueries · da885922

Herman van Hovell authored 9 years ago

### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.

da885922

[SPARK-14042][CORE] Add custom coalescer support · 3c91afec

Nezih Yigitbasi authored 9 years ago

## What changes were proposed in this pull request?

This PR adds support for specifying an optional custom coalescer to the `coalesce()` method. Currently I have only added this feature to the `RDD` interface, and once we sort out the details we can proceed with adding this feature to the other APIs (`Dataset` etc.)

## How was this patch tested?

Added a unit test for this functionality.

/cc rxin (per our discussion on the mailing list)

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #11865 from nezihyigitbasi/custom_coalesce_policy.

3c91afec

[SPARK-14656][CORE] Fix Benchmark.getPorcessorName() always return "Unknown processor" on Linux · 0b8369d8

Kazuaki Ishizaki authored 9 years ago

## What changes were proposed in this pull request?
This PR returns correct processor name in ```/proc/cpuinfo``` on Linux from  ```Benchmark.getPorcessorName()```. Now, this return ```Unknown processor```.
Since ```Utils.executeAndGetOutput(Seq("which", "grep"))``` return ```/bin/grep\n```, it is failed to execute ```/bin/grep\n```. This PR strips ```\n``` at the end of the line of a result of ```Utils.executeAndGetOutput()```

Before applying this PR
````
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64
Unknown processor
back-to-back filter:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Dataset                                   472 /  503         21.2          47.2       1.0X
DataFrame                                  51 /   58        198.0           5.1       9.3X
RDD                                       189 /  211         52.8          18.9       2.5X
````

After applying this PR
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64
Intel(R) Xeon(R) CPU E5-2667 v2  3.30GHz
back-to-back filter:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Dataset                                   490 /  502         20.4          49.0       1.0X
DataFrame                                  55 /   61        183.4           5.5       9.0X
RDD                                       210 /  237         47.7          21.0       2.3X
```

## How was this patch tested?
Run Benchmark programs on Linux by hand

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #12411 from kiszk/SPARK-14656.

0b8369d8

[SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type · 5cb2e336

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.

However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.

This PR fixes this issue by calling `distinct` before declare the class menbers.

## How was this patch tested?

new regression test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12468 from cloud-fan/bug.

5cb2e336

[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace · 947b9020

Josh Rosen authored 9 years ago

When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using https://github.com/JoshRosen/spark/commit/16b31c825197ee31a50214c6ba3c1df08148f403, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.

947b9020

[SPARK-12457] Fixed the Wrong Description and Missing Example in Collection Functions · d9620e76

gatorsmile authored 9 years ago

#### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/12185 contains the original PR I submitted in https://github.com/apache/spark/pull/10418

However, it misses one of the extended example, a wrong description and a few typos for collection functions. This PR is fix all these issues.

#### How was this patch tested?
The existing test cases already cover it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12492 from gatorsmile/expressionUpdate.

d9620e76

[SPARK-13904] Add exit code parameter to exitExecutor() · e8963360

tedyu authored 9 years ago

## What changes were proposed in this pull request?

This PR adds exit code parameter to exitExecutor() so that caller can specify different exit code.

## How was this patch tested?

Existing test

rxin hbhanawat

Author: tedyu <yuzhihong@gmail.com>

Closes #12457 from tedyu/master.

e8963360

[SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations · 9ee95b6e

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.

9ee95b6e