Commits · 62a6577bfa3a83783c813e74286e62b668e9af83 · cs525-sp18-g07 / spark

Dec 15, 2016

Preparing development version 2.1.1-SNAPSHOT · 62a6577b
Patrick Wendell authored 8 years ago

62a6577b
Preparing Spark release v2.1.0-rc4 · ec317265
Patrick Wendell authored 8 years ago

ec317265

[MINOR] Only rename SparkR tar.gz if names mismatch · ae853e8f

Shivaram Venkataraman authored 8 years ago


## What changes were proposed in this pull request?

For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail.

## How was this patch tested?

Manually by executing the following script
```
set -o pipefail
set -e
set -x

touch a

R_PACKAGE_VERSION=2.1.0
VERSION=2.1.0

if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
  cp a a
fi
```

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #16299 from shivaram/sparkr-cp-fix.

(cherry picked from commit 9634018c)
Signed-off-by: Reynold Xin <rxin@databricks.com>

ae853e8f

[SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single listener, check trigger... · 08e42728

Burak Yavuz authored 8 years ago


## What changes were proposed in this pull request?

Use `recentProgress` instead of `lastProgress` and filter out last non-zero value. Also add eventually to the latest assertQuery similar to first `assertQuery`

## How was this patch tested?

Ran test 1000 times

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16287 from brkyvz/SPARK-18868.

(cherry picked from commit 9c7f83b0)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

08e42728

Preparing development version 2.1.1-SNAPSHOT · a7364a82
Patrick Wendell authored 8 years ago

a7364a82
Preparing Spark release v2.1.0-rc3 · ef2ccf94
Patrick Wendell authored 8 years ago

ef2ccf94

[SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not defined · b6a81f47

Burak Yavuz authored 8 years ago

## What changes were proposed in this pull request?

`_to_seq` wasn't imported.

## How was this patch tested?

Added partitionBy to existing write path unit test

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16297 from brkyvz/SPARK-18888.

b6a81f47

[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource · 900ce558

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data.

This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch.

## How was this patch tested?

The added test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16251 from zsxwing/newest-first.

(cherry picked from commit 68a6dc97)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

900ce558

[SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets · e430915f

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true.

## How was this patch tested?

Added unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16289 from tdas/SPARK-18870.

(cherry picked from commit 4f7292c8)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

e430915f

Dec 14, 2016

[SPARK-18849][ML][SPARKR][DOC] vignettes final check update · 2a8de2e1

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

doc cleanup

## How was this patch tested?

~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html



Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16286 from felixcheung/rvignettespass.

(cherry picked from commit 7d858bc5)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

2a8de2e1

[SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding `DESCRIPTION` file · d399a297

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that.

- Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
- Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html



## How was this patch tested?

Manual.

```bash
cd docs
SKIP_SCALADOC=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16292 from dongjoon-hyun/SPARK-18875.

(cherry picked from commit ec0eae48)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

d399a297

[SPARK-18869][SQL] Add TreeNode.p that returns BaseType · b14fc391

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType.

## How was this patch tested?
N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell.

Author: Reynold Xin <rxin@databricks.com>

Closes #16288 from rxin/SPARK-18869.

(cherry picked from commit 5d510c69)
Signed-off-by: Reynold Xin <rxin@databricks.com>

b14fc391

[SPARK-18856][SQL] non-empty partitioned table should not report zero size · cb2c8428

Wenchen Fan authored 8 years ago


## What changes were proposed in this pull request?

In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16280 from cloud-fan/bug.

(cherry picked from commit d6f11a12)
Signed-off-by: Reynold Xin <rxin@databricks.com>

cb2c8428

[SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates · 0d94201e

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?

When do the QA work, I found that the following issues:

1). `spark.mlp` doesn't include an example;
2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
3). `spark.lda` document misses default values for some parameters.

I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.

## How was this patch tested?

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16284 from wangmiao1981/ks.

(cherry picked from commit 32438853)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

0d94201e

[SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for subqueries · 280c35af

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries.

This patch fixes the bug.

## How was this patch tested?
Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent.

Author: Reynold Xin <rxin@databricks.com>

Closes #16277 from rxin/SPARK-18854.

(cherry picked from commit ffdd1fcd)
Signed-off-by: Reynold Xin <rxin@databricks.com>

280c35af

[SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes · d0d9c572

Joseph K. Bradley authored 8 years ago

## What changes were proposed in this pull request?

Added short section for KSTest.
Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)

![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png

)

## How was this patch tested?

Manually tested example locally.
Built vignettes locally.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #16283 from jkbradley/ksTest-vignette.

(cherry picked from commit 78627425)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

d0d9c572

[SPARK-18852][SS] StreamingQuery.lastProgress should be null when recentProgress is empty · c4de90fc

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Right now `StreamingQuery.lastProgress` throws NoSuchElementException and it's hard to be used in Python since Python user will just see Py4jError.

This PR just makes it return null instead.

## How was this patch tested?

`test("lastProgress should be null when recentProgress is empty")`

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16273 from zsxwing/SPARK-18852.

(cherry picked from commit 1ac6567b)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

c4de90fc

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics · e8866f9f

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #16274 from rxin/SPARK-18853.

(cherry picked from commit 5d799473)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

e8866f9f

[SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side... · af12a21c

hyukjinkwon authored 8 years ago

[SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side post-filter for FileFormat datasources

## What changes were proposed in this pull request?

Currently, `FileSourceStrategy` does not handle the case when the pushed-down filter is `Literal(null)` and removes it at the post-filter in Spark-side.

For example, the codes below:

```scala
val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF()
df.filter($"_1" === "true").explain(true)
```

shows it keeps `null` properly.

```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- LocalRelation [_1#17]

== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#17 as double) = cast(true as double))
+- LocalRelation [_1#17]

== Optimized Logical Plan ==
Filter (isnotnull(_1#17) && null)
+- LocalRelation [_1#17]

== Physical Plan ==
*Filter (isnotnull(_1#17) && null)       << Here `null` is there
+- LocalTableScan [_1#17]
```

However, when we read it back from Parquet,

```scala
val path = "/tmp/testfile"
df.write.parquet(path)
spark.read.parquet(path).filter($"_1" === "true").explain(true)
```

`null` is removed at the post-filter.

```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#11] parquet

== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#11 as double) = cast(true as double))
+- Relation[_1#11] parquet

== Optimized Logical Plan ==
Filter (isnotnull(_1#11) && null)
+- Relation[_1#11] parquet

== Physical Plan ==
*Project [_1#11]
+- *Filter isnotnull(_1#11)       << Here `null` is missing
   +- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```

This PR fixes it to keep it properly. In more details,

```scala
val partitionKeyFilters =
  ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
```

This keeps this `null` in `partitionKeyFilters` as `Literal` always don't have `children` and `references` is being empty  which is always the subset of `partitionSet`.

And then in

```scala
val afterScanFilters = filterSet -- partitionKeyFilters
```

`null` is always removed from the post filter. So, if the referenced fields are empty, it should be applied into data columns too.

After this PR, it becomes as below:

```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#276] parquet

== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#276 as double) = cast(true as double))
+- Relation[_1#276] parquet

== Optimized Logical Plan ==
Filter (isnotnull(_1#276) && null)
+- Relation[_1#276] parquet

== Physical Plan ==
*Project [_1#276]
+- *Filter (isnotnull(_1#276) && null)
   +- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b..., PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```

## How was this patch tested?

Unit test in `FileSourceStrategySuite`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16184 from HyukjinKwon/SPARK-18753.

(cherry picked from commit 89ae26dc)
Signed-off-by: Cheng Lian <lian@databricks.com>

af12a21c

[SPARK-18730] Post Jenkins test report page instead of the full console output page to GitHub · 16d4bd4a

Cheng Lian authored 8 years ago


## What changes were proposed in this pull request?

Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while.

This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure.

Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16163 from liancheng/jenkins-test-report.

(cherry picked from commit ba4aab9b)
Signed-off-by: Reynold Xin <rxin@databricks.com>

16d4bd4a

[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 · f999312e

Nattavut Sutyanyong authored 8 years ago


## What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.

This problem can be reproduced with a simple script now.

Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show

The requirements are:
1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.

## How was this patch tested?

SubquerySuite and a simplified version of TPCDS-Q32

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #16246 from nsyca/18814.

(cherry picked from commit cccd6439)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

f999312e

Dec 13, 2016

[MINOR][SPARKR] fix kstest example error and add unit test · 8ef00593

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?

While adding vignettes for kstest, I found some errors in the example:
1. There is a typo of kstest;
2. print.summary.KStest doesn't work with the example;

Fix the example errors;
Add a new unit test for print.summary.KStest;

## How was this patch tested?
Manual test;
Add new unit test;

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16259 from wangmiao1981/ks.

(cherry picked from commit f2ddabfa)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

8ef00593

[SPARK-18588][TESTS] Ignore KafkaSourceStressForDontFailOnDataLossSuite · 019d1fa3

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Disable KafkaSourceStressForDontFailOnDataLossSuite for now.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16275 from zsxwing/ignore-flaky-test.

(cherry picked from commit e104e55c)
Signed-off-by: Reynold Xin <rxin@databricks.com>

019d1fa3

[SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes · 5693ac8e

Xiangrui Meng authored 8 years ago


## What changes were proposed in this pull request?

Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc.

cc: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #16264 from mengxr/SPARK-18793.

(cherry picked from commit 594b14f1)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

5693ac8e

[SPARK-18834][SS] Expose event time stats through StreamingQueryProgress · 25b97589

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

- Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings.

- Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started.

## How was this patch tested?

Updated tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16258 from tdas/SPARK-18834.

(cherry picked from commit c68fb426)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

25b97589

[SPARK-18843][CORE] Fix timeout in awaitResultInForkJoinSafely (branch 2.1, 2.0) · f672bfdf

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes the timeout value in `awaitResultInForkJoinSafely` for 2.1 and 2.0. Master has been fixed by https://github.com/apache/spark/pull/16230.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16268 from zsxwing/SPARK-18843.

f672bfdf

[SPARK-18816][WEB UI] Executors Logs column only ran visibility check on initial table load · 292a37f2

Alex Bozarth authored 8 years ago


## What changes were proposed in this pull request?

When I added a visibility check for the logs column on the executors page in #14382 the method I used only ran the check on the initial DataTable creation and not subsequent page loads. I moved the check out of the table definition and instead it runs on each page load. The jQuery DataTable functionality used is the same.

## How was this patch tested?

Tested Manually

No visible UI changes to screenshot.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #16256 from ajbozarth/spark18816.

(cherry picked from commit aebf44e5)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

292a37f2

[SPARK-18840][YARN] Avoid throw exception when getting token renewal interval... · d5c4a5d0

jerryshao authored 8 years ago

[SPARK-18840][YARN] Avoid throw exception when getting token renewal interval in non HDFS security environment

## What changes were proposed in this pull request?

Fix `java.util.NoSuchElementException` when running Spark in non-hdfs security environment.

In the current code, we assume `HDFS_DELEGATION_KIND` token will be found in Credentials. But in some cloud environments, HDFS is not required, so we should avoid this exception.

## How was this patch tested?

Manually verified in local environment.

Author: jerryshao <sshao@hortonworks.com>

Closes #16265 from jerryshao/SPARK-18840.

(cherry picked from commit 43298d15)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

d5c4a5d0

[SPARK-18835][SQL] Don't expose Guava types in the JavaTypeInference API. · 207107bc

Marcelo Vanzin authored 8 years ago


This avoids issues during maven tests because of shading.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16260 from vanzin/SPARK-18835.

(cherry picked from commit f280ccf4)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

207107bc

[SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes · 9f0e3be6

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.

## How was this patch tested?

Manual build html. Please see attached image for the result.
![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg

)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16222 from wangmiao1981/veg.

(cherry picked from commit 2aa16d03)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

9f0e3be6

[SPARK-18796][SS] StreamingQueryManager should not block when starting a query · 9dc5fa5f

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Major change in this PR:
- Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query.

Minor changes:
- Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name.
- Add missing docs in `StreamingQueryListener`

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16220 from zsxwing/SPARK-18796.

(cherry picked from commit 417e45c5)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

9dc5fa5f

Dec 12, 2016

[SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 1aeb7f42

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`

## How was this patch tested?

unit test, manually testing
- snapshot build url
  - download when spark jar not cached
  - when spark jar is cached
- RC build url
  - download when spark jar not cached
  - when spark jar is cached
- multiple cached spark versions
- starting with sparkR shell

To use this,
```
SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz

 R
```
then in R,
```
library(SparkR) # or specify lib.loc
sparkR.session()
```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16248 from felixcheung/rinstallurl.

(cherry picked from commit 8a51cfdc)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

1aeb7f42

[SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 523071f3

Yuming Wang authored 8 years ago


## What changes were proposed in this pull request?

Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.

## How was this patch tested?

The existing tests.

Author: Yuming Wang <wgyumg@gmail.com>

Closes #16122 from wangyum/SPARK-18681.

(cherry picked from commit 90abfd15)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

523071f3

[DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed · 35011608

Bill Chambers authored 8 years ago

## What changes were proposed in this pull request?

This PR clarifies where accumulators will be displayed.

## How was this patch tested?

No testing.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Bill Chambers <bill@databricks.com>
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #16180 from anabranch/improve-acc-docs.

(cherry picked from commit 70ffff21)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

35011608

[SPARK-18790][SS] Keep a general offset history of stream batches · 63693c17

Tyson Condie authored 8 years ago

## What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

marmbrus zsxwing

## How was this patch tested?

The following tests were added.

### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #16219 from tcondie/offset_hist.

(cherry picked from commit 83a42897)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

63693c17

Dec 11, 2016

[SPARK-18628][ML] Update Scala param and Python param to have quotes · d5f14168

krishnakalyan3 authored 8 years ago


## What changes were proposed in this pull request?

Updated Scala param and Python param to have quotes around the options making it easier for users to read.

## How was this patch tested?

Manually checked the docstrings

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #16242 from krishnakalyan3/doc-string.

(cherry picked from commit c802ad87)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

d5f14168

[SQL][MINOR] simplify a test to fix the maven tests · d4c03f87

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129

 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet.

Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation.

## How was this patch tested?

test only change

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16244 from cloud-fan/minor.

(cherry picked from commit 9abd05b6)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

d4c03f87

Dec 10, 2016

[SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary... · de21ca46

wangzhenhua authored 8 years ago

[SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary column having only null values

## What changes were proposed in this pull request?

During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null.

## How was this patch tested?

Add a test for handling null columns

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes #16243 from wzhfy/nullStats.

(cherry picked from commit a29ee55a)
Signed-off-by: Reynold Xin <rxin@databricks.com>

de21ca46

[SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8 · 5151dafa

Michal Senkyr authored 8 years ago


## What changes were proposed in this pull request?

The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc.

Replace `>` with literals in Javadoc to allow the build to pass.

## How was this patch tested?

Documentation was built and inspected manually to ensure it still displays correctly in the browser

```
cd docs && jekyll serve
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16201 from michalsenkyr/javadoc8-gt-fix.

(cherry picked from commit 11432483)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

5151dafa

[MINOR][DOCS] Remove Apache Spark Wiki address · 83822df0

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links.

```
All current wiki content has been merged into pages at http://spark.apache.org as of November 2016.
Each page links to the new location of its information on the Spark web site.
Obsolete wiki content is still hosted here, but carries a notice that it is no longer current.
```

## How was this patch tested?

Manual.

- `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme
- `docs/index.md`:
```
cd docs
SKIP_API=1 jekyll build
```
![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png

)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.

(cherry picked from commit f3a3fed7)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

83822df0