Commits · 2ba1eba371213d1ac3d1fa1552e5906e043c2ee4 · cs525-sp18-g07 / spark

Apr 26, 2017

[SPARK-12868][SQL] Allow adding jars from hdfs · 2ba1eba3

Weiqing Yang authored 7 years ago

## What changes were proposed in this pull request?
Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are inactivity for a long time or have been closed.

This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate
UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler.

## How was this patch tested?
1. Add a new unit test.
2. Check manually.
Before: throw an exception with " failed unknown protocol: hdfs"
<img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png">

After:
<img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png">

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #17342 from weiqingy/SPARK-18910.

2ba1eba3

[SPARK-20474] Fixing OnHeapColumnVector reallocation · a277ae80

Michal Szafranski authored 7 years ago

## What changes were proposed in this pull request?
OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski <michal@databricks.com>

Closes #17773 from michal-databricks/spark-20474.

a277ae80

[SPARK-20473] Enabling missing types in ColumnVector.Array · 99c6cf9e

Michal Szafranski authored 7 years ago

## What changes were proposed in this pull request?
ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array.

## How was this patch tested?
Tested using existing unit tests.

Author: Michal Szafranski <michal@databricks.com>

Closes #17772 from michal-databricks/spark-20473.

99c6cf9e

[SPARK-20391][CORE] Rename memory related fields in ExecutorSummay · 66dd5b83

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

This is a follow-up of #14617 to make the name of memory related fields more meaningful.

Here  for the backward compatibility, I didn't change `maxMemory` and `memoryUsed` fields.

## How was this patch tested?

Existing UT and local verification.

CC squito and tgravescs .

Author: jerryshao <sshao@hortonworks.com>

Closes #17700 from jerryshao/SPARK-20391.

66dd5b83

[MINOR][ML] Fix some PySpark & SparkR flaky tests · dbb06c68

Yanbo Liang authored 7 years ago

## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17757 from yanboliang/flaky-test.

dbb06c68

[SPARK-19812] YARN shuffle service fails to relocate recovery DB acro… · 7fecf513

Tom Graves authored 7 years ago

…ss NFS directories

## What changes were proposed in this pull request?

Change from using java Files.move to use Hadoop filesystem operations to move the directories.  The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.

Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.

## How was this patch tested?

Ran YarnShuffleServiceSuite unit tests.  Unfortunately couldn't easily add one here since involves NFS.
Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.

Author: Tom Graves <tgraves@apache.org>

Closes #17748 from tgravescs/SPARK-19812.

7fecf513

[SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools · 7a365257

anabranch authored 7 years ago

## What changes were proposed in this pull request?

Simple documentation change to remove explicit vendor references.

## How was this patch tested?

NA

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <bill@databricks.com>

Closes #17695 from anabranch/remove-vendor.

7a365257

[SPARK-20437][R] R wrappers for rollup and cube · df58a95a

zero323 authored 7 years ago

## What changes were proposed in this pull request?

- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.

## How was this patch tested?

- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17728 from zero323/SPARK-20437.

df58a95a

Apr 25, 2017

[SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions · 57e1da39

Eric Wasserman authored 7 years ago

## What changes were proposed in this pull request?

change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`

    public JsonParser createParser(String content)

## How was this patch tested?

existing unit tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Eric Wasserman <ericw@sgn.com>

Closes #17693 from ewasserman/SPARK-20314.

57e1da39

[SPARK-18127] Add hooks and extension points to Spark · caf39202

Sameer Agarwal authored 7 years ago

## What changes were proposed in this pull request?

This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.

## How was this patch tested?

Unit Tests in SparkSessionExtensionSuite

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #17724 from sameeragarwal/session-extensions.

caf39202

[SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel · 0a7f5f27

ding authored 7 years ago

## What changes were proposed in this pull request?

Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.

This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
## How was this patch tested?

unit tests, manual tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: ding <ding@localhost.localdomain>
Author: dding3 <ding.ding@intel.com>
Author: Michael Allman <michael@videoamp.com>

Closes #15125 from dding3/cp2_pregel.

0a7f5f27

[SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a

Yanbo Liang authored 7 years ago

## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17746 from yanboliang/spark-20449.

Unverified

67eef47a

[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant · 387565cf

wangmiao1981 authored 7 years ago

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17754 from wangmiao1981/followup.

387565cf

[SPARK-20404][CORE] Using Option(name) instead of Some(name) · 0bc7a902

Sergey Zhemzhitsky authored 7 years ago

Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
```
sparkContext.accumulator(0, null)
```

Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>

Closes #17740 from szhem/SPARK-20404-null-acc-names.

0bc7a902

[SPARK-20455][DOCS] Fix Broken Docker IT Docs · c8f12195

Armin Braun authored 7 years ago

## What changes were proposed in this pull request?

Just added the Maven `test`goal.

## How was this patch tested?

No test needed, just a trivial documentation fix.

Author: Armin Braun <me@obrown.io>

Closes #17756 from original-brownbear/SPARK-20455.

c8f12195

[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit · 31345fde

Sameer Agarwal authored 7 years ago

## What changes were proposed in this pull request?

In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
splits.

To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.

## How was this patch tested?

Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #17751 from sameeragarwal/randomsplit2.

31345fde

Apr 24, 2017

[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT · f44c8a84

Josh Rosen authored 7 years ago

This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.

f44c8a84

[SPARK-20239][CORE] Improve HistoryServer's ACL mechanism · 5280d93e

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

Current SHS (Spark History Server) two different ACLs:

* ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
* Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.

With this two ACLs, we may encounter several unexpected behaviors:

1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.

The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.

So to improve SHS's ACL mechanism, here in this PR proposed to:

1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
2. Check permission for event-log download REST API.

With this PR:

1. Admin user could see/download the list of all applications, as well as application details.
2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.

## How was this patch tested?

New UTs are added, also verified in real cluster.

CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #17582 from jerryshao/SPARK-20239.

5280d93e

[SPARK-20438][R] SparkR wrappers for split and repeat · 8a272ddc

zero323 authored 7 years ago

## What changes were proposed in this pull request?

Add wrappers for `o.a.s.sql.functions`:

- `split` as `split_string`
- `repeat` as `repeat_string`

## How was this patch tested?

Existing tests, additional unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17729 from zero323/SPARK-20438.

8a272ddc

[SPARK-18901][ML] Require in LR LogisticAggregator is redundant · 90264ace

wm624@hotmail.com authored 7 years ago

## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #17478 from wangmiao1981/logit.

90264ace

[SPARK-20439][SQL] Fix Catalog API listTables and getTable when failed to fetch table metadata · 776a2c0e

Xiao Li authored 7 years ago

### What changes were proposed in this pull request?

`spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)

### How was this patch tested?
Added a test case

Author: Xiao Li <gatorsmile@gmail.com>

Closes #17730 from gatorsmile/listTables.

776a2c0e

[BUILD] Close stale PRs · e9f97154

Takeshi Yamamuro authored 7 years ago

## What changes were proposed in this pull request?
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues).

// Open PRs whose JIRA tickets have been already closed
Closes #11785
Closes #13027
Closes #13614
Closes #13761
Closes #15197
Closes #14006
Closes #12576
Closes #15447
Closes #13259
Closes #15616
Closes #14473
Closes #16638
Closes #16146
Closes #17269
Closes #17313
Closes #17418
Closes #17485
Closes #17551
Closes #17463
Closes #17625

// Open PRs whose JIRA tickets does not exist and they are not minor issues
Closes #10739
Closes #15193
Closes #15344
Closes #14804
Closes #16993
Closes #17040
Closes #15180
Closes #17238

## How was this patch tested?
N/A

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #17734 from maropu/resolved_pr.

e9f97154

Apr 23, 2017

[SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to be... · 2eaf4f3f

郭小龙 10207633 authored 7 years ago

[SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to be formatted, in running Drivers table or Completed Drivers table in master web ui.

## What changes were proposed in this pull request?
Submitted Time' field, the date format **needs to be formatted**, in running Drivers table or Completed Drivers table in master web ui.
Before fix this problem  e.g.

Completed Drivers
Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
driver-20170419145755-0005	 **Wed Apr 19 14:57:55 CST 2017**	 worker-20170419145250-zdh120-40412	FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest

please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863977/before_fix.png

After fix this problem e.g.

Completed Drivers
Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
driver-20170419145755-0006	 **2017/04/19 16:01:25**	 worker-20170419145250-zdh120-40412	         FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest

please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863976/after_fix.png

'Submitted Time' field, the date format **has been formatted**, in running Applications table or Completed Applicationstable in master web ui, **it is correct.**
e.g.
Running Applications
Application ID	                Name	                Cores	Memory per Executor	   **Submitted Time**	      User	   State	        Duration
app-20170419160910-0000 (kill)	SparkSQL::10.43.183.120	1	    5.0 GB	               **2017/04/19 16:09:10**	  root	   RUNNING	    53 s

**Format after the time easier to observe, and consistent with the applications table,so I think it's worth fixing.**

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17682 from guoxiaolongzte/SPARK-20385.

2eaf4f3f

Apr 22, 2017

[SPARK-20132][DOCS] Add documentation for column string functions · 8765bc17

Michael Patterson authored 7 years ago

## What changes were proposed in this pull request?
Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`

There may be a better place to put the docstrings. I put them immediately above the Column class.

## How was this patch tested?

I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.

These docstrings are my original work and free license.

davies has done the most recent work reorganizing `_bin_op`

Author: Michael Patterson <map222@gmail.com>

Closes #17469 from map222/patterson-documentation.

8765bc17

[SPARK-20430][SQL] Initialise RangeExec parameters in a driver side · b3c572a6

Takeshi Yamamuro authored 7 years ago

## What changes were proposed in this pull request?
This pr initialised `RangeExec` parameters in a driver side.
In the current master, a query below throws `NullPointerException`;
```
sql("SET spark.sql.codegen.wholeStage=false")
sql("SELECT * FROM range(1)").show

17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
        at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54)
        at org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343)
        at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506)
        at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
```

## How was this patch tested?
Added a test in `DataFrameRangeSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #17717 from maropu/SPARK-20430.

b3c572a6

[SPARK-20386][SPARK CORE] modify the log info if the block exists on the slave already · 05a45149

eatoncys authored 7 years ago

## What changes were proposed in this pull request?
Modify the added memory size to memSize-originalMemSize if the  block exists on the slave already
since if the  block exists, the added memory size should be memSize-originalMemSize; if originalMemSize is bigger than memSize ,then the log info should be Removed memory, removed size should be originalMemSize-memSize

## How was this patch tested?
Multiple runs on existing unit tests

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: eatoncys <chen.yanshan@zte.com.cn>

Closes #17683 from eatoncys/SPARK-20386.

05a45149

Apr 21, 2017

[SPARK-20401][DOC] In the spark official configuration document, the... · ad290402

郭小龙 10207633 authored 7 years ago

[SPARK-20401][DOC] In the spark official configuration document, the 'spark.driver.supervise' configuration parameter specification and default values are necessary.

## What changes were proposed in this pull request?
Use the REST interface submits the spark job.
e.g.
curl -X  POST http://10.43.183.120:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data'{
    "action": "CreateSubmissionRequest",
    "appArgs": [
        "myAppArgument"
    ],
    "appResource": "/home/mr/gxl/test.jar",
    "clientSparkVersion": "2.2.0",
    "environmentVariables": {
        "SPARK_ENV_LOADED": "1"
    },
    "mainClass": "cn.zte.HdfsTest",
    "sparkProperties": {
        "spark.jars": "/home/mr/gxl/test.jar",
        **"spark.driver.supervise": "true",**
        "spark.app.name": "HdfsTest",
        "spark.eventLog.enabled": "false",
        "spark.submit.deployMode": "cluster",
        "spark.master": "spark://10.43.183.120:6066"
    }
}'

**I hope that make sure that the driver is automatically restarted if it fails with non-zero exit code.
But I can not find the 'spark.driver.supervise' configuration parameter specification and default values from the spark official document.**
## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17696 from guoxiaolongzte/SPARK-20401.

ad290402

[SPARK-20371][R] Add wrappers for collect_list and collect_set · fd648bff

zero323 authored 7 years ago

## What changes were proposed in this pull request?

Adds wrappers for `collect_list` and `collect_set`.

## How was this patch tested?

Unit tests, `check-cran.sh`

Author: zero323 <zero323@users.noreply.github.com>

Closes #17672 from zero323/SPARK-20371.

fd648bff

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 · eb00378f

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #17706 from WeichenXu123/mlor_center.

Unverified

eb00378f

[SPARK-20341][SQL] Support BigInt's value that does not fit in long value range · a750a595

Kazuaki Ishizaki authored 7 years ago

## What changes were proposed in this pull request?

This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown.

This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`.

Sample program:
```
case class BigIntWrapper(value:scala.math.BigInt)```
spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
```
Exception:
```
Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
	at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
	at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:285)
	at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
	at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
	at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
	at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
...
Caused by: java.lang.ArithmeticException: BigInteger out of long range
	at java.math.BigInteger.longValueExact(BigInteger.java:4531)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
	at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
	... 59 more
```

## How was this patch tested?

Add new test suite into `DecimalSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #17684 from kiszk/SPARK-20341.

a750a595

[SPARK-20412] Throw ParseException from visitNonOptionalPartitionSpec instead... · c9e6035e

Juliusz Sompolski authored 7 years ago

[SPARK-20412] Throw ParseException from visitNonOptionalPartitionSpec instead of returning null values.

## What changes were proposed in this pull request?

If a partitionSpec is supposed to not contain optional values, a ParseException should be thrown, and not nulls returned.
The nulls can later cause NullPointerExceptions in places not expecting them.

## How was this patch tested?

A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to throw a NullPointerException.
Now it throws a ParseException.

Author: Juliusz Sompolski <julek@databricks.com>

Closes #17707 from juliuszsompolski/SPARK-20412.

c9e6035e

Small rewording about history server use case · 34767997

Hervé authored 7 years ago

Hello
PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.

Author: Hervé <dud225@users.noreply.github.com>

Closes #17709 from dud225/patch-1.

34767997

[SPARK-20420][SQL] Add events to the external catalog · e2b3d236

Herman van Hovell authored 7 years ago

## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change.

The following events are fired per object:

- Database
  - CreateDatabasePreEvent: event fired before the database is created.
  - CreateDatabaseEvent: event fired after the database has been created.
  - DropDatabasePreEvent: event fired before the database is dropped.
  - DropDatabaseEvent: event fired after the database has been dropped.
- Table
  - CreateTablePreEvent: event fired before the table is created.
  - CreateTableEvent: event fired after the table has been created.
  - RenameTablePreEvent: event fired before the table is renamed.
  - RenameTableEvent: event fired after the table has been renamed.
  - DropTablePreEvent: event fired before the table is dropped.
  - DropTableEvent: event fired after the table has been dropped.
- Function
  - CreateFunctionPreEvent: event fired before the function is created.
  - CreateFunctionEvent: event fired after the function has been created.
  - RenameFunctionPreEvent: event fired before the function is renamed.
  - RenameFunctionEvent: event fired after the function has been renamed.
  - DropFunctionPreEvent: event fired before the function is dropped.
  - DropFunctionPreEvent: event fired after the function has been dropped.

The current events currently only contain the names of the object modified. We add more events, and more details at a later point.

A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`.

## How was this patch tested?
Added the `ExternalCatalogEventSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #17710 from hvanhovell/SPARK-20420.

e2b3d236

Apr 20, 2017

[SPARK-20281][SQL] Print the identical Range parameters of SparkContext APIs and SQL in explain · 48d760d0

Takeshi Yamamuro authored 7 years ago

## What changes were proposed in this pull request?
This pr modified code to print the identical `Range` parameters of SparkContext APIs and SQL in `explain` output. In the current master, they internally use `defaultParallelism` for `splits` by default though, they print different strings in explain output;

```
scala> spark.range(4).explain
== Physical Plan ==
*Range (0, 4, step=1, splits=Some(8))

scala> sql("select * from range(4)").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=None)
```

## How was this patch tested?
Added tests in `SQLQuerySuite` and modified some results in the existing tests.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #17670 from maropu/SPARK-20281.

48d760d0

[SPARK-20329][SQL] Make timezone aware expression without timezone unresolved · 760c8d08

Herman van Hovell authored 7 years ago

## What changes were proposed in this pull request?
A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression.

This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set.

## How was this patch tested?
Added a regression test to the `SQLQueryTestSuite.having` file.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #17641 from hvanhovell/SPARK-20329.

760c8d08

[SPARK-20367] Properly unescape column names of partitioning columns parsed from paths. · 0368eb9d

Juliusz Sompolski authored 7 years ago

## What changes were proposed in this pull request?

When infering partitioning schema from paths, the column in parsePartitionColumn should be unescaped with unescapePathName, just like it is being done in e.g. parsePathFragmentAsSeq.

## How was this patch tested?

Added a test to FileIndexSuite.

Author: Juliusz Sompolski <julek@databricks.com>

Closes #17703 from juliuszsompolski/SPARK-20367.

0368eb9d

[SPARK-20172][CORE] Add file permission check when listing files in FsHistoryProvider · 592f5c89

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

In the current Spark's HistoryServer we expected to get `AccessControlException` during listing all the files, but unfortunately it was not worked because we actually doesn't check the access permission and no other calls will throw such exception. What was worse is that this check will be deferred until reading files, which is not necessary and quite verbose, since it will be printed out the exception in every 10 seconds when checking the files.

So here with this fix, we actually check the read permission during listing the files, which could avoid unnecessary file read later on and suppress the verbose log.

## How was this patch tested?

Add unit test to verify.

Author: jerryshao <sshao@hortonworks.com>

Closes #17495 from jerryshao/SPARK-20172.

592f5c89

[SPARK-20410][SQL] Make sparkConf a def in SharedSQLContext · 03320635

Herman van Hovell authored 7 years ago

## What changes were proposed in this pull request?
It is kind of annoying that `SharedSQLContext.sparkConf` is a val when overriding test cases, because you cannot call `super` on it. This PR makes it a function.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #17705 from hvanhovell/SPARK-20410.

03320635

[SPARK-20334][SQL] Return a better error message when correlated predicates... · d95e4d9d

Dilip Biswal authored 7 years ago

[SPARK-20334][SQL] Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references.

## What changes were proposed in this pull request?
Address a follow up in [comment](https://github.com/apache/spark/pull/16954#discussion_r105718880)
Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a codegen error like following :

```SQL
SELECT t1a
FROM   t1
GROUP  BY 1
HAVING EXISTS (SELECT 1
               FROM  t2
               WHERE t2a < min(t1a + t2a));
```
Exception snippet.
```
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)

```
After this PR, a better error message is issued.
```
org.apache.spark.sql.AnalysisException
Error in query: Found an aggregate expression in a correlated
predicate that has both outer and local references, which is not supported yet.
Aggregate expression: min((t1.`t1a` + t2.`t2a`)),
Outer references: t1.`t1a`,
Local references: t2.`t2a`.;
```
## How was this patch tested?
Added tests in SQLQueryTestSuite.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #17636 from dilipbiswal/subquery_followup1.

d95e4d9d

[SPARK-20358][CORE] Executors failing stage on interrupted exception thrown by cancelled tasks · b2ebadfd

Eric Liang authored 7 years ago

## What changes were proposed in this pull request?

This was a regression introduced by my earlier PR here: https://github.com/apache/spark/pull/17531

It turns out NonFatal() does not in fact catch InterruptedException.

## How was this patch tested?

Extended cancellation unit test coverage. The first test fails before this patch.

cc JoshRosen mridulm

Author: Eric Liang <ekl@databricks.com>

Closes #17659 from ericl/spark-20358.

b2ebadfd