Commits · b31b30209259bf0efa571d793a94c0fbc401de3d · cs525-sp18-g07 / spark

Aug 01, 2017

[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. · b31b3020

Marcelo Vanzin authored 7 years ago


Handle the case where the server closes the socket before the full message
has been written by the client.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18727 from vanzin/SPARK-21522.

(cherry picked from commit b1335018)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

b31b3020

Jul 29, 2017

[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child · 78f7cdfa

Liang-Chi Hsieh authored 7 years ago


## What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109ef)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

78f7cdfa

Jul 28, 2017
- Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" · 258ca40c
  Yanbo Liang authored 7 years ago
  
  This reverts commit 8520d7c6.
  258ca40c
Jul 27, 2017

[SPARK-21306][ML] OneVsRest should support setWeightCol · 8520d7c6

Yan Facai (颜发才) authored 7 years ago


## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.

(cherry picked from commit a5a31899)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

8520d7c6

Jul 19, 2017

[SPARK-21446][SQL] Fix setAutoCommit never executed · 94987987

DFFuture authored 7 years ago

## What changes were proposed in this pull request?
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446


options.asConnectionProperties can not have fetchsize，because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap

## How was this patch tested?

Author: DFFuture <albert.zhang23@gmail.com>

Closes #18665 from DFFuture/sparksql_pg.

(cherry picked from commit c9729187)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

94987987

[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases · ac206934

donnyzone authored 7 years ago

## What changes were proposed in this pull request?

https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441



This issue can be reproduced by the following example:

```
val spark = SparkSession
   .builder()
   .appName("smj-codegen")
   .master("local")
   .config("spark.sql.autoBroadcastJoinThreshold", "1")
   .getOrCreate()
val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
val df = df1.join(df2, df1("key") === df2("key"))
   .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
   .select("int")
   df.show()
```

To conclude, the issue happens when:
(1) SortMergeJoin condition contains CodegenFallback expressions.
(2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.

This patch fixes the logic in `CollapseCodegenStages` rule.

## How was this patch tested?
Unit test and manual verification in our cluster.

Author: donnyzone <wellfengzhu@gmail.com>

Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.

(cherry picked from commit 6b6dd682)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ac206934

Jul 17, 2017

[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions · caf32b3c

aokolnychyi authored 7 years ago


## What changes were proposed in this pull request?

This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:

```
    val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
    val sc = spark.sparkContext
    val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
    val df = spark.createDataFrame(rdd, inputSchema)

    // Works correctly since no nested decimal expression is involved
    // Expected result type: (26, 6) * (26, 6) = (38, 12)
    df.select($"col" * $"col").explain(true)
    df.select($"col" * $"col").printSchema()

    // Gives a wrong result since there is a nested decimal expression that should be visited first
    // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
    df.select($"col" * $"col" * $"col").explain(true)
    df.select($"col" * $"col" * $"col").printSchema()
```

The example above gives the following output:

```
// Correct result without sub-expressions
== Parsed Logical Plan ==
'Project [('col * 'col) AS (col * col)#4]
+- LogicalRDD [col#1]

== Analyzed Logical Plan ==
(col * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]

== Optimized Logical Plan ==
Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]

== Physical Plan ==
*Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- Scan ExistingRDD[col#1]

// Schema
root
 |-- (col * col): decimal(38,12) (nullable = true)

// Incorrect result with sub-expressions
== Parsed Logical Plan ==
'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Analyzed Logical Plan ==
((col * col) * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Optimized Logical Plan ==
Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Physical Plan ==
*Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- Scan ExistingRDD[col#1]

// Schema
root
 |-- ((col * col) * col): decimal(38,12) (nullable = true)
```

## How was this patch tested?

This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.

Author: aokolnychyi <anton.okolnychyi@sap.com>

Closes #18583 from aokolnychyi/spark-21332.

(cherry picked from commit 0be5fb41)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

caf32b3c

[SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCatalyst should be global · a9efce46

Kazuaki Ishizaki authored 7 years ago

## What changes were proposed in this pull request?

This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1.

The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format.

```
val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)
```
The `valueConverter` in `ExternalMapToCatalyst` looks like:

```
if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
```
There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.

Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.

## How was this patch tested?

Added a new test suite into `DatasetPrimitiveSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18627 from kiszk/SPARK-21391.

a9efce46

Jul 14, 2017

[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison · ca4d2aa3

Kazuaki Ishizaki authored 7 years ago


## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d79)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

ca4d2aa3

Jul 09, 2017

[SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when analyzing empty table · 2c284624

Zhenhua Wang authored 7 years ago

## What changes were proposed in this pull request?

We should be able to store zero size and row count after analyzing empty table.
This is a backport for https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0.

## How was this patch tested?

Added new test.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #18577 from wzhfy/analyzeEmptyTable-2.1.

2c284624

Jul 08, 2017

[SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite... · 5e2bfd5b

Dongjoon Hyun authored 7 years ago

[SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite should clean up stopped sessions.

## What changes were proposed in this pull request?

`SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.

Recently, master branch fails consequtively.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

## How was this patch tested?

Pass the Jenkins with a updated suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1.

5e2bfd5b

Jul 06, 2017

[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream · 7f7b63bb

Sumedh Wale authored 7 years ago


## What changes were proposed in this pull request?

Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.

## How was this patch tested?

Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.

Author: Sumedh Wale <swale@snappydata.io>

Closes #18535 from sumwale/SPARK-21312.

(cherry picked from commit 14a3bb3a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

7f7b63bb

Jul 04, 2017

[SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more lazily · 8f1ca695

Dongjoon Hyun authored 7 years ago

## What changes were proposed in this pull request?

`SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).

This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.

**BEFORE**
```scala
$ bin/spark-shell
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
...
Caused by: org.apache.spark.sql.AnalysisException:
    org.apache.hadoop.hive.ql.metadata.HiveException:
       MetaException(message:java.security.AccessControlException:
          Permission denied: user=spark, access=READ,
             inode="/apps/hive/warehouse":hive:hdfs:drwx------
```
As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.

**AFTER**
```scala
$ bin/spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.2-SNAPSHOT
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.range(0, 10, 1).count()
res0: Long = 10
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.

8f1ca695

Jun 30, 2017
- Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling" · 3ecef249
  Wenchen Fan authored 7 years ago
  
  This reverts commit d995dac1.
  3ecef249
Jun 29, 2017

[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling · d995dac1

Herman van Hovell authored 7 years ago

## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909

, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #18470 from hvanhovell/SPARK-21258.

(cherry picked from commit e2f32ee4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

d995dac1

[SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 · 083adb07

IngoSchuster authored 7 years ago

## What changes were proposed in this pull request?
Please see also https://issues.apache.org/jira/browse/SPARK-21176

This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
Once https://github.com/eclipse/jetty.project/issues/1643

 is available, the code could be cleaned up to avoid the method override.

I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?

## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.

gurvindersingh zsxwing can you please review the change?

Author: IngoSchuster <ingo.schuster@de.ibm.com>
Author: Ingo Schuster <ingo.schuster@de.ibm.com>

Closes #18437 from IngoSchuster/master.

(cherry picked from commit 88a536ba)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

083adb07

Jun 25, 2017

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant... · 26f4f340

Wenchen Fan authored 7 years ago

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting"

This reverts commit 6b37c863.

26f4f340

Jun 24, 2017

[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct · 0d6b701e

gatorsmile authored 7 years ago


### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

0d6b701e

[SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. · 6750db3f

Marcelo Vanzin authored 7 years ago


Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
the same scheduler implementation is used, and if it tries to connect to the
launcher it will fail. So fix the scheduler so it only tries that in client mode;
cluster mode applications will be correctly launched and will work, but monitoring
through the launcher handle will not be available.

Tested by running a cluster mode app with "SparkLauncher.startApplication".

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18397 from vanzin/SPARK-21159.

(cherry picked from commit bfd73a7c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

6750db3f

[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path · f12883e3

Gabor Feher authored 7 years ago

This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830

When merging this PR, please give the credit to gaborfeher

Added a test case to OracleIntegrationSuite.scala

Author: Gabor Feher <gabor.feher@lynxanalytics.com>
Author: gatorsmile <gatorsmile@gmail.com>

Closes #18408 from gatorsmile/OracleType.

f12883e3

Jun 23, 2017

[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method · bcaf06c4

Ong Ming Yang authored 7 years ago


## What changes were proposed in this pull request?

* Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
* Filled in some missing parentheses

## How was this patch tested?

N/A

Author: Ong Ming Yang <me@ongmingyang.com>

Closes #18398 from ongmingyang/master.

(cherry picked from commit 4cc62951)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

bcaf06c4

[SPARK-21181] Release byteBuffers to suppress netty error messages · f8fd3b48

Dhruve Ashar authored 7 years ago

## What changes were proposed in this pull request?
We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.

### Changes proposed in this fix
By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.

## How was this patch tested?
Ran a few spark-applications and examined the logs. The error message no longer appears.

Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392



Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #18407 from dhruve/master.

(cherry picked from commit 1ebe7ffe)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

f8fd3b48

Jun 22, 2017

[SPARK-21167][SS] Decode the path generated by File sink to handle special characters · 1a98d5d0

Shixiong Zhu authored 7 years ago


## What changes were proposed in this pull request?

Decode the path generated by File sink to handle special characters.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18381 from zsxwing/SPARK-21167.

(cherry picked from commit d66b143e)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

1a98d5d0

[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting · 6b37c863

ALeksander Eskilson authored 7 years ago

## What changes were proposed in this pull request?

This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in #18075.

## How was this patch tested?

The same test provided in #18075 is included in this patch.

Author: ALeksander Eskilson <alek.eskilson@cerner.com>

Closes #18354 from bdrillard/class_splitting_2.1.

6b37c863

Jun 20, 2017

[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are... · 8923bac1

assafmendelson authored 7 years ago

[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

zsxwing This is the PR to fix version 2.1 as discussed in PR #18342

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18363 from assafmendelson/spark-21123-for-spark2.1.

8923bac1

Jun 19, 2017

[SPARK-21138][YARN] Cannot delete staging dir when the clusters of... · 7799f35d

sharkdtu authored 7 years ago

[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

## What changes were proposed in this pull request?

When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows：
```
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
```
The staging dir can not be deleted, it will prompt following message:
```
java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310


```

## How was this patch tested?

Existing tests

Author: sharkdtu <sharkdtu@tencent.com>

Closes #18352 from sharkdtu/master.

(cherry picked from commit 3d4d11a8)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

7799f35d

[SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. · a44c118e

saturday_s authored 7 years ago

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008

)

Author: saturday_s <shi.indetail@gmail.com>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6f)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

a44c118e

Jun 15, 2017

[SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to name mismatch · 0ebb3b84

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18319 from gatorsmile/fixDecimal.

0ebb3b84

[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. · 915a2010

Xianyang Liu authored 7 years ago

## What changes were proposed in this pull request?

Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342



## How was this patch tested?

Existing tests.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #18284 from ConeyLiu/treenode.

(cherry picked from commit 87ab0cec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

915a2010

[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test:... · 62f2b804

Xingbo Jiang authored 7 years ago

[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message

## What changes were proposed in this pull request?

Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().

## How was this patch tested?
N/A

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18314 from jiangxb1987/LocalCheckpointSuite.

(cherry picked from commit 7dc3e697)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

62f2b804

Jun 14, 2017

[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values... · a890466b

gatorsmile authored 7 years ago

[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0

### What changes were proposed in this pull request?

This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2

---

The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.

The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html

). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.

Before this PR, the following queries failed:
```SQL
select 1 > 0.0001
select floor(0.0001)
select ceil(0.0001)
```

### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18297 from gatorsmile/backport18244.

(cherry picked from commit 62651195)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

a890466b

Jun 13, 2017

[SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite · ee0e74e6

DjvuLee authored 7 years ago


## What changes were proposed in this pull request?

The default value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.
So we change it to 100 to avoid test failure.

## How was this patch tested?
No test

Author: DjvuLee <lihu@bytedance.com>

Closes #18280 from djvulee/NettyTestBug.

(cherry picked from commit b36ce2a2)
Signed-off-by: Sean Owen <sowen@cloudera.com>

ee0e74e6

[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions · 58a8a379

Sean Owen authored 7 years ago


## What changes were proposed in this pull request?

Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #18216 from srowen/SPARK-20920.

(cherry picked from commit 7b7c85ed)
Signed-off-by: Sean Owen <sowen@cloudera.com>

58a8a379

Jun 08, 2017

[SPARK-20914][DOCS] Javadoc contains code that is invalid · 03cc18ba

Sean Owen authored 7 years ago


## What changes were proposed in this pull request?

Fix Java, Scala Dataset examples in scaladoc, which didn't compile.

## How was this patch tested?

Existing compilation/test

Author: Sean Owen <sowen@cloudera.com>

Closes #18215 from srowen/SPARK-20914.

(cherry picked from commit 847efe12)
Signed-off-by: Sean Owen <sowen@cloudera.com>

03cc18ba

Jun 03, 2017

[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes · afab8557

Wenchen Fan authored 7 years ago


## What changes were proposed in this pull request?

REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #18191 from cloud-fan/test.

(cherry picked from commit 864d94fe)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

afab8557

Jun 01, 2017

[SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. · 0b25a7d9
Marcelo Vanzin authored 7 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18178 from vanzin/SPARK-20922-hotfix.
```
0b25a7d9

[SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. · 772a9b96

Marcelo Vanzin authored 7 years ago


Blindly deserializing classes using Java serialization opens the code up to
issues in other libraries, since just deserializing data from a stream may
end up execution code (think readObject()).

Since the launcher protocol is pretty self-contained, there's just a handful
of classes it legitimately needs to deserialize, and they're in just two
packages, so add a filter that throws errors if classes from any other
package show up in the stream.

This also maintains backwards compatibility (the updated launcher code can
still communicate with the backend code in older Spark releases).

Tested with new and existing unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18166 from vanzin/SPARK-20922.

(cherry picked from commit 8efc6e98)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

772a9b96

May 31, 2017

[SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException · dade85f7

Shixiong Zhu authored 7 years ago

## What changes were proposed in this pull request?

`IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666

) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18168 from zsxwing/SPARK-20940.

(cherry picked from commit 24db3582)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

dade85f7

May 30, 2017

[SPARK-20275][UI] Do not display "Completed" column for in-progress applications · 46400867

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.

The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.

```
[ {
  "id" : "local-1491805439678",
  "name" : "Spark shell",
  "attempts" : [ {
    "startTime" : "2017-04-10T06:23:57.574GMT",
    "endTime" : "1969-12-31T23:59:59.999GMT",
    "lastUpdated" : "2017-04-10T06:23:57.574GMT",
    "duration" : 0,
    "sparkUser" : "",
    "completed" : false,
    "startTimeEpoch" : 1491805437574,
    "endTimeEpoch" : -1,
    "lastUpdatedEpoch" : 1491805437574
  } ]
} ]%
```

Here is UI before changed:

<img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">

And after:

<img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png

">

## How was this patch tested?

Manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #17588 from jerryshao/SPARK-20275.

(cherry picked from commit 52ed9b28)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

46400867

May 27, 2017

[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities · 38f37c55

NICHOLAS T. MARION authored 7 years ago


Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.

Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.

Author: NICHOLAS T. MARION <nmarion@us.ibm.com>

Closes #17686 from n-marion/xss-fix.

(cherry picked from commit b512233a)
Signed-off-by: Sean Owen <sowen@cloudera.com>

38f37c55