Commits · 99de4b8f55ea2f700ad4ac32620217fa43a2cbdb · cs525-sp18-g07 / spark

Sep 17, 2017

[SPARK-21953] Show both memory and disk bytes spilled if either is present · 99de4b8f

Andrew Ash authored 7 years ago


As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.

Author: Andrew Ash <andrew@andrewash.com>

Closes #19164 from ash211/patch-3.

(cherry picked from commit 6308c65f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

99de4b8f

[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs · 3ae7ab8e

Andrew Ray authored 7 years ago


## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67dd)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

3ae7ab8e

Sep 13, 2017
- Preparing development version 2.1.3-SNAPSHOT · e49c997f
  Patrick Wendell authored 7 years ago
  
  e49c997f
- Preparing Spark release v2.1.2-rc1 · 6f470323
  Patrick Wendell authored 7 years ago
  
  View commits for tag v2.1.2-rc1 v2.1.2-rc1
  
  6f470323
Sep 12, 2017

[SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. · e7696ebe

FavioVazquez authored 7 years ago

## What changes were proposed in this pull request?

Fixed wrong documentation for Mean Absolute Error.

Even though the code is correct for the MAE:

```scala
Since("1.2.0")
  def meanAbsoluteError: Double = {
    summary.normL1(1) / summary.count
  }
```
In the documentation the division by N is missing.

## How was this patch tested?

All of spark tests were run.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: FavioVazquez <favio.vazquezp@gmail.com>
Author: faviovazquez <favio.vazquezp@gmail.com>
Author: Favio André Vázquez <favio.vazquezp@gmail.com>

Closes #19190 from FavioVazquez/mae-fix.

(cherry picked from commit e2ac2f1c)
Signed-off-by: Sean Owen <sowen@cloudera.com>

e7696ebe

Sep 10, 2017

[SPARKR][BACKPORT-2.1] backporting package and test changes · ae4e8ae4

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

cherrypick or manually porting changes to 2.1

## How was this patch tested?

Jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #19165 from felixcheung/rbackportpkg21.

ae4e8ae4

Sep 08, 2017

[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. · 6a8a726f

Takuya UESHIN authored 7 years ago


## What changes were proposed in this pull request?

`pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests.
This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19158 from ueshin/issues/SPARK-21950.

(cherry picked from commit 57bc1e9e)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

6a8a726f

Aug 30, 2017

[SPARK-21834] Incorrect executor request in case of dynamic allocation · 041eccb4

Sital Kedia authored 7 years ago

## What changes were proposed in this pull request?

killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635

) which is incorrect because the allocator already takes care of setting the required number of executors itself.

## How was this patch tested?

Ran a job on the cluster and made sure the executor request is correct

Author: Sital Kedia <skedia@fb.com>

Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation.

(cherry picked from commit 6949a9c5)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

041eccb4

Aug 24, 2017

[SPARK-21826][SQL][2.1][2.0] outer broadcast hash join should not throw NPE · 57697535

Wenchen Fan authored 7 years ago

backport https://github.com/apache/spark/pull/19036 to branch 2.1 and 2.0

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19040 from cloud-fan/bug.

57697535

Aug 20, 2017

[SPARK-21721][SQL][BACKPORT-2.1][FOLLOWUP] Clear FileSystem deleteOnExit cache... · 3d3be4dc

Liang-Chi Hsieh authored 7 years ago

[SPARK-21721][SQL][BACKPORT-2.1][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed

## What changes were proposed in this pull request?

Fix a typo in test.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19006 from viirya/SPARK-21721-backport-2.1-followup.

3d3be4dc

[MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression · 2394ae23

Cédric Pelvet authored 7 years ago

## What changes were proposed in this pull request?

The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol.

## How was this patch tested?

Manually.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Cédric Pelvet <cedric.pelvet@gmail.com>

Closes #18980 from sharp-pixel/master.

(cherry picked from commit 73e04ecc)
Signed-off-by: Sean Owen <sowen@cloudera.com>

2394ae23

Aug 15, 2017

[SPARK-21721][SQL][BACKPORT-2.1] Clear FileSystem deleteOnExit cache when... · 6f366fbb

Liang-Chi Hsieh authored 7 years ago

[SPARK-21721][SQL][BACKPORT-2.1] Clear FileSystem deleteOnExit cache when paths are successfully removed

## What changes were proposed in this pull request?

Backport SPARK-21721 to branch 2.1:

We put staging path to delete into the deleteOnExit cache of FileSystem in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18947 from viirya/SPARK-21721-backport-2.1.

6f366fbb

Aug 07, 2017

[SPARK-21306][ML] For branch 2.1, OneVsRest should support setWeightCol · 9b749b6c

Yan Facai (颜发才) authored 7 years ago

The PR is related to #18554, and is modified for branch 2.1.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18763 from facaiy/BUG/branch-2.1_OneVsRest_support_setWeightCol.

9b749b6c

[SPARK-18535][SPARK-19720][CORE][BACKPORT-2.1] Redact sensitive information · 444cca14

Mark Grover authored 7 years ago

## What changes were proposed in this pull request?

Backporting SPARK-18535 and SPARK-19720 to spark 2.1

It's a backport PR that redacts senstive information by configuration to Spark UI and Spark Submit console logs.

Using reference from Mark Grover markapache.org PRs

## How was this patch tested?

Same tests from PR applied

Author: Mark Grover <mark@apache.org>

Closes #18802 from dmvieira/feature-redact.

444cca14

Aug 06, 2017

[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null · 5634fadb

vinodkc authored 7 years ago


## What changes were proposed in this pull request?

In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter

Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue)

## How was this patch tested?
Added unit test

Author: vinodkc <vinod.kc.in@gmail.com>

Closes #18852 from vinodkc/br_Fix_SPARK-21588.

(cherry picked from commit 1ba967b2)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

5634fadb

Aug 04, 2017

[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with... · 734b144d

Andrew Ray authored 7 years ago

[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column

## What changes were proposed in this pull request?

An overflow of the difference of bounds on the partitioning column leads to no data being read. This
patch checks for this overflow.

## How was this patch tested?

New unit test.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #18800 from aray/SPARK-21330.

(cherry picked from commit 25826c77)
Signed-off-by: Sean Owen <sowen@cloudera.com>

734b144d

Aug 02, 2017

[SPARK-12717][PYTHON][BRANCH-2.1] Adding thread-safe broadcast pickle registry · d93e45b8

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.

## How was this patch tested?

Added a unit test that causes this race condition using another thread.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #18825 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717-2_1.

d93e45b8

Aug 01, 2017

[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. · b31b3020

Marcelo Vanzin authored 7 years ago


Handle the case where the server closes the socket before the full message
has been written by the client.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18727 from vanzin/SPARK-21522.

(cherry picked from commit b1335018)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

b31b3020

Jul 29, 2017

[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child · 78f7cdfa

Liang-Chi Hsieh authored 7 years ago


## What changes were proposed in this pull request?

When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`.

An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases.

Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.

If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO.

Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.

One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18761 from viirya/SPARK-21555.

(cherry picked from commit 9c8109ef)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

78f7cdfa

Jul 28, 2017
- Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" · 258ca40c
  Yanbo Liang authored 7 years ago
  
  This reverts commit 8520d7c6.
  258ca40c
Jul 27, 2017

[SPARK-21306][ML] OneVsRest should support setWeightCol · 8520d7c6

Yan Facai (颜发才) authored 7 years ago


## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.

(cherry picked from commit a5a31899)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

8520d7c6

Jul 19, 2017

[SPARK-21446][SQL] Fix setAutoCommit never executed · 94987987

DFFuture authored 7 years ago

## What changes were proposed in this pull request?
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446


options.asConnectionProperties can not have fetchsize，because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap

## How was this patch tested?

Author: DFFuture <albert.zhang23@gmail.com>

Closes #18665 from DFFuture/sparksql_pg.

(cherry picked from commit c9729187)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

94987987

[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases · ac206934

donnyzone authored 7 years ago

## What changes were proposed in this pull request?

https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441



This issue can be reproduced by the following example:

```
val spark = SparkSession
   .builder()
   .appName("smj-codegen")
   .master("local")
   .config("spark.sql.autoBroadcastJoinThreshold", "1")
   .getOrCreate()
val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
val df = df1.join(df2, df1("key") === df2("key"))
   .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
   .select("int")
   df.show()
```

To conclude, the issue happens when:
(1) SortMergeJoin condition contains CodegenFallback expressions.
(2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.

This patch fixes the logic in `CollapseCodegenStages` rule.

## How was this patch tested?
Unit test and manual verification in our cluster.

Author: donnyzone <wellfengzhu@gmail.com>

Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.

(cherry picked from commit 6b6dd682)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ac206934

Jul 17, 2017

[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions · caf32b3c

aokolnychyi authored 7 years ago


## What changes were proposed in this pull request?

This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:

```
    val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
    val sc = spark.sparkContext
    val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
    val df = spark.createDataFrame(rdd, inputSchema)

    // Works correctly since no nested decimal expression is involved
    // Expected result type: (26, 6) * (26, 6) = (38, 12)
    df.select($"col" * $"col").explain(true)
    df.select($"col" * $"col").printSchema()

    // Gives a wrong result since there is a nested decimal expression that should be visited first
    // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
    df.select($"col" * $"col" * $"col").explain(true)
    df.select($"col" * $"col" * $"col").printSchema()
```

The example above gives the following output:

```
// Correct result without sub-expressions
== Parsed Logical Plan ==
'Project [('col * 'col) AS (col * col)#4]
+- LogicalRDD [col#1]

== Analyzed Logical Plan ==
(col * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]

== Optimized Logical Plan ==
Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- LogicalRDD [col#1]

== Physical Plan ==
*Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
+- Scan ExistingRDD[col#1]

// Schema
root
 |-- (col * col): decimal(38,12) (nullable = true)

// Incorrect result with sub-expressions
== Parsed Logical Plan ==
'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Analyzed Logical Plan ==
((col * col) * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Optimized Logical Plan ==
Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]

== Physical Plan ==
*Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
+- Scan ExistingRDD[col#1]

// Schema
root
 |-- ((col * col) * col): decimal(38,12) (nullable = true)
```

## How was this patch tested?

This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.

Author: aokolnychyi <anton.okolnychyi@sap.com>

Closes #18583 from aokolnychyi/spark-21332.

(cherry picked from commit 0be5fb41)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

caf32b3c

[SPARK-19104][BACKPORT-2.1][SQL] Lambda variables in ExternalMapToCatalyst should be global · a9efce46

Kazuaki Ishizaki authored 7 years ago

## What changes were proposed in this pull request?

This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1.

The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format.

```
val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)
```
The `valueConverter` in `ExternalMapToCatalyst` looks like:

```
if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
```
There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.

Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.

## How was this patch tested?

Added a new test suite into `DatasetPrimitiveSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18627 from kiszk/SPARK-21391.

a9efce46

Jul 14, 2017

[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison · ca4d2aa3

Kazuaki Ishizaki authored 7 years ago


## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d79)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

ca4d2aa3

Jul 09, 2017

[SPARK-21083][SQL][BRANCH-2.1] Store zero size and row count when analyzing empty table · 2c284624

Zhenhua Wang authored 7 years ago

## What changes were proposed in this pull request?

We should be able to store zero size and row count after analyzing empty table.
This is a backport for https://github.com/apache/spark/commit/9fccc3627fa41d32fbae6dbbb9bd1521e43eb4f0.

## How was this patch tested?

Added new test.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #18577 from wzhfy/analyzeEmptyTable-2.1.

2c284624

Jul 08, 2017

[SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite... · 5e2bfd5b

Dongjoon Hyun authored 7 years ago

[SPARK-21345][SQL][TEST][TEST-MAVEN][BRANCH-2.1] SparkSessionBuilderSuite should clean up stopped sessions.

## What changes were proposed in this pull request?

`SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`.

Recently, master branch fails consequtively.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

## How was this patch tested?

Pass the Jenkins with a updated suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18572 from dongjoon-hyun/SPARK-21345-BRANCH-2.1.

5e2bfd5b

Jul 06, 2017

[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream · 7f7b63bb

Sumedh Wale authored 7 years ago


## What changes were proposed in this pull request?

Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.

## How was this patch tested?

Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.

Author: Sumedh Wale <swale@snappydata.io>

Closes #18535 from sumwale/SPARK-21312.

(cherry picked from commit 14a3bb3a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

7f7b63bb

Jul 04, 2017

[SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more lazily · 8f1ca695

Dongjoon Hyun authored 7 years ago

## What changes were proposed in this pull request?

`SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).

This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.

**BEFORE**
```scala
$ bin/spark-shell
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
...
Caused by: org.apache.spark.sql.AnalysisException:
    org.apache.hadoop.hive.ql.metadata.HiveException:
       MetaException(message:java.security.AccessControlException:
          Permission denied: user=spark, access=READ,
             inode="/apps/hive/warehouse":hive:hdfs:drwx------
```
As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.

**AFTER**
```scala
$ bin/spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.2-SNAPSHOT
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.range(0, 10, 1).count()
res0: Long = 10
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.

8f1ca695

Jun 30, 2017
- Revert "[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling" · 3ecef249
  Wenchen Fan authored 7 years ago
  
  This reverts commit d995dac1.
  3ecef249
Jun 29, 2017

[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling · d995dac1

Herman van Hovell authored 7 years ago

## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909

, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #18470 from hvanhovell/SPARK-21258.

(cherry picked from commit e2f32ee4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

d995dac1

[SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 · 083adb07

IngoSchuster authored 7 years ago

## What changes were proposed in this pull request?
Please see also https://issues.apache.org/jira/browse/SPARK-21176

This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
Once https://github.com/eclipse/jetty.project/issues/1643

 is available, the code could be cleaned up to avoid the method override.

I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?

## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.

gurvindersingh zsxwing can you please review the change?

Author: IngoSchuster <ingo.schuster@de.ibm.com>
Author: Ingo Schuster <ingo.schuster@de.ibm.com>

Closes #18437 from IngoSchuster/master.

(cherry picked from commit 88a536ba)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

083adb07

Jun 25, 2017

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant... · 26f4f340

Wenchen Fan authored 7 years ago

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting"

This reverts commit 6b37c863.

26f4f340

Jun 24, 2017

[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct · 0d6b701e

gatorsmile authored 7 years ago


### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

0d6b701e

[SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. · 6750db3f

Marcelo Vanzin authored 7 years ago


Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
the same scheduler implementation is used, and if it tries to connect to the
launcher it will fail. So fix the scheduler so it only tries that in client mode;
cluster mode applications will be correctly launched and will work, but monitoring
through the launcher handle will not be available.

Tested by running a cluster mode app with "SparkLauncher.startApplication".

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18397 from vanzin/SPARK-21159.

(cherry picked from commit bfd73a7c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

6750db3f

[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path · f12883e3

Gabor Feher authored 7 years ago

This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830

When merging this PR, please give the credit to gaborfeher

Added a test case to OracleIntegrationSuite.scala

Author: Gabor Feher <gabor.feher@lynxanalytics.com>
Author: gatorsmile <gatorsmile@gmail.com>

Closes #18408 from gatorsmile/OracleType.

f12883e3

Jun 23, 2017

[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method · bcaf06c4

Ong Ming Yang authored 7 years ago


## What changes were proposed in this pull request?

* Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
* Filled in some missing parentheses

## How was this patch tested?

N/A

Author: Ong Ming Yang <me@ongmingyang.com>

Closes #18398 from ongmingyang/master.

(cherry picked from commit 4cc62951)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

bcaf06c4

[SPARK-21181] Release byteBuffers to suppress netty error messages · f8fd3b48

Dhruve Ashar authored 7 years ago

## What changes were proposed in this pull request?
We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.

### Changes proposed in this fix
By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.

## How was this patch tested?
Ran a few spark-applications and examined the logs. The error message no longer appears.

Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392



Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #18407 from dhruve/master.

(cherry picked from commit 1ebe7ffe)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

f8fd3b48

Jun 22, 2017

[SPARK-21167][SS] Decode the path generated by File sink to handle special characters · 1a98d5d0

Shixiong Zhu authored 7 years ago


## What changes were proposed in this pull request?

Decode the path generated by File sink to handle special characters.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18381 from zsxwing/SPARK-21167.

(cherry picked from commit d66b143e)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

1a98d5d0