Commits · 8c460804698742b0405d6c7e8a1880472f436f9e · cs525-sp18-g07 / spark

Mar 12, 2017

[SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are SpecificOffsets · 8c460804

uncleGen authored 8 years ago


When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer.

KafkaSourceProvider.scala:
```
val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match {
    case Some("latest") => LatestOffsets
    case Some("earliest") => EarliestOffsets
    case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
    case None => LatestOffsets
  }
```

Thank cbowden for reporting.

Jenkins

Author: uncleGen <hustyugm@gmail.com>

Closes #17209 from uncleGen/SPARK-19853.

(cherry picked from commit 0a4d06a7)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

8c460804

[DOCS][SS] fix structured streaming python example · f9833c66

uncleGen authored 8 years ago


## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

(cherry picked from commit e29a74d5)
Signed-off-by: Sean Owen <sowen@cloudera.com>

f9833c66

Mar 10, 2017

[SPARK-19611][SQL] Introduce configurable table schema inference · e481a738

Budde authored 8 years ago

Add a new configuration option that allows Spark SQL to infer a case-sensitive schema from a Hive Metastore table's data files when a case-sensitive schema can't be read from the table properties.

- Add spark.sql.hive.caseSensitiveInferenceMode param to SQLConf
- Add schemaPreservesCase field to CatalogTable (set to false when schema can't
  successfully be read from Hive table props)
- Perform schema inference in HiveMetastoreCatalog if schemaPreservesCase is
  false, depending on spark.sql.hive.caseSensitiveInferenceMode
- Add alterTableSchema() method to the ExternalCatalog interface
- Add HiveSchemaInferenceSuite tests
- Refactor and move ParquetFileForamt.meregeMetastoreParquetSchema() as
  HiveMetastoreCatalog.mergeWithMetastoreSchema
- Move schema merging tests from ParquetSchemaSuite to HiveSchemaInferenceSuite

[JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)

The tests in ```HiveSchemaInferenceSuite``` should verify that schema inference is working as expected. ```ExternalCatalogSuite``` has also been extended to cover the new ```alterTableSchema()``` API.

Author: Budde <budde@amazon.com>

Closes #17229 from budde/SPARK-19611-2.1.

e481a738

[SPARK-19893][SQL] should not run DataFrame set oprations with map type · 5a2ad431

Wenchen Fan authored 8 years ago


In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17236 from cloud-fan/map.

(cherry picked from commit fb9beda5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

5a2ad431

[SPARK-19891][SS] Await Batch Lock notified on stream execution exit · f0d50fd5

Tyson Condie authored 8 years ago

## What changes were proposed in this pull request?

We need to notify the await batch lock when the stream exits early e.g., when an exception has been thrown.

## How was this patch tested?

Current tests that throw exceptions at runtime will finish faster as a result of this update.

zsxwing

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #17231 from tcondie/kafka-writer.

(cherry picked from commit 501b7111)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

f0d50fd5

Mar 09, 2017

[SPARK-19886] Fix reportDataLoss if statement in SS KafkaSource · a59cc369

Burak Yavuz authored 8 years ago


## What changes were proposed in this pull request?

Fix the `throw new IllegalStateException` if statement part.

## How is this patch tested

Regression test

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #17228 from brkyvz/kafka-cause-fix.

(cherry picked from commit 82138e09)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

a59cc369

[SPARK-19861][SS] watermark should not be a negative time. · ffe65b06

uncleGen authored 8 years ago


## What changes were proposed in this pull request?

`watermark` should not be negative. This behavior is invalid, check it before real run.

## How was this patch tested?

add new unit test.

Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>

Closes #17202 from uncleGen/SPARK-19861.

(cherry picked from commit 30b18e69)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

ffe65b06

[SPARK-19561][SQL] add int case handling for TimestampType · 2a76e242

Jason White authored 8 years ago

## What changes were proposed in this pull request?

Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.

## How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of https://github.com/apache/spark/pull/16896

. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun

cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

Author: Jason White <jason.white@shopify.com>

Closes #17200 from JasonMWhite/SPARK-19561.

(cherry picked from commit 206030bd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

2a76e242

[SPARK-19859][SS][FOLLOW-UP] The new watermark should override the old one. · 0c140c16

uncleGen authored 8 years ago


## What changes were proposed in this pull request?

A follow up to SPARK-19859:

- extract the calculation of `delayMs` and reuse it.
- update EventTimeWatermarkExec
- use the correct `delayMs` in EventTimeWatermark

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17221 from uncleGen/SPARK-19859.

(cherry picked from commit eeb1d6db)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

0c140c16

[SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal · 00859e14

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17217 from zsxwing/SPARK-19874.

(cherry picked from commit 029e40b4)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

00859e14

Mar 08, 2017

[MINOR][SQL] The analyzer rules are fired twice for cases when... · 78cc5721

Dilip Biswal authored 8 years ago

[MINOR][SQL] The analyzer rules are fired twice for cases when AnalysisException is raised from analyzer.

## What changes were proposed in this pull request?
In general we have a checkAnalysis phase which validates the logical plan and throws AnalysisException on semantic errors. However we also can throw AnalysisException from a few analyzer rules like ResolveSubquery.

I found that we fire up the analyzer rules twice for the queries that throw AnalysisException from one of the analyzer rules. This is a very minor fix. We don't have to strictly fix it. I just got confused seeing the rule getting fired two times when i was not expecting it.

## How was this patch tested?

Tested manually.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #17214 from dilipbiswal/analyis_twice.

(cherry picked from commit d809ceed)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

78cc5721

Revert "[SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for branch-2.1" · 3457c322
Shixiong Zhu authored 8 years ago
```
This reverts commit 502c927b.
```
3457c322

[SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in... · f6c1ad2e

Burak Yavuz authored 8 years ago

[SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

## What changes were proposed in this pull request?

**The Problem**
There is a file stream source option called maxFileAge which limits how old the files can be, relative the latest file that has been seen. This is used to limit the files that need to be remembered as "processed". Files older than the latest processed files are ignored. This values is by default 7 days.
This causes a problem when both
latestFirst = true
maxFilesPerTrigger > total files to be processed.
Here is what happens in all combinations
1) latestFirst = false - Since files are processed in order, there wont be any unprocessed file older than the latest processed file. All files will be processed.
2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is not, then all old files get processed in the first batch, and so no file is left behind.
3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch process the latest X files. That sets the threshold latest file - maxFileAge, so files older than this threshold will never be considered for processing.
The bug is with case 3.

**The Solution**

Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are set.

## How was this patch tested?

Regression test in `FileStreamSourceSuite`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #17153 from brkyvz/maxFileAge.

(cherry picked from commit a3648b5d)
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

f6c1ad2e

[SPARK-18055][SQL] Use correct mirror in ExpresionEncoder · 320eff14

Michael Armbrust authored 8 years ago


Previously, we were using the mirror of passed in `TypeTag` when reflecting to build an encoder.  This fails when the outer class is built in (i.e. `Seq`'s default mirror is based on root classloader) but inner classes (i.e. `A` in `Seq[A]`) are defined in the REPL or a library.

This patch changes us to always reflect based on a mirror created using the context classloader.

Author: Michael Armbrust <michael@databricks.com>

Closes #17201 from marmbrus/replSeqEncoder.

(cherry picked from commit 314e48a3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

320eff14

Mar 07, 2017

[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 0ba9ecbe

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?
The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

## How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17193 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.

0ba9ecbe

[SPARK-19859][SS] The new watermark should override the old one · 3b648a62

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

The new watermark should override the old one. Otherwise, we just pick up the first column which has a watermark, it may be unexpected.

## How was this patch tested?

The new test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17199 from zsxwing/SPARK-19859.

(cherry picked from commit d8830c50)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

3b648a62

Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long" · cbc37007
Wenchen Fan authored 8 years ago
```
This reverts commit 6f468462.
```
cbc37007

[SPARK-19857][YARN] Correctly calculate next credential update time. · 551b7bdb

Marcelo Vanzin authored 8 years ago


Add parentheses so that both lines form a single statement; also add
a log message so that the issue becomes more explicit if it shows up
again.

Tested manually with integration test that exercises the feature.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #17198 from vanzin/SPARK-19857.

(cherry picked from commit 8e41c2ee)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

551b7bdb

[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long · 711addd4

Jason White authored 8 years ago


## What changes were proposed in this pull request?

Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.

## How was this patch tested?

Added a new test that fails without the change.

dongjoon-hyun davies Mind taking a look?

The contribution is my original work and I license the work to the project under the project’s open source license.

Author: Jason White <jason.white@shopify.com>

Closes #16896 from JasonMWhite/SPARK-19561.

(cherry picked from commit 6f468462)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

711addd4

Mar 06, 2017

[SPARK-19719][SS] Kafka writer for both structured streaming and batch queires · fd6c6d5c

Tyson Condie authored 8 years ago

## What changes were proposed in this pull request?

Add a new Kafka Sink and Kafka Relation for writing streaming and batch queries, respectively, to Apache Kafka.
### Streaming Kafka Sink
- When addBatch is called
-- If batchId is great than the last written batch
--- Write batch to Kafka
---- Topic will be taken from the record, if present, or from a topic option, which overrides topic in record.
-- Else ignore

### Batch Kafka Sink
- KafkaSourceProvider will implement CreatableRelationProvider
- CreatableRelationProvider#createRelation will write the passed in Dataframe to a Kafka
- Topic will be taken from the record, if present, or from topic option, which overrides topic in record.
- Save modes Append and ErrorIfExist supported under identical semantics. Other save modes result in an AnalysisException

tdas zsxwing

## How was this patch tested?

### The following unit tests will be included
- write to stream with topic field: valid stream write with data that includes an existing topic in the schema
- write structured streaming aggregation w/o topic field, with default topic: valid stream write with data that does not include a topic field, but the configuration includes a default topic
- write data with bad schema: various cases of writing data that does not conform to a proper schema e.g., 1. no topic field or default topic, and 2. no value field
- write data with valid schema but wrong types: data with a complete schema but wrong types e.g., key and value types are integers.
- write to non-existing topic: write a stream to a topic that does not exist in Kafka, which has been configured to not auto-create topics.
- write batch to kafka: simple write batch to Kafka, which goes through the same code path as streaming scenario, so validity checks will not be redone here.

### Examples
```scala
// Structured Streaming
val writer = inputStringStream.map(s => s.get(0).toString.getBytes()).toDF("value")
 .selectExpr("value as key", "value as value")
 .writeStream
 .format("kafka")
 .option("checkpointLocation", checkpointDir)
 .outputMode(OutputMode.Append)
 .option("kafka.bootstrap.servers", brokerAddress)
 .option("topic", topic)
 .queryName("kafkaStream")
 .start()

// Batch
val df = spark
 .sparkContext
 .parallelize(Seq("1", "2", "3", "4", "5"))
 .map(v => (topic, v))
 .toDF("topic", "value")

df.write
 .format("kafka")
 .option("kafka.bootstrap.servers",brokerAddress)
 .option("topic", topic)
 .save()
```
Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #17043 from tcondie/kafka-writer.

fd6c6d5c

Mar 05, 2017

[SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not... · ca7a7e8a

uncleGen authored 8 years ago

[SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not filter checkpointFilesOfLatestTime with the PATH string.

## What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/



```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code
passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds.
Last failure message: 8 did not equal 2.
	at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	at org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite
.scala:172)
	at org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211)
```

the check condition is:

```
val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter {
     _.toString.contains(clock.getTimeMillis.toString)
}
// Checkpoint files are written twice for every batch interval. So assert that both
// are written to make sure that both of them have been written.
assert(checkpointFilesOfLatestTime.size === 2)
```

the path string may contain the `clock.getTimeMillis.toString`, like `3500` :

```
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk
file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500
                                                       ▲▲▲▲
```

so we should only check the filename, but not the whole path.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17167 from uncleGen/flaky-CheckpointSuite.

(cherry picked from commit 207067ea)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

ca7a7e8a

Mar 03, 2017

[SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level · 664c9795

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

"DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs.

This PR uses `testQuietly` instead to avoid changing the log level.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17156 from zsxwing/SPARK-19816.

(cherry picked from commit fbc40580)
Signed-off-by: Yin Huai <yhuai@databricks.com>

664c9795

[SPARK-19774] StreamExecution should call stop() on sources when a stream fails · da04d45c

Burak Yavuz authored 8 years ago


## What changes were proposed in this pull request?

We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka.

## How was this patch tested?

Unit tests in `StreamingQuerySuite`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #17107 from brkyvz/close-source.

(cherry picked from commit 9314c083)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

da04d45c

[SPARK-19797][DOC] ML pipeline document correction · accbed7c

Zhe Sun authored 8 years ago

## What changes were proposed in this pull request?
Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works



> If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.

## How was this patch tested?
This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document.

Author: Zhe Sun <ymwdalex@gmail.com>

Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.

(cherry picked from commit 0bac3e4c)
Signed-off-by: Sean Owen <sowen@cloudera.com>

accbed7c

Mar 02, 2017

[SPARK-19779][SS] Delete needless tmp file after restart structured streaming job · 1237aaea

guifeng authored 8 years ago

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (https://github.com/apache/spark/pull/17012

) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <guifengleaf@gmail.com>

Closes #17124 from gf53520/SPARK-19779.

(cherry picked from commit e24f21b5)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

1237aaea

[SPARK-19750][UI][BRANCH-2.1] Fix redirect issue from http to https · 3a7591ad

jerryshao authored 8 years ago

## What changes were proposed in this pull request?

If spark ui port (4040) is not set, it will choose port number 0, this will make https port to also choose 0. And in Spark 2.1 code, it will use this https port (0) to do redirect, so when redirect triggered, it will point to a wrong url:

like:

```
/tmp/temp$ wget http://172.27.25.134:55015
--2017-02-23 12:13:54--  http://172.27.25.134:55015/
Connecting to 172.27.25.134:55015... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://172.27.25.134:0/ [following]
--2017-02-23 12:13:54--  https://172.27.25.134:0/
Connecting to 172.27.25.134:0... failed: Can't assign requested address.
Retrying.

--2017-02-23 12:13:55--  (try: 2)  https://172.27.25.134:0/
Connecting to 172.27.25.134:0... failed: Can't assign requested address.
Retrying.

--2017-02-23 12:13:57--  (try: 3)  https://172.27.25.134:0/
Connecting to 172.27.25.134:0... failed: Can't assign requested address.
Retrying.

--2017-02-23 12:14:00--  (try: 4)  https://172.27.25.134:0/
Connecting to 172.27.25.134:0... failed: Can't assign requested address.
Retrying.

```

So instead of using 0 to do redirect, we should pick a bound port instead.

This issue only exists in Spark 2.1-, and can be reproduced in yarn cluster mode.

## How was this patch tested?

Current redirect UT doesn't verify this issue, so extend current UT to do correct verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #17083 from jerryshao/SPARK-19750.

3a7591ad

Mar 01, 2017

[SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio … · 27347b5f

Michael Gummelt authored 8 years ago

…on registered cores rather than accepted cores

See JIRA

Unit tests, Mesos/Spark integration tests

cc skonto susanxhuynh

Author: Michael Gummelt <mgummeltmesosphere.io>

Closes #17045 from mgummelt/SPARK-19373-registered-resources.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #17129 from mgummelt/SPARK-19373-registered-resources-2.1.

27347b5f

[SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be folded... · bbe0d8ca

Stan Zhai authored 8 years ago

[SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule

## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.

For the following query():

```
val sqlA =
  """
    |create temporary view ta as
    |select a, 'a' as tag from t1 union all
    |select a, 'b' as tag from t2
  """.stripMargin

val sqlB =
  """
    |create temporary view tb as
    |select a, 'a' as tag from t3 union all
    |select a, 'b' as tag from t4
  """.stripMargin

val sql =
  """
    |select tb.* from ta inner join tb on
    |ta.a = tb.a and
    |ta.tag = tb.tag
  """.stripMargin
```

The tag column is an constant alias column, it's folded by `FoldablePropagation` like this:

```
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = a#4) && (a = a))
    :- Union                                           :- Union
    :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, a AS tag#8]
    :  :  +- LocalRelation [a#0]                       :  :  +- LocalRelation [a#0]
    :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, b AS tag#9]
    :     +- LocalRelation [a#2]                       :     +- LocalRelation [a#2]
    +- Union                                           +- Union
       :- Project [a#4, a AS tag#14]                      :- Project [a#4, a AS tag#14]
       :  +- LocalRelation [a#4]                          :  +- LocalRelation [a#4]
       +- Project [a#6, b AS tag#15]                      +- Project [a#6, b AS tag#15]
          +- LocalRelation [a#6]                             +- LocalRelation [a#6]
```

Finally the Result of Batch Operator Optimizations is:

```
Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = a#4)
!   :- SubqueryAlias ta, `ta`                          :- Union
!   :  +- Union                                        :  :- LocalRelation [a#0]
!   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation [a#2]
!   :     :  +- SubqueryAlias t1, `t1`                 +- Union
!   :     :     +- Project [a#0]                          :- LocalRelation [a#4, tag#14]
!   :     :        +- SubqueryAlias grouping              +- LocalRelation [a#6, tag#15]
!   :     :           +- LocalRelation [a#0]
!   :     +- Project [a#2, b AS tag#9]
!   :        +- SubqueryAlias t2, `t2`
!   :           +- Project [a#2]
!   :              +- SubqueryAlias grouping
!   :                 +- LocalRelation [a#2]
!   +- SubqueryAlias tb, `tb`
!      +- Union
!         :- Project [a#4, a AS tag#14]
!         :  +- SubqueryAlias t3, `t3`
!         :     +- Project [a#4]
!         :        +- SubqueryAlias grouping
!         :           +- LocalRelation [a#4]
!         +- Project [a#6, b AS tag#15]
!            +- SubqueryAlias t4, `t4`
!               +- Project [a#6]
!                  +- SubqueryAlias grouping
!                     +- LocalRelation [a#6]
```

The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong.

After fix:

```
=== Result of Batch LocalRelation ===
 GlobalLimit 21                                           GlobalLimit 21
 +- LocalLimit 21                                         +- LocalLimit 21
    +- Project [a#4, tag#11]                                 +- Project [a#4, tag#11]
       +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))
!         :- SubqueryAlias ta                                      :- Union
!         :  +- Union                                              :  :- LocalRelation [a#0, tag#8]
!         :     :- Project [a#0, a AS tag#8]                       :  +- LocalRelation [a#2, tag#9]
!         :     :  +- SubqueryAlias t1                             +- Union
!         :     :     +- Project [a#0]                                :- LocalRelation [a#4, tag#11]
!         :     :        +- SubqueryAlias grouping                    +- LocalRelation [a#6, tag#12]
!         :     :           +- LocalRelation [a#0]
!         :     +- Project [a#2, b AS tag#9]
!         :        +- SubqueryAlias t2
!         :           +- Project [a#2]
!         :              +- SubqueryAlias grouping
!         :                 +- LocalRelation [a#2]
!         +- SubqueryAlias tb
!            +- Union
!               :- Project [a#4, a AS tag#11]
!               :  +- SubqueryAlias t3
!               :     +- Project [a#4]
!               :        +- SubqueryAlias grouping
!               :           +- LocalRelation [a#4]
!               +- Project [a#6, b AS tag#12]
!                  +- SubqueryAlias t4
!                     +- Project [a#6]
!                        +- SubqueryAlias grouping
!                           +- LocalRelation [a#6]
```

## How was this patch tested?

add sql-tests/inputs/inner-join.sql
All tests passed.

Author: Stan Zhai <zhaishidan@haizhi.com>

Closes #17099 from stanzhai/fix-inner-join.

(cherry picked from commit 5502a9cf)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

bbe0d8ca

[SPARK-19572][SPARKR] Allow to disable hive in sparkR shell · f719cccd

Jeff Zhang authored 8 years ago


## What changes were proposed in this pull request?
SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This is not only for sparkR itself, but can also benefit downstream project like livy which use shell.R for its interactive session. For now, livy has no control of whether enable hive or not.

## How was this patch tested?

Tested it manually, run `bin/sparkR --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #16907 from zjffdu/SPARK-19572.

(cherry picked from commit 73158805)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

f719cccd

Feb 28, 2017

[SPARK-19769][DOCS] Update quickstart instructions · d887f758

Michael McCune authored 8 years ago


## What changes were proposed in this pull request?

This change addresses the renaming of the `simple.sbt` build file to
`build.sbt`. Newer versions of the sbt tool are not finding the older
named file and are looking for the `build.sbt`. The quickstart
instructions for self-contained applications is updated with this
change.

## How was this patch tested?

As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes.

Author: Michael McCune <msm@redhat.com>

Closes #17101 from elmiko/spark-19769.

(cherry picked from commit bf5987cb)
Signed-off-by: Sean Owen <sowen@cloudera.com>

d887f758

[SPARK-19677][SS] Committing a delta file atop an existing one should not fail on HDFS · 947c0cd9

Roberto Agostino Vitillo authored 8 years ago

## What changes were proposed in this pull request?

HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html

) of `rename()`, the behavior of the local filesystem and HDFS varies:

> Destination exists and is a file
> Renaming a file atop an existing file is specified as failing, raising an exception.
>    - Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
>    - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.

This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output.

## How was this patch tested?

This patch was tested by running `StateStoreSuite`.

Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>

Closes #17012 from vitillo/fix_rename.

(cherry picked from commit 9734a928)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

947c0cd9

[SPARK-19748][SQL] refresh function has a wrong order to do cache invalidate... · 4b4c3bf3

windpiger authored 8 years ago

[SPARK-19748][SQL] refresh function has a wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache

## What changes were proposed in this pull request?

If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.

While the order to do these two actions is wrong, this lead to the refresh action does not take effect.

```
  override def refresh(): Unit = {
    refresh0()
    fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
    val files = listLeafFiles(rootPaths)
    cachedLeafFiles =
      new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
    cachedPartitionSpec = null
  }
```
## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.

(cherry picked from commit a350bc16)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

4b4c3bf3

Feb 26, 2017

[SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle... · 04fbb9e0

Eyal Zituny authored 8 years ago

[SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists

## What changes were proposed in this pull request?

currently if multiple streaming queries listeners exists, when a QueryTerminatedEvent is triggered, only one of the listeners will be invoked while the rest of the listeners will ignore the event.
this is caused since the the streaming queries listeners bus holds a set of running queries ids and when a termination event is triggered, after the first listeners is handling the event, the terminated query id is being removed from the set.
in this PR, the query id will be removed from the set only after all the listeners handles the event

## How was this patch tested?

a test with multiple listeners has been added to StreamingQueryListenerSuite

Author: Eyal Zituny <eyal.zituny@equalum.io>

Closes #16991 from eyalzit/master.

(cherry picked from commit 9f8e3921)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

04fbb9e0

Feb 25, 2017

[SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?
Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.

## How was this patch tested?
Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.

20a43295

[MINOR][DOCS] Fixes two problems in the SQL programing guide page · 97866e19

Boaz Mohar authored 8 years ago


## What changes were proposed in this pull request?

Removed duplicated lines in sql python example and found a typo.

## How was this patch tested?

Searched for other typo's in the page to minimize PR's.

Author: Boaz Mohar <boazmohar@gmail.com>

Closes #17066 from boazmohar/doc-fix.

(cherry picked from commit 061bcfb8)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

97866e19

Feb 24, 2017

[SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-client · ed9aaa31

jerryshao authored 8 years ago


## What changes were proposed in this pull request?

Because yarn#client will reset the `spark.yarn.keytab` configuration to point to the location in distributed file, so if user still uses the old `SparkConf` to create `SparkSession` with Hive enabled, it will read keytab from the path in distributed cached. This is OK for yarn cluster mode, but in yarn client mode where driver is running out of container, it will be failed to fetch the keytab.

So here we should avoid reseting this configuration in the `yarn#client` and only overwriting it for AM, so using `spark.yarn.keytab` could get correct keytab path no matter running in client (keytab in local fs) or cluster (keytab in distributed cache) mode.

## How was this patch tested?

Verified in security cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #16923 from jerryshao/SPARK-19038.

(cherry picked from commit a920a436)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

ed9aaa31

[SPARK-19707][CORE] Improve the invalid path check for sc.addJar · 6da6a27f

jerryshao authored 8 years ago


## What changes were proposed in this pull request?

Currently in Spark there're two issues when we add jars with invalid path:

* If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will resolve it to the current directory path and add to classpath / file server, which is unwanted. This is happened in our programatic way to submit Spark application. From my understanding Spark should defensively filter out such empty path.
* If the jar path is a invalid path (file doesn't exist), `addJar` doesn't check it and will still add to file server, the exception will be delayed until job running. Actually this local path could be checked beforehand, no need to wait until task running. We have similar check in `addFile`, but lacks similar similar mechanism in `addJar`.

## How was this patch tested?

Add unit test and local manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #17038 from jerryshao/SPARK-19707.

(cherry picked from commit b0a8c16f)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

6da6a27f

[SPARK-19691][SQL][BRANCH-2.1] Fix ClassCastException when calculating percentile of decimal column · 66a7ca28

Takeshi Yamamuro authored 8 years ago

## What changes were proposed in this pull request?
This is a backport of the two following commits: https://github.com/apache/spark/commit/93aa4271596a30752dc5234d869c3ae2f6e8e723

This pr fixed a class-cast exception below;
```
scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect()
 java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number
	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
	at
```
This fix simply converts catalyst values (i.e., `Decimal`) into scala ones by using `CatalystTypeConverters`.

## How was this patch tested?
Added a test in `DataFrameSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #17046 from maropu/SPARK-19691-BACKPORT2.1.

66a7ca28

Feb 23, 2017

[SPARK-19459][SQL][BRANCH-2.1] Support for nested char/varchar fields in ORC · 43084b3c

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
This is a backport of the two following commits: https://github.com/apache/spark/commit/78eae7e67fd5dec0c2d5b18000053ce86cd0f1ae & https://github.com/apache/spark/commit/de8a03e68202647555e30fffba551f65bc77608d

This PR adds support for ORC tables with (nested) char/varchar fields.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #17041 from hvanhovell/SPARK-19459-branch-2.1.

43084b3c

[SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index · d30238f1

actuaryzhang authored 8 years ago


## What changes were proposed in this pull request?
The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).

Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.

## How was this patch tested?
new tests

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17017 from actuaryzhang/sparkRSubsetter.

(cherry picked from commit 7bf09433)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

d30238f1