Commits · 870b9d9aa00c260b532c78088e4a0384f7f1fa8a · cs525-sp18-g07 / spark

Apr 07, 2017

[SPARK-20026][DOC][SPARKR] Add Tweedie example for SparkR in programming guide · 870b9d9a

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Add Tweedie example for SparkR in programming guide.
The doc was already updated in #17103.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17553 from actuaryzhang/programGuide.

870b9d9a

[SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description. · 9e0893b5

郭小龙 10207633 authored 7 years ago

## What changes were proposed in this pull request?

1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'

Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.

2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
Because only one stage is determined based on stage-id.

code:
  GET
  def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
    val listener = ui.jobProgressListener
    val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
    val adjStatuses = {
      if (statuses.isEmpty()) {
        Arrays.asList(StageStatus.values(): _*)
      } else {
        statuses
      }
    };

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>

Closes #17534 from guoxiaolongzte/SPARK-20218.

9e0893b5

[SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation · 1a52a623

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.

## How was this patch tested?

Python unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17494 from viirya/correlation-python-api.

1a52a623

[SPARK-20245][SQL][MINOR] pass output to LogicalRelation directly · ad3cc131

Wenchen Fan authored 7 years ago

## What changes were proposed in this pull request?

Currently `LogicalRelation` has a `expectedOutputAttributes` parameter, which makes it hard to reason about what the actual output is. Like other leaf nodes, `LogicalRelation` should also take `output` as a parameter, to simplify the logic

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17552 from cloud-fan/minor.

ad3cc131

Apr 06, 2017

[SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum · 626b4caf

Reynold Xin authored 7 years ago

## What changes were proposed in this pull request?
This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file.

## How was this patch tested?
N/A - no semantic change.

Author: Reynold Xin <rxin@databricks.com>

Closes #17555 from rxin/SPARK-19495-2.

626b4caf

[MINOR][DOCS] Fix typo in Hive Examples · 8129d59d

Dustin Koupal authored 7 years ago

## What changes were proposed in this pull request?

Fix typo in hive examples from "DaraFrames" to "DataFrames"

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Dustin Koupal <dkoupal@blizzard.com>

Closes #17554 from cooper6581/typo-daraframes.

8129d59d

[SPARK-17019][CORE] Expose on-heap and off-heap memory usage in various places · a4491626

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
1. Spark UI's executor page will display both on-heap and off-heap memory usage.
2. REST request returns both on-heap and off-heap memory.
3. Also this can be gotten from MetricsSystem.
4. Last this usage can be obtained programmatically from SparkListener.

Attach the UI changes:

![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)

Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
## How was this patch tested?

Unit test added and manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #14617 from jerryshao/SPARK-17019.

a4491626

[SPARK-20195][SPARKR][SQL] add createTable catalog API and deprecate createExternalTable · 5a693b41

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17511 from felixcheung/rceatetable.

5a693b41

[SPARK-20196][PYTHON][SQL] update doc for catalog functions for all languages,... · bccc3301

Felix Cheung authored 7 years ago

[SPARK-20196][PYTHON][SQL] update doc for catalog functions for all languages, add pyspark refreshByPath API

## What changes were proposed in this pull request?

Update doc to remove external for createTable, add refreshByPath in python

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17512 from felixcheung/catalogdoc.

bccc3301

[SPARK-20064][PYSPARK] Bump the PySpark verison number to 2.2 · d009fb36

setjet authored 7 years ago

## What changes were proposed in this pull request?
PySpark version in version.py was lagging behind
Versioning is  in line with PEP 440: https://www.python.org/dev/peps/pep-0440/

## How was this patch tested?
Simply rebuild the project with existing tests

Author: setjet <rubenljanssen@gmail.com>
Author: Ruben Janssen <rubenljanssen@gmail.com>

Closes #17523 from setjet/SPARK-20064.

d009fb36

[SPARK-20085][MESOS] Configurable mesos labels for executors · c8fc1f3b

Kalvin Chau authored 7 years ago

## What changes were proposed in this pull request?

Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor.

 "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one.

Discussion of labels with mgummelt at #17404

## How was this patch tested?

Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.

Tested with: `./build/sbt -Pmesos mesos/test`

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Kalvin Chau <kalvin.chau@viasat.com>

Closes #17413 from kalvinnchau/mesos-labels.

c8fc1f3b

[SPARK-19953][ML] Random Forest Models use parent UID when being fit · e156b5dd

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?

The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit.  This change fixes that so the models can be properly be identified with their parents.

## How was this patch tested?Existing tests.

Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.

e156b5dd

Apr 05, 2017

[SPARK-20217][CORE] Executor should not fail stage if killed task throws non-interrupted exception · 5142e5d4

Eric Liang authored 7 years ago

## What changes were proposed in this pull request?

If tasks throw non-interrupted exceptions on kill (e.g. java.nio.channels.ClosedByInterruptException), their death is reported back as TaskFailed instead of TaskKilled. This causes stage failure in some cases.

This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException.

```
spark.range(100).repartition(100).foreach { i =>
  try {
    Thread.sleep(10000000)
  } catch {
    case t: InterruptedException =>
      throw new RuntimeException(t)
  }
}
```
Based on the code in TaskSetManager, I think this also affects kills of speculative tasks. However, since the number of speculated tasks is few, and usually you need to fail a task a few times before the stage is cancelled, it unlikely this would be noticed in production unless both speculation was enabled and the num allowed task failures was = 1.

We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown.

## How was this patch tested?

Unit test. The test fails before the change in Executor.scala

cc JoshRosen

Author: Eric Liang <ekl@databricks.com>

Closes #17531 from ericl/fix-task-interrupt.

5142e5d4

[SPARK-20231][SQL] Refactor star schema code for the subsequent star join detection in CBO · 4000f128

Ioana Delaney authored 7 years ago

## What changes were proposed in this pull request?

This commit moves star schema code from ```join.scala``` to ```StarSchemaDetection.scala```. It also applies some minor fixes in ```StarJoinReorderSuite.scala```.

## How was this patch tested?
Run existing ```StarJoinReorderSuite.scala```.

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #17544 from ioana-delaney/starSchemaCBOv2.

4000f128

[SPARK-20214][ML] Make sure converted csc matrix has sorted indices · 12206058

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.

12206058

[SPARK-20204][SQL][FOLLOWUP] SQLConf should react to change in default timezone settings · 9d68c672

Dilip Biswal authored 7 years ago

## What changes were proposed in this pull request?
Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.

## How was this patch tested?
Added an unit test in ConfigEntrySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #17537 from dilipbiswal/timezone_debug.

9d68c672

[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState · 9543fc0e

Tathagata Das authored 7 years ago

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

9543fc0e

[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements · e2773996

zero323 authored 7 years ago

## What changes were proposed in this pull request?

- Allows skipping `value` argument if `to_replace` is a `dict`:
	```python
	df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
	df.replace({"Alice": "Bob"}).show()
	````
- Adds validation step to ensure homogeneous values / replacements.
- Simplifies internal control flow.
- Improves unit tests coverage.

## How was this patch tested?

Existing unit tests, additional unit tests, manual testing.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16793 from zero323/SPARK-19454.

e2773996

[SPARK-20223][SQL] Fix typo in tpcds q77.sql · a2d8d767

wangzhenhua authored 7 years ago

## What changes were proposed in this pull request?

Fix typo in tpcds q77.sql

## How was this patch tested?

N/A

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes #17538 from wzhfy/typoQ77.

a2d8d767

[SPARK-19807][WEB UI] Add reason for cancellation when a stage is killed using web UI · 71c3c481

shaolinliu authored 7 years ago

## What changes were proposed in this pull request?

When a user kills a stage using web UI (in Stages page), StagesTab.handleKillRequest requests SparkContext to cancel the stage without giving a reason. SparkContext has cancelStage(stageId: Int, reason: String) that Spark could use to pass the information for monitoring/debugging purposes.

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: shaolinliu <liu.shaolin1@zte.com.cn>
Author: lvdongr <lv.dongdong@zte.com.cn>

Closes #17258 from shaolinliu/SPARK-19807.

71c3c481

[SPARK-20042][WEB UI] Fix log page buttons for reverse proxy mode · 6f09dc70

Oliver Köth authored 7 years ago

with spark.ui.reverseProxy=true, full path URLs like /log will point to
the master web endpoint which is serving the worker UI as reverse proxy.
To access a REST endpoint in the worker in reverse proxy mode , the
leading /proxy/"target"/ part of the base URI must be retained.

Added logic to log-view.js to handle this, similar to executorspage.js

Patch was tested manually

Author: Oliver Köth <okoeth@de.ibm.com>

Closes #17370 from okoethibm/master.

6f09dc70

[SPARK-20209][SS] Execute next trigger immediately if previous batch took... · dad499f3

Tathagata Das authored 7 years ago

[SPARK-20209][SS] Execute next trigger immediately if previous batch took longer than trigger interval

## What changes were proposed in this pull request?

For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.

In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.

## How was this patch tested?
Added new unit tests to comprehensively test this behavior.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17525 from tdas/SPARK-20209.

dad499f3

Small doc fix for ReuseSubquery. · b6e71032
Reynold Xin authored 7 years ago

b6e71032

[SPARKR][DOC] update doc for fpgrowth · c1b8b667

Felix Cheung authored 7 years ago

## What changes were proposed in this pull request?

minor update

zero323

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17526 from felixcheung/rfpgrowthfollowup.

c1b8b667

Apr 04, 2017

[SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform · b28bbffb

Yuhao Yang authored 7 years ago

## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .

I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.

## How was this patch tested?

new unit test and I strength the unit test for model save/load to ensure the cache mechanism.

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17336 from hhbyyh/fpmodelminconf.

b28bbffb

[SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights · a59759e6

Seth Hendrickson authored 7 years ago

## What changes were proposed in this pull request?

This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.

Note: The primary author when this is committed should be sethah since this is taken from his code.

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #17501 from jkbradley/SPARK-20183.

a59759e6

[SPARK-19716][SQL] support by-name resolution for struct type elements in array · 295747e5

Wenchen Fan authored 7 years ago

## What changes were proposed in this pull request?

Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.

However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17398 from cloud-fan/dataset.

295747e5

[SPARK-20204][SQL] remove SimpleCatalystConf and CatalystConf type alias · 402bf2a5

Wenchen Fan authored 7 years ago

## What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/17285 .

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17521 from cloud-fan/conf.

402bf2a5

[MINOR][R] Reorder `Collate` fields in DESCRIPTION file · 0e2ee820

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.

This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.

## How was this patch tested?

Manually via `./R/check-cran.sh`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17528 from HyukjinKwon/minor-reorder-description.

0e2ee820

[SPARK-20191][YARN] Crate wrapper for RackResolver so tests can override it. · 0736980f

Marcelo Vanzin authored 7 years ago

Current test code tries to override the RackResolver used by setting
configuration params, but because YARN libs statically initialize the
resolver the first time it's used, that means that those configs don't
really take effect during Spark tests.

This change adds a wrapper class that easily allows tests to override the
behavior of the resolver for the Spark code that uses it.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #17508 from vanzin/SPARK-20191.

0736980f

[SPARK-18278][SCHEDULER] Documentation to point to Kubernetes cluster scheduler · 11238d4c

Anirudh Ramanathan authored 7 years ago

## What changes were proposed in this pull request?

Adding documentation to point to Kubernetes cluster scheduler being developed out-of-repo in https://github.com/apache-spark-on-k8s/spark
cc rxin srowen tnachen ash211 mccheah erikerlandson

## How was this patch tested?

Docs only change

Author: Anirudh Ramanathan <foxish@users.noreply.github.com>
Author: foxish <ramanathana@google.com>

Closes #17522 from foxish/upstream-doc.

11238d4c

[SPARK-20198][SQL] Remove the inconsistency in table/function name conventions... · 26e7bca2

Xiao Li authored 7 years ago

[SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs

### What changes were proposed in this pull request?
Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
- def listColumns(tableName: String): Dataset[Column]
- def getTable(tableName: String): Table
- def getFunction(functionName: String): Function
- def tableExists(tableName: String): Boolean
- def functionExists(functionName: String): Boolean

To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.

### How was this patch tested?
Added the test cases .

Author: Xiao Li <gatorsmile@gmail.com>

Closes #17518 from gatorsmile/tableIdentifier.

26e7bca2

[SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… · c95fbea6

guoxiaolongzte authored 7 years ago

…ucceeded|failed|unknown]

## What changes were proposed in this pull request?

'/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
now status is '[complete|succeeded|failed]'.
but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
Added '?status=running' and '?status=unknown'.
code ：
public enum JobExecutionStatus {
RUNNING,
SUCCEEDED,
FAILED,
UNKNOWN;

## How was this patch tested?

 manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17507 from guoxiaolongzte/SPARK-20190.

c95fbea6

[SPARK-19825][R][ML] spark.ml R API for FPGrowth · b34f7665

zero323 authored 7 years ago

## What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):

- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.

## How was this patch tested?

Feature specific unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17170 from zero323/SPARK-19825.

b34f7665

[SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface · 51d3c854

Xiao Li authored 7 years ago

### What changes were proposed in this pull request?

This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.

In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.

Below is the current way:
```
Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
```
After the change, it should look like
```
Schema: root
 |-- a: string (nullable = true)
 |-- b: integer (nullable = true)
 |-- c: string (nullable = true)
 |-- d: string (nullable = true)
```

### How was this patch tested?
`describe.sql` and `show-tables.sql`

Author: Xiao Li <gatorsmile@gmail.com>

Closes #17394 from gatorsmile/descFollowUp.

51d3c854

Apr 03, 2017

[SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS · 3bfb639c

Dilip Biswal authored 7 years ago

## What changes were proposed in this pull request?

**Description** from JIRA

The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
## How was this patch tested?

Added new tests in ParquetQuerySuite and ParquetIOSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #15332 from dilipbiswal/parquet-time-millis.

3bfb639c

[SPARK-19408][SQL] filter estimation on two columns of same table · e7877fd4

Ron Hu authored 7 years ago

## What changes were proposed in this pull request?

In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.

This PR estimates filter selectivity on two columns of same table.  For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"

## How was this patch tested?

We added 6 new test cases to test various logical predicates involving two columns of same table.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ron Hu <ron.hu@huawei.com>
Author: U-CHINA\r00754707 <r00754707@R00754707-SC04.china.huawei.com>

Closes #17415 from ron8hu/filterTwoColumns.

e7877fd4

[SPARK-20145] Fix range case insensitive bug in SQL · 58c9e6e7

samelamin authored 7 years ago

## What changes were proposed in this pull request?
Range in SQL should be case insensitive

## How was this patch tested?
unit test

Author: samelamin <hussam.elamin@gmail.com>
Author: samelamin <sam_elamin@discovery.com>

Closes #17487 from samelamin/SPARK-20145.

58c9e6e7

[SPARK-20194] Add support for partition pruning to in-memory catalog · 703c42c3

Adrian Ionescu authored 7 years ago

## What changes were proposed in this pull request?
This patch implements `listPartitionsByFilter()` for `InMemoryCatalog` and thus resolves an outstanding TODO causing the `PruneFileSourcePartitions` optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).

The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's `getPartitionsByFilter()` out from `HiveExternalCatalog` into `ExternalCatalogUtils` and calls this new function from `InMemoryCatalog` on the whole list of partitions.

Now that this method is implemented we can always pass the `CatalogTable` to the `DataSource` in `FindDataSourceTable`, so that the latter is resolved to a relation with a `CatalogFileIndex`, which is what the `PruneFileSourcePartitions` rule matches for.

## How was this patch tested?
Ran existing tests and added new test for `listPartitionsByFilter` in `ExternalCatalogSuite`, which is subclassed by both `InMemoryCatalogSuite` and `HiveExternalCatalogSuite`.

Author: Adrian Ionescu <adrian@databricks.com>

Closes #17510 from adrian-ionescu/InMemoryCatalog.

703c42c3

[SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces... · 4fa1a43a

hyukjinkwon authored 7 years ago

[SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs

## What changes were proposed in this pull request?

Currently, when we infer the types for vaild JSON strings but object or array, we are producing empty schemas regardless of parse modes as below:

```scala
scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root
```

```scala
scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root
```

This PR proposes to handle parse modes in type inference.

After this PR,

```scala

scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root
 |-- a: long (nullable = true)
```

```
scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
java.lang.RuntimeException: Failed to infer a common schema. Struct types are expected but string was found.
```

This PR is based on https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae and I and NathanHowell talked about this in https://issues.apache.org/jira/browse/SPARK-19641

## How was this patch tested?

Unit tests in `JsonSuite` for both `DROPMALFORMED` and `FAILFAST` modes.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17492 from HyukjinKwon/SPARK-19641.

4fa1a43a