Commits · ea662286561aa9fe321cb0a0e10cdeaf60440b90 · cs525-sp18-g07 / spark

Aug 30, 2016

[SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now · befab9c1

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Clarify that only parquet files are supported by DataStreamWriter now

## How was this patch tested?

(Doc build -- no functional changes to test)

Author: Sean Owen <sowen@cloudera.com>

Closes #14860 from srowen/SPARK-17264.

befab9c1

Aug 25, 2016

[SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` could be removed. · 5f02d2e5

jiangxingbo authored 8 years ago

## What changes were proposed in this pull request?

Method `SQLContext.parseDataType(dataTypeString: String)` could be removed, we should use `SparkSession.parseDataType(dataTypeString: String)` instead.
This require updating PySpark.

## How was this patch tested?

Existing test cases.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #14790 from jiangxb1987/parseDataType.

5f02d2e5

Aug 24, 2016

[SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and... · 29952ed0

hyukjinkwon authored 8 years ago

[SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON

## What changes were proposed in this pull request?

### Default - ISO 8601

Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:

- CSV
  ```
  // TimestampType
  1414459800000000
  // DateType
  16673
  ```

- Json

  ```
  // TimestampType
  1970-01-01 11:46:40.0
  // DateType
  1970-01-01
  ```

So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.

So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).

- For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)

  ```
  1970-01-01T02:00:01.000-01:00
  ```

- For `Date` it becomes as below (`yyyy-MM-dd`)

  ```
  1970-01-01
  ```

### Custom date format option - `dateFormat`

This PR also adds the support to write and read dates and timestamps in a formatted string as below:

- **DateType**

  - With `dateFormat` option (e.g. `yyyy/MM/dd`)

    ```
    +----------+
    |      date|
    +----------+
    |2015/08/26|
    |2014/10/27|
    |2016/01/28|
    +----------+
    ```

### Custom date format option - `timestampFormat`

- **TimestampType**

  - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)

    ```
    +----------------+
    |            date|
    +----------------+
    |2015/08/26 18:00|
    |2014/10/27 18:30|
    |2016/01/28 20:00|
    +----------------+
    ```

## How was this patch tested?

Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.

29952ed0

Jul 28, 2016

[SPARK-16772] Correct API doc references to PySpark classes + formatting fixes · 274f3b9e

Nicholas Chammas authored 8 years ago

## What's Been Changed

The PR corrects several broken or missing class references in the Python API docs. It also correct formatting problems.

For example, you can see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction) how Sphinx is not picking up the reference to `DataType`. That's because the reference is relative to the current module, whereas `DataType` is in a different module.

You can also see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) how the formatting for byte, tinyint, and so on is italic instead of monospace. That's because in ReST single backticks just make things italic, unlike in Markdown.

## Testing

I tested this PR by [building the Python docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html) and reviewing the results locally in my browser. I confirmed that the broken or missing class references were resolved, and that the formatting was corrected.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #14393 from nchammas/python-docstring-fixes.

274f3b9e

Jul 01, 2016

[SPARK-16335][SQL] Structured streaming should fail if source directory does not exist · d601894c

Reynold Xin authored 9 years ago

## What changes were proposed in this pull request?
In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).

## How was this patch tested?
Updated unit tests to reflect the new behavior.

Author: Reynold Xin <rxin@databricks.com>

Closes #14002 from rxin/SPARK-16335.

d601894c

Jun 30, 2016

[SPARK-16313][SQL] Spark should not silently drop exceptions in file listing · 3d75a5b2

Reynold Xin authored 9 years ago

## What changes were proposed in this pull request?
Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.

## How was this patch tested?
Manually verified.

Author: Reynold Xin <rxin@databricks.com>

Closes #13987 from rxin/SPARK-16313.

3d75a5b2

Jun 29, 2016

[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to... · f454a7f9

Tathagata Das authored 9 years ago

[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming

## What changes were proposed in this pull request?

- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in the docs
- Added pyspark.sql.streaming module to the docs

## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13955 from tdas/SPARK-16266.

f454a7f9

Jun 15, 2016

[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery · 9a507199

Tathagata Das authored 9 years ago

Renamed for simplicity, so that its obvious that its related to streaming.

Existing unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13673 from tdas/SPARK-15953.

9a507199

Jun 14, 2016

[SPARK-15935][PYSPARK] Fix a wrong format tag in the error message · 0ee9fd9e

Shixiong Zhu authored 9 years ago

## What changes were proposed in this pull request?

A follow up PR for #13655 to fix a wrong format tag.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13665 from zsxwing/fix.

0ee9fd9e

[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream... · 214adb14

Tathagata Das authored 9 years ago

[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs

## What changes were proposed in this pull request?
Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.

- [x] Python API!!

## How was this patch tested?
Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13653 from tdas/SPARK-15933.

214adb14

[SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests · 96c3500c

Shixiong Zhu authored 9 years ago

## What changes were proposed in this pull request?

This PR just enables tests for sql/streaming.py and also fixes the failures.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13655 from zsxwing/python-streaming-test.

96c3500c

Jun 06, 2016

[MINOR] Fix Typos 'an -> a' · fd8af397

Zheng RuiFeng authored 9 years ago

## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.

fd8af397

Jun 01, 2016

[SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming · a71d1364

Reynold Xin authored 9 years ago

## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.

## How was this patch tested?
Updated tests to reflect the moves.

Author: Reynold Xin <rxin@databricks.com>

Closes #13429 from rxin/SPARK-15686.

a71d1364

May 04, 2016

[SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346

Andrew Or authored 9 years ago

## What changes were proposed in this pull request?

See title.

## How was this patch tested?

PySpark tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12917 from andrewor14/deprecate-hive-context-python.

fa79d346

Apr 28, 2016

[SPARK-14555] Second cut of Python API for Structured Streaming · 78c8aaf8

Burak Yavuz authored 9 years ago

## What changes were proposed in this pull request?

This PR adds Python APIs for:
 - `ContinuousQueryManager`
 - `ContinuousQueryException`

The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.

For `ContinuousQueryManager`, all APIs are provided except for registering listeners.

This PR also attempts to fix test flakiness by stopping all active streams just before tests.

## How was this patch tested?

Python Doc tests and unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12673 from brkyvz/pyspark-cqm.

78c8aaf8

Apr 20, 2016

[SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4

Burak Yavuz authored 9 years ago

## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.

80bf48f4