Commits · dab10516138867b7c4fc6d42168497e82853b539 · cs525-sp18-g07 / spark

Jun 30, 2016

[SPARK-16328][ML][MLLIB][PYSPARK] Add 'asML' and 'fromML' conversion methods to PySpark linalg · dab10516

Nick Pentreath authored 8 years ago

The move to `ml.linalg` created `asML`/`fromML` utility methods in Scala/Java for converting between representations. These are missing in Python, this PR adds them.

## How was this patch tested?

New doctests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13997 from MLnick/SPARK-16328-python-linalg-convert.

dab10516

[SPARK-16276][SQL] Implement elt SQL function · 85f2303e

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.

## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13966 from petermaxlee/SPARK-16276.

85f2303e

[SPARK-16313][SQL] Spark should not silently drop exceptions in file listing · 3d75a5b2

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.

## How was this patch tested?
Manually verified.

Author: Reynold Xin <rxin@databricks.com>

Closes #13987 from rxin/SPARK-16313.

3d75a5b2

[SPARK-16336][SQL] Suggest doing table refresh upon FileNotFoundException · fb41670c

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata.

## How was this patch tested?
Added a unit test for this in MetadataCacheSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14003 from petermaxlee/SPARK-16336.

fb41670c

[SPARK-16256][DOCS] Fix window operation diagram · 5d00a7bc
Tathagata Das authored 8 years ago
```
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14001 from tdas/SPARK-16256-2.
```
5d00a7bc

[SPARK-16212][STREAMING][KAFKA] code cleanup from review feedback · c6226334

cody koeninger authored 8 years ago

## What changes were proposed in this pull request?
code cleanup in kafka-0-8 to match suggested changes for kafka-0-10 branch

## How was this patch tested?
unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #13908 from koeninger/kafka-0-8-cleanup.

c6226334

[SPARK-16289][SQL] Implement posexplode table generating function · 46395db8

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.

46395db8

[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors · fdf9f94f

Imran Rashid authored 8 years ago

## What changes were proposed in this pull request?

Before this change, when you turn on blacklisting with `spark.scheduler.executorTaskBlacklistTime`, but you have fewer than `spark.task.maxFailures` executors, you can end with a job "hung" after some task failures.

Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor.  If not, the taskset (and any corresponding jobs) are failed.
* Worst case, this is O(maxTaskFailures + numTasks).  But unless many executors are bad, this should be small
* This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks.  This is to avoid an O(numPendingTasks * numExecutors) operation
* Also, it is conceivable this fails too quickly.  You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor.

## How was this patch tested?

Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13603 from squito/progress_w_few_execs_and_blacklist.

fdf9f94f

[SPARK-13850] Force the sorter to Spill when number of elements in th… · 07f46afc

Sital Kedia authored 8 years ago

## What changes were proposed in this pull request?

Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.

## How was this patch tested?

Tested by running a job which was failing without this change due to TimSort bug.

Author: Sital Kedia <skedia@fb.com>

Closes #13107 from sitalkedia/fix_TimSort.

07f46afc

[SPARK-15820][PYSPARK][SQL] Add Catalog.refreshTable into python API · 5344bade

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

Add Catalog.refreshTable API into python interface for Spark-SQL.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13558 from WeichenXu123/update_python_sql_interface_refreshTable.

5344bade

[SPARK-16071][SQL] Checks size limit when doubling the array size in BufferHolder · 5320adc8

Sean Zhong authored 8 years ago

## What changes were proposed in this pull request?

This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13829 from clockfly/SPARK-16071_2.

5320adc8

[SPARK-12177][TEST] Removed test to avoid compilation issue in scala 2.10 · de8ab313

Tathagata Das authored 8 years ago

## What changes were proposed in this pull request?

The commented lines failed scala 2.10 build. This is because of change in behavior of case classes between 2.10 and 2.11. In scala 2.10, if companion object of a case class has explicitly defined apply(), then the implicit apply method is not generated. In scala 2.11 it is generated. Hence, the lines compile fine in 2.11 but not in 2.10.

This simply comments the tests to fix broken build. Correct solution is pending.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13992 from tdas/SPARK-12177.

de8ab313

[SPARK-16241][ML] model loading backward compatibility for ml NaiveBayes · b30a2dc7

zlpmichelle authored 8 years ago

## What changes were proposed in this pull request?

model loading backward compatibility for ml NaiveBayes

## How was this patch tested?

existing ut and manual test for loading models saved by Spark 1.6.

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #13940 from zlpmichelle/naivebayes.

b30a2dc7

[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide · 2c3d9613
Tathagata Das authored 8 years ago
```
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.
```
2c3d9613

[SPARK-12177][STREAMING][KAFKA] Update KafkaDStreams to new Kafka 0.10 Consumer API · dedbceec

cody koeninger authored 8 years ago

## What changes were proposed in this pull request?

New Kafka consumer api for the released 0.10 version of Kafka

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #11863 from koeninger/kafka-0.9.

dedbceec

[SPARK-16294][SQL] Labelling support for the include_example Jekyll plugin · bde1d6a6

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.

## How was this patch tested?

Manually tested.

<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">

Author: Cheng Lian <lian@databricks.com>

Closes #13972 from liancheng/include-example-with-labels.

bde1d6a6

Jun 29, 2016

[SPARK-16274][SQL] Implement xpath_boolean · d3af6731

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.

## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13964 from petermaxlee/SPARK-16274.

d3af6731

[SPARK-16267][TEST] Replace deprecated `CREATE TEMPORARY TABLE ... USING` from testsuites. · 831a04f5

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites.

```
12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy:
CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated,
please use CREATE TEMPORARY VIEW viewName USING... instead
```

Total : 40
- JDBCWriteSuite: 14
- DDLSuite: 6
- TableScanSuite: 6
- ParquetSourceSuite: 5
- OrcSourceSuite: 2
- SQLQuerySuite: 2
- HiveCommandSuite: 2
- JsonSuite: 1
- PrunedScanSuite: 1
- FilteredScanSuite  1

This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`.

The Jenkins results shows only remaining 10 messages.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull

## How was this patch tested?

This is a testsuite-only change.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13956 from dongjoon-hyun/SPARK-16267.

831a04f5

[SPARK-16134][SQL] optimizer rules for typed filter · d063898b

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

This PR adds 3 optimizer rules for typed filter:

1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.

This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.

## How was this patch tested?

`TypedFilterOptimizationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13846 from cloud-fan/filter.

d063898b

[SPARK-16228][SQL] HiveSessionCatalog should return `double`-param functions... · 2eaabfa4

Dongjoon Hyun authored 8 years ago

[SPARK-16228][SQL] HiveSessionCatalog should return `double`-param functions for decimal param lookups

## What changes were proposed in this pull request?

This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.

**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  ; line 1 pos 7

scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13930 from dongjoon-hyun/SPARK-16228.

2eaabfa4

[SPARK-16238] Metrics for generated method and class bytecode size · 23c58653

Eric Liang authored 8 years ago

## What changes were proposed in this pull request?

This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g.

```
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues()
res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585)

scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues()
res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220)
```

## How was this patch tested?

Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell

Author: Eric Liang <ekl@databricks.com>

Closes #13934 from ericl/spark-16238.

23c58653

[SPARK-16006][SQL] Attemping to write empty DataFrame with no fields throws non-intuitive exception · 9b1b3ae7

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns.

**Before**
```scala
scala> spark.emptyDataFrame.write.parquet("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
scala> spark.emptyDataFrame.write.csv("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
```

After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected.

## How was this patch tested?

Pass the Jenkins tests including updated test cases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13730 from dongjoon-hyun/SPARK-16006.

9b1b3ae7

[SPARK-16301] [SQL] The analyzer rule for resolving using joins should respect... · 8b5a8b25

Yin Huai authored 8 years ago

[SPARK-16301] [SQL] The analyzer rule for resolving using joins should respect the case sensitivity setting.

## What changes were proposed in this pull request?
The analyzer rule for resolving using joins should respect the case sensitivity setting.

## How was this patch tested?
New tests in ResolveNaturalJoinSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #13977 from yhuai/SPARK-16301.

8b5a8b25

[TRIVIAL] [PYSPARK] Clean up orc compression option as well · d8a87a3e

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR corrects ORC compression option for PySpark as well. I think this was missed mistakenly in https://github.com/apache/spark/pull/13948.

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13963 from HyukjinKwon/minor-orc-compress.

d8a87a3e

[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide · 64132a14
Tathagata Das authored 8 years ago
```
Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.
```
64132a14

[SPARK-14480][SQL] Remove meaningless StringIteratorReader for CSV data source. · cb1b9d34

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR removes meaningless `StringIteratorReader` for CSV data source.

In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this.

Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`.
Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103).

A benchmark was performed manually and the results were below:

- Original codes with Reader wrapping Iterator

|End-to-end (ns)  |   Parse Time (ns) |
|-----------------------|------------------------|
|14116265034      |2008277960        |

- New codes with Iterator

|End-to-end (ns)  |   Parse Time (ns) |
|-----------------------|------------------------|
|13451699644      | 1549050564       |

For the details for the environment, dataset and methods, please refer the JIRA ticket.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13808 from HyukjinKwon/SPARK-14480-small.

cb1b9d34

[SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader · 39f2eb1d

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
In Python API, we have the same issue. Thanks for identifying this issue, zsxwing ! Below is an example:
```Python
spark.read.format('json').load('python/test_support/sql/people.json')
```
#### How was this patch tested?
Existing test cases cover the changes by this PR

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13965 from gatorsmile/optionPaths.

39f2eb1d

[SPARK-16140][MLLIB][SPARKR][DOCS] Group k-means method in generated R doc · 8c9cd0a7

Xin Ren authored 8 years ago

https://issues.apache.org/jira/browse/SPARK-16140

## What changes were proposed in this pull request?

Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.

## How was this patch tested?

Tested on my local machine

And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.

![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)

Author: Xin Ren <iamshrek@126.com>

Closes #13921 from keypointt/SPARK-16140.

8c9cd0a7

[MINOR][SPARKR] Fix arguments of survreg in SparkR · c6a220d7

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Fix wrong arguments description of ```survreg``` in SparkR.

## How was this patch tested?
```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```):
![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13970 from yanboliang/spark-16143-followup.

c6a220d7

[SPARK-15990][YARN] Add rolling log aggregation support for Spark on yarn · 272a2f78

jerryshao authored 8 years ago

## What changes were proposed in this pull request?

Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.

One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.

## How was this patch tested?

Manually verified with Hadoop 2.7.1.

Author: jerryshao <sshao@hortonworks.com>

Closes #13712 from jerryshao/SPARK-15990.

272a2f78

[SPARK-15858][ML] Fix calculating error by tree stack over flow prob… · 393db655

Mahmoud Rawas authored 8 years ago

## What changes were proposed in this pull request?

What changes were proposed in this pull request?

Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees

## How was this patch tested?

the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail.

**PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0

Author: Mahmoud Rawas <mhmoudr@gmail.com>
Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au>

Closes #13624 from mhmoudr/SPARK-15858.master.

393db655

[SPARK-16261][EXAMPLES][ML] Fixed incorrect appNames in ML Examples · 21385d02

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?

Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala.  This corrects the names.

## How was this patch tested?
Style, local tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.

21385d02

[SPARK-16157][SQL] Add New Methods for comments in StructField and StructType · 7ee9e39c

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.

Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
  StructField(
    "cl1",
    IntegerType,
    nullable = false,
    new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
  .add("a", "int", true, "test1")

val struct = (new StructType)
  .add("c", StringType, true, "test3")

val struct = (new StructType)
  .add(StructField("d", StringType).withComment("test4"))
```

#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13860 from gatorsmile/newMethodForComment.

7ee9e39c

[SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions... · d1e81088

Cheng Lian authored 8 years ago

[SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions that reference no input attributes

## What changes were proposed in this pull request?

`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.

The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.

This PR fixes this issue by removing the aforementioned branch.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #13968 from liancheng/spark-16291-nested-agg-functions.

d1e81088

[TRIVIAL][DOCS][STREAMING][SQL] The return type mentioned in the Javadoc is... · 757dc2c0

Holden Karau authored 8 years ago

[TRIVIAL][DOCS][STREAMING][SQL] The return type mentioned in the Javadoc is incorrect for toJavaRDD, …

## What changes were proposed in this pull request?

Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.

## How was this patch tested?

Docs only change.

Author: Holden Karau <holden@us.ibm.com>

Closes #13954 from holdenk/trivial-streaming-tojavardd-doc-fix.

757dc2c0

[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to... · f454a7f9

Tathagata Das authored 8 years ago

[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming

## What changes were proposed in this pull request?

- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in the docs
- Added pyspark.sql.streaming module to the docs

## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13955 from tdas/SPARK-16266.

f454a7f9

Jun 28, 2016

[SPARK-16271][SQL] Implement Hive's UDFXPathUtil · 153c2f9a

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch ports Hive's UDFXPathUtil over to Spark, which can be used to implement xpath functionality in Spark in the near future.

## How was this patch tested?
Added two new test suites UDFXPathUtilSuite and ReusableStringReaderSuite. They have been ported over from Hive (but rewritten in Scala in order to leverage ScalaTest).

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13961 from petermaxlee/xpath.

153c2f9a

[SPARK-16245][ML] model loading backward compatibility for ml.feature.PCA · 0df5ce1b

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.

## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13937 from yanboliang/spark-16245.

0df5ce1b

[SPARK-16248][SQL] Whitelist the list of Hive fallback functions · 363bcede

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13939 from rxin/hive-whitelist.

363bcede

[SPARK-16268][PYSPARK] SQLContext should import DataStreamReader · 5bf8881b

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

Fixed the following error:
```
>>> sqlContext.readStream
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...", line 442, in readStream
    return DataStreamReader(self._wrapped)
NameError: global name 'DataStreamReader' is not defined
```

## How was this patch tested?

The added test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13958 from zsxwing/fix-import.

5bf8881b