Commits · 4f8ceed59367319300e4bfa5b957c387be81ffa3 · cs525-sp18-g07 / spark

Jul 06, 2016

[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and... · 4f8ceed5

hyukjinkwon authored 8 years ago

[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet

## What changes were proposed in this pull request?

Currently, if there is a schema as below:

```
root
  |-- _1: struct (nullable = true)
  |    |-- _1: integer (nullable = true)
```

and if we execute the codes below:

```scala
df.filter("_1 IS NOT NULL").count()
```

This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).

The reason is, `ParquetFilters.getFieldMap` produces results below:

```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```

and then it becomes a `Map`

```
(_1,IntegerType)
```

Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.

So, Parquet filter2 produces incorrect results, for example, the codes below:

```
df.filter("_1 IS NOT NULL").count()
```

produces always 0.

This PR prevents this by not finding nested fields.

## How was this patch tested?

Unit test in `ParquetFilterSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14067 from HyukjinKwon/SPARK-16371.

4f8ceed5

[SPARK-16304] LinkageError should not crash Spark executor · 480357cc

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError.

## How was this patch tested?
Added an end-to-end test in FailureSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13982 from petermaxlee/SPARK-16304.

480357cc

[MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation · 4e14199f

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes wrongly formatted examples in PySpark documentation as below:

- **`SparkSession`**

  - **Before**

    ![2016-07-06 11 34 41](https://cloud.githubusercontent.com/assets/6477701/16605847/ae939526-436d-11e6-8ab8-6ad578362425.png)

  - **After**

    ![2016-07-06 11 33 56](https://cloud.githubusercontent.com/assets/6477701/16605845/ace9ee78-436d-11e6-8923-b76d4fc3e7c3.png)

- **`Builder`**

  - **Before**
    ![2016-07-06 11 34 44](https://cloud.githubusercontent.com/assets/6477701/16605844/aba60dbc-436d-11e6-990a-c87bc0281c6b.png)

  - **After**
    ![2016-07-06 1 26 37](https://cloud.githubusercontent.com/assets/6477701/16607562/586704c0-437d-11e6-9483-e0af93d8f74e.png)

This PR also fixes several similar instances across the documentation in `sql` PySpark module.

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14063 from HyukjinKwon/minor-pyspark-builder.

4e14199f

[DOC][SQL] update out-of-date code snippets using SQLContext in all documents. · b1310425

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

I search the whole documents directory using SQLContext, and update the following places:

- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14025 from WeichenXu123/WIP_SQLContext_update.

b1310425

[SPARK-15979][SQL] Renames CatalystWriteSupport to ParquetWriteSupport · 23eff5e5

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #14070 from liancheng/spark-15979-follow-up.

23eff5e5

[SPARK-15591][WEBUI] Paginate Stage Table in Stages tab · 478b71d0

Tao Lin authored 8 years ago

## What changes were proposed in this pull request?

This patch adds pagination support for the Stage Tables in the Stage tab. Pagination is provided for all of the four Job Tables (active, pending, completed, and failed). Besides, the paged stage tables are also used in JobPage (the detail page for one job) and PoolPage.

Interactions (jumping, sorting, and setting page size) for paged tables are also included.

## How was this patch tested?

Tested manually by using checking the Web UI after completing and failing hundreds of jobs.  Same as the testings for [Paginate Job Table in Jobs tab](https://github.com/apache/spark/pull/13620).

This shows the pagination for completed stages:
![paged stage table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png)

Author: Tao Lin <nblintao@gmail.com>

Closes #13708 from nblintao/stageTable.

478b71d0

[SPARK-16229][SQL] Drop Empty Table After CREATE TABLE AS SELECT fails · 21eadd1d

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,

```SQL
CREATE TABLE tab
STORED AS TEXTFILE
SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
```
The above query failed as expected but an empty table `t` is created.

This PR is to drop the created table when hitting any non-fatal exception.

#### How was this patch tested?
Added a test case to verify the behavior

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13926 from gatorsmile/dropTableAfterException.

21eadd1d

[SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data · 909c6d81

MechCoder authored 8 years ago

## What changes were proposed in this pull request?

The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.

## How was this patch tested?

The patch is a test....

Author: MechCoder <mks542@nyu.edu>

Closes #13981 from MechCoder/dt_variance.

909c6d81

[SPARK-16388][SQL] Remove spark.sql.nativeView and spark.sql.nativeView.canonical config · 7e28fabd

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>

Closes #14061 from rxin/SPARK-16388.

7e28fabd

[SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public for loading · 5497242c

Yuhao Yang authored 8 years ago

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16249
Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").

## How was this patch tested?

existing ut and manually test for load ( saved with current code)

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13941 from hhbyyh/ldapublic.

5497242c

[SPARK-16339][CORE] ScriptTransform does not print stderr when outstream is lost · 5f342049

Tejas Patil authored 8 years ago

## What changes were proposed in this pull request?

Currently, if due to some failure, the outstream gets destroyed or closed and later `outstream.close()` leads to IOException in such case. Due to this, the `stderrBuffer` does not get logged and there is no way for users to see why the job failed.

The change is to first display the stderr buffer and then try closing the outstream.

## How was this patch tested?

The correct way to test this fix would be to grep the log to see if the `stderrBuffer` gets logged but I dont think having test cases which do that is a good idea.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…

Author: Tejas Patil <tejasp@fb.com>

Closes #13834 from tejasapatil/script_transform.

5f342049

[SPARK-16340][SQL] Support column arguments for `regexp_replace` Dataset operation · ec79183a

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too.

## How was this patch tested?

Pass the Jenkins tests with a updated testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14060 from dongjoon-hyun/SPARK-16340.

ec79183a

Jul 05, 2016

[SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and... · ec18cd0a

gatorsmile authored 8 years ago

[SPARK-16389][SQL] Remove MetastoreRelation from SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer

#### What changes were proposed in this pull request?
- Remove useless `MetastoreRelation` from the signature of `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`.
- Avoid unnecessary metadata retrieval using Hive client in `InsertIntoHiveTable`.

#### How was this patch tested?
Existing test cases already cover it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14062 from gatorsmile/removeMetastoreRelation.

ec18cd0a

[SPARK-16286][SQL] Implement stack table generating function · d0d28507

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR implements `stack` table generating function.

## How was this patch tested?

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14033 from dongjoon-hyun/SPARK-16286.

d0d28507

[SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls · fdde7d0a

Joseph K. Bradley authored 8 years ago

## What changes were proposed in this pull request?

Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.

This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X

## How was this patch tested?

Existing unit tests.  Manual testing in an environment where this was an issue.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14023 from jkbradley/SPARK-16348.

fdde7d0a

[SPARK-16385][CORE] Catch correct exception when calling method via reflection. · 59f9c1bd

Marcelo Vanzin authored 8 years ago

Using "Method.invoke" causes an exception to be thrown, not an error, so
Utils.waitForProcess() was always throwing an exception when run on Java 7.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14056 from vanzin/SPARK-16385.

59f9c1bd

[SPARK-16383][SQL] Remove `SessionState.executeSql` · 4db63fd2

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14055 from dongjoon-hyun/SPARK-16383.

4db63fd2

[SPARK-16359][STREAMING][KAFKA] unidoc skip kafka 0.10 · 1f0d0213

cody koeninger authored 8 years ago

## What changes were proposed in this pull request?
during sbt unidoc task, skip the streamingKafka010 subproject and filter kafka 0.10 classes from the classpath, so that at least existing kafka 0.8 doc can be included in unidoc without error

## How was this patch tested?
sbt spark/scalaunidoc:doc | grep -i error

Author: cody koeninger <cody@koeninger.org>

Closes #14041 from koeninger/SPARK-16359.

1f0d0213

[SPARK-15730][SQL] Respect the --hiveconf in the spark-sql command line · 920cb5fe

Cheng Hao authored 8 years ago

## What changes were proposed in this pull request?
This PR makes spark-sql (backed by SparkSQLCLIDriver) respects confs set by hiveconf, which is what we do in previous versions. The change is that when we start SparkSQLCLIDriver, we explicitly set confs set through --hiveconf to SQLContext's conf (basically treating those confs as a SparkSQL conf).

## How was this patch tested?
A new test in CliSuite.

Closes #13542

Author: Cheng Hao <hao.cheng@intel.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14058 from yhuai/hiveConfThriftServer.

920cb5fe

[HOTFIX] Fix build break. · 5b7a1770
Reynold Xin authored 8 years ago

5b7a1770

[SPARK-16212][STREAMING][KAFKA] use random port for embedded kafka · 1fca9da9

cody koeninger authored 8 years ago

## What changes were proposed in this pull request?

Testing for 0.10 uncovered an issue with a fixed port number being used in KafkaTestUtils.  This is making a roughly equivalent fix for the 0.8 connector

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #14018 from koeninger/kafka-0-8-test-port.

1fca9da9

[SPARK-16311][SQL] Metadata refresh should work on temporary views · 16a2a7d7

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.

Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).

## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.

Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>

Closes #14009 from rxin/SPARK-16311.

16a2a7d7

[SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet... · 07d9c532

hyukjinkwon authored 8 years ago

[SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown and replace deprecated fromByteArray.

## What changes were proposed in this pull request?

It seems Parquet has been upgraded to 1.8.1 by https://github.com/apache/spark/pull/13280. So,  this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake).

This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251).

## How was this patch tested?

Unit tests in `ParquetFilters`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13389 from HyukjinKwon/parquet-1.8-followup.

07d9c532

[SPARK-16360][SQL] Speed up SQL query performance by removing redundant `executePlan` call · 7f7eb393

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, there are a few reports about Spark 2.0 query performance regression for large queries.

This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.

**Sample Query**
```scala
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
  s"""
     |SELECT $columns
     |FROM VALUES ($values) T($columns)
     |WHERE 1=2 AND 1 IN ($columns)
     |GROUP BY $columns
     |ORDER BY $columns
     |""".stripMargin

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
  result
}
```

**Before**
```scala
scala> time(sql(query))
Elapsed time: 30.138142577s  // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 25.787751452s  // Let's compare this one.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```

**After**
```scala
scala> time(sql(query))
Elapsed time: 17.500279659s  // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 12.364812255s  // This shows the real difference. The speed up is about 2 times.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```

## How was this patch tested?

Manual by the above script.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14044 from dongjoon-hyun/SPARK-16360.

7f7eb393

[SPARK-15198][SQL] Support for pushing down filters for boolean types in ORC data source · 7742d9f1

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

It seems ORC supports all the types in  ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) which includes boolean types. So, this was tested first.

This PR adds the support for pushing filters down for `BooleanType` in ORC data source.

This PR also removes `OrcTableScan` class and the companion object, which is not used anymore.

## How was this patch tested?

Unittest in `OrcFilterSuite` and `OrcQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12972 from HyukjinKwon/SPARK-15198.

7742d9f1

Jul 04, 2016

[SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached · 8f6cf00c

Michael Allman authored 8 years ago

(Please note this is a revision of PR #13686, which has been closed in favor of this PR.)

This PR addresses [SPARK-15968](https://issues.apache.org/jira/browse/SPARK-15968).

## What changes were proposed in this pull request?

The `getCached` method of [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala) computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.

Rather than get `pathsInMetastore` from

metastoreRelation.catalogTable.storage.locationUri.toSeq

I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5:

https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.

I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog.

## How was this patch tested?

I've added a new unit test to `parquetSuites.scala`:

SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation

Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.

Note I also modified a different but related unit test in `parquetSuites.scala`:

SPARK-15248: explicitly added partitions should be readable

This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.

In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.

Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.

Thanks.

Author: Michael Allman <michael@videoamp.com>

Closes #13818 from mallman/spark-15968.

8f6cf00c

[SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc · 7dbffcdd

Michael Allman authored 8 years ago

Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-16353

## What changes were proposed in this pull request?

The javadoc options for the java unidoc generation are ignored when generating the java unidoc. For example, the generated `index.html` has the wrong HTML page title. This can be seen at http://spark.apache.org/docs/latest/api/java/index.html.

I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`.

## How was this patch tested?

I ran `docs/jekyll build` and verified that the java unidoc `index.html` has the correct HTML page title.

Author: Michael Allman <michael@videoamp.com>

Closes #14031 from mallman/spark-16353.

7dbffcdd

[MINOR][DOCS] Remove unused images; crush PNGs that could use it for good measure · 18fb57f5

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Coincidentally, I discovered that a couple images were unused in `docs/`, and then searched and found more, and then realized some PNGs were pretty big and could be crushed, and before I knew it, had done the same for the ASF site (not committed yet).

No functional change at all, just less superfluous image data.

## How was this patch tested?

`jekyll serve`

Author: Sean Owen <sowen@cloudera.com>

Closes #14029 from srowen/RemoveCompressImages.

18fb57f5

[SPARK-16260][ML][EXAMPLE] PySpark ML Example Improvements and Cleanup · a539b724

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
1). Remove unused import in Scala example;

2). Move spark session import outside example off;

3). Change parameter setting the same as Scala;

4). Change comment to be consistent;

5). Make sure that Scala and python using the same data set;

I did one pass and fixed the above issues. There are missing examples in python, which might be added later.

TODO: For some examples, there are comments on how to run examples; But there are many missing. We can add them later.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test them

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14021 from wangmiao1981/ann.

a539b724

[SPARK-16358][SQL] Remove InsertIntoHiveTable From Logical Plan · 26283339

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
LogicalPlan `InsertIntoHiveTable` is useless. Thus, we can remove it from the code base.

#### How was this patch tested?
The existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14037 from gatorsmile/InsertIntoHiveTable.

26283339

Jul 03, 2016

[SPARK-15204][SQL] improve nullability inference for Aggregator · 8cdb81fa

Koert Kuipers authored 8 years ago

## What changes were proposed in this pull request?

TypedAggregateExpression sets nullable based on the schema of the outputEncoder

## How was this patch tested?

Add test in DatasetAggregatorSuite

Author: Koert Kuipers <koert@tresata.com>

Closes #13532 from koertkuipers/feat-aggregator-nullable.

8cdb81fa

[SPARK-16288][SQL] Implement inline table generating function · 88134e73

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR implements `inline` table generating function.

## How was this patch tested?

Pass the Jenkins tests with new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13976 from dongjoon-hyun/SPARK-16288.

88134e73

[SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions · 54b27c17

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.

## How was this patch tested?

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13967 from dongjoon-hyun/SPARK-16278.

54b27c17

[SPARK-16329][SQL] Star Expansion over Table Containing No Column · ea990f96

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch.

For example,
```scala
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.registerTempTable("temp_table_no_cols")
sqlContext.sql("select * from temp_table_no_cols").show
```

Without the fix, users will get the following the exception:
```
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
```

#### How was this patch tested?
Tests are added

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14007 from gatorsmile/starExpansionTableWithZeroColumn.

ea990f96

Jul 02, 2016

[MINOR][BUILD] Fix Java linter errors · 3000b4b2

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes the minor Java linter errors like the following.
```
-    public int read(char cbuf[], int off, int len) throws IOException {
+    public int read(char[] cbuf, int off, int len) throws IOException {
```

## How was this patch tested?

Manual.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14017 from dongjoon-hyun/minor_build_java_linter_error.

3000b4b2

[SPARK-16345][DOCUMENTATION][EXAMPLES][GRAPHX] Extract graphx programming... · 0bd7cd18

WeichenXu authored 8 years ago

[SPARK-16345][DOCUMENTATION][EXAMPLES][GRAPHX] Extract graphx programming guide example snippets from source files instead of hard code them

## What changes were proposed in this pull request?

I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.

The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala

All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14015 from WeichenXu123/graphx_example_plugin.

0bd7cd18

[GRAPHX][EXAMPLES] move graphx test data directory and update graphx document · 192d1f9c

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

There are two test data files used for graphx examples existing in directory "graphx/data"
I move it into "data/" directory because the "graphx" directory is used for code files and other test data files (such as mllib, streaming test data) are all in there.

I also update the graphx document where reference the data files which I move place.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14010 from WeichenXu123/move_graphx_data_dir.

192d1f9c

Jul 01, 2016

[SPARK-16095][YARN] Yarn cluster mode should report correct state to SparkLauncher · bad0f7db

peng.zhang authored 8 years ago

## What changes were proposed in this pull request?
Yarn cluster mode should return correct state for SparkLauncher

## How was this patch tested?
unit test

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #13962 from renozhang/SPARK-16095-spark-launcher-wrong-state.

bad0f7db

[SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is available. · d17e5f2f

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

ORC test should be enabled only when HiveContext is available.

## How was this patch tested?

Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped

2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped

3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped

4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped

5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped

DONE ===========================================================================
Tests passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14019 from dongjoon-hyun/SPARK-16233.

d17e5f2f

[SPARK-16335][SQL] Structured streaming should fail if source directory does not exist · d601894c

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).

## How was this patch tested?
Updated unit tests to reflect the new behavior.

Author: Reynold Xin <rxin@databricks.com>

Closes #14002 from rxin/SPARK-16335.

d601894c