Commits · 1b001b5203444cc8d5c4887a30e03e8fb298d17d · cs525-sp18-g07 / spark

Sep 04, 2016

[MINOR][ML][MLLIB] Remove work around for breeze sparse matrix. · 1b001b52

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14953 from yanboliang/matrices.

1b001b52

[SPARK-17311][MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases · cdeb97a8

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change.

- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14826 from srowen/SPARK-16832.2.

cdeb97a8

[SPARK-17308] Improved the spark core code by replacing all pattern match on... · e75c162e

Shivansh authored 8 years ago

[SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.

## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.

## How was this patch tested?

By running the tests

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14873 from shiv4nsh/SPARK-17308.

e75c162e

[SPARK-17324][SQL] Remove Direct Usage of HiveClient in InsertIntoHiveTable · 6b156e2f

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
This is another step to get rid of HiveClient from `HiveSessionState`. All the metastore interactions should be through `ExternalCatalog` interface. However, the existing implementation of `InsertIntoHiveTable ` still requires Hive clients. This PR is to remove HiveClient by moving the metastore interactions into `ExternalCatalog`.

### How was this patch tested?
Existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14888 from gatorsmile/removeClientFromInsertIntoHiveTable.

6b156e2f

Sep 03, 2016

[SPARK-16829][SPARKR] sparkR sc.setLogLevel doesn't work · e9b58e9e

wm624@hotmail.com authored 8 years ago

(Please fill in changes proposed in this fix)

./bin/sparkR
Launching java with spark-submit command /Users/mwang/spark_ws_0904/bin/spark-submit "sparkr-shell" /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpQxJGiZ/backend_porte9474603ed1e
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).

> sc.setLogLevel("INFO")
Error: could not find function "sc.setLogLevel"

sc.setLogLevel doesn't exist.

R has a function setLogLevel.

I rename the setLogLevel function to sc.setLogLevel.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Change unit test. Run unit tests.
Manually tested it in sparkR shell.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14433 from wangmiao1981/sc.

e9b58e9e

[SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapper · abb2f921

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution.

## How was this patch tested?

R unit test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14881 from junyangq/SPARK-17315.

abb2f921

[SPARK-17335][SQL] Fix ArrayType and MapType CatalogString. · c2a1576c

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
the `catalogString` for `ArrayType` and `MapType` currently calls the `simpleString` method on its children. This is a problem when the child is a struct, the `struct.simpleString` implementation truncates the number of fields it shows (25 at max). This breaks the generation of a proper `catalogString`, and has shown to cause errors while writing to Hive.

This PR fixes this by providing proper `catalogString` implementations for `ArrayData` or `MapData`.

## How was this patch tested?
Added testing for `catalogString` to `DataTypeSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14938 from hvanhovell/SPARK-17335.

c2a1576c

[MINOR][SQL] Not dropping all necessary tables · a8a35b39

Sandeep Singh authored 8 years ago

## What changes were proposed in this pull request?
was not dropping table `parquet_t3`

## How was this patch tested?
tested `LogicalPlanToSQLSuite` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13767 from techaddict/minor-8.

a8a35b39

[SPARK-17347][SQL][EXAMPLES] Encoder in Dataset example has incorrect type · 97da4103

CodingCat authored 8 years ago

## What changes were proposed in this pull request?

We propose to fix the Encoder type in the Dataset example

## How was this patch tested?

The PR will be tested with the current unit test cases

Author: CodingCat <zhunansjtu@gmail.com>

Closes #14901 from CodingCat/SPARK-17347.

97da4103

[SPARK-17363][ML][MLLIB] fix MultivariantOnlineSummerizer.numNonZeros · 7a8a81d7

WeichenXu authored 8 years ago

## What changes were proposed in this pull request?

fix `MultivariantOnlineSummerizer.numNonZeros` method,
return `nnz` array, instead of  `weightSum` array

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14923 from WeichenXu123/fix_MultivariantOnlineSummerizer_numNonZeros.

7a8a81d7

Sep 02, 2016

[SPARKR][MINOR] Fix docs for sparkR.session and count · d2fde6b7

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to add some more explanation to `sparkR.session`. It also modifies doc for `count` so when grouped in one doc, the description doesn't confuse users.

## How was this patch tested?

Manual test.

![screen shot 2016-09-02 at 1 21 36 pm](https://cloud.githubusercontent.com/assets/15318264/18217198/409613ac-7110-11e6-8dae-cb0c8df557bf.png)

Author: Junyang Qian <junyangq@databricks.com>

Closes #14942 from junyangq/fixSparkRSessionDoc.

d2fde6b7

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products · e6132a6c

Srinath Shankar authored 8 years ago

## What changes were proposed in this pull request?

Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.

The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.

## How was this patch tested?

Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used.

Author: Srinath Shankar <srinath@databricks.com>

Closes #14866 from srinathshankar/crossjoin.

e6132a6c

[SPARK-16334] Reusing same dictionary column for decoding consecutive row... · a2c9acb0

Sameer Agarwal authored 8 years ago

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error

## What changes were proposed in this pull request?

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

## How was this patch tested?

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14941 from sameeragarwal/parquet-exception-2.

a2c9acb0

[SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in DataFrameWriter · ed9c884d

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.

For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

Ideally, we should make all the analyzer rules all idempotent, that may require lots of effort to double checking them one by one (may be not easy).

An easier approach could be never feed a optimized plan into Analyzer, this PR fix the case for RunnableComand, they will be optimized, during execution, the passed `query` will also be passed into QueryExecution again. This PR make these `query` not part of the children, so they will not be optimized and analyzed again.

Right now, we did not know a logical plan is optimized or not, we could introduce a flag for that, and make sure a optimized logical plan will not be analyzed again.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #14797 from davies/fix_writer.

ed9c884d

[SPARK-17376][SPARKR] followup - change since version · eac1d0e9

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

change since version in doc

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14939 from felixcheung/rsparkversion2.

eac1d0e9

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade · e79962f2

Thomas Graves authored 8 years ago

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled". Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it. Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #14718 from tgravescs/SPARK-16711.

e79962f2

[SPARKR][DOC] regexp_extract should doc that it returns empty string when match fails · 419eefd8

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Doc change - see https://issues.apache.org/jira/browse/SPARK-16324

## How was this patch tested?

manual check

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14934 from felixcheung/regexpextractdoc.

419eefd8

[SPARK-17376][SPARKR] Spark version should be available in R · 812333e4

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Add sparkR.version() API.

```
> sparkR.version()
[1] "2.1.0-SNAPSHOT"
```

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14935 from felixcheung/rsparksessionversion.

812333e4

[SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in... · ea662286

Jeff Zhang authored 8 years ago

[SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

## What changes were proposed in this pull request?

Set SparkSession._instantiatedContext as None so that we can recreate SparkSession again.

## How was this patch tested?

Tested manually using the following command in pyspark shell
```
spark.stop()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14857 from zjffdu/SPARK-17261.

ea662286

[SPARK-17351] Refactor JDBCRDD to expose ResultSet -> Seq[Row] utility methods · 6bcbf9b7

Josh Rosen authored 8 years ago

This patch refactors the internals of the JDBC data source in order to allow some of its code to be re-used in an automated comparison testing harness. Here are the key changes:

- Move the JDBC `ResultSetMetadata` to `StructType` conversion logic from `JDBCRDD.resolveTable()` to the `JdbcUtils` object (as a new `getSchema(ResultSet, JdbcDialect)` method), allowing it to be applied on `ResultSet`s that are created elsewhere.
- Move the `ResultSet` to `InternalRow` conversion methods from `JDBCRDD` to `JdbcUtils`:
  - It makes sense to move the `JDBCValueGetter` type and `makeGetter` functions here given that their write-path counterparts (`JDBCValueSetter`) are already in `JdbcUtils`.
  - Add an internal `resultSetToSparkInternalRows` method which takes a `ResultSet` and schema and returns an `Iterator[InternalRow]`. This effectively extracts the main loop of `JDBCRDD` into its own method.
  - Add a public `resultSetToRows` method to `JdbcUtils`, which wraps the minimal machinery around `resultSetToSparkInternalRows` in order to allow it to be called from outside of a Spark job.
- Make `JdbcDialect.get` into a `DeveloperApi` (`JdbcDialect` itself is already a `DeveloperApi`).

Put together, these changes enable the following testing pattern:

```scala
val jdbResultSet: ResultSet = conn.prepareStatement(query).executeQuery()
val resultSchema: StructType = JdbcUtils.getSchema(jdbResultSet, JdbcDialects.get("jdbc:postgresql"))
val jdbcRows: Seq[Row] = JdbcUtils.resultSetToRows(jdbResultSet, schema).toSeq
checkAnswer(sparkResult, jdbcRows) // in a test case
```

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14907 from JoshRosen/modularize-jdbc-internals.

6bcbf9b7

[SPARK-16984][SQL] don't try whole dataset immediately when first partition doesn't have… · 806d8a8e

Robert Kruszewski authored 8 years ago

## What changes were proposed in this pull request?

Try increase number of partitions to try so we don't revert to all.

## How was this patch tested?

Empirically. This is common case optimization.

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14573 from robert3005/robertk/execute-take-backoff.

806d8a8e

[SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs · 247a4faf

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
Function-related `HiveExternalCatalog` APIs do not have enough verification logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become consistent in the error handling.

For example, below is the exception we got when calling `renameFunction`.
```
15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db1, returning NoSuchObjectException
15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db2, returning NoSuchObjectException
15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object "org.apache.hadoop.hive.metastore.model.MFunction205629e9" using statement "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUEFUNCTION' defined on 'FUNCS'.
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
```

### How was this patch tested?
Improved the existing test cases to check whether the messages are right.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14521 from gatorsmile/functionChecking.

247a4faf

[SPARK-17352][WEBUI] Executor computing time can be negative-number because of calculation error · 7ee24dac

Kousuke Saruta authored 8 years ago

## What changes were proposed in this pull request?

In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers.

Following capture is an example.

<img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png">

## How was this patch tested?

Manual tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14908 from sarutak/SPARK-17352.

7ee24dac

[SQL][DOC][MINOR] Add (Scala-specific) and (Java-specific) · a3097e2b

Jacek Laskowski authored 8 years ago

## What changes were proposed in this pull request?

Adds (Scala-specific) and (Java-specific) to Scaladoc.

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #14891 from jaceklaskowski/scala-specifics.

a3097e2b

[SPARK-15509][ML][SPARKR] R MLlib algorithms should support input columns "features" and "label" · 6969dcc7

Xin Ren authored 8 years ago

https://issues.apache.org/jira/browse/SPARK-15509

## What changes were proposed in this pull request?

Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
`training <- loadDF(sqlContext, ".../mnist", "libsvm")`
`model <- naiveBayes(label ~ features, training)`
This fails with:
```
16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Output column features already exists.
	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
	at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
The same issue appears for the "label" column once you rename the "features" column.
```
The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`

## How was this patch tested?

Test on my local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #13584 from keypointt/SPARK-15509.

6969dcc7

[SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when... · 0f30cded

wm624@hotmail.com authored 8 years ago

[SPARK-16883][SPARKR] SQL decimal type is not properly cast to number when collecting SparkDataFrame

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))

'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2

The problem is that spark returns `decimal(10, 0)` col type, instead of `decimal`. Thus, `decimal(10, 0)` is not handled correctly. It should be handled as "double".

As discussed in JIRA thread, we can have two potential fixes:
1). Scala side fix to add a new case when writing the object back; However, I can't use spark.sql.types._ in Spark core due to dependency issues. I don't find a way of doing type case match;

2). SparkR side fix: Add a helper function to check special type like `"decimal(10, 0)"` and replace it with `double`, which is PRIMITIVE type. This special helper is generic for adding new types handling in the future.

I open this PR to discuss pros and cons of both approaches. If we want to do Scala side fix, we need to find a way to match the case of DecimalType and StructType in Spark Core.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test:
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  from iris limit 5")))
'data.frame':	5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y: num  2 2 2 2 2
R Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14613 from wangmiao1981/type.

0f30cded

[SPARK-17342][WEBUI] Style of event timeline is broken · 2ab8dbdd

Kousuke Saruta authored 8 years ago

## What changes were proposed in this pull request?

SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken.

In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`.

## How was this patch tested?

manual tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

* Before
<img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png">

* After
<img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14900 from sarutak/SPARK-17342.

2ab8dbdd

Sep 01, 2016

[SPARK-16926][SQL] Add unit test to compare table and partition column metadata. · f2d6e2ef

Brian Cho authored 8 years ago

## What changes were proposed in this pull request?

Add unit test for changes made in PR #14515. It makes sure that a newly created table has the same number of columns in table and partition metadata. This test fails before the changes introduced in #14515.

## How was this patch tested?

Run new unit test.

Author: Brian Cho <bcho@fb.com>

Closes #14930 from dafrista/partition-metadata-unit-test.

f2d6e2ef

[SPARK-16302][SQL] Set the right number of partitions for reading data from a local collection. · 06e33985

Lianhui Wang authored 8 years ago

follow #13137 This pr sets the right number of partitions when reading data from a local collection.
Query 'val df = Seq((1, 2)).toDF("key", "value").count' always use defaultParallelism tasks. So it causes run many empty or small tasks.

Manually tested and checked.

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13979 from lianhuiwang/localTable-Parallel.

06e33985

[SPARK-16619] Add shuffle service metrics entry in monitoring docs · 5bea8757

Yangyang Liu authored 8 years ago

After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list.

Author: Yangyang Liu <yangyangliu@fb.com>

Closes #14254 from lovexi/yangyang-monitoring-doc.

5bea8757

[SPARK-16525] [SQL] Enable Row Based HashMap in HashAggregateExec · 03d77af9

Qifan Pu authored 8 years ago

## What changes were proposed in this pull request?

This PR is the second step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBatch`. We then automatically pick between the two implementations based on certain knobs.

In this second-step PR, we enable `RowBasedHashMapGenerator` in `HashAggregateExec`.

## How was this patch tested?

Added tests: `RowBasedAggregateHashMapSuite` and ` VectorizedAggregateHashMapSuite`
Additional micro-benchmarks tests and TPCDS results will be added in a separate PR in the series.

Author: Qifan Pu <qifan.pu@gmail.com>
Author: ooq <qifan.pu@gmail.com>

Closes #14176 from ooq/rowbasedfastaggmap-pr2.

03d77af9

[SPARK-17355] Workaround for HIVE-14684 / HiveResultSetMetaData.isSigned exception · 15539e54

Josh Rosen authored 8 years ago

## What changes were proposed in this pull request?

Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer results in a `java.sql.SQLException: Method` not supported exception from `org.apache.hive.jdbc.HiveResultSetMetaData.isSigned`. Here are two user reports of this issue:

- https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
- https://stackoverflow.com/questions/32195946/method-not-supported-in-spark

I have filed [HIVE-14684](https://issues.apache.org/jira/browse/HIVE-14684) to attempt to fix this in Hive by implementing the isSigned method, but in the meantime / for compatibility with older JDBC drivers I think we should add special-case error handling to work around this bug.

This patch updates `JDBCRDD`'s `ResultSetMetadata` to schema conversion to catch the "Method not supported" exception from Hive and return `isSigned = true`. I believe that this is safe because, as far as I know, Hive does not support unsigned numeric types.

## How was this patch tested?

Tested manually against a Spark Thrift Server.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14911 from JoshRosen/hive-jdbc-workaround.

15539e54

[SPARK-16461][SQL] Support partition batch pruning with `<=>` predicate in InMemoryTableScanExec · d314677c

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

It seems `EqualNullSafe` filter was missed for batch pruneing partitions in cached tables.

It seems supporting this improves the performance roughly 5 times faster.

Running the codes below:

```scala
test("Null-safe equal comparison") {
  val N = 20000000
  val df = spark.range(N).repartition(20)
  val benchmark = new Benchmark("Null-safe equal comparison", N)
  df.createOrReplaceTempView("t")
  spark.catalog.cacheTable("t")
  sql("select id from t where id <=> 1").collect()

  benchmark.addCase("Null-safe equal comparison", 10) { _ =>
    sql("select id from t where id <=> 1").collect()
  }
  benchmark.run()
}
```

produces the results below:

**Before:**

```
Running benchmark: Null-safe equal comparison
  Running case: Null-safe equal comparison
  Stopped after 10 iterations, 2098 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU  2.30GHz

Null-safe equal comparison:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison                     204 /  210         98.1          10.2       1.0X
```

**After:**

```
Running benchmark: Null-safe equal comparison
  Running case: Null-safe equal comparison
  Stopped after 10 iterations, 478 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU  2.30GHz

Null-safe equal comparison:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison                      42 /   48        474.1           2.1       1.0X
```

## How was this patch tested?

Unit tests in `PartitionBatchPruningSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14117 from HyukjinKwon/SPARK-16461.

d314677c

[SPARK-16732][SQL] Remove unused codes in subexpressionEliminationForWholeStageCodegen · e388bd54

Yucai Yu authored 8 years ago

## What changes were proposed in this pull request?
Some codes in subexpressionEliminationForWholeStageCodegen are never used actually.
Remove them using this PR.

## How was this patch tested?
Local unit tests.

Author: Yucai Yu <yucai.yu@intel.com>

Closes #14366 from yucai/subExpr_unused_codes.

e388bd54

[SPARK-16926] [SQL] Remove partition columns from partition metadata. · 473d7864

Brian Cho authored 8 years ago

## What changes were proposed in this pull request?

This removes partition columns from column metadata of partitions to match tables.

A change introduced in SPARK-14388 removed partition columns from the column metadata of tables, but not for partitions. This causes TableReader to believe that the schema is different between table and partition, and create an unnecessary conversion object inspector in TableReader.

## How was this patch tested?

Existing unit tests.

Author: Brian Cho <bcho@fb.com>

Closes #14515 from dafrista/partition-columns-metadata.

473d7864

[SPARK-16533][HOTFIX] Fix compilation on Scala 2.10. · edb45734

Marcelo Vanzin authored 8 years ago

No idea why it was failing (the needed import was there), but
this makes things work.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14925 from vanzin/SPARK-16533.

edb45734

[SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays · 3893e8c5

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.

3893e8c5

[SPARK-17263][SQL] Add hexadecimal literal parsing · 2be5f8d7

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
This PR adds the ability to parse SQL (hexadecimal) binary literals (AKA bit strings). It follows the following syntax `X'[Hexadecimal Characters]+'`, for example: `X'01AB'` would create a binary the following binary array `0x01AB`.

If an uneven number of hexadecimal characters is passed, then the upper 4 bits of the initial byte are kept empty, and the lower 4 bits are filled using the first character. For example `X'1C7'` would create the following binary array `0x01C7`.

Binary data (Array[Byte]) does not have a proper `hashCode` and `equals` functions. This meant that comparing `Literal`s containing binary data was a pain. I have updated Literal.hashCode and Literal.equals to deal properly with binary data.

## How was this patch tested?
Added tests to the `ExpressionParserSuite`, `SQLQueryTestSuite` and `ExpressionSQLBuilderSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14832 from hvanhovell/SPARK-17263.

2be5f8d7

[SPARK-16533][CORE] resolve deadlocking in driver when executors die · a0aac4b7

Angus Gerry authored 8 years ago

## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes #14710 from angolon/SPARK-16533.

a0aac4b7

[SPARK-17271][SQL] Remove redundant `semanticEquals()` from `SortOrder` · adaaffa3

Tejas Patil authored 8 years ago

## What changes were proposed in this pull request?

Removing `semanticEquals()` from `SortOrder` because it can use the `semanticEquals()` provided by its parent class (`Expression`). This was as per suggestion by cloud-fan at https://github.com/apache/spark/pull/14841/files/7192418b3a26a14642fc04fc92bf496a954ffa5d#r77106801

## How was this patch tested?

Ran the test added in https://github.com/apache/spark/pull/14841

Author: Tejas Patil <tejasp@fb.com>

Closes #14910 from tejasapatil/SPARK-17271_remove_semantic_ordering.

adaaffa3