Commits · 6e1c55eac4849669e119ce0d51f6d051830deb9f · cs525-sp18-g07 / spark

Dec 09, 2015

[SPARK-12012][SQL] Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan · 6e1c55ea

Cheng Lian authored 9 years ago

This PR adds a `private[sql]` method `metadata` to `SparkPlan`, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of `PhysicalRDD`s translated from a data source relation. For example, a `ParquetRelation` converted from Hive metastore table `default.psrc` is now shown as the following screenshot:

![image](https://cloud.githubusercontent.com/assets/230655/11526657/e10cb7e6-9916-11e5-9afa-f108932ec890.png)

And here is the screenshot for a regular `ParquetRelation` (not converted from Hive metastore table) loaded from a really long path:

![output](https://cloud.githubusercontent.com/assets/230655/11680582/37c66460-9e94-11e5-8f50-842db5309d5a.png)

Author: Cheng Lian <lian@databricks.com>

Closes #10004 from liancheng/spark-12012.physical-rdd-metadata.

6e1c55ea

[SPARK-12031][CORE][BUG] Integer overflow when do sampling · a1132168
uncleGen authored 9 years ago
```
Author: uncleGen <hustyugm@gmail.com>

Closes #10023 from uncleGen/1.6-bugfix.
```
a1132168

[SPARK-11676][SQL] Parquet filter tests all pass if filters are not really pushed down · f6883bb7

hyukjinkwon authored 9 years ago

Currently Parquet predicate tests all pass even if filters are not pushed down or this is disabled.

In this PR, For checking evaluating filters, Simply it makes the expression from `expression.Filter` and then try to create filters just like Spark does.

For checking the results, this manually accesses to the child rdd (of `expression.Filter`) and produces the results which should be filtered properly, and then compares it to expected values.

Now, if filters are not pushed down or this is disabled, this throws exceptions.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9659 from HyukjinKwon/SPARK-11676.

f6883bb7

Dec 08, 2015

[SPARK-12222] [CORE] Deserialize RoaringBitmap using Kryo serializer throw... · 3934562d

Fei Wang authored 9 years ago

[SPARK-12222] [CORE] Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception

Jira: https://issues.apache.org/jira/browse/SPARK-12222

Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception:
```
com.esotericsoftware.kryo.KryoException: Buffer underflow.
	at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
	at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
	at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
```

This is caused by a bug of kryo's `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) and we call this method in `KryoInputDataInputBridge`.

Instead of upgrade kryo's version, this pr bypass the  kryo's `Input.skip(long count)` by directly call another `skip` method in kryo's Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124), i.e. write the bug-fixed version of `Input.skip(long count)` in KryoInputDataInputBridge's `skipBytes` method.

more detail link to https://github.com/apache/spark/pull/9748#issuecomment-162860246

Author: Fei Wang <wangfei1@huawei.com>

Closes #10213 from scwf/patch-1.

3934562d

[SPARK-11343][ML] Documentation of float and double prediction/label columns in RegressionEvaluator · a0046e37

Dominik Dahlem authored 9 years ago

felixcheung , mengxr

Just added a message to require()

Author: Dominik Dahlem <dominik.dahlem@gmail.combination>

Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.

a0046e37

[SPARK-8517][ML][DOC] Reorganizes the spark.ml user guide · 765c67f5

Timothy Hunter authored 9 years ago

This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested.

<img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png">

Author: Timothy Hunter <timhunter@databricks.com>

Closes #10207 from thunterdb/spark-8517.

765c67f5

[SPARK-12069][SQL] Update documentation with Datasets · 39594894
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #10060 from marmbrus/docs.
```
39594894

[SPARK-12187] *MemoryPool classes should not be fully public · 94945216

Andrew Or authored 9 years ago

This patch tightens them to `private[memory]`.

Author: Andrew Or <andrew@databricks.com>

Closes #10182 from andrewor14/memory-visibility.

94945216

[SPARK-3873][BUILD] Add style checker to enforce import ordering. · 2ff17bcf

Marcelo Vanzin authored 9 years ago

The checker tries to follow as closely as possible the guidelines of
the code style document, and makes some decisions where the guide is
not clear. In particular:

- wildcard imports come first when there are other imports in the
  same package
- multi-import blocks come before single imports
- lower-case names inside multi-import blocks come before others

In some projects, such as graphx, there seems to be a convention to
separate o.a.s imports from the project's own; to simplify the
checker, I chose not to allow that, which is a strict interpretation
of the code style guide, even though I think it makes sense.

Since the checks are based on syntax only, some edge cases may
generate spurious warnings; for example, when class names start
with a lower case letter (and are thus treated as a package name
by the checker).

The checker is currently only generating warnings, and since there
are many of those, the build output does get a little noisy. The
idea is to fix the code (and the checker, as needed) little by little
instead of having a huge change that touches everywhere.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6502 from vanzin/SPARK-3873.

2ff17bcf

[SPARK-12159][ML] Add user guide section for IndexToString transformer · 06746b30

BenFradet authored 9 years ago

Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10166 from BenFradet/SPARK-12159.

06746b30

[SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs · 5cb46950

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.

fix:

1. `StreamingTest.registerStream` needs java friendly interface.

2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.

TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.

Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.

cc jkbradley feynmanliang

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10102 from hhbyyh/javaAPI.

5cb46950

[SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction · 4bcb8949

Andrew Ray authored 9 years ago

Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #10202 from aray/sql-pivot-unresolved-function.

4bcb8949

[SPARK-10393] use ML pipeline in LDA example · 872a2ee2

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-10393

Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>

Closes #8551 from hhbyyh/ldaExUpdate.

872a2ee2

[SPARK-12188][SQL] Code refactoring and comment correction in Dataset APIs · 5d96a710

gatorsmile authored 9 years ago

This PR contains the following updates:

- Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`.
- Replaced all the `queryExecution.analyzed` by the function call `logicalPlan`
- A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`)
- A few API descriptions are wrong. (e.g., `mapPartitions`)

marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10184 from gatorsmile/datasetClean.

5d96a710

[SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encoder · c0b13d55

gatorsmile authored 9 years ago

This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`.

marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10188 from gatorsmile/dataTypesinEncoder.

c0b13d55

[SPARK-12201][SQL] add type coercion rule for greatest/least · 381f17b5

Wenchen Fan authored 9 years ago

checked with hive, greatest/least should cast their children to a tightest common type,
i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10196 from cloud-fan/type-coercion.

381f17b5

[SPARK-12074] Avoid memory copy involving ByteBuffer.wrap(ByteArrayOutputStream.toByteArray) · 75c60bf4

tedyu authored 9 years ago

SPARK-12060 fixed JavaSerializerInstance.serialize
This PR applies the same technique on two other classes.

zsxwing

Author: tedyu <yuzhihong@gmail.com>

Closes #10177 from tedyu/master.

75c60bf4

[SPARK-11155][WEB UI] Stage summary json should include stage duration · 6cb06e87

Xin Ren authored 9 years ago

The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at api/v1/applications/<appId>/stages.

Metrics I've added are: submissionTime, firstTaskLaunchedTime and completionTime

Author: Xin Ren <iamshrek@126.com>

Closes #10107 from keypointt/SPARK-11155.

6cb06e87

[SPARK-11652][CORE] Remote code execution with InvokerTransformer · e3735ce1

Sean Owen authored 9 years ago

Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at https://github.com/apache/spark/pull/9731

Author: Sean Owen <sowen@cloudera.com>

Closes #10198 from srowen/SPARK-11652.2.

e3735ce1

[SPARK-11551][DOC][EXAMPLE] Revert PR #10002 · da2012a0

Cheng Lian authored 9 years ago

This reverts PR #10002, commit 78209b0c.

The original PR wasn't tested on Jenkins before being merged.

Author: Cheng Lian <lian@databricks.com>

Closes #10200 from liancheng/revert-pr-10002.

da2012a0

[SPARK-11439][ML] Optimization of creating sparse feature without dense one · 037b7e76

Nakul Jindal authored 9 years ago

Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.

037b7e76

[SPARK-12166][TEST] Unset hadoop related environment in testing · 70812918
Jeff Zhang authored 9 years ago
```
Author: Jeff Zhang <zjffdu@apache.org>

Closes #10172 from zjffdu/SPARK-12166.
```
70812918
[SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V … · 48a9804b
cody koeninger authored 9 years ago
```
…means Value

Author: cody koeninger <cody@koeninger.org>

Closes #10132 from koeninger/SPARK-12103.
```
48a9804b

[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code · 4a39b5a1

Yanbo Liang authored 9 years ago

Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10006 from yanboliang/spark-11958.

4a39b5a1

[SPARK-10259][ML] Add @since annotation to ml.classification · 7d05a624

Takahashi Hiroshi authored 9 years ago

Add since annotation to ml.classification

Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>

Closes #8534 from taishi-oss/issue10259.

7d05a624

Closes #10098 · 73896588
Xiangrui Meng authored 9 years ago

73896588

[SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using include_example · 78209b0c

somideshmukh authored 9 years ago

Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three  java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10002 from somideshmukh/SomilBranch1.33.

78209b0c

Dec 07, 2015

[SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib · 3e7e05f5

Joseph K. Bradley authored 9 years ago

Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.

This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml.

CC: mengxr yhuai

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10161 from jkbradley/mllib-sqlcontext-fix.

3e7e05f5

[SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala doc · 36282f78

Andrew Ray authored 9 years ago

In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #10176 from aray/sql-pivot-python-doc.

36282f78

[SPARK-11884] Drop multiple columns in the DataFrame API · 84b80944

tedyu authored 9 years ago

See the thread Ben started:
http://search-hadoop.com/m/q3RTtveEuhjsr7g/

This PR adds drop() method to DataFrame which accepts multiple column names

Author: tedyu <yuzhihong@gmail.com>

Closes #9862 from ted-yu/master.

84b80944

[SPARK-11963][DOC] Add docs for QuantileDiscretizer · 871e85d9

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11963

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9962 from yinxusen/SPARK-11963.

871e85d9

[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize · 3f4efb5c

Shixiong Zhu authored 9 years ago

Merged #10051 again since #10083 is resolved.

This reverts commit 328b757d.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10167 from zsxwing/merge-SPARK-12060.

3f4efb5c

[SPARK-11932][STREAMING] Partition previous TrackStateRDD if partitioner not present · 5d80d8c6

Tathagata Das authored 9 years ago

The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004).

While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9988 from tdas/SPARK-11932.

5d80d8c6

[SPARK-12132] [PYSPARK] raise KeyboardInterrupt inside SIGINT handler · ef3f047c

Davies Liu authored 9 years ago

Currently, the current line is not cleared by Cltr-C

After this patch
```
>>> asdfasdf^C
Traceback (most recent call last):
  File "~/spark/python/pyspark/context.py", line 225, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
```

It's still worse than 1.5 (and before).

Author: Davies Liu <davies@databricks.com>

Closes #10134 from davies/fix_cltrc.

ef3f047c

[SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases. · 39d677c8

Sun Rui authored 9 years ago

This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <rui.sun@intel.com>

Closes #10030 from sun-rui/SPARK-12034.

39d677c8

[SPARK-12032] [SQL] Re-order inner joins to do join with conditions first · 9cde7d5f

Davies Liu authored 9 years ago

Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.

This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.

After this patch, the TPCDS query Q64/65 can run hundreds times faster.

cc marmbrus nongli

Author: Davies Liu <davies@databricks.com>

Closes #10073 from davies/reorder_joins.

9cde7d5f

[SPARK-12106][STREAMING][FLAKY-TEST] BatchedWAL test transiently flaky when Jenkins load is high · 6fd9e70e

Burak Yavuz authored 9 years ago

We need to make sure that the last entry is indeed the last entry in the queue.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10110 from brkyvz/batch-wal-test-fix.

6fd9e70e

Dec 06, 2015

[SPARK-12152][PROJECT-INFRA] Speed up Scalastyle checks by only invoking SBT once · 80a824d3

Josh Rosen authored 9 years ago

Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10151 from JoshRosen/speed-up-scalastyle.

80a824d3

[SPARK-12138][SQL] Escape \u in the generated comments of codegen · 49efd03b

gatorsmile authored 9 years ago

When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10155 from gatorsmile/escapeU.

49efd03b

[SPARK-12048][SQL] Prevent to close JDBC resources twice · 04b67999
gcc authored 9 years ago
```
Author: gcc <spark-src@condor.rhaag.ip>

Closes #10101 from rh99/master.
```
04b67999