Commits · d0d7ec533062151269b300ed455cf150a69098c0 · cs525-sp18-g07 / spark

Dec 02, 2015

[SPARK-12093][SQL] Fix the error of comment in DDLParser · d0d7ec53
Yadong Qi authored 9 years ago
```
Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #10096 from watermen/patch-1.
```
d0d7ec53

[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning · de07d06a

Yu ISHIKAWA authored 9 years ago

cc mengxr noel-smith

I worked on this issues based on https://github.com/apache/spark/pull/8729.
ehsanmok  thank you for your contricution!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #9338 from yu-iskw/JIRA-10266.

de07d06a

[SPARK-12001] Allow partially-stopped StreamingContext to be completely stopped · 452690ba

Josh Rosen authored 9 years ago

If `StreamingContext.stop()` is interrupted midway through the call, the context will be marked as stopped but certain state will have not been cleaned up. Because `state = STOPPED` will be set, subsequent `stop()` calls will be unable to finish stopping the context, preventing any new StreamingContexts from being created.

This patch addresses this issue by only marking the context as `STOPPED` once the `stop()` has successfully completed which allows `stop()` to be called a second time in order to finish stopping the context in case the original `stop()` call was interrupted.

I discovered this issue by examining logs from a failed Jenkins run in which this race condition occurred in `FailureSuite`, leaking an unstoppable context and causing all subsequent tests to fail.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9982 from JoshRosen/SPARK-12001.

452690ba

[SPARK-12094][SQL] Prettier tree string for TreeNode · a1542ce2

Cheng Lian authored 9 years ago

When examining plans of complex queries with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. This PR adds tree lines for the tree string of a `TreeNode`, so that the result can be visually more intuitive.

Author: Cheng Lian <lian@databricks.com>

Closes #10099 from liancheng/prettier-tree-string.

a1542ce2

[SPARK-3580][CORE] Add Consistent Method To Get Number of RDD Partitions Across Different Languages · 128c2903

Jeroen Schot authored 9 years ago

I have tried to address all the comments in pull request https://github.com/apache/spark/pull/2447.

Note that the second commit (using the new method in all internal code of all components) is quite intrusive and could be omitted.

Author: Jeroen Schot <jeroen.schot@surfsara.nl>

Closes #9767 from schot/master.

128c2903

[SPARK-12090] [PYSPARK] consider shuffle in coalesce() · 4375eb3f
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #10090 from davies/fix_coalesce.
```
4375eb3f

Dec 01, 2015

[SPARK-11949][SQL] Check bitmasks to set nullable property · 0f37d1d7

Liang-Chi Hsieh authored 9 years ago

Following up #10038.

We can use bitmasks to determine which grouping expressions need to be set as nullable.

cc yhuai

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10067 from viirya/fix-cube-following.

0f37d1d7

[SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles · 8a75a304

Tathagata Das authored 9 years ago

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10088 from tdas/SPARK-12087.

8a75a304

[SPARK-12077][SQL] change the default plan for single distinct · 96691fea

Davies Liu authored 9 years ago

Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag to address performance regression for low cardinality aggregation.

cc yhuai nongli

Author: Davies Liu <davies@databricks.com>

Closes #10075 from davies/agg_15.

96691fea

[SPARK-12081] Make unified memory manager work with small heaps · d96f8c99

Andrew Or authored 9 years ago

The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.

**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).

Author: Andrew Or <andrew@databricks.com>

Closes #10081 from andrewor14/unified-memory-small-heaps.

d96f8c99

[SPARK-8414] Ensure context cleaner periodic cleanups · 1ce4adf5

Andrew Or authored 9 years ago

Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This is a problem for long-running applications (e.g. streaming).

Author: Andrew Or <andrew@databricks.com>

Closes #10070 from andrewor14/periodic-gc.

1ce4adf5

[SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of... · e96a70d5

Yin Huai authored 9 years ago

[SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString.

In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString.

I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241).
```
val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
    println(s"PROCESSING >>>>>>>>>>> $idx")
    val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
    val union = curr.map(_.unionAll(df)).getOrElse(df)
    union.cache()
    Some(union)
  }

c.get.explain(true)
```

Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms.

https://issues.apache.org/jira/browse/SPARK-11596

Author: Yin Huai <yhuai@databricks.com>

Closes #10079 from yhuai/SPARK-11596.

e96a70d5

[SPARK-11352][SQL] Escape */ in the generated comments. · 5872a9d8

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11352

Author: Yin Huai <yhuai@databricks.com>

Closes #10072 from yhuai/SPARK-11352.

5872a9d8

[SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source · 5a8b5fdd

Huaxin Gao authored 9 years ago

When query the Timestamp or Date column like the following
val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end)
The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'"

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes #9872 from huaxingao/spark-11788.

5a8b5fdd

[SPARK-11328][SQL] Improve error message when hitting this issue · 47a0abc3

Nong Li authored 9 years ago

The issue is that the output commiter is not idempotent and retry attempts will
fail because the output file already exists. It is not safe to clean up the file
as this output committer is by design not retryable. Currently, the job fails
with a confusing file exists error. This patch is a stop gap to tell the user
to look at the top of the error log for the proper message.

This is difficult to test locally as Spark is hardcoded not to retry. Manually
verified by upping the retry attempts.

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

Closes #10080 from nongli/spark-11328.

47a0abc3

[SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() · ef6790fd

Josh Rosen authored 9 years ago

When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:

- Avoid `TestHive.reset()` whenever possible:
  - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests.
  - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt.
- Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.

After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.

This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10055 from JoshRosen/speculative-testhive-reset.

ef6790fd

[SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue · f292018f

jerryshao authored 9 years ago

Fixed a minor race condition in #10017

Closes #10017

Author: jerryshao <sshao@hortonworks.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10074 from zsxwing/review-pr10017.

f292018f

[SPARK-11961][DOC] Add docs of ChiSqSelector · e76431f8

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11961

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9965 from yinxusen/SPARK-11961.

e76431f8

Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize" · 328b757d
Shixiong Zhu authored 9 years ago
```
This reverts commit 14011665.
```
328b757d

[SPARK-12004] Preserve the RDD partitioner through RDD checkpointing · 60b541ee

Tathagata Das authored 9 years ago

The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `<checkpoint dir>/_partitioner`. In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9983 from tdas/SPARK-12004.

60b541ee

[SPARK-12030] Fix Platform.copyMemory to handle overlapping regions. · 2cef1cdf

Nong Li authored 9 years ago

This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.

Author: Nong Li <nong@databricks.com>

Closes #10068 from nongli/spark-12030.

2cef1cdf

[SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2 · 34e7093c

Josh Rosen authored 9 years ago

This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10054 from JoshRosen/upgrade-to-tachyon-0.8.2.

34e7093c

[SPARK-11821] Propagate Kerberos keytab for all environments · 6a8cf80c

woj-i authored 9 years ago

andrewor14 the same PR as in branch 1.5
harishreedharan

Author: woj-i <wojciechindyk@gmail.com>

Closes #9859 from woj-i/master.

6a8cf80c

[SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs · 0a7bca2d

gatorsmile authored 9 years ago

Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?

Please provide your opinions. marmbrus rxin cloud-fan

Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #9889 from gatorsmile/persistDS.

0a7bca2d

[SPARK-11954][SQL] Encoder for JavaBeans · fd95eeaf

Wenchen Fan authored 9 years ago

create java version of `constructorFor` and `extractorFor` in `JavaTypeInference`

Author: Wenchen Fan <wenchen@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9937 from cloud-fan/pojo.

fd95eeaf

[SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema · 9df24624

Wenchen Fan authored 9 years ago

When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9840 from cloud-fan/err-msg.

9df24624

[SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail · 8ddc55f1

Wenchen Fan authored 9 years ago

The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10059 from cloud-fan/bug.

8ddc55f1

[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues · 69dbe6b4

Cheng Lian authored 9 years ago

This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.

69dbe6b4

[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize · 14011665

Shixiong Zhu authored 9 years ago

`JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray` needs to copy the content in the internal array to a new array. However, since the array will be converted to `ByteBuffer` at once, we can avoid the memory copy.

This PR added `ByteBufferOutputStream` to access the protected `buf` and convert it to a `ByteBuffer` directly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10051 from zsxwing/SPARK-12060.

14011665

[SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct... · c87531b7

Liang-Chi Hsieh authored 9 years ago

[SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct results for null values

JIRA: https://issues.apache.org/jira/browse/SPARK-11949

The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10038 from viirya/fix-cube.

c87531b7

[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec · a0af0e35

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,

1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.

This will also help extend the upper limit for Word2Vec.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9878 from hhbyyh/w2vBC.

a0af0e35

Nov 30, 2015

[SPARK-12018][SQL] Refactor common subexpression elimination code · 9693b0d5

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12018

The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10009 from viirya/refactor-subexpr-eliminate.

9693b0d5

[HOTFIX][SPARK-12000] Add missing quotes in Jekyll API docs plugin. · f73379be
Josh Rosen authored 9 years ago
```
I accidentally omitted these as part of #10049.
```
f73379be

[SPARK-12049][CORE] User JVM shutdown hook can cause deadlock at shutdown · 96bf468c

Sean Owen authored 9 years ago

Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks'

Author: Sean Owen <sowen@cloudera.com>

Closes #10042 from srowen/SPARK-12049.

96bf468c

[SPARK-12007][NETWORK] Avoid copies in the network lib's RPC layer. · 9bf21206

Marcelo Vanzin authored 9 years ago

This change seems large, but most of it is just replacing `byte[]`
with `ByteBuffer` and `new byte[]` with `ByteBuffer.allocate()`,
since it changes the network library's API.

The following are parts of the code that actually have meaningful
changes:

- The Message implementations were changed to inherit from a new
  AbstractMessage that can optionally hold a reference to a body
  (in the form of a ManagedBuffer); this is similar to how
  ResponseWithBody worked before, except now it's not restricted
  to just responses.

- The TransportFrameDecoder was pretty much rewritten to avoid
  copies as much as possible; it doesn't rely on CompositeByteBuf
  to accumulate incoming data anymore, since CompositeByteBuf
  has issues when slices are retained. The code now is able to
  create frames without having to resort to copying bytes except
  for a few bytes (containing the frame length) in very rare cases.

- Some minor changes in the SASL layer to convert things back to
  `byte[]` since the JDK SASL API operates on those.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9987 from vanzin/SPARK-12007.

9bf21206

[SPARK-12037][CORE] initialize heartbeatReceiverRef before calling startDriverHeartbeat · 0a46e437

CodingCat authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12037

a simple fix by changing the order of the statements

Author: CodingCat <zhunansjtu@gmail.com>

Closes #10032 from CodingCat/SPARK-12037.

0a46e437

[SPARK-12035] Add more debug information in include_example tag of Jekyll · e6dc89a3

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12035

When we debuging lots of example code files, like in https://github.com/apache/spark/pull/10002, it's hard to know which file causes errors due to limited information in `include_example.rb`. With their filenames, we can locate bugs easily.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10026 from yinxusen/SPARK-12035.

e6dc89a3

[SPARK-12000] Fix API doc generation issues · d3ca8cfa

Josh Rosen authored 9 years ago

This pull request fixes multiple issues with API doc generation.

- Modify the Jekyll plugin so that the entire doc build fails if API docs cannot be generated. This will make it easy to detect when the doc build breaks, since this will now trigger Jenkins failures.
- Change how we handle the `-target` compiler option flag in order to fix `javadoc` generation.
- Incorporate doc changes from thunterdb (in #10048).

Closes #10048.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Timothy Hunter <timhunter@databricks.com>

Closes #10049 from JoshRosen/fix-doc-build.

d3ca8cfa

[SPARK-12058][HOTFIX] Disable KinesisStreamTests · edb26e7f

Shixiong Zhu authored 9 years ago

KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/

Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.

This PR just disabled the test to unblock #10039

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10047 from zsxwing/disable-python-kinesis-test.

edb26e7f

fix Maven build · ecc00ec3
Davies Liu authored 9 years ago

ecc00ec3