- Dec 01, 2015
-
-
Liang-Chi Hsieh authored
[SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct results for null values JIRA: https://issues.apache.org/jira/browse/SPARK-11949 The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10038 from viirya/fix-cube.
-
Yuhao Yang authored
jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.
-
- Nov 30, 2015
-
-
Liang-Chi Hsieh authored
JIRA: https://issues.apache.org/jira/browse/SPARK-12018 The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10009 from viirya/refactor-subexpr-eliminate.
-
Josh Rosen authored
I accidentally omitted these as part of #10049.
-
Sean Owen authored
Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks' Author: Sean Owen <sowen@cloudera.com> Closes #10042 from srowen/SPARK-12049.
-
Marcelo Vanzin authored
This change seems large, but most of it is just replacing `byte[]` with `ByteBuffer` and `new byte[]` with `ByteBuffer.allocate()`, since it changes the network library's API. The following are parts of the code that actually have meaningful changes: - The Message implementations were changed to inherit from a new AbstractMessage that can optionally hold a reference to a body (in the form of a ManagedBuffer); this is similar to how ResponseWithBody worked before, except now it's not restricted to just responses. - The TransportFrameDecoder was pretty much rewritten to avoid copies as much as possible; it doesn't rely on CompositeByteBuf to accumulate incoming data anymore, since CompositeByteBuf has issues when slices are retained. The code now is able to create frames without having to resort to copying bytes except for a few bytes (containing the frame length) in very rare cases. - Some minor changes in the SASL layer to convert things back to `byte[]` since the JDK SASL API operates on those. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9987 from vanzin/SPARK-12007.
-
CodingCat authored
https://issues.apache.org/jira/browse/SPARK-12037 a simple fix by changing the order of the statements Author: CodingCat <zhunansjtu@gmail.com> Closes #10032 from CodingCat/SPARK-12037.
-
Xusen Yin authored
https://issues.apache.org/jira/browse/SPARK-12035 When we debuging lots of example code files, like in https://github.com/apache/spark/pull/10002, it's hard to know which file causes errors due to limited information in `include_example.rb`. With their filenames, we can locate bugs easily. Author: Xusen Yin <yinxusen@gmail.com> Closes #10026 from yinxusen/SPARK-12035.
-
Josh Rosen authored
This pull request fixes multiple issues with API doc generation. - Modify the Jekyll plugin so that the entire doc build fails if API docs cannot be generated. This will make it easy to detect when the doc build breaks, since this will now trigger Jenkins failures. - Change how we handle the `-target` compiler option flag in order to fix `javadoc` generation. - Incorporate doc changes from thunterdb (in #10048). Closes #10048. Author: Josh Rosen <joshrosen@databricks.com> Author: Timothy Hunter <timhunter@databricks.com> Closes #10049 from JoshRosen/fix-doc-build.
-
Shixiong Zhu authored
KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/ Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually. This PR just disabled the test to unblock #10039 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10047 from zsxwing/disable-python-kinesis-test.
-
Davies Liu authored
-
Feynman Liang authored
CC jkbradley mengxr josepablocam Author: Feynman Liang <feynman.liang@gmail.com> Closes #10005 from feynmanliang/streaming-test-user-guide.
-
Yanbo Liang authored
Remove duplicate mllib example (DT/RF/GBT in Java/Python). Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9954 from yanboliang/SPARK-11975.
-
Yuhao Yang authored
jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample.
-
Teng Qiu authored
```EventLoggingListener.getLogPath``` needs 4 input arguments: https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280 the 3rd parameter should be appAttemptId, 4th parameter is codec... Author: Teng Qiu <teng.qiu@gmail.com> Closes #10044 from chutium/SPARK-12053.
-
Josh Rosen authored
This reverts commit cc243a07 / PR #9297 I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds. See #9991 for a discussion of why this broke the tests.
-
BenFradet authored
The list in ml-ensembles.md wasn't properly formatted and, as a result, was looking like this:  This PR aims to make it look like this:  Author: BenFradet <benjamin.fradet@gmail.com> Closes #10025 from BenFradet/ml-ensembles-doc.
-
Davies Liu authored
This PR improve the performance of CartesianProduct by caching the result of right plan. After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster). cc nongli Author: Davies Liu <davies@databricks.com> Closes #9969 from davies/improve_cartesian.
-
Davies Liu authored
In 1.6, we introduce a public API to have a SQLContext for current thread, SparkPlan should use that. Author: Davies Liu <davies@databricks.com> Closes #9990 from davies/leak_context.
-
CK50 authored
[SPARK-11989][SQL] Only use commit in JDBC data source if the underlying database supports transactions Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989 ) Author: CK50 <christian.kurz@oracle.com> Author: Christian Kurz <christian.kurz@oracle.com> Closes #9973 from CK50/branch-1.6_non-transactional. (cherry picked from commit a589736a) Signed-off-by:
Reynold Xin <rxin@databricks.com>
-
Prashant Sharma authored
this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/). Author: Prashant Sharma <scrapcodes@gmail.com> Closes #10014 from ScrapCodes/assembly-warning.
-
Wieland Hoffmann authored
Top is implemented in terms of takeOrdered, which already maintains the order, so top should, too. Author: Wieland Hoffmann <themineo@gmail.com> Closes #10013 from mineo/top-order.
-
Prashant Sharma authored
[MINOR][BUILD] Changed the comment to reflect the plugin project is there to support SBT pom reader only. Author: Prashant Sharma <scrapcodes@gmail.com> Closes #10012 from ScrapCodes/minor-build-comment.
-
toddwan authored
[SPARK-11859][MESOS] SparkContext accepts invalid Master URLs in the form zk://host:port for a multi-master Mesos cluster using ZooKeeper * According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://` http://spark.apache.org/docs/latest/running-on-mesos.html `The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.` * However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port` * For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port` * This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted. * This PR also updated corresponding unit test. Author: toddwan <tawan0109@outlook.com> Closes #9886 from toddwan/S11859.
-
- Nov 29, 2015
-
-
Yin Huai authored
[SPARK-12039] [SQL] Ignore HiveSparkSubmitSuite's "SPARK-9757 Persist Parquet relation with decimal column". https://issues.apache.org/jira/browse/SPARK-12039 Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause. Author: Yin Huai <yhuai@databricks.com> Closes #10035 from yhuai/SPARK-12039-ignore.
-
Herman van Hovell authored
In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10015 from hvanhovell/SPARK-12024.
-
Sun Rui authored
Author: Sun Rui <rui.sun@intel.com> Closes #9769 from sun-rui/SPARK-11781.
-
- Nov 28, 2015
-
-
felixcheung authored
Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 shivaram sun-rui Author: felixcheung <felixcheung_m@hotmail.com> Closes #9654 from felixcheung/colnamescoltypes.
-
felixcheung authored
[SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples shivaram sun-rui Author: felixcheung <felixcheung_m@hotmail.com> Closes #10019 from felixcheung/rfunctionsdoc.
-
gatorsmile authored
When calling `get_json_object` for the following two cases, both results are `"null"`: ```scala val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil val df: DataFrame = tuple.toDF("key", "jstring") val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` ```scala val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil val df2: DataFrame = tuple2.toDF("key", "jstring") val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` Fixed the problem and also added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #10018 from gatorsmile/get_json_object.
-
- Nov 27, 2015
-
-
Yin Huai authored
https://issues.apache.org/jira/browse/SPARK-12020 Author: Yin Huai <yhuai@databricks.com> Closes #10010 from yhuai/SPARK-12020.
-
Shixiong Zhu authored
In StreamingListenerSuite."don't call ssc.stop in listener", after the main thread calls `ssc.stop()`, `StreamingContextStoppingCollector` may call `ssc.stop()` in the listener bus thread, which is a dead-lock. This PR updated `StreamingContextStoppingCollector` to only call `ssc.stop()` in the first batch to avoid the dead-lock. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10011 from zsxwing/fix-test-deadlock.
-
Yanbo Liang authored
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side. There are two reasons that we should make this change: * We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645) * Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0. It's better to fix this issue before 1.6 release, otherwise we will make breaking API change. cc shivaram sun-rui Author: Yanbo Liang <ybliang8@gmail.com> Closes #10016 from yanboliang/SPARK-12025.
-
- Nov 26, 2015
-
-
Dilip Biswal authored
Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10001 from dilipbiswal/spark-11997.
-
Reynold Xin authored
-
Jeremy Derr authored
If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster. This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result. Author: Jeremy Derr <jcderr@radius.com> Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.
-
Huaxin Gao authored
Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9890 from huaxingao/spark-11778-regression-test.
-
Jeff Zhang authored
Author: Jeff Zhang <zjffdu@apache.org> Closes #9903 from zjffdu/SPARK-11917.
-
mariusvniekerk authored
Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. Author: mariusvniekerk <marius.v.niekerk@gmail.com> Closes #9861 from mariusvniekerk/SPARK-11881.
-
Yanbo Liang authored
Spark SQL aggregate function: ```Java stddev stddev_pop stddev_samp variance var_pop var_samp skewness kurtosis collect_list collect_set ``` should support ```columnName``` as arguments like other aggregate function(max/min/count/sum). Author: Yanbo Liang <ybliang8@gmail.com> Closes #9994 from yanboliang/SPARK-12011.
-