- Oct 20, 2015
-
-
Holden Karau authored
Upgrade to Py4j0.9 Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
-
Daoyuan Wang authored
PromotePrecision is not necessary after HiveTypeCoercion done. Jira: https://issues.apache.org/jira/browse/SPARK-10463 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8621 from adrian-wang/promoterm.
-
Jakob Odersky authored
`transient` annotations on class parameters (not case class parameters or vals) causes compilation errors during compilation with Scala 2.11. I understand that transient *parameters* make no sense, however I don't quite understand why the 2.10 compiler accepted them. Note: in case it is preferred to keep the annotations in case someone would in the future want to redefine them as vals, it would also be possible to just add `val` after the annotation, e.g. `class Foo(transient x: Int)` becomes `class Foo(transient private val x: Int)`. I chose to remove the annotation as it also reduces needles clutter, however please feel free to tell me if you prefer the second option and I'll update the PR Author: Jakob Odersky <jodersky@gmail.com> Closes #9126 from jodersky/sbt-scala-2.11.
-
Jean-Baptiste Onofré authored
Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #9059 from jbonofre/SPARK-10876.
-
- Oct 19, 2015
-
-
Cheng Lian authored
`DataSourceStrategy.mergeWithPartitionValues` is essentially a projection implemented in a quite inefficient way. This PR optimizes this method with `UnsafeProjection` to avoid unnecessary boxing costs. Author: Cheng Lian <lian@databricks.com> Closes #9104 from liancheng/spark-11088.faster-partition-values-merging.
-
Ryan Williams authored
I also added some information to container-failure error msgs about what host they failed on, which would have helped me identify the problem that lead me to this JIRA and PR sooner. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #9147 from ryan-williams/dyn-exec-failures.
-
Chris Bannister authored
[SPARK-9708][MESOS] Spark should create local temporary directories in Mesos sandbox when launched with Mesos This is my own original work and I license this to the project under the project's open source license Author: Chris Bannister <chris.bannister@swiftkey.com> Author: Chris Bannister <chris.bannister@swiftkey.net> Closes #8358 from Zariel/mesos-local-dir.
-
Davies Liu authored
Also added SQLContext.newSession() Author: Davies Liu <davies@databricks.com> Closes #9122 from davies/py_create.
-
Liang-Chi Hsieh authored
JIRA: https://issues.apache.org/jira/browse/SPARK-11051 When a `RDD` is materialized and checkpointed, its partitions and dependencies are cleared. If we allow local checkpointing on it and assign `LocalRDDCheckpointData` to its `checkpointData`. Next time when the RDD is materialized again, the error will be thrown. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9072 from viirya/no-localcheckpoint-after-checkpoint.
-
Marcelo Vanzin authored
Because the registration RPC was not really an RPC, but a bunch of disconnected messages, it was possible for other messages to be sent before the reply to the registration arrived, and that would confuse the Worker. Especially in local-cluster mode, the worker was succeptible to receiving an executor request before it received a message from the master saying registration succeeded. On top of the above, the change also fixes a ClassCastException when the registration fails, which also affects the executor registration protocol. Because the `ask` is issued with a specific return type, if the error message (of a different type) was returned instead, the code would just die with an exception. This is fixed by having a common base trait for these reply messages. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9138 from vanzin/SPARK-11131.
-
zsxwing authored
[SPARK-11063] [STREAMING] Change preferredLocations of Receiver's RDD to hosts rather than hostports The format of RDD's preferredLocations must be hostname but the format of Streaming Receiver's scheduling executors is hostport. So it doesn't work. This PR converts `schedulerExecutors` to `hosts` before creating Receiver's RDD. Author: zsxwing <zsxwing@gmail.com> Closes #9075 from zsxwing/SPARK-11063.
-
Rishabh Bhardwaj authored
Added support for boolean types in fill and replace methods Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9166 from rishabhbhardwaj/master.
-
Wenchen Fan authored
The purpose of this PR is to keep the unsafe format detail only inside the unsafe class itself, so when we use them(like use unsafe array in unsafe map, use unsafe array and map in columnar cache), we don't need to understand the format before use them. change list: * unsafe array's 4-bytes numElements header is now required(was optional), and become a part of unsafe array format. * w.r.t the previous changing, the `sizeInBytes` of unsafe array now counts the 4-bytes header. * unsafe map's format was `[numElements] [key array numBytes] [key array content(without numElements header)] [value array content(without numElements header)]` before, which is a little hacky as it makes unsafe array's header optional. I think saving 4 bytes is not a big deal, so the format is now: `[key array numBytes] [unsafe key array] [unsafe value array]`. * w.r.t the previous changing, the `sizeInBytes` of unsafe map now counts both map's header and array's header. Author: Wenchen Fan <wenchen@databricks.com> Closes #9131 from cloud-fan/unsafe.
-
lewuathe authored
…2 regularization if the number of features is small Author: lewuathe <lewuathe@me.com> Author: Lewuathe <sasaki@treasure-data.com> Author: Kai Sasaki <sasaki@treasure-data.com> Author: Lewuathe <lewuathe@me.com> Closes #8884 from Lewuathe/SPARK-10668.
-
Alex Angelini authored
Includes: https://github.com/irmen/Pyrolite/pull/23 which fixes datetimes with timezones. JoshRosen https://issues.apache.org/jira/browse/SPARK-9643 Author: Alex Angelini <alex.louis.angelini@gmail.com> Closes #7950 from angelini/upgrade_pyrolite_up.
-
Jacek Laskowski authored
…redNodeLocationData Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8976 from jaceklaskowski/SPARK-10921.
-
zsxwing authored
The unit test added in #9132 is flaky. This is a follow up PR to add `listenerBus.waitUntilEmpty` to fix it. Author: zsxwing <zsxwing@gmail.com> Closes #9163 from zsxwing/SPARK-11126-follow-up.
-
Brennon York authored
This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes. From the original PR description (by brennonyork): Currently a few things are left out that, could and I think should, be smaller JIRA's after this. 1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out. 2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO. 3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well. Closes #7401. Author: Brennon York <brennon.york@capitalone.com> Closes #9161 from JoshRosen/run-tests-jenkins-refactoring.
-
- Oct 18, 2015
-
-
zsxwing authored
SQLListener adds all stage infos to `_stageIdToStageMetrics`, but only removes stage infos belonging to SQL executions. This PR fixed it by ignoring stages that don't belong to SQL executions. Reported by Terry Hoo in https://www.mail-archive.com/userspark.apache.org/msg38810.html Author: zsxwing <zsxwing@gmail.com> Closes #9132 from zsxwing/SPARK-11126.
-
Mahmoud Lababidi authored
[SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by presenting the Object The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion. The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings. Author: Mahmoud Lababidi <mahmoud@thehumangeo.com> Author: Mahmoud Lababidi <lababidi@gmail.com> Closes #9149 from lababidi/master.
-
Patrick Wendell authored
This commit exists to close the following pull requests on Github: Closes #8737 (close requested by 'srowen') Closes #5323 (close requested by 'JoshRosen') Closes #6148 (close requested by 'JoshRosen') Closes #7557 (close requested by 'JoshRosen') Closes #7047 (close requested by 'srowen') Closes #8713 (close requested by 'marmbrus') Closes #5834 (close requested by 'srowen') Closes #7467 (close requested by 'tdas') Closes #8943 (close requested by 'xiaowen147') Closes #4434 (close requested by 'JoshRosen') Closes #8949 (close requested by 'srowen') Closes #5368 (close requested by 'JoshRosen') Closes #8186 (close requested by 'marmbrus') Closes #5147 (close requested by 'JoshRosen')
-
Reynold Xin authored
Our merge script now turns ``` [SPARK-1234][SPARK-1235][SPARK-1236][SQL] description ``` into ``` [SPARK-1234] [SPARK-1235] [SPARK-1236] [SQL] description ``` The extra spaces are more annoying in git since the first line of a git commit is supposed to be very short. Doctest passes with the following command: ``` python -m doctest merge_spark_pr.py ``` Author: Reynold Xin <rxin@databricks.com> Closes #9156 from rxin/SPARK-11169.
-
Lukasz Piepiora authored
This patch fixes a small typo in the GraphX programming guide Author: Lukasz Piepiora <lpiepiora@gmail.com> Closes #9160 from lpiepiora/11174-fix-typo-in-graphx-programming-guide.
-
tedyu authored
Author: tedyu <yuzhihong@gmail.com> Closes #9157 from tedyu/master.
-
- Oct 17, 2015
-
-
huangzhaowei authored
[SPARK-11000] [YARN] Load `metadata.Hive` class only when `hive.metastore.uris` was set to avoid bootting the database twice Author: huangzhaowei <carlmartinmax@gmail.com> Closes #9026 from SaintBacchus/SPARK-11000.
-
ph authored
Mesos has a feature for linking to frameworks running on top of Mesos from the Mesos WebUI. This commit enables Spark to make use of this feature so one can directly visit the running Spark WebUIs from the Mesos WebUI. Author: ph <ph@plista.com> Closes #9135 from philipphoffmann/SPARK-11129.
-
Koert Kuipers authored
Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider Author: Koert Kuipers <koert@tresata.com> Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
-
Reynold Xin authored
Its classdoc actually says; "NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility." Author: Reynold Xin <rxin@databricks.com> Closes #9155 from rxin/private-logging-trait.
-
Luvsandondov Lkhamsuren authored
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com> Closes #8609 from lkhamsurenl/SPARK-9963.
-
Yuhao Yang authored
jira: https://issues.apache.org/jira/browse/SPARK-11029 We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel. This will be a temp fix until we have proper evaluators defined for clustering. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #9073 from hhbyyh/computeCost.
-
- Oct 16, 2015
-
-
zero323 authored
At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9098 from zero323/sparse_vector_getitem_improved.
-
Burak Yavuz authored
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways: - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition **NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was. Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle. Size A: 1e5 x 1e5 Size B: 1e5 x 1e5 Block Sizes: 1024 x 1024 Sparsity: 0.01 Old implementation: 1m 13s New implementation: 9s cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster Author: Burak Yavuz <brkyvz@gmail.com> Closes #8757 from brkyvz/opt-bmm.
-
Bhargav Mangipudi authored
…rror message For negative indices in the SparseVector, we update the index value. If we have an incorrect index at this point, the error message has the incorrect *updated* index instead of the original one. This change contains the fix for the same. Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com> Closes #9069 from bhargav/spark-10759.
-
gweidner authored
Switched from deprecated org.apache.hadoop.fs.permission.AccessControlException to org.apache.hadoop.security.AccessControlException. Author: gweidner <gweidner@us.ibm.com> Closes #9144 from gweidner/SPARK-11109.
-
zsxwing authored
The following deadlock may happen if shutdownHook and StreamingContext.stop are running at the same time. ``` Java stack information for the threads listed above: =================================================== "Thread-2": at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699) - waiting to lock <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729) at org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236) - locked <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) "main": at org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248) - waiting to lock <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager) at org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712) - locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684) - locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108) at org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` This PR just moved `ShutdownHookManager.removeShutdownHook` out of `synchronized` to avoid deadlock. Author: zsxwing <zsxwing@gmail.com> Closes #9116 from zsxwing/stop-deadlock.
-
zsxwing authored
[SPARK-10974] [STREAMING] Add progress bar for output operation column and use red dots for failed batches Screenshot: <img width="1363" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10342571/385d9340-6d4c-11e5-8e79-1fa4c3c98f81.png"> Also fixed the description and duration for output operations that don't have spark jobs. <img width="1354" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10342775/4bd52a0e-6d4d-11e5-99bc-26265a9fc792.png"> Author: zsxwing <zsxwing@gmail.com> Closes #9010 from zsxwing/output-op-progress-bar.
-
Pravin Gadakh authored
Groups are not resolved properly in scaladoc in following classes: sql/core/src/main/scala/org/apache/spark/sql/Column.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/functions.scala Author: Pravin Gadakh <pravingadakh177@gmail.com> Closes #9148 from pravingadakh/master.
-
navis.ryu authored
Some json parsers are not closed. parser in JacksonParser#parseJson, for example. Author: navis.ryu <navis@apache.org> Closes #9130 from navis/SPARK-11124.
-
Jakob Odersky authored
Shows that an error is actually due to a fatal warning. Author: Jakob Odersky <jodersky@gmail.com> Closes #9128 from jodersky/fatalwarnings.
-
Jakob Odersky authored
Removes any extra strings from the Java version, fixing subsequent integer parsing. This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field. Author: Jakob Odersky <jodersky@gmail.com> Closes #9111 from jodersky/fixtestrunner.
-