- Oct 30, 2014
-
-
GuoQiang Li authored
- [X] Standalone - [X] YARN - [X] Mesos - [X] Mac OS X - [X] Linux - [ ] Windows This is another implementation about #1031 Author: GuoQiang Li <witgo@qq.com> Closes #2711 from witgo/SPARK-1719 and squashes the following commits: c7b26f6 [GuoQiang Li] review commits 4488e41 [GuoQiang Li] Refactoring CommandUtils a444094 [GuoQiang Li] review commits 40c0b4a [GuoQiang Li] Add buildLocalCommand method c1a0ddd [GuoQiang Li] fix comments 156ce88 [GuoQiang Li] review commit 38aa377 [GuoQiang Li] Refactor CommandUtils.scala 4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala 7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path
-
- Oct 29, 2014
-
-
Tathagata Das authored
[SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing block generator throttling In the unit test that checked whether blocks generated by throttled block generator had expected number of records, the thresholds are too tight, which sometimes led to the test failing. This PR fixes it by relaxing the thresholds and the time intervals for testing. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2900 from tdas/receiver-suite-flakiness and squashes the following commits: 28508a2 [Tathagata Das] Made the ReceiverSuite test more reliable
-
Andrew Or authored
This is part of a bigger effort to provide elastic scaling of executors within a Spark application ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does not provide any functionality by itself; it is a skeleton that is missing a mechanism to be added later in [SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822). Comments and feedback are most welcome. For those of you reviewing this in detail, I highly recommend doing it through your favorite IDE instead of through the diff here. Author: Andrew Or <andrewor14@gmail.com> Author: Andrew Or <andrew@databricks.com> Closes #2746 from andrewor14/scaling-heuristics and squashes the following commits: 8a4fdaa [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics e045df8 [Andrew Or] Add warning message (minor) dfa31ec [Andrew Or] Fix tests c0becc4 [Andrew Or] Merging with SPARK-3822 4784f93 [Andrew Or] Reword an awkward log message 181f27f [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics c79e907 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 4672b90 [Andrew Or] It's nano time. a6a30f2 [Andrew Or] Do not allow min/max executors of 0 c60ec33 [Andrew Or] Rewrite test logic with clocks b00b680 [Andrew Or] Fix style c3caa65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 7f9da14 [Andrew Or] Factor out logic to verify bounds on # executors (minor) f279019 [Andrew Or] Add time mocking tests for polling loop 685e347 [Andrew Or] Factor out clock in polling loop to facilitate testing 3cea7f7 [Andrew Or] Use PrivateMethodTester to keep original class private 3156d81 [Andrew Or] Update comments and exception messages 92f36f9 [Andrew Or] Address minor review comments abdea61 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 2aefd09 [Andrew Or] Correct listener behavior 9fe6e44 [Andrew Or] Rename variables and configs + update comments and log messages 149cc32 [Andrew Or] Fix style 254c958 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 5ff829b [Andrew Or] Add tests for ExecutorAllocationManager 19c6c4b [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 5896515 [Andrew Or] Move ExecutorAllocationManager out of scheduler package 9ca8945 [Andrew Or] Rewrite callbacks through the listener interface 5e336b9 [Andrew Or] Remove code from backend to avoid conflict with SPARK-3822 092d1fd [Andrew Or] Remove timeout logic for pending requests 1309fab [Andrew Or] Request executors by specifying the number pending 8bc0e9d [Andrew Or] Add logic to expire pending requests after timeouts b750ee1 [Andrew Or] Express timers in terms of expiration times + remove retry logic 7f8dd47 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 9d516cc [Andrew Or] Bug fix: Actually trigger the add timer / add retry timer 44f1832 [Andrew Or] Rename configs to include time units eaae7ef [Andrew Or] Address various review comments 6f8be6c [Andrew Or] Beef up comments on what each of the timers mean baaa403 [Andrew Or] Simplify variable names (minor) 42beec8 [Andrew Or] Reset whether the add threshold is crossed on cancellation 9bcc0bc [Andrew Or] ExecutorScalingManager -> ExecutorAllocationManager 2784398 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics 5a97d9e [Andrew Or] Log retry attempts in INFO + clean up logging 2f55c9f [Andrew Or] Do not keep requesting executors even after max attempts 0acd1cb [Andrew Or] Rewrite timer logic with polling b3c7d44 [Andrew Or] Start the retry timer for adding executors at the right time 9b5f2ea [Andrew Or] Wording changes in comments and log messages c2203a5 [Andrew Or] Simplify code to access the scheduler backend e519d08 [Andrew Or] Simplify initialization code 2cc87a7 [Andrew Or] Add retry logic for removing executors d0b34a6 [Andrew Or] Add retry logic for adding executors 9cc4649 [Andrew Or] Simplifying synchronization logic 67c03c7 [Andrew Or] Correct semantics of adding executors + update comments 6c48ab0 [Andrew Or] Update synchronization comment 8901900 [Andrew Or] Simplify remove policy + change the semantics of add policy 1cc8444 [Andrew Or] Minor wording change ae5b64a [Andrew Or] Add synchronization 20ec6b9 [Andrew Or] First cut implementation of removing executors dynamically 4077ae2 [Andrew Or] Minor code re-organization 6f1fa66 [Andrew Or] First cut implementation of adding executors dynamically b2e6dcc [Andrew Or] Add skeleton interface for requesting / killing executors
-
zsxwing authored
There is a chance that `thread` is null when calling `thread.interrupt()`. ```Scala override def cancel(): Unit = this.synchronized { _cancelled = true if (thread != null) { thread.interrupt() } } ``` Should put `thread = null` into a `synchronized` block to fix the race condition. Author: zsxwing <zsxwing@gmail.com> Closes #2957 from zsxwing/SPARK-4097 and squashes the following commits: edf0aee [zsxwing] Add comments to explain the lock c5cfeca [zsxwing] Fix the race condition of 'thread'
-
Andrew Or authored
This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues. The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code. I have tested this significantly on a stable Yarn cluster. ------------ A remaining task for this issue is to tone down the error messages emitted when an executor is removed. Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134. Author: Andrew Or <andrew@databricks.com> Author: Andrew Or <andrewor14@gmail.com> Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits: 485863e [Andrew Or] Minor log message changes 4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now 1c57804 [Andrew Or] Reword a few comments + other review comments 6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism 02836c0 [Andrew Or] Limit scope of synchronization 4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly 73ade46 [Andrew Or] Wording changes (minor) 2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor) 665f229 [Andrew Or] Mima excludes 79aa2df [Andrew Or] Simplify the request interface by asking for a total 04f625b [Andrew Or] Fix race condition that causes over-allocation of executors f4783f8 [Andrew Or] Change the semantics of requesting executors 005a124 [Andrew Or] Fix tests 4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism 572f5c5 [Andrew Or] Unused import (minor) f30261c [Andrew Or] Kill multiple executors rather than one at a time de260d9 [Andrew Or] Simplify by skipping useless null check 9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl 97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend 7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API 47466cd [Andrew Or] Refactor common Yarn scheduler backend logic c4dfaac [Andrew Or] Avoid thrashing when removing executors 53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message bbee669 [Andrew Or] Add mechanism in yarn client mode
-
Daoyuan Wang authored
In JavaSqlContext, we need to let java program use big decimal, timestamp, date types. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2850 from adrian-wang/javacontext and squashes the following commits: 4c4292c [Daoyuan Wang] change underlying type of JavaSchemaRDD as scala bb0508f [Daoyuan Wang] add test cases 3c58b0d [Daoyuan Wang] add 3 types for java SQL context
-
Reynold Xin authored
This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty. In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however. In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version. These issues: have been addressed by folding in #2330: SPARK-3453: Refactor Netty module to use BlockTransferService interface SPARK-3018: Release all buffers upon task completion/failure SPARK-3002: Create a connection pool and reuse clients across different threads SPARK-3017: Integration tests and unit tests for connection failures SPARK-3049: Make sure client doesn't block when server/connection has error(s) SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option SPARK-3503: Disable thread local cache in PooledByteBufAllocator TODO before mergeable: - [x] Implement uploadBlock() - [x] Unit tests for RPC side of code - [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022)) - [x] Turn OFF by default (currently on for unit testing) Author: Reynold Xin <rxin@apache.org> Author: Aaron Davidson <aaron@databricks.com> Author: cocoatomo <cocoatomo77@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <prashant.s@imaginea.com> Author: Davies Liu <davies.liu@gmail.com> Author: Anand Avati <avati@redhat.com> Closes #2753 from aarondav/netty and squashes the following commits: cadfd28 [Aaron Davidson] Turn netty off by default d7be11b [Aaron Davidson] Turn netty on by default 4a204b8 [Aaron Davidson] Fail block fetches if client connection fails 2b0d1c0 [Aaron Davidson] 100ch 0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty 14e37f7 [Aaron Davidson] Address Reynold's comments 8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty 322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename e5675a4 [Aaron Davidson] Fail outstanding RPCs as well ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails 9da0bc1 [Aaron Davidson] Add RPC unit tests d236dfd [Aaron Davidson] Remove no-op serializer :) 7b7a26c [Aaron Davidson] Fix Nio compile issue dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test 939f276 [Aaron Davidson] Attempt to make comm. bidirectional aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings 8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed 5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s). 2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade. 020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4 29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting. 5d98ce3 [Reynold Xin] Flip buffer. f6c220d [Reynold Xin] Merge with latest master. 407e59a [Reynold Xin] Fix style violation. a0518c7 [Reynold Xin] Implemented block uploads. 4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync. bec4ea2 [Reynold Xin] Removed OIO and added num threads settings. 1bdd7ee [Reynold Xin] Fixed tests. d68f328 [Reynold Xin] Logging close() in case close() fails. f63fb4c [Reynold Xin] Add more debug message. 6afc435 [Reynold Xin] Added logging. c066309 [Reynold Xin] Implement java.io.Closeable interface. 519d64d [Reynold Xin] Mark private package visibility and MimaExcludes. f0a16e9 [Reynold Xin] Fixed test hanging. 14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk. b2f3281 [Reynold Xin] Added connection pooling. d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline 9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite 5cd33d7 [Reynold Xin] Fixed style violation. cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite 1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names. 108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list. b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite. 064747b [Reynold Xin] Reference count buffers and clean them up properly. 2b44cf1 [Reynold Xin] Added more documentation. 1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well. 165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
-
DB Tsai authored
In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. Author: DB Tsai <dbtsai@alpinenow.com> Closes #2992 from dbtsai/SPARK-4129 and squashes the following commits: b99db6c [DB Tsai] fixed java.lang.ArrayIndexOutOfBoundsException 2b5e882 [DB Tsai] small refactoring ebe3e74 [DB Tsai] First commit
-
Xiangrui Meng authored
I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes. aarondav Author: Xiangrui Meng <meng@databricks.com> Closes #2990 from mengxr/disable-benchmark and squashes the following commits: c58f070 [Xiangrui Meng] disable benchmark code
-
- Oct 28, 2014
-
-
Davies Liu authored
Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType. For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple. Author: Davies Liu <davies@databricks.com> Closes #2973 from davies/udf_array and squashes the following commits: 306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array 2c00e43 [Davies Liu] fix merge 11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array 9df50a2 [Davies Liu] address comments 79afb4e [Davies Liu] type conversionfor python udf
-
Cheng Hao authored
In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception. e.g. select named_struct("x", "str") from src limit 1; I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable). As well as the `unwrap` method by providing the input `ObjectInspector`. Author: Cheng Hao <hao.cheng@intel.com> Closes #2762 from chenghao-intel/udf_coi and squashes the following commits: bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1 2416e5d [Cheng Hao] revert to hive 0.12 5793c01 [Cheng Hao] add space before while 4e56e1b [Cheng Hao] style issue 683d3fd [Cheng Hao] Add golden files fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List 8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing
-
zsxwing authored
`zeroValue` will be serialized by `spark.closure.serializer` but `spark.closure.serializer` only supports the default Java serializer. So it must not be `ClassWithoutNoArgConstructor`, which can not be serialized by the Java serializer. This PR changed `zeroValue` to null and updated the test to make it work correctly. Author: zsxwing <zsxwing@gmail.com> Closes #2856 from zsxwing/SPARK-4008 and squashes the following commits: 51da655 [zsxwing] [SPARK-4008] Fix "kryo with fold" in KryoSerializerSuite
-
Xiangrui Meng authored
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin . The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`. The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter: ~~~ [info] - Sorter benchmark for key-value pairs !!! IGNORED !!! Java Arrays.sort() on non-primitive int array: Took 13237 ms Java Arrays.sort() on non-primitive int array: Took 13320 ms Java Arrays.sort() on non-primitive int array: Took 15718 ms Java Arrays.sort() on non-primitive int array: Took 13283 ms Java Arrays.sort() on non-primitive int array: Took 13267 ms Java Arrays.sort() on non-primitive int array: Took 15122 ms Java Arrays.sort() on non-primitive int array: Took 15495 ms Java Arrays.sort() on non-primitive int array: Took 14877 ms Java Arrays.sort() on non-primitive int array: Took 16429 ms Java Arrays.sort() on non-primitive int array: Took 14250 ms Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average) Java Arrays.sort() on primitive int array: Took 2683 ms Java Arrays.sort() on primitive int array: Took 2683 ms Java Arrays.sort() on primitive int array: Took 2701 ms Java Arrays.sort() on primitive int array: Took 2746 ms Java Arrays.sort() on primitive int array: Took 2685 ms Java Arrays.sort() on primitive int array: Took 2735 ms Java Arrays.sort() on primitive int array: Took 2669 ms Java Arrays.sort() on primitive int array: Took 2693 ms Java Arrays.sort() on primitive int array: Took 2680 ms Java Arrays.sort() on primitive int array: Took 2642 ms Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average) Sorter without key reuse on primitive int array: Took 10732 ms Sorter without key reuse on primitive int array: Took 12482 ms Sorter without key reuse on primitive int array: Took 10718 ms Sorter without key reuse on primitive int array: Took 12650 ms Sorter without key reuse on primitive int array: Took 10747 ms Sorter without key reuse on primitive int array: Took 10783 ms Sorter without key reuse on primitive int array: Took 12721 ms Sorter without key reuse on primitive int array: Took 10604 ms Sorter without key reuse on primitive int array: Took 10622 ms Sorter without key reuse on primitive int array: Took 11843 ms Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average) Sorter with key reuse on primitive int array: Took 5141 ms Sorter with key reuse on primitive int array: Took 5298 ms Sorter with key reuse on primitive int array: Took 5066 ms Sorter with key reuse on primitive int array: Took 5164 ms Sorter with key reuse on primitive int array: Took 5203 ms Sorter with key reuse on primitive int array: Took 5274 ms Sorter with key reuse on primitive int array: Took 5186 ms Sorter with key reuse on primitive int array: Took 5159 ms Sorter with key reuse on primitive int array: Took 5164 ms Sorter with key reuse on primitive int array: Took 5078 ms Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average) ~~~ So with key reuse, it is faster and less likely to trigger GC. Author: Xiangrui Meng <meng@databricks.com> Author: Reynold Xin <rxin@apache.org> Closes #2937 from mengxr/SPARK-4084 and squashes the following commits: d73c3d0 [Xiangrui Meng] address comments 0b7b682 [Xiangrui Meng] fix mima a72f53c [Xiangrui Meng] update timeIt 38ba50c [Xiangrui Meng] update timeIt 720f731 [Xiangrui Meng] add doc about JIT specialization 78f2879 [Xiangrui Meng] update tests 7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct 8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite 5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat 6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark] b00db4d [Xiangrui Meng] doc and tests cf94e8a [Xiangrui Meng] renaming 464ddce [Reynold Xin] cherry-pick rxin's commit
-
Cheng Hao authored
Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation. Author: Cheng Hao <hao.cheng@intel.com> Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits: e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1 cfb3662 [Cheng Hao] revert to hive 0.12 c8a547d [Cheng Hao] Support SerDe properties within CTAS
-
zsxwing authored
A global UTF8 constant is very helpful to handle encoding problems when converting between String and bytes. There are several solutions here: 1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala 2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7) 3. io.netty.util.CharsetUtil.UTF_8 4. com.google.common.base.Charsets.UTF_8 5. org.apache.commons.lang.CharEncoding.UTF_8 6. org.apache.commons.lang3.CharEncoding.UTF_8 IMO, I prefer option 1) because people can find it easily. This is a PR for option 1) and only fixes Spark Core. Author: zsxwing <zsxwing@gmail.com> Closes #2781 from zsxwing/SPARK-3922 and squashes the following commits: f974edd [zsxwing] Merge branch 'master' into SPARK-3922 2d27423 [zsxwing] Refactor spark-core to use Refactor spark-core to use Utils.UTF_8
-
Daoyuan Wang authored
Add json and python api for date type. By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2901 from adrian-wang/spark3988 and squashes the following commits: c51a24d [Daoyuan Wang] convert datetime to date 5670626 [Daoyuan Wang] minor line combine f760d8e [Daoyuan Wang] fix indent 444f100 [Daoyuan Wang] fix a typo 1d74448 [Daoyuan Wang] fix scala style 8d7dd22 [Daoyuan Wang] add json and python api for date type
-
ravipesala authored
Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same. I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR Author: ravipesala <ravindra.pesala@huawei.com> Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits: a391c7a [ravipesala] Rebase with master
-
Kousuke Saruta authored
In a script 'python/run-tests', log file name is represented by a variable 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log file name in the script. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2905 from sarutak/SPARK-4058 and squashes the following commits: 7710490 [Kousuke Saruta] Fixed python/run-tests not to use hard-coded log file name
-
Michael Griffiths authored
This issue employs logic similar to the bash launcher (pyspark) to check if IPTYHON=1, and if so launch ipython with options in IPYTHON_OPTS. This fix assumes that ipython is available in the system Path, and can be invoked with a plain "ipython" command. Author: Michael Griffiths <msjgriffiths@gmail.com> Closes #2910 from msjgriffiths/pyspark-windows and squashes the following commits: ef34678 [Michael Griffiths] Change build message to comply with [SPARK-3775] 361e3d8 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows 9ce72d1 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows
-
Kousuke Saruta authored
The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2943 from sarutak/SPARK-4089 and squashes the following commits: aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml
-
Kousuke Saruta authored
[SPARK-3657] yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell tgravescs reported this issue. Following is quoted from tgravescs' report. YarnRMClientImpl.registerApplicationMaster can throw null pointer exception when setting the trackingurl if its empty: appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority()) I hit this just start spark-shell without the tracking url set. 14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030 Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132) at org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102) at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55) at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38) at org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168) at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120) Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2981 from sarutak/SPARK-3657-2 and squashes the following commits: e2fd6bc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3657 70b8882 [Kousuke Saruta] Fixed NPE thrown
-
WangTaoTheTonic authored
[SPARK-4096][YARN]let ApplicationMaster accept executor memory argument in same format as JVM memory strings Here `ApplicationMaster` accept executor memory argument only in number format, we should let it accept JVM style memory strings as well. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2955 from WangTaoTheTonic/modifyDesc and squashes the following commits: ab98c70 [WangTaoTheTonic] append parameter passed in 3779767 [WangTaoTheTonic] Update executor memory description in the help message
-
Kousuke Saruta authored
In spark-daemon.sh, thare are following comments. # SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_PREFIX}/conf. # SPARK_LOG_DIR Where log files are stored. PWD by default. But, I think the default value for SPARK_CONF_DIR is `${SPARK_HOME}/conf` and for SPARK_LOG_DIR is `${SPARK_HOME}/logs`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2972 from sarutak/SPARK-4110 and squashes the following commits: 5a171a2 [Kousuke Saruta] Fixed wrong comments
-
Shivaram Venkataraman authored
This avoids reading torrent broadcast variables when they are referenced in the closure but not used in the closure. This is done by using a `lazy val` to read broadcast blocks cc rxin JoshRosen for review Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #2871 from shivaram/broadcast-read-value and squashes the following commits: 1456d65 [Shivaram Venkataraman] Use getUsedTimeMs and remove readObject d6c5ee9 [Shivaram Venkataraman] Use laxy val to implement readBroadcastBlock 0b34df7 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value 9cec507 [Shivaram Venkataraman] Test if broadcast variables are read lazily 768b40b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value 8792ed8 [Shivaram Venkataraman] Make torrent broadcast read blocks on use. This avoids reading broadcast variables when they are referenced in the closure but not used by the code.
-
WangTaoTheTonic authored
https://issues.apache.org/jira/browse/SPARK-4098 Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2958 from WangTaoTheTonic/useAddress and squashes the following commits: 29236e6 [WangTaoTheTonic] use appUIAddress instead of appUIHostPort in yarn-cluster mode
-
WangTaoTheTonic authored
Instead of checking if `args.userClass` is null repeatedly, we extract it to an global val as in `ApplicationMaster`. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2954 from WangTaoTheTonic/MemUnit and squashes the following commits: 13bda20 [WangTaoTheTonic] extract val isLaunchingDriver in ClientBase
-
WangTaoTheTonic authored
Since its name reduced at https://github.com/apache/spark/pull/560, the log4j-spark-container.properties was never used again. And I have searched its name globally in code and found no cite. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2977 from WangTaoTheTonic/delLog4j and squashes the following commits: fb2729f [WangTaoTheTonic] delete the log4j file obsoleted
-
Davies Liu authored
Added completed Python API for MLlib.feature Normalizer StandardScalerModel StandardScaler HashTF IDFModel IDF cc mengxr Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2819 from davies/feature and squashes the following commits: 4f48f48 [Davies Liu] add a note for HashingTF 67f6d21 [Davies Liu] address comments b628693 [Davies Liu] rollback changes in Word2Vec efb4f4f [Davies Liu] Merge branch 'master' into feature 806c7c2 [Davies Liu] address comments 3abb8c2 [Davies Liu] address comments 59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature a405ae7 [Davies Liu] fix tests 7a1891a [Davies Liu] fix tests 486795f [Davies Liu] update programming guide, HashTF -> HashingTF 8a50584 [Davies Liu] Python API for mllib.feature
-
Josh Rosen authored
`read()` may return fewer bytes than requested; when this occurred, the old code would silently return less data than requested, which might cause stream corruption errors. `skip()` faces similar issues, too. This patch fixes several cases where we mis-handle these methods' return values. Author: Josh Rosen <joshrosen@databricks.com> Closes #2969 from JoshRosen/file-channel-read-fix and squashes the following commits: e724a9f [Josh Rosen] Fix similar issue of not checking skip() return value. cbc03ce [Josh Rosen] Update the other log message, too. 01e6015 [Josh Rosen] file.getName -> file.getAbsolutePath d961d95 [Josh Rosen] Fix another issue in FileServerSuite. b9265d2 [Josh Rosen] Fix a similar (minor) issue in TestUtils. cd9d76f [Josh Rosen] Fix a similar error in Tachyon: 3db0008 [Josh Rosen] Fix a similar read() error in Utils.offsetBytes(). db985ed [Josh Rosen] Fix unsafe usage of FileChannel.read():
-
Ryan Williams authored
seems like `building-spark.html` was renamed to `building-with-maven.html`? Is Maven the blessed build tool these days, or SBT? I couldn't find a building-with-sbt page so I went with the Maven one here. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #2859 from ryan-williams/broken-links-readme and squashes the following commits: 7692253 [Ryan Williams] fix broken links in README.md
-
GuoQiang Li authored
cc @rxin Author: GuoQiang Li <witgo@qq.com> Closes #2929 from witgo/SPARK-4064 and squashes the following commits: 20110f2 [GuoQiang Li] Modify the exception msg 3425225 [GuoQiang Li] review commits 2b07e49 [GuoQiang Li] If we create a lot of big broadcast variables, Spark may hang
-
wangxiaojing authored
JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907 Add turncate table support TRUNCATE TABLE table_name [PARTITION partition_spec]; partition_spec: : (partition_col = partition_col_value, partition_col = partiton_col_value, ...) Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table. Author: wangxiaojing <u9jing@gmail.com> Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits: 63dbd81 [wangxiaojing] change hive scalastyle 7a03707 [wangxiaojing] add comment f6e710e [wangxiaojing] change truncate table a1f692c [wangxiaojing] Correct spelling mistakes 3b20007 [wangxiaojing] add truncate can not support column err message e483547 [wangxiaojing] add golden file 77b1f20 [wangxiaojing] add truncate table support
-
- Oct 27, 2014
-
-
Yin Huai authored
`schemaRDD2` is not tested because `schemaRDD1` is registered again. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2869 from yhuai/JavaApplySchemaSuite and squashes the following commits: 95fe894 [Yin Huai] Correct variable name.
-
wangfei authored
[SPARK-4041][SQL] Attributes names in table scan should converted to lowercase when compare with relation attributes In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```. ```neededColumnIDs``` may be not correct if not convert to lowercase. Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #2884 from scwf/fixColumnIds and squashes the following commits: 6174046 [scwf] use AttributeMap for this issue dc74a24 [wangfei] use lowerName and add a test case for this issue 3ff3a80 [wangfei] more safer change 294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
-
Alex Liu authored
...ob conf in SparkHadoopWriter class Author: Alex Liu <alex_liu68@yahoo.com> Closes #2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits: 79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf
-
Cheng Hao authored
``` explain extended select cos(null) from src limit 1; ``` outputs: ``` Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] HiveTableScan [], (MetastoreRelation default, src, None), None ``` After patching this PR it outputs ``` == Parsed Logical Plan == Limit 1 Project ['cos(null) AS c_0#0] UnresolvedRelation None, src, None == Analyzed Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [null AS c_0#0] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [null AS c_0#0] HiveTableScan [], (MetastoreRelation default, src, None), None ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits: 1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest 1e52dda [Cheng Hao] add unit test for hive simple udf constant folding 01609ff [Cheng Hao] support constant folding for HiveSimpleUdf
-
coderxiang authored
Also update step parameter to pass the proposed test Author: coderxiang <shuoxiangpub@gmail.com> Closes #2965 from coderxiang/nnls-test and squashes the following commits: 24b06f9 [coderxiang] add test case on objective value for NNLS; update step parameter to pass the test
-
Sean Owen authored
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match. Author: Sean Owen <sowen@cloudera.com> Closes #2928 from srowen/SPARK-4022 and squashes the following commits: 61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample() 16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator a1a78e0 [Sean Owen] Use Well19937c 31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems 5c9c67f [Sean Owen] Additional test fixes from review d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
-
Cheng Lian authored
PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics. This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection. Author: Cheng Lian <lian@databricks.com> Closes #2960 from liancheng/fix-join-suite and squashes the following commits: 715b2de [Cheng Lian] Fixes caching related JoinSuite failure
-
Sandy Ryza authored
The patch takes advantage an API provided in Hadoop 2.5 that allows getting accurate data on Hadoop FileSystem bytes read. It eliminates the old method, which naively accepts the split size as the input bytes. An impact of this change will be that input metrics go away when using against Hadoop versions earlier thatn 2.5. I can add this back in, but my opinion is that no metrics are better than inaccurate metrics. This is difficult to write a test for because we don't usually build against a version of Hadoop that contains the function we need. I've tested it manually on a pseudo-distributed cluster. Author: Sandy Ryza <sandy@cloudera.com> Closes #2087 from sryza/sandy-spark-2621 and squashes the following commits: 23010b8 [Sandy Ryza] Missing style fixes 74fc9bb [Sandy Ryza] Make getFSBytesReadOnThreadCallback private 1ab662d [Sandy Ryza] Clear things up a bit 984631f [Sandy Ryza] Switch from pull to push model and add test 7ef7b22 [Sandy Ryza] Add missing curly braces 219abc9 [Sandy Ryza] Fall back to split size 90dbc14 [Sandy Ryza] SPARK-2621. Update task InputMetrics incrementally
-