Skip to content
Snippets Groups Projects
  1. Feb 27, 2016
    • Reynold Xin's avatar
      [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts · 59e3e10b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts.
      
      Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11400 from rxin/release-script.
      59e3e10b
  2. Feb 26, 2016
    • Josh Rosen's avatar
      [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org · f77dc4e1
      Josh Rosen authored
      Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.
      f77dc4e1
    • Shixiong Zhu's avatar
      [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning executor's state · ad615291
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor.
      
      This PR modified the driver to send `StopExecutor` to the executor when it's removed.
      
      ## How was this patch tested?
      
      manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11399 from zsxwing/SPARK-13519.
      ad615291
    • zlpmichelle's avatar
      [SPARK-13505][ML] add python api for MaxAbsScaler · 1e5fcdf9
      zlpmichelle authored
      ## What changes were proposed in this pull request?
      After SPARK-13028, we should add Python API for MaxAbsScaler.
      
      ## How was this patch tested?
      unit test
      
      Author: zlpmichelle <zlpmichelle@gmail.com>
      
      Closes #11393 from zlpmichelle/master.
      1e5fcdf9
    • Reynold Xin's avatar
      [SPARK-13465] Add a task failure listener to TaskContext · 391755dc
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      
      TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails.
      
      ## How was the this patch tested?
      New unit test case and integration test case covering the code path
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11340 from rxin/SPARK-13465.
      391755dc
    • Nong Li's avatar
      [SPARK-13499] [SQL] Performance improvements for parquet reader. · 0598a2b8
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      This patch includes these performance fixes:
        - Remove unnecessary setNotNull() calls. The NULL bits are cleared already.
        - Speed up RLE group decoding
        - Speed up dictionary decoding by decoding NULLs directly into the result.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      In addition to the updated benchmarks, on TPCDS, the result of these changes
      running Q55 (sf40) is:
      
      ```
      TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
      ---------------------------------------------------------------------------------
      q55 (Before)                             6398 / 6616         18.0          55.5
      q55 (After)                              4983 / 5189         23.1          43.3
      ```
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11375 from nongli/spark-13499.
      0598a2b8
    • Davies Liu's avatar
      [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin · 6df1e55a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too slow, very easy to hang forever. This PR will create fast path for some joinType and buildSide, also improve the worst case (will use much less memory than before).
      
      Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of rows from one partition of streamed table, it could hang the job (because of GC).
      
      In order to workaround this for InnerJoin, we have to disable auto-broadcast, switch to CartesianProduct: This could be workaround for InnerJoin, see https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
      
      In this PR, we will have fast path for these joins :
      
       InnerJoin with BuildLeft or BuildRight
       LeftOuterJoin with BuildRight
       RightOuterJoin with BuildLeft
       LeftSemi with BuildRight
      
      These fast paths are all stream based (take one pass on streamed table), required O(1) memory.
      
      All other join types and build types will take two pass on streamed table, one pass to find the matched rows that includes streamed part, which require O(1) memory, another pass to find the rows from build table that does not have a matched row from streamed table, which required O(K) memory, K is the number rows from build side, one bit per row, should be much smaller than the memory for broadcast. The following join types work in this way:
      
      LeftOuterJoin with BuildLeft
      RightOuterJoin with BuildRight
      FullOuterJoin with BuildLeft or BuildRight
      LeftSemi with BuildLeft
      
      This PR also added tests for all the join types for BroadcastNestedLoopJoin.
      
      After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin should be faster than CartesianProduct, we don't need that workaround anymore.
      
      ## How was the this patch tested?
      
      Added unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11328 from davies/nested_loop.
      6df1e55a
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Fix modifier order. · 727e7801
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the order of modifier from `abstract public` into `public abstract`.
      Currently, when we run `./dev/lint-java`, it shows the error.
      ```
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/util/sketch/CountMinSketch.java:[53,10] (modifier) ModifierOrder: 'public' modifier out of order with the JLS suggestions.
      ```
      
      ## How was this patch tested?
      
      ```
      $ ./dev/lint-java
      Checkstyle checks passed.
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11390 from dongjoon-hyun/fix_modifier_order.
      727e7801
    • Dongjoon Hyun's avatar
      [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example · 7af0de07
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR replaces example codes in `mllib-linear-methods.md` using `include_example`
      by doing the followings:
        * Extracts the example codes(Scala,Java,Python) as files in `example` module.
        * Merges some dialog-style examples into a single file.
        * Hide redundant codes in HTML for the consistency with other docs.
      
      ## How was the this patch tested?
      
      manual test.
      This PR can be tested by document generations, `SKIP_API=1 jekyll build`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11320 from dongjoon-hyun/SPARK-11381.
      7af0de07
    • Bryan Cutler's avatar
      [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format · b33261f9
      Bryan Cutler authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the tree module.
      
      closes #10601
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: vijaykiran <mail@vijaykiran.com>
      
      Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
      b33261f9
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 99dfcedb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This is another try of PR #11323.
      
      This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11388 from liancheng/remove-df-rdd-ops.
      99dfcedb
    • huangzhaowei's avatar
      [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store. · 5c3912e5
      huangzhaowei authored
      Obtain the hive metastore and hbase token as well as hdfs token in DelegationToeknRenewer to supoort long-running application of spark on hbase or thriftserver.
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #10645 from SaintBacchus/SPARK-12523.
      5c3912e5
    • Liwei Lin's avatar
      [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike · 318bf411
      Liwei Lin authored
      Author: Liwei Lin <proflin.me@gmail.com>
      
      Closes #11385 from proflin/Fix-minor-naming.
      318bf411
    • hyukjinkwon's avatar
      [SPARK-13503][SQL] Support to specify the (writing) option for compression codec for TEXT · 9812a24a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13503
      This PR makes the TEXT datasource can compress output by option instead of manually setting Hadoop configurations.
      For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805 and https://github.com/apache/spark/pull/10858.
      
      ## How was this patch tested?
      
      This was tested with unittests and with `dev/run_tests` for coding style
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11384 from HyukjinKwon/SPARK-13503.
      9812a24a
    • Reynold Xin's avatar
      [SPARK-13487][SQL] User-facing RuntimeConfig interface · 26ac6080
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch creates the public API for runtime configuration and an implementation for it. The public runtime configuration includes configs for existing SQL, as well as Hadoop Configuration.
      
      This new interface is currently dead code. It will be added to SQLContext and a session entry point to Spark when we add that.
      
      ## How was this patch tested?
      a new unit test suite
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11378 from rxin/SPARK-13487.
      26ac6080
    • thomastechs's avatar
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string... · 8afe4914
      thomastechs authored
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
      
      ## What changes were proposed in this pull request?
      
      This Pull request is used for the fix SPARK-12941, creating a data type mapping to Oracle for the corresponding data type"Stringtype" from dataframe. This PR is for the master branch fix, where as another PR is already tested with the branch 1.4
      
      ## How was the this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      This patch was tested using the Oracle docker .Created a new integration suite for the same.The oracle.jdbc jar was to be downloaded from the maven repository.Since there was no jdbc jar available in the maven repository, the jar was downloaded from oracle site manually and installed in the local; thus tested. So, for SparkQA test case run, the ojdbc jar might be manually placed in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test run.
      
      Author: thomastechs <thomas.sebastian@tcs.com>
      
      Closes #11306 from thomastechs/master.
      8afe4914
  3. Feb 25, 2016
    • Yanbo Liang's avatar
      [SPARK-13504] [SPARKR] Add approxQuantile for SparkR · 50e60e36
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Add ```approxQuantile``` for SparkR.
      ## How was this patch tested?
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11383 from yanboliang/spark-13504 and squashes the following commits:
      
      4f17adb [Yanbo Liang] Add approxQuantile for SparkR
      50e60e36
    • Tommy YU's avatar
      [SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression · f3be369e
      Tommy YU authored
      Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.
      
      yanboliang Please help to review.
      For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.
      
      Author: Tommy YU <tummyyu@163.com>
      
      Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:
      
      3646b36 [Tommy YU] address review comments
      9cddc98 [Tommy YU] change base on review and pr 11197
      cc61d9d [Tommy YU] remove default parameter set
      19535d4 [Tommy YU] add export/import to regression
      44a9dc2 [Tommy YU] add import/export for ml.regression
      f3be369e
    • Yuhao Yang's avatar
      [SPARK-13028] [ML] Add MaxAbsScaler to ML.feature as a transformer · 90d07154
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-13028
      MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.
      
      Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.
      
      Something similar from sklearn:
      http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #10939 from hhbyyh/maxabs and squashes the following commits:
      
      fd8bdcd [Yuhao Yang] add tag and some optimization on fit
      648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      cb10bb6 [Yuhao Yang] remove minmax
      91ef8f3 [Yuhao Yang] ut added
      8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
      a9215b5 [Yuhao Yang] max abs scaler
      90d07154
    • Takeshi YAMAMURO's avatar
      [SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark · 1b39fafa
      Takeshi YAMAMURO authored
      This pr added benchmark codes for Encoder#compress().
      Also, it replaced the benchmark results with new ones because the output format of `Benchmark` changed.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #11236 from maropu/CompressionSpike.
      1b39fafa
    • Josh Rosen's avatar
      [SPARK-12757] Add block-level read/write locks to BlockManager · 633d63a4
      Josh Rosen authored
      ## Motivation
      
      As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults.
      
      ## Changes
      
      ### BlockInfoManager and reader/writer locks
      
      This patch adds block-level read/write locks to the BlockManager. It introduces a new `BlockInfoManager` component, which is contained within the `BlockManager`, holds the `BlockInfo` objects that the `BlockManager` uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes.
      
      `BlockManager`'s `get*()` and `put*()` methods now implicitly acquire the necessary locks. After a `get()` call successfully retrieves a block, that block is locked in a shared read mode. A `put()` call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This `put()` locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify `CacheManager` in the future (see #10748).
      
      See `BlockInfoManagerSuite`'s test cases for a more detailed specification of the locking semantics.
      
      ### Auto-release of locks at the end of tasks
      
      Our locking APIs support explicit release of locks (by calling `unlock()`), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit `close()` operator to signal that no more records will be consumed, operations like `take()` or `limit()` don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task.
      
      To address this, `BlockInfoManager` uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from `TaskContext`. When a task finishes, code in `Executor` calls `BlockInfoManager.unlockAllLocksForTask(taskAttemptId)` to free locks.
      
      ### Locking and the MemoryStore
      
      In order to prevent in-memory blocks from being evicted while they are being read, the `MemoryStore`'s `evictBlocksToFreeSpace()` method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed.
      
      ### Locking and remote block transfer
      
      This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers.
      
      ## FAQ
      
      - **Why not use Java's built-in [`ReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html)?**
      
        Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using `ReadWriteLock` would mean that we might call `unlock()` from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use [`StampedLock`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) to work around this issue.
      
      - **Why not detect "leaked" locks in tests?**:
      
        See above notes about `take()` and `limit`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10705 from JoshRosen/pin-pages.
      633d63a4
    • Timothy Chen's avatar
      [SPARK-13387][MESOS] Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. · 71299575
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
      
      ## How was the this patch tested?
      
      Manual testing by launching dispatcher with SPARK_DAEMON_JAVA_OPTS
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #11277 from tnachen/cluster_dispatcher_opts.
      71299575
    • Josh Rosen's avatar
      [SPARK-13501] Remove use of Guava Stopwatch · f2cfafdf
      Josh Rosen authored
      Our nightly doc snapshot builds are failing due to some issue involving the Guava Stopwatch constructor:
      
      ```
      [error] /home/jenkins/workspace/spark-master-docs/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:496: constructor Stopwatch in class Stopwatch cannot be accessed in class CoarseMesosSchedulerBackend
      [error]     val stopwatch = new Stopwatch()
      [error]                     ^
      ```
      
      This Stopwatch constructor was deprecated in newer versions of Guava (https://github.com/google/guava/commit/fd0cbc2c5c90e85fb22c8e86ea19630032090943) and it's possible that some classpath issues affecting Unidoc could be causing this to trigger compilation failures.
      
      In order to work around this issue, this patch removes this use of Stopwatch since we don't use it anywhere else in the Spark codebase.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11376 from JoshRosen/remove-stopwatch.
      f2cfafdf
    • hushan's avatar
      [SPARK-12009][YARN] Avoid to re-allocating yarn container while driver want to stop all Executors · 7a6ee8a8
      hushan authored
      Author: hushan <hushan@xiaomi.com>
      
      Closes #9992 from suyanNone/tricky.
      7a6ee8a8
    • Liwei Lin's avatar
      [SPARK-13468][WEB UI] Fix a corner case where the Stage UI page should show DAG but it doesn't show · dc6c5ea4
      Liwei Lin authored
      When uses clicks more than one time on any stage in the DAG graph on the *Job* web UI page, many new *Stage* web UI pages are opened, but only half of their DAG graphs are expanded.
      
      After this PR's fix, every newly opened *Stage* page's DAG graph is expanded.
      
      Before:
      ![](https://cloud.githubusercontent.com/assets/15843379/13279144/74808e86-db10-11e5-8514-cecf31af8908.png)
      
      After:
      ![](https://cloud.githubusercontent.com/assets/15843379/13279145/77ca5dec-db10-11e5-9457-8e1985461328.png)
      
      ## What changes were proposed in this pull request?
      
      - Removed the `expandDagViz` parameter for _Stage_ page and related codes
      - Added a `onclick` function setting `expandDagVizArrowKey(false)` as `true`
      
      ## How was this patch tested?
      
      Manual tests (with this fix) to verified this fix work:
      - clicked many times on _Job_ Page's DAG Graph → each newly opened Stage page's DAG graph is expanded
      
      Manual tests (with this fix) to verified this fix do not break features we already had:
      - refreshed many times for a same _Stage_ page (whose DAG already expanded) → DAG remained expanded upon every refresh
      - refreshed many times for a same _Stage_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh
      - refreshed many times for a same _Job_ page (whose DAG already expanded) → DAG remained expanded upon every refresh
      - refreshed many times for a same _Job_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh
      
      Author: Liwei Lin <proflin.me@gmail.com>
      
      Closes #11368 from proflin/SPARK-13468.
      dc6c5ea4
    • Yu ISHIKAWA's avatar
      [SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark · 35316cb0
      Yu ISHIKAWA authored
      ## What changes were proposed in this pull request?
      QuantileDiscretizer in Python should also specify a random seed.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:
      
      02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
      35316cb0
    • Yu ISHIKAWA's avatar
      [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication · 14e2700d
      Yu ISHIKAWA authored
      ## What changes were proposed in this pull request?
      ML StringIndexer does not protect itself from column name duplication.
      
      We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`.  However, it would be great to fix at another issue.
      
      ## How was this patch tested?
      unit test
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #11370 from yu-iskw/SPARK-12874.
      14e2700d
    • Lin Zhao's avatar
      [SPARK-13069][STREAMING] Add "ask" style store() to ActorReciever · fb8bb047
      Lin Zhao authored
      Introduces a "ask" style ```store``` in ```ActorReceiver``` as a way to allow actor receiver blocked by back pressure or maxRate.
      
      Author: Lin Zhao <lin@exabeam.com>
      
      Closes #11176 from lin-zhao/SPARK-13069.
      fb8bb047
    • Davies Liu's avatar
      Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations" · 751724b1
      Davies Liu authored
      This reverts commit 157fe64f.
      751724b1
    • Shixiong Zhu's avatar
      46f6e793
    • huangzhaowei's avatar
      [SPARK-12316] Wait a minutes to avoid cycle calling. · 5fcf4c2b
      huangzhaowei authored
      When application end, AM will clean the staging dir.
      But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'.
      Then it lead driver StackOverflowError.
      https://issues.apache.org/jira/browse/SPARK-12316
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #10475 from SaintBacchus/SPARK-12316.
      5fcf4c2b
    • Cheng Lian's avatar
      [SPARK-13457][SQL] Removes DataFrame RDD operations · 157fe64f
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
      
      ## How was the this patch tested?
      
      No extra tests are added. Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11323 from liancheng/remove-df-rdd-ops.
      157fe64f
    • Yanbo Liang's avatar
      [SPARK-13490][ML] ML LinearRegression should cache standardization param value · 4460113d
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
      cc srowen
      
      ## How was this patch tested?
      No extra tests are added. It should pass all existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11367 from yanboliang/spark-13490.
      4460113d
    • Michael Gummelt's avatar
      [SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated · c98a93de
      Michael Gummelt authored
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #11311 from mgummelt/document_csv.
      c98a93de
    • Terence Yim's avatar
      [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method · fae88af1
      Terence Yim authored
      ## What changes were proposed in this pull request?
      
      Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead
      
      ## How was the this patch tested?
      
      Ran the ./dev/run-tests locally
      Tested manually on a cluster
      
      Author: Terence Yim <terence@cask.co>
      
      Closes #11337 from chtyim/fixes/SPARK-13441-null-check.
      fae88af1
    • Oliver Pierson's avatar
      [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames · 6f8e835c
      Oliver Pierson authored
      ## What changes were proposed in this pull request?
      
      Change line 113 of QuantileDiscretizer.scala to
      
      `val requiredSamples = math.max(numBins * numBins, 10000.0)`
      
      so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
      
      ## How was the this patch tested?
      Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
      
      Author: Oliver Pierson <ocp@gatech.edu>
      Author: Oliver Pierson <opierson@umd.edu>
      
      Closes #11319 from oliverpierson/SPARK-13444.
      6f8e835c
    • Cheng Lian's avatar
      [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) · 3fa6491b
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      Predicates shouldn't be pushed through project with nondeterministic field(s).
      
      See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details.
      
      This PR targets master, branch-1.6, and branch-1.5.
      
      ## How was this patch tested?
      
      A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.
      3fa6491b
    • Devaraj K's avatar
      [SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0 · 2e44031f
      Devaraj K authored
      Fixed the HTTP Server Host Name/IP issue i.e. HTTP Server to take the
      configured host name/IP and not '0.0.0.0' always.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11133 from devaraj-kavali/SPARK-13117.
      2e44031f
    • Reynold Xin's avatar
      [SPARK-13486][SQL] Move SQLConf into an internal package · 2b2c8c33
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves SQLConf into org.apache.spark.sql.internal package to make it very explicit that it is internal. Soon I will also submit more API work that creates implementations of interfaces in this internal package.
      
      ## How was this patch tested?
      If it compiles, then the refactoring should work.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11363 from rxin/SPARK-13486.
      2b2c8c33
    • Davies Liu's avatar
      [SPARK-13376] [SPARK-13476] [SQL] improve column pruning · 07f92ef1
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).
      
      This PR also fix a bug in Generate, it should always output UnsafeRow, added an regression test for that.
      
      ## How was this patch tested?
      
      This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11354 from davies/fix_column_pruning.
      07f92ef1
Loading