Skip to content
Snippets Groups Projects
  1. Dec 14, 2015
    • Shixiong Zhu's avatar
      [SPARK-12281][CORE] Fix a race condition when reporting ExecutorState in the shutdown hook · 2aecda28
      Shixiong Zhu authored
      1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook.
      2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook.
      
      This should fix the potential exceptions when exiting a local cluster
      ```
      java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal
      	at scala.Predef$.assert(Predef.scala:179)
      	at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
      	at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
      	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
      	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
      	at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
      	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10269 from zsxwing/executor-state.
      2aecda28
  2. Dec 12, 2015
  3. Dec 11, 2015
  4. Dec 10, 2015
    • Davies Liu's avatar
      [SPARK-12258][SQL] passing null into ScalaUDF · b1b4ee7f
      Davies Liu authored
      Check nullability and passing them into ScalaUDF.
      
      Closes #10249
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10259 from davies/udf_null.
      b1b4ee7f
    • jerryshao's avatar
      [STREAMING][DOC][MINOR] Update the description of direct Kafka stream doc · 24d3357d
      jerryshao authored
      With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise.
      
      zsxwing tdas , please review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #10246 from jerryshao/direct-kafka-doc-update.
      24d3357d
    • Andrew Or's avatar
      [SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management · 5030923e
      Andrew Or authored
      **Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills.
      
      **Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen.
      
      **Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B.
      
      This patch also guards against the race condition (SPARK-12253):
      (1) Existing tasks collectively occupy all execution memory
      (2) New task comes in and blocks while existing tasks spill
      (3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory
      (4) New task still cannot acquire memory and goes back to sleep
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10240 from andrewor14/fix-oom.
      5030923e
    • Josh Rosen's avatar
      [SPARK-12251] Document and improve off-heap memory configurations · 23a9e62b
      Josh Rosen authored
      This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
      
      - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
      - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
      - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
      - Document these configurations on the configuration page.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10237 from JoshRosen/SPARK-12251.
      23a9e62b
    • Bryan Cutler's avatar
      [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark · 6a6c1fc5
      Bryan Cutler authored
      Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
      6a6c1fc5
    • Marcelo Vanzin's avatar
      [SPARK-11563][CORE][REPL] Use RpcEnv to transfer REPL-generated classes. · 4a46b885
      Marcelo Vanzin authored
      This avoids bringing up yet another HTTP server on the driver, and
      instead reuses the file server already managed by the driver's
      RpcEnv. As a bonus, the repl now inherits the security features of
      the network library.
      
      There's also a small change to create the directory for storing classes
      under the root temp dir for the application (instead of directly
      under java.io.tmpdir).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9923 from vanzin/SPARK-11563.
      4a46b885
    • Timothy Hunter's avatar
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib... · 2ecbe02d
      Timothy Hunter authored
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
      
      Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
      
      It also removes some files that I forgot to delete with #10207
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #10234 from thunterdb/12212.
      2ecbe02d
    • Yin Huai's avatar
      [SPARK-12228][SQL] Try to run execution hive's derby in memory. · ec5f9ed5
      Yin Huai authored
      This PR tries to make execution hive's derby run in memory since it is a fake metastore and every time we create a HiveContext, we will switch to a new one. It is possible that it can reduce the flakyness of our tests that need to create HiveContext (e.g. HiveSparkSubmitSuite). I will test it more.
      
      https://issues.apache.org/jira/browse/SPARK-12228
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10204 from yhuai/derbyInMemory.
      ec5f9ed5
    • Yin Huai's avatar
      [SPARK-12250][SQL] Allow users to define a UDAF without providing details of its inputSchema · bc5f56aa
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-12250
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10236 from yhuai/SPARK-12250.
      bc5f56aa
    • Yanbo Liang's avatar
      [SPARK-12234][SPARKR] Fix ```subset``` function error when only set ```select``` argument · d9d354ed
      Yanbo Liang authored
      Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.
      
      cc sun-rui felixcheung shivaram
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10217 from yanboliang/spark-12234.
      d9d354ed
    • Yuhao Yang's avatar
      [SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit · 9fba9c80
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-11602
      
      Made a pass on the API change of 1.6. Open the PR for efficient discussion.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #9939 from hhbyyh/auditScala.
      9fba9c80
    • Yanbo Liang's avatar
      [SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile · eeb58722
      Yanbo Liang authored
      SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10191 from yanboliang/spark-12198.
      eeb58722
    • Jakob Odersky's avatar
      [SPARK-11832][CORE] Process arguments in spark-shell for Scala 2.11 · db516524
      Jakob Odersky authored
      Process arguments passed to the spark-shell. Fixes running the spark-shell from within a build environment.
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #9824 from jodersky/shell-2.11.
      db516524
    • Reynold Xin's avatar
      [SPARK-12242][SQL] Add DataFrame.transform method · 76540b6d
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10226 from rxin/df-transform.
      76540b6d
    • Sean Owen's avatar
      [SPARK-11530][MLLIB] Return eigenvalues with PCA model · 21b3d2a7
      Sean Owen authored
      Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance.
      
      CC mengxr
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #9736 from srowen/SPARK-11530.
      21b3d2a7
    • bomeng's avatar
      [SPARK-12136][STREAMING] rddToFileName does not properly handle prefix and suffix parameters · e29704f9
      bomeng authored
      The original code does not properly handle the cases where the prefix is null, but suffix is not null - the suffix should be used but is not.
      
      The fix is using StringBuilder to construct the proper file name.
      
      Author: bomeng <bmeng@us.ibm.com>
      Author: Bo Meng <mengbo@bos-macbook-pro.usca.ibm.com>
      
      Closes #10185 from bomeng/SPARK-12136.
      e29704f9
    • Wenchen Fan's avatar
      [SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky · d8ec081c
      Wenchen Fan authored
      in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing.
      
      In this PR, I try to fix this problem by expsing the `loopVar`.
      
      This also fixes SPARK-12131 which is caused by the hacky `MapObjects`.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10239 from cloud-fan/map-objects.
      d8ec081c
  5. Dec 09, 2015
    • Tathagata Das's avatar
      [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState... · bd2cd4f5
      Tathagata Das authored
      [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState and change tracking function signature
      
      SPARK-12244:
      
      Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem.
      "trackState" is a completely new term which really does not give any intuition on what the operation is
      the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better.
      "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved.
      
      SPARK-12245:
      
      From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #10224 from tdas/rename.
      bd2cd4f5
    • Mark Grover's avatar
      [SPARK-11796] Fix httpclient and httpcore depedency issues related to docker-client · 2166c2a7
      Mark Grover authored
      This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build.
      
      Author: Mark Grover <mgrover@cloudera.com>
      
      Closes #9876 from markgrover/master_docker.
      2166c2a7
    • Yin Huai's avatar
      [SPARK-11678][SQL][DOCS] Document basePath in the programming guide. · ac8cdf1c
      Yin Huai authored
      This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`.
      
      The compiled doc is shown below.
      ![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png)
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-11678
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10211 from yhuai/basePathDoc.
      ac8cdf1c
    • Andrew Or's avatar
      [SPARK-12165][ADDENDUM] Fix outdated comments on unroll test · 8770bd12
      Andrew Or authored
      JoshRosen
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10229 from andrewor14/unroll-test-comments.
      8770bd12
    • Andrew Ray's avatar
      [SPARK-12211][DOC][GRAPHX] Fix version number in graphx doc for migration from 1.1 · 7a8e587d
      Andrew Ray authored
      Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #10206 from aray/graphx-doc-1.1-migration.
      7a8e587d
    • Xusen Yin's avatar
      [SPARK-11551][DOC] Replace example code in ml-features.md using include_example · 051c6a06
      Xusen Yin authored
      PR on behalf of somideshmukh, thanks!
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: somideshmukh <somilde@us.ibm.com>
      
      Closes #10219 from yinxusen/SPARK-11551.
      051c6a06
    • Sean Owen's avatar
      [SPARK-11824][WEBUI] WebUI does not render descriptions with 'bad' HTML, throws console error · 1eb7c22c
      Sean Owen authored
      Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1"
      
      The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above
      
      CC tdas as author of this bit of code
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10159 from srowen/SPARK-11824.
      1eb7c22c
    • Josh Rosen's avatar
      [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution · aec5ea00
      Josh Rosen authored
      This patch fixes a bug in the eviction of storage memory by execution.
      
      ## The bug:
      
      In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug:
      
      ```scala
          val maxMemory = 1000L
          val taskAttemptId = 0L
          val (mm, ms) = makeThings(maxMemory)
          // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes
          // of storage memory which are immune to eviction by execution memory pressure.
      
          // Acquire enough storage memory to exceed the storage region size
          assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks))
          assertEvictBlocksToFreeSpaceNotCalled(ms)
          assert(mm.executionMemoryUsed === 0L)
          assert(mm.storageMemoryUsed === 750L)
      
          // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution
          // should be able to reclaim up to 250 bytes of storage memory.
          // Therefore, execution should now be able to require up to 500 bytes of memory:
          assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L
          assert(mm.storageMemoryUsed === 500L)
          assert(mm.executionMemoryUsed === 500L)
          assertEvictBlocksToFreeSpaceCalled(ms, 250L)
      ```
      
      The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used.
      
      The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`.
      
      ## The solution:
      
      The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes:
      
      - Remove a layer of inheritance which made the memory manager code harder to understand (53841174760a24a0df3eb1562af1f33dbe340eb9).
      - Move some bounds checks earlier in the call chain (13ba7ada77f87ef1ec362aec35c89a924e6987cb).
      - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca09cb1b12f157400866983f753ac863380e).
      - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c87423d5e482b710e098486b3bbc4daf06).
      - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842aea82c8895125d46a00aa43dfb0d121de9).
      
      Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided.
      
      I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](https://github.com/apache/spark/commit/7c68ca09cb1b12f157400866983f753ac863380e#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10170 from JoshRosen/SPARK-12165.
      aec5ea00
    • Steve Loughran's avatar
      [SPARK-12241][YARN] Improve failure reporting in Yarn client obtainTokenForHBase() · 442a7715
      Steve Loughran authored
      This lines up the HBase token logic with that done for Hive in SPARK-11265: reflection with only CFNE being swallowed.
      
      There is a test, one which doesn't try to put HBase on the yarn/test class and really do the reflection (the way the hive introspection does). If people do want that then it could be added with careful POM work
      
      +also: cut an incorrect comment from the Hive test case before copying it, and a couple of imports that may have been related to the hive test in the past.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #10227 from steveloughran/stevel/patches/SPARK-12241-obtainTokenForHBase.
      442a7715
Loading