Skip to content
Snippets Groups Projects
  1. Dec 15, 2015
  2. Dec 14, 2015
    • gatorsmile's avatar
      [SPARK-12288] [SQL] Support UnsafeRow in Coalesce/Except/Intersect. · 606f99b9
      gatorsmile authored
      Support UnsafeRow for the Coalesce/Except/Intersect.
      
      Could you review if my code changes are ok? davies Thank you!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10285 from gatorsmile/unsafeSupportCIE.
      606f99b9
    • gatorsmile's avatar
      [SPARK-12188][SQL][FOLLOW-UP] Code refactoring and comment correction in Dataset APIs · d13ff82c
      gatorsmile authored
      marmbrus This PR is to address your comment. Thanks for your review!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10214 from gatorsmile/followup12188.
      d13ff82c
    • Wenchen Fan's avatar
      [SPARK-12274][SQL] WrapOption should not have type constraint for child · 9ea1a8ef
      Wenchen Fan authored
      I think it was a mistake, and we have not catched it so far until https://github.com/apache/spark/pull/10260 which begin to check if the `fromRowExpression` is resolved.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10263 from cloud-fan/encoder.
      9ea1a8ef
    • Shivaram Venkataraman's avatar
      [SPARK-12327] Disable commented code lintr temporarily · fb3778de
      Shivaram Venkataraman authored
      cc yhuai felixcheung shaneknapp
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #10300 from shivaram/comment-lintr-disable.
      fb3778de
    • Liang-Chi Hsieh's avatar
      [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark · b51a4cdf
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12016
      
      We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #10100 from viirya/fix-load-py-wordvecmodel.
      b51a4cdf
    • BenFradet's avatar
      [MINOR][DOC] Fix broken word2vec link · e25f1fe4
      BenFradet authored
      Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10282 from BenFradet/SPARK-12199.
      e25f1fe4
    • yucai's avatar
      [SPARK-12275][SQL] No plan for BroadcastHint in some condition · ed87f6d3
      yucai authored
      When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies.
      
      https://issues.apache.org/jira/browse/SPARK-12275
      
      Author: yucai <yucai.yu@intel.com>
      
      Closes #10265 from yucai/broadcast_hint.
      ed87f6d3
    • Davies Liu's avatar
      [SPARK-12213][SQL] use multiple partitions for single distinct query · 834e7148
      Davies Liu authored
      Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other
      works better for high cardinality column (default one).
      
      This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6).
      
      For a query like `SELECT COUNT(DISTINCT a) FROM table` will be
      ```
      AGG-4 (count distinct)
        Shuffle to a single reducer
          Partial-AGG-3 (count distinct, no grouping)
            Partial-AGG-2 (grouping on a)
              Shuffle by a
                Partial-AGG-1 (grouping on a)
      ```
      
      This PR also includes large refactor for aggregation (reduce 500+ lines of code)
      
      cc yhuai nongli marmbrus
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10228 from davies/single_distinct.
      834e7148
    • Shixiong Zhu's avatar
      [SPARK-12281][CORE] Fix a race condition when reporting ExecutorState in the shutdown hook · 2aecda28
      Shixiong Zhu authored
      1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook.
      2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook.
      
      This should fix the potential exceptions when exiting a local cluster
      ```
      java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal
      	at scala.Predef$.assert(Predef.scala:179)
      	at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
      	at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
      	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
      	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
      	at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
      	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10269 from zsxwing/executor-state.
      2aecda28
  3. Dec 12, 2015
  4. Dec 11, 2015
  5. Dec 10, 2015
    • Davies Liu's avatar
      [SPARK-12258][SQL] passing null into ScalaUDF · b1b4ee7f
      Davies Liu authored
      Check nullability and passing them into ScalaUDF.
      
      Closes #10249
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10259 from davies/udf_null.
      b1b4ee7f
    • jerryshao's avatar
      [STREAMING][DOC][MINOR] Update the description of direct Kafka stream doc · 24d3357d
      jerryshao authored
      With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise.
      
      zsxwing tdas , please review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #10246 from jerryshao/direct-kafka-doc-update.
      24d3357d
    • Andrew Or's avatar
      [SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management · 5030923e
      Andrew Or authored
      **Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills.
      
      **Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen.
      
      **Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B.
      
      This patch also guards against the race condition (SPARK-12253):
      (1) Existing tasks collectively occupy all execution memory
      (2) New task comes in and blocks while existing tasks spill
      (3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory
      (4) New task still cannot acquire memory and goes back to sleep
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10240 from andrewor14/fix-oom.
      5030923e
    • Josh Rosen's avatar
      [SPARK-12251] Document and improve off-heap memory configurations · 23a9e62b
      Josh Rosen authored
      This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
      
      - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
      - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
      - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
      - Document these configurations on the configuration page.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10237 from JoshRosen/SPARK-12251.
      23a9e62b
    • Bryan Cutler's avatar
      [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark · 6a6c1fc5
      Bryan Cutler authored
      Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
      6a6c1fc5
    • Marcelo Vanzin's avatar
      [SPARK-11563][CORE][REPL] Use RpcEnv to transfer REPL-generated classes. · 4a46b885
      Marcelo Vanzin authored
      This avoids bringing up yet another HTTP server on the driver, and
      instead reuses the file server already managed by the driver's
      RpcEnv. As a bonus, the repl now inherits the security features of
      the network library.
      
      There's also a small change to create the directory for storing classes
      under the root temp dir for the application (instead of directly
      under java.io.tmpdir).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9923 from vanzin/SPARK-11563.
      4a46b885
    • Timothy Hunter's avatar
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib... · 2ecbe02d
      Timothy Hunter authored
      [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
      
      Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
      
      It also removes some files that I forgot to delete with #10207
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #10234 from thunterdb/12212.
      2ecbe02d
    • Yin Huai's avatar
      [SPARK-12228][SQL] Try to run execution hive's derby in memory. · ec5f9ed5
      Yin Huai authored
      This PR tries to make execution hive's derby run in memory since it is a fake metastore and every time we create a HiveContext, we will switch to a new one. It is possible that it can reduce the flakyness of our tests that need to create HiveContext (e.g. HiveSparkSubmitSuite). I will test it more.
      
      https://issues.apache.org/jira/browse/SPARK-12228
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10204 from yhuai/derbyInMemory.
      ec5f9ed5
    • Yin Huai's avatar
      [SPARK-12250][SQL] Allow users to define a UDAF without providing details of its inputSchema · bc5f56aa
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-12250
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10236 from yhuai/SPARK-12250.
      bc5f56aa
    • Yanbo Liang's avatar
      [SPARK-12234][SPARKR] Fix ```subset``` function error when only set ```select``` argument · d9d354ed
      Yanbo Liang authored
      Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it.
      
      cc sun-rui felixcheung shivaram
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10217 from yanboliang/spark-12234.
      d9d354ed
Loading