Skip to content
Snippets Groups Projects
  1. Nov 08, 2015
  2. Nov 07, 2015
  3. Nov 06, 2015
    • Andrew Or's avatar
      [SPARK-11112] DAG visualization: display RDD callsite · 7f741905
      Andrew Or authored
      <img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png">
      mateiz sarutak
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9398 from andrewor14/rdd-callsite.
      7f741905
    • Josh Rosen's avatar
      [SPARK-11389][CORE] Add support for off-heap memory to MemoryManager · 30b706b7
      Josh Rosen authored
      In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.
      
      ## User-facing changes
      
      This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.
      
      ## Internals changes
      
      This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.
      
      There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).
      
      I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.
      
      ## TODOs
      
      - [x] Fix handful of test failures in the MemoryManagerSuites.
      - [x] Fix remaining TODO comments in code.
      - [ ] Document new configuration.
      - [x] Fix commented-out tests / asserts:
        - [x] UnifiedMemoryManagerSuite.
      - [x] Write tests that exercise the new off-heap memory management policies.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9344 from JoshRosen/offheap-memory-accounting.
      30b706b7
    • Michael Armbrust's avatar
      [HOTFIX] Fix python tests after #9527 · 105732dc
      Michael Armbrust authored
      #9527 missed updating the python tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9533 from marmbrus/hotfixTextValue.
      105732dc
    • navis.ryu's avatar
      [SPARK-11546] Thrift server makes too many logs about result schema · 1c80d66e
      navis.ryu authored
      SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file.
      
      Author: navis.ryu <navis@apache.org>
      
      Closes #9514 from navis/SPARK-11546.
      1c80d66e
    • Herman van Hovell's avatar
      [SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule · 6d0ead32
      Herman van Hovell authored
      The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.
      
      This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are:
      - This can use the faster TungstenAggregate code path.
      - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.
      
      The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.
      
      cc yhuai - Could you also tell me where to add tests for this?
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #9406 from hvanhovell/SPARK-9241-rewriter.
      6d0ead32
    • Nong Li's avatar
      [SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW… · 1ab72b08
      Nong Li authored
      …ithinPartitions.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #9504 from nongli/spark-11410.
      1ab72b08
    • Wenchen Fan's avatar
      [SPARK-11269][SQL] Java API support & test cases for Dataset · 7e9a9e60
      Wenchen Fan authored
      This simply brings https://github.com/apache/spark/pull/9358 up-to-date.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9528 from rxin/dataset-java.
      7e9a9e60
    • Thomas Graves's avatar
      [SPARK-11555] spark on yarn spark-class --num-workers doesn't work · f6680cdc
      Thomas Graves authored
      I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.
      
      --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #9523 from tgravescs/SPARK-11555.
      f6680cdc
    • Xiangrui Meng's avatar
      [SPARK-11217][ML] save/load for non-meta estimators and transformers · c447c9d5
      Xiangrui Meng authored
      This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:
      
      * class name
      * uid
      * timestamp
      * paramMap
      
      The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.
      
      ~~~scala
      instance.save("path")
      instance.write.context(sqlContext).overwrite().save("path")
      
      Instance.load("path")
      ~~~
      
      The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.
      
      TODOs:
      
      * [x] Java test
      * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers
      
      cc jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9454 from mengxr/SPARK-11217.
      c447c9d5
    • Reynold Xin's avatar
      [SPARK-11561][SQL] Rename text data source's column name to value. · 3a652f69
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9527 from rxin/SPARK-11561.
      3a652f69
    • Herman van Hovell's avatar
      [SPARK-11450] [SQL] Add Unsafe Row processing to Expand · f328feda
      Herman van Hovell authored
      This PR enables the Expand operator to process and produce Unsafe Rows.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #9414 from hvanhovell/SPARK-11450.
      f328feda
    • Imran Rashid's avatar
      [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits · 49f1a820
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-10116
      
      This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
      
      mengxr mkolod
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #8314 from squito/SPARK-10116.
      49f1a820
    • Jacek Laskowski's avatar
      Typo fixes + code readability improvements · 62bb2907
      Jacek Laskowski authored
      Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
      
      Closes #9501 from jaceklaskowski/typos-with-style.
      62bb2907
    • Yin Huai's avatar
      [SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of... · 8211aab0
      Yin Huai authored
      [SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up)
      
      https://issues.apache.org/jira/browse/SPARK-9858
      
      This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9453 from yhuai/numReducer-followUp.
      8211aab0
    • Cheng Lian's avatar
      [SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 · c048929c
      Cheng Lian authored
      This PR adds test cases that test various column pruning and filter push-down cases.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9468 from liancheng/spark-10978.follow-up.
      c048929c
    • Liang-Chi Hsieh's avatar
      [SPARK-9162] [SQL] Implement code generation for ScalaUDF · 574141a2
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9162
      
      Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9270 from viirya/scalaudf-codegen.
      574141a2
    • Shixiong Zhu's avatar
      [SPARK-11511][STREAMING] Fix NPE when an InputDStream is not used · cf69ce13
      Shixiong Zhu authored
      Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #9476 from zsxwing/SPARK-11511.
      cf69ce13
    • Wenchen Fan's avatar
      [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit · 253e87e8
      Wenchen Fan authored
      A cleanup for https://github.com/apache/spark/pull/9085.
      
      The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
      Also added low level unit test at `SqlParserSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9482 from cloud-fan/parser.
      253e87e8
    • Reynold Xin's avatar
      [SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark... · bc5d6c03
      Reynold Xin authored
      [SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark various dialects as private.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9511 from rxin/SPARK-11541.
      bc5d6c03
  4. Nov 05, 2015
Loading