Skip to content
Snippets Groups Projects
  1. Oct 30, 2015
    • Yin Huai's avatar
      e8ec2a7b
    • Davies Liu's avatar
      [SPARK-11423] remove MapPartitionsWithPreparationRDD · 45029bfd
      Davies Liu authored
      Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.
      
      This PR basically revert #8543, #8511, #8038, #8011
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9381 from davies/remove_prepare2.
      45029bfd
    • felixcheung's avatar
      [SPARK-11340][SPARKR] Support setting driver properties when starting Spark... · bb5a2af0
      felixcheung authored
      [SPARK-11340][SPARKR] Support setting driver properties when starting Spark from R programmatically or from RStudio
      
      Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments.
      
      shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf?
      sun-rui
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9290 from felixcheung/rdrivermem.
      bb5a2af0
    • Jeff Zhang's avatar
      [SPARK-11342][TESTS] Allow to set hadoop profile when running dev/ru… · 729f983e
      Jeff Zhang authored
      …n_tests
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9295 from zjffdu/SPARK-11342.
      729f983e
    • Sun Rui's avatar
      [SPARK-11210][SPARKR] Add window functions into SparkR [step 2]. · 40c77fb2
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #9196 from sun-rui/SPARK-11210.
      40c77fb2
    • Sun Rui's avatar
      [SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in... · fab710a9
      Sun Rui authored
      [SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in RRDD in the PR for SPARK-10971.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #9368 from sun-rui/SPARK-11414.
      fab710a9
    • Iulian Dragos's avatar
      [SPARK-10986][MESOS] Set the context class loader in the Mesos executor backend. · 0451b001
      Iulian Dragos authored
      See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details.
      
      This fixes the `ClassNotFoundException` for Spark classes in the serializer.
      
      I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization.
      
      I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue).
      
      tnachen andrewor14
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #9282 from dragos/issue/mesos-classloader.
      0451b001
    • Wenchen Fan's avatar
      [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that... · 14d08b99
      Wenchen Fan authored
      [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that GroupedIterator.hasNext is not idempotent
      
      When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9346 from cloud-fan/cogroup and squashes the following commits:
      
      9be67c8 [Wenchen Fan] SPARK-11393
      14d08b99
    • hyukjinkwon's avatar
      [SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail · 59db9e9c
      hyukjinkwon authored
      When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
      This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).
      
      For now, it just simply disables predicate push down when using merged schema in this PR.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #9327 from HyukjinKwon/SPARK-11103.
      59db9e9c
    • Lewuathe's avatar
      [SPARK-11207] [ML] Add test cases for solver selection of LinearRegres… · 86d65265
      Lewuathe authored
      …sion as followup. This is the follow up work of SPARK-10668.
      
      * Fix miner style issues.
      * Add test case for checking whether solver is selected properly.
      
      Author: Lewuathe <lewuathe@me.com>
      Author: lewuathe <lewuathe@me.com>
      
      Closes #9180 from Lewuathe/SPARK-11207.
      86d65265
    • Davies Liu's avatar
      [SPARK-11417] [SQL] no @Override in codegen · eb59b94c
      Davies Liu authored
      Older version of Janino (>2.7) does not support Override, we should not use that in codegen.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9372 from davies/no_override.
      eb59b94c
    • Davies Liu's avatar
      [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management · 56419cf1
      Davies Liu authored
      This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.
      
      Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).
      
      The PrepareRDD may be not needed anymore, could be removed in follow up PR.
      
      The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).
      
      ```python
      sqlContext.setConf("spark.sql.shuffle.partitions", "1")
      df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
      df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
      j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
      j.explain()
      print j.count()
      ```
      
      For thread-safety, here what I'm got:
      
      1) Without calling spill(), the operators should only be used by single thread, no safety problems.
      
      2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.
      
      3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.
      
      4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.
      
      5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9241 from davies/force_spill.
      56419cf1
  2. Oct 29, 2015
  3. Oct 28, 2015
  4. Oct 27, 2015
    • Cheng Hao's avatar
      [SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some cases · d9c60398
      Cheng Hao authored
      In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8652 from chenghao-intel/cartesian.
      d9c60398
    • Kay Ousterhout's avatar
      [SPARK-11178] Improving naming around task failures. · b960a890
      Kay Ousterhout authored
      Commit af3bc59d introduced new
      functionality so that if an executor dies for a reason that's not
      caused by one of the tasks running on the executor (e.g., due to
      pre-emption), Spark doesn't count the failure towards the maximum
      number of failures for the task.  That commit introduced some vague
      naming that this commit attempts to fix; in particular:
      
      (1) The variable "isNormalExit", which was used to refer to cases where
      the executor died for a reason unrelated to the tasks running on the
      machine, has been renamed (and reversed) to "exitCausedByApp". The problem
      with the existing name is that it's not clear (at least to me!) what it
      means for an exit to be "normal"; the new name is intended to make the
      purpose of this variable more clear.
      
      (2) The variable "shouldEventuallyFailJob" has been renamed to
      "countTowardsTaskFailures". This variable is used to determine whether
      a task's failure should be counted towards the maximum number of failures
      allowed for a task before the associated Stage is aborted. The problem
      with the existing name is that it can be confused with implying that
      the task's failure should immediately cause the stage to fail because it
      is somehow fatal (this is the case for a fetch failure, for example: if
      a task fails because of a fetch failure, there's no point in retrying,
      and the whole stage should be failed).
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9164 from kayousterhout/SPARK-11178.
      b960a890
    • zsxwing's avatar
      [SPARK-11212][CORE][STREAMING] Make preferred locations support... · 9fbd75ab
      zsxwing authored
      [SPARK-11212][CORE][STREAMING] Make preferred locations support ExecutorCacheTaskLocation and update…
      
      … ReceiverTracker and ReceiverSchedulingPolicy to use it
      
      This PR includes the following changes:
      
      1. Add a new preferred location format, `executor_<host>_<executorID>` (e.g., "executor_localhost_2"), to support specifying the executor locations for RDD.
      2. Use the new preferred location format in `ReceiverTracker` to optimize the starting time of Receivers when there are multiple executors in a host.
      
      The goal of this PR is to enable the streaming scheduler to place receivers (which run as tasks) in specific executors. Basically, I want to have more control on the placement of the receivers such that they are evenly distributed among the executors. We tried to do this without changing the core scheduling logic. But it does not allow specifying particular executor as preferred location, only at the host level. So if there are two executors in the same host, and I want two receivers to run on them (one on each executor), I cannot specify that. Current code only specifies the host as preference, which may end up launching both receivers on the same executor. We try to work around it but restarting a receiver when it does not launch in the desired executor and hope that next time it will be started in the right one. But that cause lots of restarts, and delays in correctly launching the receiver.
      
      So this change, would allow the streaming scheduler to specify the exact executor as the preferred location. Also this is not exposed to the user, only the streaming scheduler uses this.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9181 from zsxwing/executor-location.
      9fbd75ab
    • Burak Yavuz's avatar
      [SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a write · 4f030b9e
      Burak Yavuz authored
      Currently the Write Ahead Log in Spark Streaming flushes data as writes need to be made. S3 does not support flushing of data, data is written once the stream is actually closed.
      In case of failure, the data for the last minute (default rolling interval) will not be properly written. Therefore we need a flag to close the stream after the write, so that we achieve read after write consistency.
      
      cc tdas zsxwing
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #9285 from brkyvz/caw-wal.
      4f030b9e
Loading