Skip to content
Snippets Groups Projects
  1. Jun 13, 2017
    • liuxian's avatar
      [SPARK-21006][TESTS][FOLLOW-UP] Some Worker's RpcEnv is leaked in WorkerSuite · 2aaed0a4
      liuxian authored
      ## What changes were proposed in this pull request?
      
      Create rpcEnv and run later needs shutdown. as #18226
      
      ## How was this patch tested?
      unit test
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18259 from 10110346/wip-lx-0610.
      2aaed0a4
    • Felix Cheung's avatar
      [TEST][SPARKR][CORE] Fix broken SparkSubmitSuite · 278ba7a2
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Fix test file path. This is broken in #18264 and undetected since R-only changes don't build core and subsequent post-commit with the change built fine (again because it wasn't building core)
      
      actually appveyor builds everything but it's not running scala suites ...
      
      ## How was this patch tested?
      
      jenkins
      srowen gatorsmile
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18283 from felixcheung/rsubmitsuite.
      278ba7a2
  2. Jun 11, 2017
    • Josh Rosen's avatar
      [SPARK-20715] Store MapStatuses only in MapOutputTracker, not ShuffleMapStage · 3476390c
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This PR refactors `ShuffleMapStage` and `MapOutputTracker` in order to simplify the management of `MapStatuses`, reduce driver memory consumption, and remove a potential source of scheduler correctness bugs.
      
      ### Background
      
      In Spark there are currently two places where MapStatuses are tracked:
      
      - The `MapOutputTracker` maintains an `Array[MapStatus]` storing a single location for each map output. This mapping is used by the `DAGScheduler` for determining reduce-task locality preferences (when locality-aware reduce task scheduling is enabled) and is also used to serve map output locations to executors / tasks.
      - Each `ShuffleMapStage` also contains a mapping of `Array[List[MapStatus]]` which holds the complete set of locations where each map output could be available. This mapping is used to determine which map tasks need to be run when constructing `TaskSets` for the stage.
      
      This duplication adds complexity and creates the potential for certain types of correctness bugs.  Bad things can happen if these two copies of the map output locations get out of sync. For instance, if the `MapOutputTracker` is missing locations for a map output but `ShuffleMapStage` believes that locations are available then tasks will fail with `MetadataFetchFailedException` but `ShuffleMapStage` will not be updated to reflect the missing map outputs, leading to situations where the stage will be reattempted (because downstream stages experienced fetch failures) but no task sets will be launched (because `ShuffleMapStage` thinks all maps are available).
      
      I observed this behavior in a real-world deployment. I'm still not quite sure how the state got out of sync in the first place, but we can completely avoid this class of bug if we eliminate the duplicate state.
      
      ### Why we only need to track a single location for each map output
      
      I think that storing an `Array[List[MapStatus]]` in `ShuffleMapStage` is unnecessary.
      
      First, note that this adds memory/object bloat to the driver we need one extra `List` per task. If you have millions of tasks across all stages then this can add up to be a significant amount of resources.
      
      Secondly, I believe that it's extremely uncommon that these lists will ever contain more than one entry. It's not impossible, but is very unlikely given the conditions which must occur for that to happen:
      
      - In normal operation (no task failures) we'll only run each task once and thus will have at most one output.
      - If speculation is enabled then it's possible that we'll have multiple attempts of a task. The TaskSetManager will [kill duplicate attempts of a task](https://github.com/apache/spark/blob/04901dd03a3f8062fd39ea38d585935ff71a9248/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L717) after a task finishes successfully, reducing the likelihood that both the original and speculated task will successfully register map outputs.
      - There is a [comment in `TaskSetManager`](https://github.com/apache/spark/blob/04901dd03a3f8062fd39ea38d585935ff71a9248/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L113) which suggests that running tasks are not killed if a task set becomes a zombie. However:
        - If the task set becomes a zombie due to the job being cancelled then it doesn't matter whether we record map outputs.
        - If the task set became a zombie because of a stage failure (e.g. the map stage itself had a fetch failure from an upstream match stage) then I believe that the "failedEpoch" will be updated which may cause map outputs from still-running tasks to [be ignored](https://github.com/apache/spark/blob/04901dd03a3f8062fd39ea38d585935ff71a9248/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1213). (I'm not 100% sure on this point, though).
      - Even if you _do_ manage to record multiple map outputs for a stage, only a single map output is reported to / tracked by the MapOutputTracker. The only situation where the additional output locations could actually be read or used would be if a task experienced a `FetchFailure` exception. The most likely cause of a `FetchFailure` exception is an executor lost, which will have most likely caused the loss of several map tasks' output, so saving on potential re-execution of a single map task isn't a huge win if we're going to have to recompute several other lost map outputs from other tasks which ran on that lost executor. Also note that the re-population of MapOutputTracker state from state in the ShuffleMapTask only happens after the reduce stage has failed; the additional location doesn't help to prevent FetchFailures but, instead, can only reduce the amount of work when recomputing missing parent stages.
      
      Given this, this patch chooses to do away with tracking multiple locations for map outputs and instead stores only a single location. This change removes the main distinction between the `ShuffleMapTask` and `MapOutputTracker`'s copies of this state, paving the way for storing it only in the `MapOutputTracker`.
      
      ### Overview of other changes
      
      - Significantly simplified the cache / lock management inside of the `MapOutputTrackerMaster`:
        - The old code had several parallel `HashMap`s which had to be guarded by maps of `Object`s which were used as locks. This code was somewhat complicated to follow.
        - The new code uses a new `ShuffleStatus` class to group together all of the state associated with a particular shuffle, including cached serialized map statuses, significantly simplifying the logic.
      - Moved more code out of the shared `MapOutputTracker` abstract base class and into the `MapOutputTrackerMaster` and `MapOutputTrackerWorker` subclasses. This makes it easier to reason about which functionality needs to be supported only on the driver or executor.
      - Removed a bunch of code from the `DAGScheduler` which was used to synchronize information from the `MapOutputTracker` to `ShuffleMapStage`.
      - Added comments to clarify the role of `MapOutputTrackerMaster`'s `epoch` in invalidating executor-side shuffle map output caches.
      
      I will comment on these changes via inline GitHub review comments.
      
      /cc hvanhovell and rxin (whom I discussed this with offline), tgravescs (who recently worked on caching of serialized MapOutputStatuses), and kayousterhout and markhamstra (for scheduler changes).
      
      ## How was this patch tested?
      
      Existing tests. I purposely avoided making interface / API which would require significant updates or modifications to test code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17955 from JoshRosen/map-output-tracker-rewrite.
      3476390c
  3. Jun 09, 2017
    • guoxiaolong's avatar
      [SPARK-20997][CORE] driver-cores' standalone or Mesos or YARN in Cluster deploy mode only. · 82faacd7
      guoxiaolong authored
      ## What changes were proposed in this pull request?
      
      '--driver-cores'  standalone or Mesos or YARN in Cluster deploy mode only.So  The description of spark-submit about it is not very accurate.
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
      
      Closes #18241 from guoxiaolongzte/SPARK-20997.
      82faacd7
    • Joseph K. Bradley's avatar
      [SPARK-14408][CORE] Changed RDD.treeAggregate to use fold instead of reduce · 5a337188
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Previously, `RDD.treeAggregate` used `reduceByKey` and `reduce` in its implementation, neither of which technically allows the `seq`/`combOps` to modify and return their first arguments.
      
      This PR uses `foldByKey` and `fold` instead and notes that `aggregate` and `treeAggregate` are semantically identical in the Scala doc.
      
      Note that this had some test failures by unknown reasons. This was actually fixed in https://github.com/apache/spark/commit/e3554605b36bdce63ac180cc66dbdee5c1528ec7.
      
      The root cause was, the `zeroValue` now becomes `AFTAggregator` and it compares `totalCnt` (where the value is actually 0). It starts merging one by one and it keeps returning `this` where `totalCnt` is 0. So, this looks not the bug in the current change.
      
      This is now fixed in the commit. So, this should pass the tests.
      
      ## How was this patch tested?
      
      Test case added in `RDDSuite`.
      
      Closes #12217
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18198 from HyukjinKwon/SPARK-14408.
      5a337188
  4. Jun 08, 2017
    • Josh Rosen's avatar
      [SPARK-20863] Add metrics/instrumentation to LiveListenerBus · 2a23cdd0
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch adds Coda Hale metrics for instrumenting the `LiveListenerBus` in order to track the number of events received, dropped, and processed. In addition, it adds per-SparkListener-subclass timers to track message processing time. This is useful for identifying when slow third-party SparkListeners cause performance bottlenecks.
      
      See the new `LiveListenerBusMetrics` for a complete description of the new metrics.
      
      ## How was this patch tested?
      
      New tests in SparkListenerSuite, including a test to ensure proper counting of dropped listener events.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #18083 from JoshRosen/listener-bus-metrics.
      2a23cdd0
    • 10087686's avatar
      [SPARK-21006][TESTS] Create rpcEnv and run later needs shutdown and awaitTermination · 9be79458
      10087686 authored
      Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
      
      ## What changes were proposed in this pull request?
      When  run test("port conflict") case, we need run anotherEnv.shutdown() and anotherEnv.awaitTermination() for free resource.
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      run RpcEnvSuit.scala Utest
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10087686 <wang.jiaochun@zte.com.cn>
      
      Closes #18226 from wangjiaochun/master.
      9be79458
  5. Jun 06, 2017
    • jinxing's avatar
      [SPARK-20985] Stop SparkContext using LocalSparkContext.withSpark · 44de108d
      jinxing authored
      ## What changes were proposed in this pull request?
      SparkContext should always be stopped after using, thus other tests won't complain that there's only one `SparkContext` can exist.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18204 from jinxing64/SPARK-20985.
      44de108d
  6. Jun 05, 2017
    • jerryshao's avatar
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as... · 06c05441
      jerryshao authored
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as equivalence of --repositories
      
      ## What changes were proposed in this pull request?
      
      In our use case of launching Spark applications via REST APIs (Livy), there's no way for user to specify command line arguments, all Spark configurations are set through configurations map. For "--repositories" because there's no equivalent Spark configuration, so we cannot specify the custom repository through configuration.
      
      So here propose to add "--repositories" equivalent configuration in Spark.
      
      ## How was this patch tested?
      
      New UT added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18201 from jerryshao/SPARK-20981.
      06c05441
    • liupengcheng's avatar
      [SPARK-20945] Fix TID key not found in TaskSchedulerImpl · 2d39711b
      liupengcheng authored
      ## What changes were proposed in this pull request?
      
      This pull request fix the TaskScheulerImpl bug in some condition.
      Detail see:
      https://issues.apache.org/jira/browse/SPARK-20945
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      manual tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: liupengcheng <liupengcheng@xiaomi.com>
      Author: PengchengLiu <pengchengliu_bupt@163.com>
      
      Closes #18171 from liupc/Fix-tid-key-not-found-in-TaskSchedulerImpl.
      2d39711b
  7. Jun 03, 2017
    • zuotingbing's avatar
      [SPARK-20936][CORE] Lack of an important case about the test of resolveURI in... · 887cf0ec
      zuotingbing authored
      [SPARK-20936][CORE] Lack of an important case about the test of resolveURI in UtilsSuite, and add it as needed.
      
      ## What changes were proposed in this pull request?
      1.  add `assert(resolve(before) === after)` to check before and after in test of resolveURI.
      the function `assertResolves(before: String, after: String)` have two params, it means we should check the before value whether equals the after value which we want.
      e.g. the after value of Utils.resolveURI("hdfs:///root/spark.jar#app.jar").toString should be "hdfs:///root/spark.jar#app.jar" rather than "hdfs:/root/spark.jar#app.jar". we need `assert(resolve(before) === after)` to make it more safe.
      2. identify the cases between resolveURI and resolveURIs.
      3. delete duplicate cases and some small fix make this suit more clear.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #18158 from zuotingbing/spark-UtilsSuite.
      887cf0ec
  8. Jun 02, 2017
  9. Jun 01, 2017
    • Dongjoon Hyun's avatar
      [SPARK-20708][CORE] Make `addExclusionRules` up-to-date · 34661d8a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since [SPARK-9263](https://issues.apache.org/jira/browse/SPARK-9263), `resolveMavenCoordinates` ignores Spark and Spark's dependencies by using `addExclusionRules`. This PR aims to make [addExclusionRules](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L956-L974) up-to-date to neglect correctly because it fails to neglect some components like the following.
      
      **mllib (correct)**
      ```
      $ bin/spark-shell --packages org.apache.spark:spark-mllib_2.11:2.1.1
      ...
      ---------------------------------------------------------------------
      |                  |            modules            ||   artifacts   |
      |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
      ---------------------------------------------------------------------
      |      default     |   0   |   0   |   0   |   0   ||   0   |   0   |
      ---------------------------------------------------------------------
      ```
      
      **mllib-local (wrong)**
      ```
      $ bin/spark-shell --packages org.apache.spark:spark-mllib-local_2.11:2.1.1
      ...
      ---------------------------------------------------------------------
      |                  |            modules            ||   artifacts   |
      |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
      ---------------------------------------------------------------------
      |      default     |   15  |   2   |   2   |   0   ||   15  |   2   |
      ---------------------------------------------------------------------
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a updated test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17947 from dongjoon-hyun/SPARK-20708.
      34661d8a
    • jerryshao's avatar
      [SPARK-20244][CORE] Handle incorrect bytesRead metrics when using PySpark · 5854f77c
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD.
      
      So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
      
      ## How was this patch tested?
      
      Unit test and local cluster verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17617 from jerryshao/SPARK-20244.
      5854f77c
  10. May 31, 2017
  11. May 30, 2017
    • jerryshao's avatar
      [SPARK-20275][UI] Do not display "Completed" column for in-progress applications · 52ed9b28
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.
      
      The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.
      
      ```
      [ {
        "id" : "local-1491805439678",
        "name" : "Spark shell",
        "attempts" : [ {
          "startTime" : "2017-04-10T06:23:57.574GMT",
          "endTime" : "1969-12-31T23:59:59.999GMT",
          "lastUpdated" : "2017-04-10T06:23:57.574GMT",
          "duration" : 0,
          "sparkUser" : "",
          "completed" : false,
          "startTimeEpoch" : 1491805437574,
          "endTimeEpoch" : -1,
          "lastUpdatedEpoch" : 1491805437574
        } ]
      } ]%
      ```
      
      Here is UI before changed:
      
      <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
      
      And after:
      
      <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png">
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17588 from jerryshao/SPARK-20275.
      52ed9b28
    • jinxing's avatar
      [SPARK-20333] HashPartitioner should be compatible with num of child RDD's partitions. · de953c21
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Fix test
      "don't submit stage until its dependencies map outputs are registered (SPARK-5259)" ,
      "run trivial shuffle with out-of-band executor failure and retry",
      "reduce tasks should be placed locally with map output"
      in DAGSchedulerSuite.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #17634 from jinxing64/SPARK-20333.
      de953c21
  12. May 27, 2017
  13. May 26, 2017
    • Wenchen Fan's avatar
      [SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read · 1d62f8ac
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR includes some minor improvement for the comments and tests in https://github.com/apache/spark/pull/16989
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18117 from cloud-fan/follow.
      1d62f8ac
    • Yu Peng's avatar
      [SPARK-10643][CORE] Make spark-submit download remote files to local in client mode · 4af37812
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      This PR makes spark-submit script download remote files to local file system for local/standalone client mode.
      
      ## How was this patch tested?
      
      - Unit tests
      - Manual tests by adding s3a jar and testing against file on s3.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #18078 from loneknightpy/download-jar-in-spark-submit.
      4af37812
    • Sital Kedia's avatar
      [SPARK-20014] Optimize mergeSpillsWithFileStream method · 473d7552
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      When the individual partition size in a spill is small, mergeSpillsWithTransferTo method does many small disk ios which is really inefficient. One way to improve the performance will be to use mergeSpillsWithFileStream method by turning off transfer to and using buffered file read/write to improve the io throughput.
      However, the current implementation of mergeSpillsWithFileStream does not do a buffer read/write of the files and in addition to that it unnecessarily flushes the output files for each partitions.
      
      ## How was this patch tested?
      
      Tested this change by running a job on the cluster and the map stage run time was reduced by around 20%.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #17343 from sitalkedia/upstream_mergeSpillsWithFileStream.
      473d7552
    • 10129659's avatar
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores... · 0fd84b05
      10129659 authored
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      
      ## What changes were proposed in this pull request?
      In my test, the submitted app running with out an error when the --total-executor-cores less than 0
      and given the warnings:
      "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources";
      
      It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      Run the ut tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #18060 from eatoncys/totalcores.
      0fd84b05
    • Wenchen Fan's avatar
      [SPARK-20887][CORE] support alternative keys in ConfigBuilder · 629f38e1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ConfigBuilder` builds `ConfigEntry` which can only read value with one key, if we wanna change the config name but still keep the old one, it's hard to do.
      
      This PR introduce `ConfigBuilder.withAlternative`, to support reading config value with alternative keys. And also rename `spark.scheduler.listenerbus.eventqueue.size` to `spark.scheduler.listenerbus.eventqueue.capacity` with this feature, according to https://github.com/apache/spark/pull/14269#discussion_r118432313
      
      ## How was this patch tested?
      
      a new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18110 from cloud-fan/config.
      629f38e1
    • Wenchen Fan's avatar
      [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo · d9ad7890
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.
      
       However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.
      
      https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18091 from cloud-fan/shuffle.
      d9ad7890
  14. May 25, 2017
    • hyukjinkwon's avatar
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid... · e9f983df
      hyukjinkwon authored
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid path check for sc.addJar on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes two things:
      
      - A follow up for SPARK-19707 (Improving the invalid path check for sc.addJar on Windows as well).
      
      ```
      org.apache.spark.SparkContextSuite:
       - add jar with invalid path *** FAILED *** (32 milliseconds)
         2 was not equal to 1 (SparkContextSuite.scala:309)
         ...
      ```
      
      - Fix path vs URI related test failures on Windows.
      
      ```
      org.apache.spark.storage.LocalDirsSuite:
       - SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0 milliseconds)
         new java.io.File("/NONEXISTENT_PATH").exists() was true (LocalDirsSuite.scala:50)
         ...
      
       - Utils.getLocalDir() throws an exception if any temporary directory cannot be retrieved *** FAILED *** (15 milliseconds)
         Expected exception java.io.IOException to be thrown, but no exception was thrown. (LocalDirsSuite.scala:64)
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
       - orc: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
         ...
      
       - parquet: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
         ...
      
       - orc: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (141 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c
         ...
      
       - parquet: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (125 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc
         ...
      
       - orc: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (156 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a
         ...
      
       - parquet: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (547 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.command.DDLSuite:
       - create temporary view using *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-3881d9ca-561b-488d-90b9-97587472b853	mp;
         ...
      
       - insert data to a data source table which has a non-existing location should succeed *** FAILED *** (109 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54 did not equal file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54 (DDLSuite.scala:1869)
         ...
      
       - insert into a data source table with a non-existing partition location should succeed *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d did not equal file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d (DDLSuite.scala:1910)
         ...
      
       - read data from a data source table which has a non-existing location should succeed *** FAILED *** (93 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34 did not equal file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34 (DDLSuite.scala:1937)
         ...
      
       - read data from a data source table with non-existing partition location should succeed *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - create datasource table with a non-existing location *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8 did not equal file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8 (DDLSuite.scala:1982)
         ...
      
       - CTAS for external data source table with a non-existing location *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - CTAS for external data source table with a existed location *** FAILED *** (15 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a:b *** FAILED *** (143 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a%b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a,b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - location uri contains a b for datasource table *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b did not equal file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a:b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b did not equal file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a%b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b did not equal file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a b for database *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a:b for database *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a%b for database *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.execution.HiveDDLSuite:
       - create hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a existed location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of parquet table containing a b *** FAILED *** (156 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a:b *** FAILED *** (94 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a%b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a,b *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of hive table containing a b *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a:b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a%b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a,b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a:b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a%b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.sources.PathOptionSuite:
       - path option also exist for write path *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc did not equal file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc (PathOptionSuite.scala:98)
         ...
      ```
      
      ```
      org.apache.spark.sql.CachedTableSuite:
       - SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer to this table *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite:
       - treeString is redacted *** FAILED *** (250 milliseconds)
         "file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" did not contain "C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" (DataSourceScanExecRedactionSuite.scala:46)
         ...
      ```
      
      ## How was this patch tested?
      
      Tested via AppVeyor for each and checked it passed once each. These should be retested via AppVeyor in this PR.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17987 from HyukjinKwon/windows-20170515.
      e9f983df
    • jinxing's avatar
      [SPARK-19659] Fetch big blocks to disk when shuffle-read. · 3f94e64a
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse.
      Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming.
      It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM.
      
      In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019):
      
      1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus;
      2. Request memory from `MemoryManager` before fetch blocks and release the memory to `MemoryManager` when `ManagedBuffer` is released.
      3. Fetch remote blocks to disk when failing acquiring memory from `MemoryManager`, otherwise fetch to memory.
      
      This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below:
      1. Single huge block;
      2. Sizes of many blocks are underestimated in `MapStatus` and the actual footprint of blocks is much larger than the estimated.
      
      ## How was this patch tested?
      Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #16989 from jinxing64/SPARK-19659.
      3f94e64a
    • Xianyang Liu's avatar
      [SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data · 731462a0
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception.  And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`.
      
      https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #18090 from ConeyLiu/SPARK-20250.
      731462a0
  15. May 24, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20205][CORE] Make sure StageInfo is updated before sending event. · 95aef660
      Marcelo Vanzin authored
      The DAGScheduler was sending a "stage submitted" event before it properly
      updated the event's information. This meant that a listener (e.g. the
      even logging listener) could record wrong information about the event.
      
      This change sets the stage's submission time before the event is submitted,
      when there are tasks to be executed in the stage.
      
      Tested with existing unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17925 from vanzin/SPARK-20205.
      95aef660
    • Xingbo Jiang's avatar
      [SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release · d76633e3
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.
      
      ## How was this patch tested?
      
      Add new failing regression test case in `RDDSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18076 from jiangxb1987/completion-iterator.
      d76633e3
  16. May 22, 2017
    • James Shuster's avatar
      [SPARK-20815][SPARKR] NullPointerException in RPackageUtils#checkManifestForR · 4dbb63f0
      James Shuster authored
      ## What changes were proposed in this pull request?
      
      - Add a null check to RPackageUtils#checkManifestForR so that jars w/o manifests don't NPE.
      
      ## How was this patch tested?
      
      - Unit tests and manual tests.
      
      Author: James Shuster <jshuster@palantir.com>
      
      Closes #18040 from jrshust/feature/r-package-utils.
      4dbb63f0
    • jinxing's avatar
      [SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. · 2597674b
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold.
      
      ## How was this patch tested?
      
      Added test in MapStatusSuite.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18031 from jinxing64/SPARK-20801.
      2597674b
    • John Lee's avatar
      [SPARK-20813][WEB UI] Fixed Web UI executor page tab search by status not working · aea73be1
      John Lee authored
      ## What changes were proposed in this pull request?
      On status column of the table, I removed the condition  that forced only the display value to take on values Active, Blacklisted and Dead.
      
      Before the removal, values used for sort and filter for that particular column was True and False.
      ## How was this patch tested?
      
      Tested with Active, Blacklisted and Dead present as current status.
      
      Author: John Lee <jlee2@yahoo-inc.com>
      
      Closes #18036 from yoonlee95/SPARK-20813.
      aea73be1
    • caoxuewen's avatar
      [SPARK-20609][CORE] Run the SortShuffleSuite unit tests have residual spark_* system directory · f1ffc6e7
      caoxuewen authored
      ## What changes were proposed in this pull request?
      This PR solution to run the SortShuffleSuite unit tests have residual spark_* system directory
      For example:
      OS:Windows 7
      After the running SortShuffleSuite unit tests,
      the system of TMP directory have '..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not deleted
      
      ## How was this patch tested?
      Run SortShuffleSuite unit test.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #17869 from heary-cao/SortShuffleSuite.
      f1ffc6e7
    • fjh100456's avatar
      [SPARK-20591][WEB UI] Succeeded tasks num not equal in all jobs page and job... · 190d8b0b
      fjh100456 authored
      [SPARK-20591][WEB UI] Succeeded tasks num not equal in all jobs page and job detail page on spark web ui when speculative task(s) exist.
      
      ## What changes were proposed in this pull request?
      
      Modified succeeded num in job detail page from "completed = stageData.completedIndices.size" to "completed = stageData.numCompleteTasks",which making succeeded tasks num in all jobs page and job detail page look more consistent, and more easily to find which stages the speculative task(s) were in.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: fjh100456 <fu.jinhua6@zte.com.cn>
      
      Closes #17923 from fjh100456/master.
      190d8b0b
  17. May 19, 2017
    • caoxuewen's avatar
      [SPARK-20607][CORE] Add new unit tests to ShuffleSuite · f398640d
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      This PR update to two:
      1.adds the new unit tests.
        testing would be performed when there is no shuffle stage,
        shuffle will not generate the data file and the index files.
      2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its local shuffle file' unit test,
        parallelize is 1 but not is 2, Check the index file and delete.
      
      ## How was this patch tested?
      The new unit test.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #17868 from heary-cao/ShuffleSuite.
      f398640d
  18. May 17, 2017
    • Shixiong Zhu's avatar
      [SPARK-13747][CORE] Add ThreadUtils.awaitReady and disallow Await.ready · 324a904d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add `ThreadUtils.awaitReady` similar to `ThreadUtils.awaitResult` and disallow `Await.ready`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17763 from zsxwing/awaitready.
      324a904d
    • Shixiong Zhu's avatar
      [SPARK-20788][CORE] Fix the Executor task reaper's false alarm warning logs · f8e0f0f4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Executor task reaper may fail to detect if a task is finished or not when a task is finishing but being killed at the same time.
      
      The fix is pretty easy, just flip the "finished" flag when a task is successful.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18021 from zsxwing/SPARK-20788.
      f8e0f0f4
Loading