Skip to content
Snippets Groups Projects
  1. Dec 19, 2016
    • Wenchen Fan's avatar
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when... · f923c849
      Wenchen Fan authored
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table
      
      ## What changes were proposed in this pull request?
      
      When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
      
      However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
      
      This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
      * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
      * SPARK-18912: We forget to check the number of columns for non-file-based data source table
      * SPARK-18913: We don't support append data to a table with special column names.
      
      ## How was this patch tested?
      new regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16313 from cloud-fan/bug1.
      f923c849
    • Josh Rosen's avatar
      [SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors · fa829ce2
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.
      
      This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.
      
      This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.
      
      ## How was this patch tested?
      
      Tested via a new test case in `JobCancellationSuite`, plus manual testing.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16189 from JoshRosen/cancellation.
      fa829ce2
    • Josh Rosen's avatar
      [SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter · 5857b9ac
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189).
      
      This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses.
      
      ## How was this patch tested?
      
      Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16340 from JoshRosen/sql-task-interruption.
      5857b9ac
    • Shivaram Venkataraman's avatar
      [SPARK-18836][CORE] Serialize one copy of task metrics in DAGScheduler · 4cb49412
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Right now we serialize the empty task metrics once per task – Since this is shared across all tasks we could use the same serialized task metrics across all tasks of a stage.
      
      ## How was this patch tested?
      
      - [x] Run tests on EC2 to measure performance improvement
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16261 from shivaram/task-metrics-one-copy.
      4cb49412
    • jiangxingbo's avatar
      [SPARK-18624][SQL] Implicit cast ArrayType(InternalType) · 70d495dc
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Currently `ImplicitTypeCasts` doesn't handle casts between `ArrayType`s, this is not convenient, we should add a rule to enable casting from `ArrayType(InternalType)` to `ArrayType(newInternalType)`.
      
      Goals:
      1. Add a rule to `ImplicitTypeCasts` to enable casting between `ArrayType`s;
      2. Simplify `Percentile` and `ApproximatePercentile`.
      
      ## How was this patch tested?
      
      Updated test cases in `TypeCoercionSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16057 from jiangxb1987/implicit-cast-complex-types.
      70d495dc
    • Wenchen Fan's avatar
      [SPARK-18921][SQL] check database existence with Hive.databaseExists instead of getDatabase · 7a75ee1c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's weird that we use `Hive.getDatabase` to check the existence of a database, while Hive has a `databaseExists` interface.
      
      What's worse, `Hive.getDatabase` will produce an error message if the database doesn't exist, which is annoying when we only want to check the database existence.
      
      This PR fixes this and use `Hive.databaseExists` to check database existence.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16332 from cloud-fan/minor.
      7a75ee1c
    • xuanyuanking's avatar
      [SPARK-18700][SQL] Add StripedLock for each table's relation in cache · 24482858
      xuanyuanking authored
      ## What changes were proposed in this pull request?
      
      As the scenario describe in [SPARK-18700](https://issues.apache.org/jira/browse/SPARK-18700), when cachedDataSourceTables invalided, the coming few queries will fetch all FileStatus in listLeafFiles function. In the condition of table has many partitions, these jobs will occupy much memory of driver finally may cause driver OOM.
      
      In this patch, add StripedLock for each table's relation in cache not for the whole cachedDataSourceTables, each table's load cache operation protected by it.
      
      ## How was this patch tested?
      
      Add a multi-thread access table test in `PartitionedTablePerfStatsSuite` and check it only loading once using metrics in `HiveCatalogMetrics`
      
      Author: xuanyuanking <xyliyuanjian@gmail.com>
      
      Closes #16135 from xuanyuanking/SPARK-18700.
      24482858
    • Zakaria_Hili's avatar
      [SPARK-18356][ML] KMeans should cache RDD before training · 7db09abb
      Zakaria_Hili authored
      ## What changes were proposed in this pull request?
      
      According to request of Mr. Joseph Bradley , I did this update of my PR https://github.com/apache/spark/pull/15965 in order to eliminate the extrat fit() method.
      
      jkbradley
      ## How was this patch tested?
      Pass existing tests
      
      Author: Zakaria_Hili <zakahili@gmail.com>
      Author: HILI Zakaria <zakahili@gmail.com>
      
      Closes #16295 from ZakariaHili/zakbranch.
      Unverified
      7db09abb
  2. Dec 18, 2016
    • Yuming Wang's avatar
      [SPARK-18827][CORE] Fix cannot read broadcast on disk · 1e5c51f3
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      `NoSuchElementException` will throw since https://github.com/apache/spark/pull/15056 if a broadcast cannot cache in memory. The reason is that that change cannot cover `!unrolled.hasNext` in `next()` function.
      
      This change is to cover the `!unrolled.hasNext` and check `hasNext` before calling `next` in `blockManager.getLocalValues` to make it  more robust.
      
      We can cache and read broadcast even it cannot fit in memory from this pull request.
      
      Exception log:
      ```
      16/12/10 10:10:04 INFO UnifiedMemoryManager: Will not store broadcast_131 as the required space (1048576 bytes) exceeds our memory limit (122764 bytes)
      16/12/10 10:10:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_131 in memory.
      16/12/10 10:10:04 WARN MemoryStore: Not enough space to cache broadcast_131 in memory! (computed 384.0 B so far)
      16/12/10 10:10:04 INFO MemoryStore: Memory use = 95.6 KB (blocks) + 0.0 B (scratch space shared across 0 tasks(s)) = 95.6 KB. Storage limit = 119.9 KB.
      16/12/10 10:10:04 ERROR Utils: Exception encountered
      java.util.NoSuchElementException
      	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
      	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
      	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
      	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
      	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
      	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      16/12/10 10:10:04 ERROR Executor: Exception in task 1.0 in stage 86.0 (TID 134423)
      java.io.IOException: java.util.NoSuchElementException
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
      	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
      	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
      	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.util.NoSuchElementException
      	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
      	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
      	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
      	... 12 more
      ```
      
      ## How was this patch tested?
      
      Add unit test
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16252 from wangyum/SPARK-18827.
      Unverified
      1e5c51f3
    • gatorsmile's avatar
      [SPARK-18918][DOC] Missing </td> in Configuration page · c0c9e1d2
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The configuration page looks messy now, as shown in the nightly build:
      https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html
      
      Starting from the following location:
      
      ![screenshot 2016-12-18 00 26 33](https://cloud.githubusercontent.com/assets/11567269/21292396/ace4719c-c4b8-11e6-8dfd-d9ab95be43d5.png)
      
      ### How was this patch tested?
      Attached is the screenshot generated in my local computer after the fix.
      [Configuration - Spark 2.2.0 Documentation.pdf](https://github.com/apache/spark/files/659315/Configuration.-.Spark.2.2.0.Documentation.pdf)
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16327 from gatorsmile/docFix.
      Unverified
      c0c9e1d2
  3. Dec 17, 2016
  4. Dec 16, 2016
    • hyukjinkwon's avatar
      [SPARK-18895][TESTS] Fix resource-closing-related and path-related test... · 2bc1c951
      hyukjinkwon authored
      [SPARK-18895][TESTS] Fix resource-closing-related and path-related test failures in identified ones on Windows
      
      ## What changes were proposed in this pull request?
      
      There are several tests failing due to resource-closing-related and path-related  problems on Windows as below.
      
      - `RPackageUtilsSuite`:
      
      ```
      - build an R package from a jar end to end *** FAILED *** (1 second, 625 milliseconds)
        java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
        at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
        at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
        at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
      
      - faulty R package shows documentation *** FAILED *** (359 milliseconds)
        java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
        at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
        at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
        at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
      
      - SparkR zipping works properly *** FAILED *** (47 milliseconds)
        java.util.regex.PatternSyntaxException: Unknown character property name {r} near index 4
      
      C:\projects\spark\target\tmp\1481729429282-0
      
          ^
        at java.util.regex.Pattern.error(Pattern.java:1955)
        at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
      ```
      
      - `InputOutputMetricsSuite`:
      
      ```
      - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics for new Hadoop API with coalesce *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
      
      - input metrics when reading text file *** FAILED *** (110 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics on records read - simple *** FAILED *** (125 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics on records read - more stages *** FAILED *** (110 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
      
      - input metrics on records read with cache *** FAILED *** (93 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input read/write and shuffle read/write metrics all line up *** FAILED *** (93 milliseconds)
        java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
      
      - input metrics with interleaved reads *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-2638d893-e89b-47ce-acd0-bbaeee78dd9b\InputOutputMetricsSuite_cart.txt, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
      
      - input metrics with old CombineFileInputFormat *** FAILED *** (157 milliseconds)
        17947 was not greater than or equal to 300000 (InputOutputMetricsSuite.scala:324)
        org.scalatest.exceptions.TestFailedException:
        at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
        at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
        at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
      
      - input metrics with new CombineFileInputFormat *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-11920c08-19d8-4c7c-9fba-28ed72b79f80\test\InputOutputMetricsSuite.txt, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
      ```
      
      - `ReplayListenerSuite`:
      
      ```
      - End-to-end replay *** FAILED *** (121 milliseconds)
        java.io.IOException: No FileSystem for scheme: C
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
      
      - End-to-end replay with compression *** FAILED *** (516 milliseconds)
        java.io.IOException: No FileSystem for scheme: C
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
      ```
      
      - `EventLoggingListenerSuite`:
      
      ```
      - End-to-end event logging *** FAILED *** (7 seconds, 435 milliseconds)
        java.io.IOException: No FileSystem for scheme: C
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
      
      - End-to-end event logging with compression *** FAILED *** (1 second)
        java.io.IOException: No FileSystem for scheme: C
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
      
      - Event log name *** FAILED *** (16 milliseconds)
        "file:/[]base-dir/app1" did not equal "file:/[C:/]base-dir/app1" (EventLoggingListenerSuite.scala:123)
        org.scalatest.exceptions.TestFailedException:
        at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
        at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
        at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
      ```
      
      This PR proposes to fix the test failures on Windows
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor
      
      **Before**
      
      `RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/273-RPackageUtilsSuite-before
      `InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/272-InputOutputMetricsSuite-before
      `ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/274-ReplayListenerSuite-before
      `EventLoggingListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/275-EventLoggingListenerSuite-before
      
      **After**
      
      `RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/270-RPackageUtilsSuite
      `InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/271-InputOutputMetricsSuite
      `ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/277-ReplayListenerSuite-after
      `EventLoggingListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/278-EventLoggingListenerSuite-after
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16305 from HyukjinKwon/RPackageUtilsSuite-InputOutputMetricsSuite.
      2bc1c951
    • Shixiong Zhu's avatar
      [SPARK-18904][SS][TESTS] Merge two FileStreamSourceSuite files · 4faa8a3e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Merge two FileStreamSourceSuite files into one file.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16315 from zsxwing/FileStreamSourceSuite.
      4faa8a3e
    • Mark Hamstra's avatar
      [SPARK-17769][CORE][SCHEDULER] Some FetchFailure refactoring · 295db825
      Mark Hamstra authored
      ## What changes were proposed in this pull request?
      
      Readability rewrites.
      Changed order of `failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)` and `disallowStageRetryForTest` evaluation.
      Stage resubmission guard condition changed from `failedStages.isEmpty` to `!failedStages.contains(failedStage)`
      Log all resubmission of stages
      ## How was this patch tested?
      
      existing tests
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #15335 from markhamstra/SPARK-17769.
      295db825
    • Dongjoon Hyun's avatar
      [SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test table · 1169db44
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`.
      
      As a result, the rows in `people` table are accumulated at every run and the test cases fail.
      
      The following is the failure result for the second run.
      
      ```r
      Failed -------------------------------------------------------------------------
      1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) -------------------
      collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16).
      Lengths differ: 2 vs 1
      
      2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) -------------------
      collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5).
      Lengths differ: 2 vs 1
      ```
      
      ## How was this patch tested?
      
      Manual. Run `run-tests.sh` twice and check if it passes without failures.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16310 from dongjoon-hyun/SPARK-18897.
      1169db44
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix lint-check failures and javadoc8 break · ed84cd06
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix lint-check failures and javadoc8 break.
      
      Few errors were introduced as below:
      
      **lint-check failures**
      
      ```
      [ERROR] src/test/java/org/apache/spark/network/TransportClientFactorySuite.java:[45,1] (imports) RedundantImport: Duplicate import to line 43 - org.apache.spark.network.util.MapConfigProvider.
      [ERROR] src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:[255,10] (modifier) RedundantModifier: Redundant 'final' modifier.
      ```
      
      **javadoc8**
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:19: error: bad use of '>'
      [error]  *                   "max" -> "2016-12-05T20:54:20.827Z"  // maximum event time seen in this trigger
      [error]                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:20: error: bad use of '>'
      [error]  *                   "min" -> "2016-12-05T20:54:20.827Z"  // minimum event time seen in this trigger
      [error]                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:21: error: bad use of '>'
      [error]  *                   "avg" -> "2016-12-05T20:54:20.827Z"  // average event time seen in this trigger
      [error]                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/StreamingQueryProgress.java:22: error: bad use of '>'
      [error]  *                   "watermark" -> "2016-12-05T20:54:20.827Z"  // watermark used in this trigger
      [error]
      ```
      
      ## How was this patch tested?
      
      Manually checked as below:
      
      **lint-check failures**
      
      ```
      ./dev/lint-java
      Checkstyle checks passed.
      ```
      
      **javadoc8**
      
      This seems hidden in the API doc but I manually checked after removing access modifier as below:
      
      It looks not rendering properly (scaladoc).
      
      ![2016-12-16 3 40 34](https://cloud.githubusercontent.com/assets/6477701/21255175/8df1fe6e-c3ad-11e6-8cda-ce7f76c6677a.png)
      
      After this PR, it renders as below:
      
      - scaladoc
        ![2016-12-16 3 40 23](https://cloud.githubusercontent.com/assets/6477701/21255135/4a11dab6-c3ad-11e6-8ab2-b091c4f45029.png)
      
      - javadoc
        ![2016-12-16 3 41 10](https://cloud.githubusercontent.com/assets/6477701/21255137/4bba1d9c-c3ad-11e6-9b88-62f1f697b56a.png)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16307 from HyukjinKwon/lint-javadoc8.
      Unverified
      ed84cd06
    • Aliaksandr.Bedrytski's avatar
      [SPARK-18708][CORE] Improvement/improve docs in spark context file · f7a574a6
      Aliaksandr.Bedrytski authored
      ## What changes were proposed in this pull request?
      
      SparkContext.scala was created a long time ago and contains several types of Scaladocs/Javadocs mixed together. Public methods/fields should have a Scaladoc that is formatted in the same way everywhere. This pull request also adds scaladoc to methods/fields that did not have it before.
      
      ## How was this patch tested?
      
      No actual code was modified, only comments.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Aliaksandr.Bedrytski <aliaksandr.bedrytski@valtech.co.uk>
      
      Closes #16137 from Mironor/improvement/improve-docs-in-spark-context-file.
      Unverified
      f7a574a6
    • Michal Senkyr's avatar
      [SPARK-18723][DOC] Expanded programming guide information on wholeTex… · 836c95b1
      Michal Senkyr authored
      ## What changes were proposed in this pull request?
      
      Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.
      
      Also added reference to the underlying CombineFileInputFormat
      
      ## How was this patch tested?
      
      Manual build of documentation and inspection in browser
      
      ```
      cd docs
      jekyll serve --watch
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
      Unverified
      836c95b1
    • Takeshi YAMAMURO's avatar
      [SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet reader fail to read data · dc2a4d4a
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      A vectorized parquet reader fails to read column data if data schema and partition schema overlap with each other and inferred types in the partition schema differ from ones in the data schema. An example code to reproduce this bug is as follows;
      
      ```
      scala> case class A(a: Long, b: Int)
      scala> val as = Seq(A(1, 2))
      scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
      scala> val df = spark.read.parquet("/data/")
      scala> df.printSchema
      root
       |-- a: long (nullable = true)
       |-- b: integer (nullable = true)
      scala> df.collect
      java.lang.NullPointerException
              at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
              at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
      ```
      The root cause is that a logical layer (`HadoopFsRelation`) and a physical layer (`VectorizedParquetRecordReader`) have a different assumption on partition schema; the logical layer trusts the data schema to infer the type the overlapped partition columns, and, on the other hand, the physical layer trusts partition schema which is inferred from path string. To fix this bug, this pr simply updates `HadoopFsRelation.schema` to respect the partition columns position in data schema and respect the partition columns type in partition schema.
      
      ## How was this patch tested?
      Add tests in `ParquetPartitionDiscoverySuite`
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #16030 from maropu/SPARK-18108.
      dc2a4d4a
    • root's avatar
      [SPARK-18742][CORE] Clarify that user-defined BroadcastFactory is not supported · 53ab8fb3
      root authored
      ## What changes were proposed in this pull request?
      After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory conf has removed.
      
      however the scaladoc says [2]:
      
      /**
       * An interface for all the broadcast implementations in Spark (to allow
       * multiple broadcast implementations). SparkContext uses a user-specified
       * BroadcastFactory implementation to instantiate a particular broadcast for the
       * entire Spark job.
       */
      
      so we should modify the comment that SparkContext will not use a  user-specified BroadcastFactory implementation
      
      [1] https://issues.apache.org/jira/browse/SPARK-12588
      [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30
      
      ## How was this patch tested?
      unit test added
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      Author: windpiger <songjun@outlook.com>
      
      Closes #16173 from windpiger/addBroadFactoryConf.
      Unverified
      53ab8fb3
    • Shixiong Zhu's avatar
      [SPARK-18850][SS] Make StreamExecution and progress classes serializable · d7f3058e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds StreamingQueryWrapper to make StreamExecution and progress classes serializable because it is too easy for it to get captured with normal usage. If StreamingQueryWrapper gets captured in a closure but no place calls its methods, it should not fail the Spark tasks. However if its methods are called, then this PR will throw a better message.
      
      ## How was this patch tested?
      
      `test("StreamingQuery should be Serializable but cannot be used in executors")`
      `test("progress classes should be Serializable")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16272 from zsxwing/SPARK-18850.
      d7f3058e
    • Andrew Ray's avatar
      [SPARK-18845][GRAPHX] PageRank has incorrect initialization value that leads to slow convergence · 78062b85
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Change the initial value in all PageRank implementations to be `1.0` instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of `joinVertices` so that source vertices get updated in each iteration.
      
      This seems to have been introduced a long time ago in https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
      
      With the exception of graphs with sinks (which currently give incorrect results see SPARK-18847) this gives faster convergence as the sum of ranks is already correct (sum of ranks should be number of vertices).
      
      Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
      Code for benchmark: https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd
      
      ## How was this patch tested?
      
      (corrected) existing unit tests and additional test that verifies against result of igraph and NetworkX on a loop with a source.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16271 from aray/pagerank-initial-value.
      78062b85
  5. Dec 15, 2016
    • Reynold Xin's avatar
      [SPARK-18892][SQL] Alias percentile_approx approx_percentile · 172a52f5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      percentile_approx is the name used in Hive, and approx_percentile is the name used in Presto. approx_percentile is actually more consistent with our approx_count_distinct. Given the cost to alias SQL functions is low (one-liner), it'd be better to just alias them so it is easier to use.
      
      ## How was this patch tested?
      Technically I could add an end-to-end test to verify this one-line change, but it seemed too trivial to me.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16300 from rxin/SPARK-18892.
      172a52f5
    • Shivaram Venkataraman's avatar
      [MINOR] Handle fact that mv is different on linux, mac · 5a44f18a
      Shivaram Venkataraman authored
      Follow up to https://github.com/apache/spark/commit/ae853e8f3bdbd16427e6f1ffade4f63abaf74abb as `mv` throws an error on the Jenkins machines if source and destinations are the same.
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16302 from shivaram/sparkr-no-mv-fix.
      5a44f18a
    • Shivaram Venkataraman's avatar
      [MINOR] Only rename SparkR tar.gz if names mismatch · 9634018c
      Shivaram Venkataraman authored
      ## What changes were proposed in this pull request?
      
      For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail.
      
      ## How was this patch tested?
      
      Manually by executing the following script
      ```
      set -o pipefail
      set -e
      set -x
      
      touch a
      
      R_PACKAGE_VERSION=2.1.0
      VERSION=2.1.0
      
      if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
        cp a a
      fi
      ```
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16299 from shivaram/sparkr-cp-fix.
      9634018c
    • Burak Yavuz's avatar
      [SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single listener, check trigger... · 9c7f83b0
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Use `recentProgress` instead of `lastProgress` and filter out last non-zero value. Also add eventually to the latest assertQuery similar to first `assertQuery`
      
      ## How was this patch tested?
      
      Ran test 1000 times
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16287 from brkyvz/SPARK-18868.
      9c7f83b0
    • Imran Rashid's avatar
      [SPARK-8425][SCHEDULER][HOTFIX] fix scala 2.10 compile error · 32ff9645
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/commit/93cdb8a7d0f124b4db069fd8242207c82e263c52 Introduced a compile error under scala 2.10, this fixes that error.
      
      ## How was this patch tested?
      
      locally ran
      ```
      dev/change-version-to-2.10.sh
      build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Dscala-2.10 "project yarn" "test-only *YarnAllocatorSuite"
      ```
      (which failed at test compilation before this change)
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16298 from squito/blacklist-2.10.
      32ff9645
    • Burak Yavuz's avatar
      [SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not defined · 0917c8ee
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      `_to_seq` wasn't imported.
      
      ## How was this patch tested?
      
      Added partitionBy to existing write path unit test
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16297 from brkyvz/SPARK-18888.
      0917c8ee
    • Shixiong Zhu's avatar
      [SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource · 68a6dc97
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data.
      
      This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch.
      
      ## How was this patch tested?
      
      The added test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16251 from zsxwing/newest-first.
      68a6dc97
    • Tathagata Das's avatar
      [SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets · 4f7292c8
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true.
      
      ## How was this patch tested?
      
      Added unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16289 from tdas/SPARK-18870.
      4f7292c8
    • jiangxingbo's avatar
      [SPARK-17910][SQL] Allow users to update the comment of a column · 01e14bf3
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Right now, once a user set the comment of a column with create table command, he/she cannot update the comment. It will be useful to provide a public interface (e.g. SQL) to do that.
      
      This PR implements the following SQL statement:
      ```
      ALTER TABLE table [PARTITION partition_spec]
      CHANGE [COLUMN] column_old_name column_new_name column_dataType
      [COMMENT column_comment]
      [FIRST | AFTER column_name];
      ```
      
      For further expansion, we could support alter `name`/`dataType`/`index` of a column too.
      
      ## How was this patch tested?
      
      Add new test cases in `ExternalCatalogSuite` and `SessionCatalogSuite`.
      Add sql file test for `ALTER TABLE CHANGE COLUMN` statement.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15717 from jiangxb1987/change-column.
      01e14bf3
    • Imran Rashid's avatar
      [SPARK-8425][CORE] Application Level Blacklisting · 93cdb8a7
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
      ## How was this patch tested?
      
      Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.
      
      The added tests include:
      - verifying BlacklistTracker works correctly
      - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
      - an integration test for the entire scheduler with blacklisting in a few different scenarios
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: mwws <wei.mao@intel.com>
      
      Closes #14079 from squito/blacklist-SPARK-8425.
      93cdb8a7
  6. Dec 14, 2016
    • Felix Cheung's avatar
      [SPARK-18849][ML][SPARKR][DOC] vignettes final check update · 7d858bc5
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc cleanup
      
      ## How was this patch tested?
      
      ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~
      Output html here: https://felixcheung.github.io/sparkr-vignettes.html
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16286 from felixcheung/rvignettespass.
      7d858bc5
    • Dongjoon Hyun's avatar
      [SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding `DESCRIPTION` file · ec0eae48
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that.
      
      - Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
      - Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html
      
      ## How was this patch tested?
      
      Manual.
      
      ```bash
      cd docs
      SKIP_SCALADOC=1 jekyll build
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16292 from dongjoon-hyun/SPARK-18875.
      ec0eae48
    • Reynold Xin's avatar
      [SPARK-18869][SQL] Add TreeNode.p that returns BaseType · 5d510c69
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType.
      
      ## How was this patch tested?
      N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16288 from rxin/SPARK-18869.
      5d510c69
    • Wenchen Fan's avatar
      [SPARK-18856][SQL] non-empty partitioned table should not report zero size · d6f11a12
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16280 from cloud-fan/bug.
      d6f11a12
    • gatorsmile's avatar
      [SPARK-18703][SQL] Drop Staging Directories and Data Files After each... · 8db4d95c
      gatorsmile authored
      [SPARK-18703][SQL] Drop Staging Directories and Data Files After each Insertion/CTAS of Hive serde Tables
      
      ### What changes were proposed in this pull request?
      Below are the files/directories generated for three inserts againsts a Hive table:
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      
      The first 18 files are temporary. We do not drop it until the end of JVM termination. If JVM does not appropriately terminate, these temporary files/directories will not be dropped.
      
      Only the last two files are needed, as shown below.
      ```
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
      /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
      ```
      The temporary files/directories could accumulate a lot when we issue many inserts, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. When the JVM does not terminates approprately, the files might not be dropped.
      
      This PR is to drop the created staging files and temporary data files after each insert/CTAS.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16134 from gatorsmile/deleteFiles.
      8db4d95c
    • wm624@hotmail.com's avatar
      [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates · 32438853
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      When do the QA work, I found that the following issues:
      
      1). `spark.mlp` doesn't include an example;
      2). `spark.mlp` and `spark.lda` have redundant parameter explanations;
      3). `spark.lda` document misses default values for some parameters.
      
      I also changed the `spark.logit` regParam in the examples, as we discussed in #16222.
      
      ## How was this patch tested?
      
      Manual test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16284 from wangmiao1981/ks.
      32438853
    • Reynold Xin's avatar
      [SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for subqueries · ffdd1fcd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries.
      
      This patch fixes the bug.
      
      ## How was this patch tested?
      Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16277 from rxin/SPARK-18854.
      ffdd1fcd
Loading