Skip to content
Snippets Groups Projects
  1. Dec 08, 2016
  2. Dec 07, 2016
    • sarutak's avatar
      [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040 · bb94f61a
      sarutak authored
      ## What changes were proposed in this pull request?
      
      When SSL is enabled, the Spark shell shows:
      ```
      Spark context Web UI available at https://192.168.99.1:4040
      ```
      This is wrong because 4040 is http, not https. It redirects to the https port.
      More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481.
      
      CC: mengxr liancheng
      
      I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled.
      
      Author: sarutak <sarutak@oss.nttdata.co.jp>
      
      Closes #16190 from sarutak/SPARK-18761.
      bb94f61a
    • Shixiong Zhu's avatar
      [SPARK-18764][CORE] Add a warning log when skipping a corrupted file · dbf3e298
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16192 from zsxwing/SPARK-18764.
      dbf3e298
    • Jie Xiong's avatar
      [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap · c496d03b
      Jie Xiong authored
      ## What changes were proposed in this pull request?
      
      BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate.
      
      However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM.
      
      This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills.
      
      ## How was this patch tested?
      
      Existing tests and tested on realworld workloads.
      
      Author: Jie Xiong <jiexiong@fb.com>
      Author: jiexiong <jiexiong@gmail.com>
      
      Closes #15722 from jiexiong/jie_oom_fix.
      c496d03b
    • Sean Owen's avatar
      [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 79f5f281
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
      
      ## How was this patch tested?
      
      Existing test plus new test case.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16129 from srowen/SPARK-18678.
      Unverified
      79f5f281
  3. Dec 06, 2016
    • Peter Ableda's avatar
      [SPARK-18740] Log spark.app.name in driver logs · 05d416ff
      Peter Ableda authored
      ## What changes were proposed in this pull request?
      
      Added simple logInfo line to print out the `spark.app.name` in the driver logs
      
      ## How was this patch tested?
      
      Spark was built and tested with SparkPi app. Example log:
      ```
      16/12/06 05:49:50 INFO spark.SparkContext: Running Spark version 2.0.0
      16/12/06 05:49:52 INFO spark.SparkContext: Submitted application: Spark Pi
      16/12/06 05:49:52 INFO spark.SecurityManager: Changing view acls to: root
      16/12/06 05:49:52 INFO spark.SecurityManager: Changing modify acls to: root
      ```
      
      Author: Peter Ableda <peter.ableda@cloudera.com>
      
      Closes #16172 from peterableda/feature/print_appname.
      05d416ff
  4. Dec 05, 2016
    • hyukjinkwon's avatar
      [SPARK-18672][CORE] Close recordwriter in SparkHadoopMapReduceWriter before committing · b8c7b8d3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems some APIs such as `PairRDDFunctions.saveAsHadoopDataset()` do not close the record writer before issuing the commit for the task.
      
      On Windows, the output in the temp directory is being open and output committer tries to rename it from temp directory to the output directory after finishing writing.
      
      So, it fails to move the file. It seems we should close the writer actually before committing the task like the other writers such as `FileFormatWriter`.
      
      Identified failure was as below:
      
      ```
      FAILURE! - in org.apache.spark.JavaAPISuite
      writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite)  Time elapsed: 0.25 sec  <<< ERROR!
      org.apache.spark.SparkException: Job aborted.
      	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
      Caused by: org.apache.spark.SparkException:
      Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
      	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436)
      	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415)
      	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
      	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
      	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156)
      	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168)
      	... 8 more
      Driver stacktrace:
      	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
      Caused by: org.apache.spark.SparkException: Task failed while writing rows
      Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
      ```
      
      This PR proposes to close this before committing the task.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      **Before**
      
      https://ci.appveyor.com/project/spark-test/spark/build/94-scala-tests
      
      **After**
      
      https://ci.appveyor.com/project/spark-test/spark/build/93-scala-tests
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16098 from HyukjinKwon/close-wirter-first.
      Unverified
      b8c7b8d3
  5. Dec 04, 2016
    • Reynold Xin's avatar
      [SPARK-18702][SQL] input_file_block_start and input_file_block_length · e9730b70
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:
      
      1. input_file_block_start: returns the file block start offset, or -1 if not available.
      
      2. input_file_block_length: returns the file block length, or -1 if not available.
      
      ## How was this patch tested?
      Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16133 from rxin/SPARK-18702.
      e9730b70
  6. Dec 02, 2016
    • Reynold Xin's avatar
      [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT · c7c72659
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch bumps master branch version to 2.2.0-SNAPSHOT.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16126 from rxin/SPARK-18695.
      c7c72659
    • Eric Liang's avatar
      [SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables · 294163ee
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.
      
      This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).
      
      cc mallman  cloud-fan
      
      ## How was this patch tested?
      
      Checked metrics in unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16112 from ericl/spark-18679.
      294163ee
  7. Dec 01, 2016
  8. Nov 30, 2016
    • wm624@hotmail.com's avatar
      [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label. · 2eb6764f
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
      
      In this PR, original label output is supported and test cases are modified and added. Document is also modified.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15910 from wangmiao1981/audit.
      2eb6764f
    • Shixiong Zhu's avatar
      [SPARK-18655][SS] Ignore Structured Streaming 2.0.2 logs in history server · c4979f6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror:
      
      ```
      [info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"])
      [info]  at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
      [info]   at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
      [info]   at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
      [info]   at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
      ...
      ```
      
      This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs.
      
      ## How was this patch tested?
      
      `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16085 from zsxwing/SPARK-18655.
      c4979f6e
    • Marcelo Vanzin's avatar
      [SPARK-18546][CORE] Fix merging shuffle spills when using encryption. · 93e9d880
      Marcelo Vanzin authored
      The problem exists because it's not possible to just concatenate encrypted
      partition data from different spill files; currently each partition would
      have its own initial vector to set up encryption, and the final merged file
      should contain a single initial vector for each merged partiton, otherwise
      iterating over each record becomes really hard.
      
      To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
      so that the merged file contains a single initial vector at the start of
      the partition data.
      
      Because it's not possible to do that using the fast transferTo path, when
      encryption is enabled UnsafeShuffleWriter will revert back to using file
      streams when merging. It may be possible to use a hybrid approach when
      using encryption, using an intermediate direct buffer when reading from
      files and encrypting the data, but that's better left for a separate patch.
      
      As part of the change I made DiskBlockObjectWriter take a SerializerManager
      instead of a "wrap stream" closure, since that makes it easier to test the
      code without having to mock SerializerManager functionality.
      
      Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
      side and ExternalAppendOnlyMapSuite for integration), and by running some
      apps that failed without the fix.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15982 from vanzin/SPARK-18546.
      93e9d880
    • Josh Rosen's avatar
      [SPARK-18640] Add synchronization to TaskScheduler.runningTasksByExecutors · c51c7725
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the mutable `executorIdToRunningTaskIds` map without proper synchronization. In addition, as markhamstra pointed out in #15986, the signature's use of parentheses is a little odd given that this is a pure getter method.
      
      This patch fixes both issues.
      
      ## How was this patch tested?
      
      Covered by existing tests.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.
      c51c7725
    • uncleGen's avatar
      [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark Streaming · 56c82eda
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution.
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16052 from uncleGen/SPARK-18617.
      56c82eda
  9. Nov 29, 2016
    • Josh Rosen's avatar
      [SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss · 9a02f682
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      _This is the master branch version of #15986; the original description follows:_
      
      This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails.
      
      This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map.
      
      Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here.
      
      This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss.
      
      There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes.
      
      ## How was this patch tested?
      
      I added a new unit test to `TaskSchedulerImplSuite`.
      
      /cc kayousterhout and markhamstra, who reviewed #15986.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16045 from JoshRosen/fix-leak-following-total-executor-loss-master.
      9a02f682
    • hyukjinkwon's avatar
      [SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks · 1a870090
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below:
      
      ```scala
      /** Return an RDD with the pairs from `this` whose keys are not in `other`. */
      ```
      
      So, we could work around this as below:
      
      ```scala
      /**
       * Return an RDD with the pairs from `this` whose keys are not in `other`.
       */
      ```
      
      - javadoc
      
        - **Before**
          ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png)
      
        - **After**
          ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png)
      
      - scaladoc (this one looks fine either way)
      
        - **Before**
          ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png)
      
        - **After**
          ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png)
      
      I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`.
      
      ## How was this patch tested?
      
      I found them via
      
      ```
      grep -r "\/\*\*.*\`" . | grep .scala
      ````
      
      and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16050 from HyukjinKwon/javadoc-markdown.
      Unverified
      1a870090
    • hyukjinkwon's avatar
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility... · f830bb91
      hyukjinkwon authored
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
      
      ## What changes were proposed in this pull request?
      
      This PR make `sbt unidoc` complete with Java 8.
      
      This PR roughly includes several fixes as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```diff
        - * A column that will be computed based on the data in a [[DataFrame]].
        + * A column that will be computed based on the data in a `DataFrame`.
        ```
      
      - Fix throws annotations so that they are recognisable in javadoc
      
      - Fix URL links to `<a href="http..."></a>`.
      
        ```diff
        - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
        + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
        + * Decision tree (Wikipedia)</a> model for regression.
        ```
      
        ```diff
        -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
        +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
        +   * Receiver operating characteristic (Wikipedia)</a>
        ```
      
      - Fix < to > to
      
        - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.
      
        - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558
      
      - Fix `</p>` complaint
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
      Unverified
      f830bb91
    • Davies Liu's avatar
      [SPARK-18188] add checksum for blocks of broadcast · 7d5cb3af
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      A TorrentBroadcast is serialized and compressed first, then splitted as fixed size blocks, if any block is corrupt when fetching from remote, the decompression/deserialization will fail without knowing which block is corrupt. Also, the corrupt block is kept in block manager and reported to driver, so other tasks (in same executor or from different executor) will also fail because of it.
      
      This PR add checksum for each block, and check it after fetching a block from remote executor, because it's very likely that the corruption happen in network. When the corruption happen, it will throw the block away and throw an exception to fail the task, which will be retried.
      
      Added a config for it: `spark.broadcast.checksum`, which is true by default.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15935 from davies/broadcast_checksum.
      7d5cb3af
  10. Nov 28, 2016
    • Marcelo Vanzin's avatar
      [SPARK-18547][CORE] Propagate I/O encryption key when executors register. · 8b325b17
      Marcelo Vanzin authored
      This change modifies the method used to propagate encryption keys used during
      shuffle. Instead of relying on YARN's UserGroupInformation credential propagation,
      this change explicitly distributes the key using the messages exchanged between
      driver and executor during registration. When RPC encryption is enabled, this means
      key propagation is also secure.
      
      This allows shuffle encryption to work in non-YARN mode, which means that it's
      easier to write unit tests for areas of the code that are affected by the feature.
      
      The key is stored in the SecurityManager; because there are many instances of
      that class used in the code, the key is only guaranteed to exist in the instance
      managed by the SparkEnv. This path was chosen to avoid storing the key in the
      SparkConf, which would risk having the key being written to disk as part of the
      configuration (as, for example, is done when starting YARN applications).
      
      Tested by new and existing unit tests (which were moved from the YARN module to
      core), and by running apps with shuffle encryption enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15981 from vanzin/SPARK-18547.
      8b325b17
    • Imran Rashid's avatar
      [SPARK-18117][CORE] Add test for TaskSetBlacklist · 8b1609be
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This adds tests to verify the interaction between TaskSetBlacklist and
      TaskSchedulerImpl.  TaskSetBlacklist was introduced by SPARK-17675 but
      it neglected to add these tests.
      
      This change does not fix any bugs -- it is just for increasing test
      coverage.
      ## How was this patch tested?
      
      Jenkins
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15644 from squito/taskset_blacklist_test_update.
      8b1609be
    • Mark Grover's avatar
      [SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI · 237c3b96
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      This patch adds a new property called `spark.secret.redactionPattern` that
      allows users to specify a scala regex to decide which Spark configuration
      properties and environment variables in driver and executor environments
      contain sensitive information. When this regex matches the property or
      environment variable name, its value is redacted from the environment UI and
      various logs like YARN and event logs.
      
      This change uses this property to redact information from event logs and YARN
      logs. It also, updates the UI code to adhere to this property instead of
      hardcoding the logic to decipher which properties are sensitive.
      
      Here's an image of the UI post-redaction:
      ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)
      
      Here's the text in the YARN logs, post-redaction:
      ``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)``
      
      Here's the text in the event logs, post-redaction:
      ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...``
      
      ## How was this patch tested?
      1. Unit tests are added to ensure that redaction works.
      2. A YARN job reading data off of S3 with confidential information
      (hadoop credential provider password) being provided in the environment
      variables of driver and executor. And, afterwards, logs were grepped to make
      sure that no mention of secret password was present. It was also ensure that
      the job was able to read the data off of S3 correctly, thereby ensuring that
      the sensitive information was being trickled down to the right places to read
      the data.
      3. The event logs were checked to make sure no mention of secret password was
      present.
      4. UI environment tab was checked to make sure there was no secret information
      being displayed.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #15971 from markgrover/master_redaction.
      237c3b96
  11. Nov 25, 2016
    • Takuya UESHIN's avatar
      [SPARK-18583][SQL] Fix nullability of InputFileName. · a88329d4
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      The nullability of `InputFileName` should be `false`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16007 from ueshin/issues/SPARK-18583.
      a88329d4
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 51b1c155
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      Unverified
      51b1c155
    • n.fraison's avatar
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one... · f42db0c0
      n.fraison authored
      [SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server
      
      ## What changes were proposed in this pull request?
      
      Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
      ## How was this patch tested?
      
      manual tests
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      This commit is contributed by Criteo SA under the Apache v2 licence.
      
      Author: n.fraison <n.fraison@criteo.com>
      
      Closes #15648 from ashangit/SPARK-18119.
      Unverified
      f42db0c0
  12. Nov 23, 2016
    • Reynold Xin's avatar
      [SPARK-18557] Downgrade confusing memory leak warning message · 9785ed40
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      TaskMemoryManager has a memory leak detector that gets called at task completion callback and checks whether any memory has not been released. If they are not released by the time the callback is invoked, TaskMemoryManager releases them.
      
      The current error message says something like the following:
      ```
      WARN  [Executor task launch worker-0]
      org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
      org.apache.spark.unsafe.map.BytesToBytesMap33fb6a15
      In practice, there are multiple reasons why these can be triggered in the normal code path (e.g. limit, or task failures), and the fact that these messages are log means the "leak" is fixed by TaskMemoryManager.
      ```
      
      To not confuse users, this patch downgrade the message from warning to debug level, and avoids using the word "leak" since it is not actually a leak.
      
      ## How was this patch tested?
      N/A - this is a simple logging improvement.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15989 from rxin/SPARK-18557.
      9785ed40
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 85235ed6
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507 was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      85235ed6
  13. Nov 19, 2016
    • Kazuaki Ishizaki's avatar
      [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java · d93b6552
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
      
      ## How was this patch tested?
      
      Manually executed query82 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15907 from kiszk/SPARK-18458.
      d93b6552
    • Stavros Kontopoulos's avatar
      [SPARK-17062][MESOS] add conf option to mesos dispatcher · ea77c81e
      Stavros Kontopoulos authored
      Adds --conf option to set spark configuration properties in mesos dispacther.
      Properties provided with --conf take precedence over properties within the properties file.
      The reason for this PR is that for simple configuration or testing purposes we need to provide a property file (ideally a shared one for a cluster) even if we just provide a single property.
      
      Manually tested.
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
      
      Closes #14650 from skonto/dipatcher_conf.
      ea77c81e
    • Sean Owen's avatar
      [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s · 8b1e1088
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15833 from srowen/SPARK-18353.
      Unverified
      8b1e1088
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      Unverified
      d5b1d5fc
  14. Nov 18, 2016
  15. Nov 17, 2016
  16. Nov 16, 2016
    • Xianyang Liu's avatar
      [SPARK-18420][BUILD] Fix the errors caused by lint check in Java · 7569cf6c
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      Small fix, fix the errors caused by lint check in Java
      
      - Clear unused objects and `UnusedImports`.
      - Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
      - Cut the line which is longer than 100 characters into two lines.
      
      ## How was this patch tested?
      Travis CI.
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      ```
      Before:
      ```
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
      [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
      [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
      [ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      ```
      
      After:
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
      Checkstyle checks passed.
      ```
      
      Author: Xianyang Liu <xyliu0530@icloud.com>
      
      Closes #15865 from ConeyLiu/master.
      Unverified
      7569cf6c
  17. Nov 14, 2016
  18. Nov 12, 2016
    • Guoqiang Li's avatar
      [SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final · bc41d997
      Guoqiang Li authored
      ## What changes were proposed in this pull request?
      
      One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
      In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #15830 from witgo/SPARK-18375_netty-4.0.42.
      Unverified
      bc41d997
  19. Nov 11, 2016
    • Weiqing Yang's avatar
      [SPARK-16759][CORE] Add a configuration property to pass caller contexts of... · 3af89451
      Weiqing Yang authored
      [SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark
      
      ## What changes were proposed in this pull request?
      
      Many applications take Spark as a computing engine and run on it. This PR adds a configuration property `spark.log.callerContext` that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.
      
      The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information `A` specified by Spark itself and the value `B` of `spark.log.callerContext` property.  Currently `A` typically takes 64 to 74 characters,  so `B` can have up to 50 characters (mentioned in the doc `running-on-yarn.md`)
      ## How was this patch tested?
      
      Manual tests. I have run some Spark applications with `spark.log.callerContext` configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.
      
      The ways to configure `spark.log.callerContext` property:
      - In spark-defaults.conf:
      
      ```
      spark.log.callerContext  infoSpecifiedByUpstreamApp
      ```
      - In app's source code:
      
      ```
      val spark = SparkSession
            .builder
            .appName("SparkKMeans")
            .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
            .getOrCreate()
      ```
      
      When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs `.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")`.
      
      The following  example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.
      
      Command:
      
      ```
      ./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
      ```
      
      Yarn RM log:
      
      <img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png">
      
      HDFS audit log:
      
      <img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png">
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15563 from weiqingy/SPARK-16759.
      3af89451
Loading