Skip to content
Snippets Groups Projects
  1. Mar 31, 2017
    • Ryan Blue's avatar
      [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from history files. · e3cec18e
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
      
      ## How was this patch tested?
      
      Current History UI tests cover use of the history file.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
      
      (cherry picked from commit c4c03eed)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      e3cec18e
  2. Mar 29, 2017
    • jerryshao's avatar
      [SPARK-20059][YARN] Use the correct classloader for HBaseCredentialProvider · 103ff54d
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
      
      Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
      
      ## How was this patch tested?
      
      Unit test and local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17388 from jerryshao/SPARK-20059.
      
      (cherry picked from commit c622a87c)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      103ff54d
  3. Mar 22, 2017
  4. Mar 20, 2017
    • Michael Allman's avatar
      [SPARK-17204][CORE] Fix replicated off heap storage · d205d40a
      Michael Allman authored
      (Jira: https://issues.apache.org/jira/browse/SPARK-17204
      
      )
      
      ## What changes were proposed in this pull request?
      
      There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
      
      ## How was this patch tested?
      
      `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #16499 from mallman/spark-17204-replicated_off_heap_storage.
      
      (cherry picked from commit 7fa116f8)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      d205d40a
  5. Mar 19, 2017
    • Felix Cheung's avatar
      [SPARK-18817][SPARKR][SQL] change derby log output to temp dir · b60f6902
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
      
      ## How was this patch tested?
      
      Manually, unit tests
      
      With this, these are relocated to under /tmp
      ```
      # ls /tmp/RtmpG2M0cB/
      derby.log
      ```
      And they are removed automatically when the R session is ended.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16330 from felixcheung/rderby.
      
      (cherry picked from commit 422aa67d)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      b60f6902
  6. Mar 02, 2017
    • jerryshao's avatar
      [SPARK-19750][UI][BRANCH-2.1] Fix redirect issue from http to https · 3a7591ad
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      If spark ui port (4040) is not set, it will choose port number 0, this will make https port to also choose 0. And in Spark 2.1 code, it will use this https port (0) to do redirect, so when redirect triggered, it will point to a wrong url:
      
      like:
      
      ```
      /tmp/temp$ wget http://172.27.25.134:55015
      --2017-02-23 12:13:54--  http://172.27.25.134:55015/
      Connecting to 172.27.25.134:55015... connected.
      HTTP request sent, awaiting response... 302 Found
      Location: https://172.27.25.134:0/ [following]
      --2017-02-23 12:13:54--  https://172.27.25.134:0/
      Connecting to 172.27.25.134:0... failed: Can't assign requested address.
      Retrying.
      
      --2017-02-23 12:13:55--  (try: 2)  https://172.27.25.134:0/
      Connecting to 172.27.25.134:0... failed: Can't assign requested address.
      Retrying.
      
      --2017-02-23 12:13:57--  (try: 3)  https://172.27.25.134:0/
      Connecting to 172.27.25.134:0... failed: Can't assign requested address.
      Retrying.
      
      --2017-02-23 12:14:00--  (try: 4)  https://172.27.25.134:0/
      Connecting to 172.27.25.134:0... failed: Can't assign requested address.
      Retrying.
      
      ```
      
      So instead of using 0 to do redirect, we should pick a bound port instead.
      
      This issue only exists in Spark 2.1-, and can be reproduced in yarn cluster mode.
      
      ## How was this patch tested?
      
      Current redirect UT doesn't verify this issue, so extend current UT to do correct verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17083 from jerryshao/SPARK-19750.
      3a7591ad
  7. Feb 26, 2017
    • Eyal Zituny's avatar
      [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle... · 04fbb9e0
      Eyal Zituny authored
      [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
      
      ## What changes were proposed in this pull request?
      
      currently if multiple streaming queries listeners exists, when a QueryTerminatedEvent is triggered, only one of the listeners will be invoked while the rest of the listeners will ignore the event.
      this is caused since the the streaming queries listeners bus holds a set of running queries ids and when a termination event is triggered, after the first listeners is handling the event, the terminated query id is being removed from the set.
      in this PR, the query id will be removed from the set only after all the listeners handles the event
      
      ## How was this patch tested?
      
      a test with multiple listeners has been added to StreamingQueryListenerSuite
      
      Author: Eyal Zituny <eyal.zituny@equalum.io>
      
      Closes #16991 from eyalzit/master.
      
      (cherry picked from commit 9f8e3921)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      04fbb9e0
  8. Feb 24, 2017
    • jerryshao's avatar
      [SPARK-19707][CORE] Improve the invalid path check for sc.addJar · 6da6a27f
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      Currently in Spark there're two issues when we add jars with invalid path:
      
      * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will resolve it to the current directory path and add to classpath / file server, which is unwanted. This is happened in our programatic way to submit Spark application. From my understanding Spark should defensively filter out such empty path.
      * If the jar path is a invalid path (file doesn't exist), `addJar` doesn't check it and will still add to file server, the exception will be delayed until job running. Actually this local path could be checked beforehand, no need to wait until task running. We have similar check in `addFile`, but lacks similar similar mechanism in `addJar`.
      
      ## How was this patch tested?
      
      Add unit test and local manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17038 from jerryshao/SPARK-19707.
      
      (cherry picked from commit b0a8c16f)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      6da6a27f
  9. Feb 22, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19652][UI] Do auth checks for REST API access (branch-2.1). · 21afc453
      Marcelo Vanzin authored
      The REST API has a security filter that performs auth checks
      based on the UI root's security manager. That works fine when
      the UI root is the app's UI, but not when it's the history server.
      
      In the SHS case, all users would be allowed to see all applications
      through the REST API, even if the UI itself wouldn't be available
      to them.
      
      This change adds auth checks for each app access through the API
      too, so that only authorized users can see the app's data.
      
      The change also modifies the existing security filter to use
      `HttpServletRequest.getRemoteUser()`, which is used in other
      places. That is not necessarily the same as the principal's
      name; for example, when using Hadoop's SPNEGO auth filter,
      the remote user strips the realm information, which then matches
      the user name registered as the owner of the application.
      
      I also renamed the UIRootFromServletContext trait to a more generic
      name since I'm using it to store more context information now.
      
      Tested manually with an authentication filter enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17019 from vanzin/SPARK-19652_2.1.
      21afc453
  10. Feb 20, 2017
  11. Feb 17, 2017
    • Davies Liu's avatar
      [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap · 6e3abed8
      Davies Liu authored
      
      ## What changes were proposed in this pull request?
      
      Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1).
      
      This PR fix the off-by-one bug in BytesToBytesMap.
      
      This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 .
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16844 from davies/off_by_one.
      
      (cherry picked from commit 3d0c3af0)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      6e3abed8
    • Stan Zhai's avatar
      [SPARK-19622][WEBUI] Fix a http error in a paged table when using a `Go` button to search. · 55958bcd
      Stan Zhai authored
      ## What changes were proposed in this pull request?
      
      The search function of paged table is not available because of we don't skip the hash data of the reqeust path.
      
      ![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png
      
      )
      
      ## How was this patch tested?
      
      Tested manually with my browser.
      
      Author: Stan Zhai <zhaishidan@haizhi.com>
      
      Closes #16953 from stanzhai/fix-webui-paged-table.
      
      (cherry picked from commit 021062af)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      55958bcd
  12. Feb 15, 2017
  13. Feb 13, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 7fe3543f
      Marcelo Vanzin authored
      
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      
      (cherry picked from commit 0169360e)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      7fe3543f
    • Shixiong Zhu's avatar
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using... · 328b2298
      Shixiong Zhu authored
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes
      
      ## What changes were proposed in this pull request?
      
      Netty's `MessageToMessageEncoder` uses [Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62
      
      ) to generate a matcher class and the implementation calls `Class.forName` to check if this class is already generated. If `MessageEncoder` or `MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause `ClassCircularityError`. This is because loading this Netty generated class will call `ExecutorClassLoader.findClass` to search this class, and `ExecutorClassLoader` will try to use RPC to load it and cause to load the non-exist matcher class again. JVM will report `ClassCircularityError` to prevent such infinite recursion.
      
      ##### Why it only happens in Maven builds
      
      It's because Maven and SBT have different class loader tree. The Maven build will set a URLClassLoader as the current context class loader to run the tests and expose this issue. The class loader tree is as following:
      
      ```
      bootstrap class loader ------ ... ----- REPL class loader ---- ExecutorClassLoader
      |
      |
      URLClasssLoader
      ```
      
      The SBT build uses the bootstrap class loader directly and `ReplSuite.test("propagation of local properties")` is the first test in ReplSuite, which happens to load `io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher` into the bootstrap class loader (Note: in maven build, it's loaded into URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can be reproduced in SBT as well. Here are the produce steps:
      - Enable `hadoop.caller.context.enabled`.
      - Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
      - Ignore `ReplSuite.test("propagation of local properties")`.
      - Run `ReplSuite` using SBT.
      
      This PR just creates a singleton MessageEncoder and MessageDecoder and makes sure they are created before switching to ExecutorClassLoader. TransportContext will be created when creating RpcEnv and that happens before creating ExecutorClassLoader.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16859 from zsxwing/SPARK-17714.
      
      (cherry picked from commit 905fdf0c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      328b2298
  14. Feb 09, 2017
  15. Feb 01, 2017
    • Shixiong Zhu's avatar
      [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout · 7c23bd49
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      When connecting timeout, `ask` may fail with a confusing message:
      
      ```
      17/02/01 23:15:19 INFO Worker: Connecting to master ...
      java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
              at scala.Predef$.require(Predef.scala:224)
              at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
              at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
      ```
      
      It's better to provide a meaningful message.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16773 from zsxwing/connect-timeout.
      
      (cherry picked from commit 8303e20c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      7c23bd49
    • Devaraj K's avatar
      [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED · f9464641
      Devaraj K authored
      
      ## What changes were proposed in this pull request?
      
      Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.
      
      ## How was this patch tested?
      
      Current behaviour of displaying tasks in stage UI page,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      |156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      
      Web UI display after applying the patch,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
      |156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #16725 from devaraj-kavali/SPARK-19377.
      
      (cherry picked from commit df4a27cc)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      f9464641
  16. Jan 26, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19220][UI] Make redirection to HTTPS apply to all URIs. (branch-2.1) · 59502bbc
      Marcelo Vanzin authored
      The redirect handler was installed only for the root of the server;
      any other context ended up being served directly through the HTTP
      port. Since every sub page (e.g. application UIs in the history
      server) is a separate servlet context, this meant that everything
      but the root was accessible via HTTP still.
      
      The change adds separate names to each connector, and binds contexts
      to specific connectors so that content is only served through the
      HTTPS connector when it's enabled. In that case, the only thing that
      binds to the HTTP connector is the redirect handler.
      
      Tested with new unit tests and by checking a live history server.
      
      (cherry picked from commit d3dcb63b)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16711 from vanzin/SPARK-19220_2.1.
      59502bbc
  17. Jan 25, 2017
    • Tathagata Das's avatar
      [SPARK-14804][SPARK][GRAPHX] Fix checkpointing of VertexRDD/EdgeRDD · 0d7e3852
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed.
      
      This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException.
      
      The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic.
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15396 from tdas/SPARK-14804.
      
      (cherry picked from commit 47d5d0dd)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      0d7e3852
  18. Jan 23, 2017
    • jerryshao's avatar
      [SPARK-19306][CORE] Fix inconsistent state in DiskBlockObject when expection occurred · ed5d1e72
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16657 from jerryshao/SPARK-19306.
      
      (cherry picked from commit e4974721)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      ed5d1e72
  19. Jan 16, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet · 4f3ce062
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:
      
      1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
      2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.
      
      This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.
      
      Two main changes in this patch:
      
      1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner
      
          We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.
      
      2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.
      
      One thing to notice is:
      
      We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`.  One concern is that it might also shadow other runtime exceptions other than reading corrupt files.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
      
      (cherry picked from commit 61e48f52)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      4f3ce062
  20. Jan 06, 2017
    • jerryshao's avatar
      [SPARK-19033][CORE] Add admin acls for history server · 4ca17888
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.
      
      So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.
      
      ## How was this patch tested?
      
      Unit test added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16470 from jerryshao/SPARK-19033.
      
      (cherry picked from commit 4a4c3dc9)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      4ca17888
  21. Dec 24, 2016
  22. Dec 23, 2016
    • Shixiong Zhu's avatar
      [SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use... · 5bafdc45
      Shixiong Zhu authored
      [SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use ConcurrentHashMap to make it faster
      
      ## What changes were proposed in this pull request?
      
      The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16390 from zsxwing/SPARK-18991.
      
      (cherry picked from commit a848f0ba)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      5bafdc45
  23. Dec 22, 2016
    • Shixiong Zhu's avatar
      [SPARK-18972][CORE] Fix the netty thread names for RPC · 1857acc7
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Right now the name of threads created by Netty for Spark RPC are `shuffle-client-**` and `shuffle-server-**`. It's pretty confusing.
      
      This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes:
      
      - TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods.
      - Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise,  if the caller calls `next` without `hasNext`, it will just hang.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16380 from zsxwing/SPARK-18972.
      
      (cherry picked from commit f252cb5d)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      1857acc7
  24. Dec 20, 2016
    • Josh Rosen's avatar
      [SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors · 2971ae56
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.
      
      This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.
      
      This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.
      
      ## How was this patch tested?
      
      Tested via a new test case in `JobCancellationSuite`, plus manual testing.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16189 from JoshRosen/cancellation.
      2971ae56
  25. Dec 19, 2016
    • Josh Rosen's avatar
      [SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter · f07e989c
      Josh Rosen authored
      
      ## What changes were proposed in this pull request?
      
      In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189).
      
      This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses.
      
      ## How was this patch tested?
      
      Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16340 from JoshRosen/sql-task-interruption.
      
      (cherry picked from commit 5857b9ac)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      f07e989c
  26. Dec 18, 2016
    • Yuming Wang's avatar
      [SPARK-18827][CORE] Fix cannot read broadcast on disk · a5da8db8
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      `NoSuchElementException` will throw since https://github.com/apache/spark/pull/15056
      
       if a broadcast cannot cache in memory. The reason is that that change cannot cover `!unrolled.hasNext` in `next()` function.
      
      This change is to cover the `!unrolled.hasNext` and check `hasNext` before calling `next` in `blockManager.getLocalValues` to make it  more robust.
      
      We can cache and read broadcast even it cannot fit in memory from this pull request.
      
      Exception log:
      ```
      16/12/10 10:10:04 INFO UnifiedMemoryManager: Will not store broadcast_131 as the required space (1048576 bytes) exceeds our memory limit (122764 bytes)
      16/12/10 10:10:04 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_131 in memory.
      16/12/10 10:10:04 WARN MemoryStore: Not enough space to cache broadcast_131 in memory! (computed 384.0 B so far)
      16/12/10 10:10:04 INFO MemoryStore: Memory use = 95.6 KB (blocks) + 0.0 B (scratch space shared across 0 tasks(s)) = 95.6 KB. Storage limit = 119.9 KB.
      16/12/10 10:10:04 ERROR Utils: Exception encountered
      java.util.NoSuchElementException
      	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
      	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
      	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
      	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
      	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
      	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      16/12/10 10:10:04 ERROR Executor: Exception in task 1.0 in stage 86.0 (TID 134423)
      java.io.IOException: java.util.NoSuchElementException
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
      	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
      	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
      	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
      	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.util.NoSuchElementException
      	at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
      	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
      	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
      	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
      	... 12 more
      ```
      
      ## How was this patch tested?
      
      Add unit test
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16252 from wangyum/SPARK-18827.
      
      (cherry picked from commit 1e5c51f3)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      a5da8db8
  27. Dec 13, 2016
  28. Dec 09, 2016
    • Xiangrui Meng's avatar
      [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend · 0c6415ae
      Xiangrui Meng authored
      
      ## What changes were proposed in this pull request?
      
      * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
      * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
      * Small refactor of `SerDe.sqlSerDe` to increase readability.
      
      ## How was this patch tested?
      
      * Added unit tests for `JVMObjectTracker`.
      * Wait for Jenkins to run full tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16154 from mengxr/SPARK-17822.
      
      (cherry picked from commit fd48d80a)
      Signed-off-by: default avatarXiangrui Meng <meng@databricks.com>
      0c6415ae
    • Jacek Laskowski's avatar
      [MINOR][CORE][SQL][DOCS] Typo fixes · b226f10e
      Jacek Laskowski authored
      
      ## What changes were proposed in this pull request?
      
      Typo fixes
      
      ## How was this patch tested?
      
      Local build. Awaiting the official build.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #16144 from jaceklaskowski/typo-fixes.
      
      (cherry picked from commit b162cc0c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      b226f10e
  29. Dec 08, 2016
  30. Dec 07, 2016
    • sarutak's avatar
      [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040 · 76e1f165
      sarutak authored
      ## What changes were proposed in this pull request?
      
      When SSL is enabled, the Spark shell shows:
      ```
      Spark context Web UI available at https://192.168.99.1:4040
      
      
      ```
      This is wrong because 4040 is http, not https. It redirects to the https port.
      More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481.
      
      CC: mengxr liancheng
      
      I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled.
      
      Author: sarutak <sarutak@oss.nttdata.co.jp>
      
      Closes #16190 from sarutak/SPARK-18761.
      
      (cherry picked from commit bb94f61a)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      76e1f165
    • Shixiong Zhu's avatar
      [SPARK-18764][CORE] Add a warning log when skipping a corrupted file · acb6ac5d
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16192 from zsxwing/SPARK-18764.
      
      (cherry picked from commit dbf3e298)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      acb6ac5d
    • Jie Xiong's avatar
      [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap · 4432a2a8
      Jie Xiong authored
      
      ## What changes were proposed in this pull request?
      
      BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate.
      
      However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM.
      
      This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills.
      
      ## How was this patch tested?
      
      Existing tests and tested on realworld workloads.
      
      Author: Jie Xiong <jiexiong@fb.com>
      Author: jiexiong <jiexiong@gmail.com>
      
      Closes #15722 from jiexiong/jie_oom_fix.
      
      (cherry picked from commit c496d03b)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      4432a2a8
    • Sean Owen's avatar
      [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 51754d6d
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
      
      ## How was this patch tested?
      
      Existing test plus new test case.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16129 from srowen/SPARK-18678.
      
      (cherry picked from commit 79f5f281)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      51754d6d
Loading