Skip to content
Snippets Groups Projects
  1. Feb 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19652][UI] Do auth checks for REST API access. · 17d83e1e
      Marcelo Vanzin authored
      The REST API has a security filter that performs auth checks
      based on the UI root's security manager. That works fine when
      the UI root is the app's UI, but not when it's the history server.
      
      In the SHS case, all users would be allowed to see all applications
      through the REST API, even if the UI itself wouldn't be available
      to them.
      
      This change adds auth checks for each app access through the API
      too, so that only authorized users can see the app's data.
      
      The change also modifies the existing security filter to use
      `HttpServletRequest.getRemoteUser()`, which is used in other
      places. That is not necessarily the same as the principal's
      name; for example, when using Hadoop's SPNEGO auth filter,
      the remote user strips the realm information, which then matches
      the user name registered as the owner of the application.
      
      I also renamed the UIRootFromServletContext trait to a more generic
      name since I'm using it to store more context information now.
      
      Tested manually with an authentication filter enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16978 from vanzin/SPARK-19652.
      17d83e1e
  2. Feb 20, 2017
    • Sean Owen's avatar
      [SPARK-19646][CORE][STREAMING] binaryRecords replicates records in scala API · d0ecca60
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Use `BytesWritable.copyBytes`, not `getBytes`, because `getBytes` returns the underlying array, which may be reused when repeated reads don't need a different size, as is the case with binaryRecords APIs
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16974 from srowen/SPARK-19646.
      Unverified
      d0ecca60
  3. Feb 19, 2017
    • Sean Owen's avatar
      [SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 features · 1487c9af
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16964 from srowen/SPARK-19534.
      Unverified
      1487c9af
    • jinxing's avatar
      [SPARK-19450] Replace askWithRetry with askSync. · ba8912e5
      jinxing authored
      ## What changes were proposed in this pull request?
      
      `askSync` is already added in `RpcEndpointRef` (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and `askWithRetry` is marked as deprecated.
      As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
      
      >askWithRetry is basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it.
      
      Since `askWithRetry` is just used inside spark and not in user logic. It might make sense to replace all of them with `askSync`.
      
      ## How was this patch tested?
      This PR doesn't change code logic, existing unit test can cover.
      
      Author: jinxing <jinxing@meituan.com>
      
      Closes #16790 from jinxing64/SPARK-19450.
      Unverified
      ba8912e5
  4. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      Unverified
      0e240549
  5. Feb 13, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 0169360e
      Marcelo Vanzin authored
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      0169360e
  6. Feb 01, 2017
    • jinxing's avatar
      [SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple... · c5fcb7f6
      jinxing authored
      [SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple times because of askWithRetry.
      
      ## What changes were proposed in this pull request?
      
      `ReceiverSupervisorImpl` on executor side reports block's meta back to `ReceiverTracker` on driver side. In current code, `askWithRetry` is used. However, for `AddBlock`, `ReceiverTracker` is not idempotent, which may result in messages are processed multiple times.
      
      **To reproduce**:
      
      1. Check if it is the first time receiving `AddBlock` in `ReceiverTracker`, if so sleep long enough(say 200 seconds), thus the first RPC call will be timeout in `askWithRetry`, then `AddBlock` will be resent.
      2. Rebuild Spark and run following job:
      ```
        def streamProcessing(): Unit = {
          val conf = new SparkConf()
            .setAppName("StreamingTest")
            .setMaster(masterUrl)
          val ssc = new StreamingContext(conf, Seconds(200))
          val stream = ssc.socketTextStream("localhost", 1234)
          stream.print()
          ssc.start()
          ssc.awaitTermination()
        }
      ```
      **To fix**:
      
      It makes sense to provide a blocking version `ask` in RpcEndpointRef, as mentioned in SPARK-18113 (https://github.com/apache/spark/pull/16503#event-927953218). Because Netty RPC layer will not drop messages. `askWithRetry` is a leftover from akka days. It imposes restrictions on the caller(e.g. idempotency) and other things that people generally don't pay that much attention to when using it.
      
      ## How was this patch tested?
      Test manually. The scenario described above doesn't happen with this patch.
      
      Author: jinxing <jinxing@meituan.com>
      
      Closes #16690 from jinxing64/SPARK-19347.
      c5fcb7f6
    • hyukjinkwon's avatar
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in... · f1a1f260
      hyukjinkwon authored
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - Support LaTex inline-formula, `\( ... \)` in Scala API documentation
        It seems currently,
      
        ```
        \( ... \)
        ```
      
        are rendered as they are, for example,
      
        <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">
      
        It seems mistakenly more backslashes were added.
      
      - Fix warnings Scaladoc/Javadoc generation
        This PR fixes t two types of warnings as below:
      
        ```
        [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
        [warn]   /**
        [warn]   ^
        ```
      
        ```
        [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
        [warn]  * `${var}`, `${system:var}` and `${env:var}`.
        [warn]      ^
        ```
      
      - Fix Javadoc8 break
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
        [error]    *                          E.g., {link VectorUDT} for vector features.
        [error]                                            ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
        [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
        [error]                                                  ^
        ```
      
      ## How was this patch tested?
      
      Manually via `sbt unidoc` and `jeykil build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16741 from HyukjinKwon/warn-and-break.
      Unverified
      f1a1f260
  7. Jan 24, 2017
  8. Jan 18, 2017
    • uncleGen's avatar
      [SPARK-19182][DSTREAM] Optimize the lock in StreamingJobProgressListener to... · a81e336f
      uncleGen authored
      [SPARK-19182][DSTREAM] Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs
      
      ## What changes were proposed in this pull request?
      
      When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener (numInactiveReceivers, streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata).
      It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation.
      
      ## How was this patch tested?
      existing ut
      
      cc zsxwing
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16601 from uncleGen/SPARK-19182.
      a81e336f
  9. Jan 16, 2017
  10. Jan 10, 2017
    • hyukjinkwon's avatar
      [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due... · 4e27578f
      hyukjinkwon authored
      [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due to path and resource-not-closed problems on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix all the test failures identified by testing with AppVeyor.
      
      **Scala - aborted tests**
      
      ```
      WindowQuerySuite:
        Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.WindowQuerySuite *** ABORTED *** (156 milliseconds)
         org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive   argetscala-2.11   est-classesdatafilespart_tiny.txt;
      
      OrcSourceSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.orc.OrcSourceSuite *** ABORTED *** (62 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      ParquetMetastoreSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetMetastoreSuite *** ABORTED *** (4 seconds, 703 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      ParquetSourceSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetSourceSuite *** ABORTED *** (3 seconds, 907 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-581a6575-454f-4f21-a516-a07f95266143;
      
      KafkaRDDSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaRDDSuite *** ABORTED *** (5 seconds, 212 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-4722304d-213e-4296-b556-951df1a46807
      
      DirectKafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite *** ABORTED *** (7 seconds, 127 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d0d3eba7-4215-4e10-b40e-bb797e89338e
         at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
      
      ReliableKafkaStreamSuite
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite *** ABORTED *** (5 seconds, 498 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d33e45a0-287e-4bed-acae-ca809a89d888
      
      KafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaStreamSuite *** ABORTED *** (2 seconds, 892 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-59c9d169-5a56-4519-9ef0-cefdbd3f2e6c
      
      KafkaClusterSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaClusterSuite *** ABORTED *** (1 second, 690 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-3ef402b0-8689-4a60-85ae-e41e274f179d
      
      DirectKafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6
      
      KafkaRDDSuite:
      Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.KafkaRDDSuite *** ABORTED *** (2 minutes, 6 seconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-b9ce7929-5dae-46ab-a0c4-9ef6f58fbc2
      ```
      
      **Java - failed tests**
      
      ```
      Test org.apache.spark.streaming.kafka.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-1cee32f4-4390-4321-82c9-e8616b3f0fb0, took 9.61 sec
      
      Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-f42695dd-242e-4b07-847c-f299b8e4676e, took 11.797 sec
      
      Test org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-85c0d062-78cf-459c-a2dd-7973572101ce, took 1.581 sec
      
      Test org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-49eb6b5c-8366-47a6-83f2-80c443c48280, took 17.895 sec
      
      org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-898cf826-d636-4b1c-a61a-c12a364c02e7, took 8.858 sec
      ```
      
      **Scala - failed tests**
      
      ```
      PartitionProviderCompatibilitySuite:
       - insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (828 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-bb6337b9-4f99-45ab-ad2c-a787ab965c09
      
       - SPARK-18635 special chars in partition values - partition management true *** FAILED *** (5 seconds, 360 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - SPARK-18635 special chars in partition values - partition management false *** FAILED *** (141 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      UtilsSuite:
       - reading offset bytes of a file (compressed) *** FAILED *** (0 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ecb2b7d5-db8b-43a7-b268-1bf242b5a491
      
       - reading offset bytes across multiple files (compressed) *** FAILED *** (0 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-25cc47a8-1faa-4da5-8862-cf174df63ce0
      ```
      
      ```
      StatisticsSuite:
       - MetastoreRelations fallback to HDFS for size estimation *** FAILED *** (110 milliseconds)
         org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'csv_table' not found in database 'default';
      ```
      
      ```
      SQLQuerySuite:
       - permanent UDTF *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count_temp'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24
      
       - describe functions - user defined functions *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
      
       - CTAS without serde with location *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-ed673d73-edfc-404e-829e-2e2b9725d94e/c1
      
       - derived from Hive query file: drop_database_removes_partition_dirs.q *** FAILED *** (47 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_database_removes_partition_dirs_table
      
       - derived from Hive query file: drop_table_removes_partition_dirs.q *** FAILED *** (0 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_table_removes_partition_dirs_table2
      
       - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH *** FAILED *** (109 milliseconds)
         java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/sql/hive/projectsspark	arget	mpspark-1a122f8c-dfb3-46c4-bab1-f30764baee0e/*part-r*
      ```
      
      ```
      HiveDDLSuite:
       - drop external tables in default database *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - add/drop partitions - external table *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - create/drop database - location without pre-created directory *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - create/drop database - location with pre-created directory *** FAILED *** (32 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - drop database containing tables - CASCADE *** FAILED *** (94 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop an empty database - CASCADE *** FAILED *** (63 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop database containing tables - RESTRICT *** FAILED *** (47 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop an empty database - RESTRICT *** FAILED *** (47 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - CREATE TABLE LIKE an external data source table *** FAILED *** (140 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-c5eba16d-07ae-4186-95bb-21c5811cf888;
      
       - CREATE TABLE LIKE an external Hive serde table *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - desc table for data source table - no user-defined schema *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e8bf5bf5-721a-4cbe-9d6	at scala.collection.immutable.List.foreach(List.scala:381)d-5543a8301c1d;
      ```
      
      ```
      MetastoreDataSourcesSuite
       - CTAS: persisted bucketed data source table *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
      ```
      
      ```
      ShowCreateTableSuite:
       - simple external hive table *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      PartitionedTablePerfStatsSuite:
       - hive table: partitioned pruned table reports only selected files *** FAILED *** (313 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: partitioned pruned table reports only selected files *** FAILED *** (219 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-311f45f8-d064-4023-a4bb-e28235bff64d;
      
       - hive table: lazy partition pruning reads only necessary partition data *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: lazy partition pruning reads only necessary partition data *** FAILED *** (187 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-fde874ca-66bd-4d0b-a40f-a043b65bf957;
      
       - hive table: lazy partition pruning with file status caching enabled *** FAILED *** (188 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: lazy partition pruning with file status caching enabled *** FAILED *** (187 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e6d20183-dd68-4145-acbe-4a509849accd;
      
       - hive table: file status caching respects refresh table and refreshByPath *** FAILED *** (172 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: file status caching respects refresh table and refreshByPath *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-8b2c9651-2adf-4d58-874f-659007e21463;
      
       - hive table: file status cache respects size limit *** FAILED *** (219 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: file status cache respects size limit *** FAILED *** (171 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-7835ab57-cb48-4d2c-bb1d-b46d5a4c47e4;
      
       - datasource table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-20598d76-c004-42a7-8061-6c56f0eda5e2;
      
       - hive table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - hive table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 281 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 422 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4cfed321-4d1d-4b48-8d34-5c169afff383;
      
       - hive table: files read and cached when filesource partition management is off *** FAILED *** (234 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: all partition data cached in memory when partition management is off *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4bcc0398-15c9-4f6a-811e-12d40f3eec12;
      
       - SPARK-18700: table loaded only once even when resolved concurrently *** FAILED *** (1 second, 266 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      HiveSparkSubmitSuite:
       - temporary Hive UDF: define a UDF and use it *** FAILED *** (2 seconds, 94 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - permanent Hive UDF: define a UDF and use it *** FAILED *** (281 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - permanent Hive UDF: use a already defined permanent function *** FAILED *** (718 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8368: includes jars passed in through --jars *** FAILED *** (3 seconds, 521 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8020: set sql conf in spark conf *** FAILED *** (0 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8489: MissingRequirementError during reflection *** FAILED *** (94 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-9757 Persist Parquet relation with decimal column *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-11009 fix wrong result of Window function in cluster mode *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-14244 fix window partition size attribute binding failure *** FAILED *** (78 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - set spark.sql.warehouse.dir *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - set hive.metastore.warehouse.dir *** FAILED *** (15 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-16901: set javax.jdo.option.ConnectionURL *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-18360: default table path of tables in default database should depend on the location of default database *** FAILED *** (15 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      ```
      
      ```
      UtilsSuite:
       - resolveURIs with multiple paths *** FAILED *** (0 milliseconds)
         ".../jar3,file:/C:/pi.py[%23]py.pi,file:/C:/path%..." did not equal ".../jar3,file:/C:/pi.py[#]py.pi,file:/C:/path%..." (UtilsSuite.scala:468)
      ```
      
      ```
      CheckpointSuite:
       - recovery with file input stream *** FAILED *** (10 seconds, 205 milliseconds)
         The code passed to eventually never returned normally. Attempted 660 times over 10.014272499999999 seconds. Last failure message: Unexpected internal error near index 1
         \
          ^. (CheckpointSuite.scala:680)
      ```
      
      ## How was this patch tested?
      
      Manually via AppVeyor as below:
      
      **Scala - aborted tests**
      
      ```
      WindowQuerySuite - all passed
      OrcSourceSuite:
      - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
        org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
      ParquetMetastoreSuite - all passed
      ParquetSourceSuite - all passed
      KafkaRDDSuite - all passed
      DirectKafkaStreamSuite - all passed
      ReliableKafkaStreamSuite - all passed
      KafkaStreamSuite - all passed
      KafkaClusterSuite - all passed
      DirectKafkaStreamSuite - all passed
      KafkaRDDSuite - all passed
      ```
      
      **Java - failed tests**
      
      ```
      org.apache.spark.streaming.kafka.JavaKafkaRDDSuite - all passed
      org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka.JavaKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite - all passed
      ```
      
      **Scala - failed tests**
      
      ```
      PartitionProviderCompatibilitySuite:
      - insert overwrite partition of new datasource table overwrites just partition (1 second, 953 milliseconds)
      - SPARK-18635 special chars in partition values - partition management true (6 seconds, 31 milliseconds)
      - SPARK-18635 special chars in partition values - partition management false (4 seconds, 578 milliseconds)
      ```
      
      ```
      UtilsSuite:
      - reading offset bytes of a file (compressed) (203 milliseconds)
      - reading offset bytes across multiple files (compressed) (0 milliseconds)
      ```
      
      ```
      StatisticsSuite:
      - MetastoreRelations fallback to HDFS for size estimation (94 milliseconds)
      ```
      
      ```
      SQLQuerySuite:
       - permanent UDTF (407 milliseconds)
       - describe functions - user defined functions (441 milliseconds)
       - CTAS without serde with location (2 seconds, 831 milliseconds)
       - derived from Hive query file: drop_database_removes_partition_dirs.q (734 milliseconds)
       - derived from Hive query file: drop_table_removes_partition_dirs.q (563 milliseconds)
       - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH (453 milliseconds)
      ```
      
      ```
      HiveDDLSuite:
       - drop external tables in default database (3 seconds, 5 milliseconds)
       - add/drop partitions - external table (2 seconds, 750 milliseconds)
       - create/drop database - location without pre-created directory (500 milliseconds)
       - create/drop database - location with pre-created directory (407 milliseconds)
       - drop database containing tables - CASCADE (453 milliseconds)
       - drop an empty database - CASCADE (375 milliseconds)
       - drop database containing tables - RESTRICT (328 milliseconds)
       - drop an empty database - RESTRICT (391 milliseconds)
       - CREATE TABLE LIKE an external data source table (953 milliseconds)
       - CREATE TABLE LIKE an external Hive serde table (3 seconds, 782 milliseconds)
       - desc table for data source table - no user-defined schema (1 second, 150 milliseconds)
      ```
      
      ```
      MetastoreDataSourcesSuite
       - CTAS: persisted bucketed data source table (875 milliseconds)
      ```
      
      ```
      ShowCreateTableSuite:
       - simple external hive table (78 milliseconds)
      ```
      
      ```
      PartitionedTablePerfStatsSuite:
       - hive table: partitioned pruned table reports only selected files (1 second, 109 milliseconds)
      - datasource table: partitioned pruned table reports only selected files (860 milliseconds)
       - hive table: lazy partition pruning reads only necessary partition data (859 milliseconds)
       - datasource table: lazy partition pruning reads only necessary partition data (1 second, 219 milliseconds)
       - hive table: lazy partition pruning with file status caching enabled (875 milliseconds)
       - datasource table: lazy partition pruning with file status caching enabled (890 milliseconds)
       - hive table: file status caching respects refresh table and refreshByPath (922 milliseconds)
       - datasource table: file status caching respects refresh table and refreshByPath (640 milliseconds)
       - hive table: file status cache respects size limit (469 milliseconds)
       - datasource table: file status cache respects size limit (453 milliseconds)
       - datasource table: table setup does not scan filesystem (328 milliseconds)
       - hive table: table setup does not scan filesystem (313 milliseconds)
       - hive table: num hive client calls does not scale with partition count (5 seconds, 431 milliseconds)
       - datasource table: num hive client calls does not scale with partition count (4 seconds, 79 milliseconds)
       - hive table: files read and cached when filesource partition management is off (656 milliseconds)
       - datasource table: all partition data cached in memory when partition management is off (484 milliseconds)
       - SPARK-18700: table loaded only once even when resolved concurrently (2 seconds, 578 milliseconds)
      ```
      
      ```
      HiveSparkSubmitSuite:
       - temporary Hive UDF: define a UDF and use it (1 second, 745 milliseconds)
       - permanent Hive UDF: define a UDF and use it (406 milliseconds)
       - permanent Hive UDF: use a already defined permanent function (375 milliseconds)
       - SPARK-8368: includes jars passed in through --jars (391 milliseconds)
       - SPARK-8020: set sql conf in spark conf (156 milliseconds)
       - SPARK-8489: MissingRequirementError during reflection (187 milliseconds)
       - SPARK-9757 Persist Parquet relation with decimal column (157 milliseconds)
       - SPARK-11009 fix wrong result of Window function in cluster mode (156 milliseconds)
       - SPARK-14244 fix window partition size attribute binding failure (156 milliseconds)
       - set spark.sql.warehouse.dir (172 milliseconds)
       - set hive.metastore.warehouse.dir (156 milliseconds)
       - SPARK-16901: set javax.jdo.option.ConnectionURL (157 milliseconds)
       - SPARK-18360: default table path of tables in default database should depend on the location of default database (172 milliseconds)
      ```
      
      ```
      UtilsSuite:
       - resolveURIs with multiple paths (0 milliseconds)
      ```
      
      ```
      CheckpointSuite:
       - recovery with file input stream (4 seconds, 452 milliseconds)
      ```
      
      Note: after resolving the aborted tests, there is a test failure identified as below:
      
      ```
      OrcSourceSuite:
      - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
        org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
      ```
      
      This does not look due to this problem so this PR does not fix it here.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16451 from HyukjinKwon/all-path-resource-fixes.
      Unverified
      4e27578f
  11. Jan 04, 2017
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      Unverified
      a1e40b1f
  12. Dec 22, 2016
    • saturday_s's avatar
      [SPARK-18537][WEB UI] Add a REST api to serve spark streaming information · ce99f51d
      saturday_s authored
      ## What changes were proposed in this pull request?
      
      This PR is an inheritance from #16000, and is a completion of #15904.
      
      **Description**
      
      - Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
      - Retrieve the streaming information through StreamingJobProgressListener.
      
      > this api should cover exceptly the same amount of information as you can get from the web interface
      > the implementation is base on the current REST implementation of spark-core
      > and will be available for running applications only
      >
      > https://issues.apache.org/jira/browse/SPARK-18537
      
      ## How was this patch tested?
      
      Local test.
      
      Author: saturday_s <shi.indetail@gmail.com>
      Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
      Author: peterCPChan <universknight@gmail.com>
      
      Closes #16253 from saturday-shi/SPARK-18537.
      ce99f51d
  13. Dec 21, 2016
  14. Dec 02, 2016
  15. Dec 01, 2016
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite.... · 086b0c8f
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Avoid to create multiple threads to stop StreamingContext. Otherwise, the latch added in #16091 can be passed too early.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16105 from zsxwing/SPARK-18617-2.
      086b0c8f
  16. Nov 30, 2016
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite.... · 0a811210
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Fixed the potential SparkContext leak in `StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` which was added in #16052. I also removed FakeByteArrayReceiver and used TestReceiver directly.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16091 from zsxwing/SPARK-18617-follow-up.
      0a811210
    • uncleGen's avatar
      [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark Streaming · 56c82eda
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution.
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16052 from uncleGen/SPARK-18617.
      56c82eda
  17. Nov 29, 2016
  18. Nov 19, 2016
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      Unverified
      d5b1d5fc
  19. Nov 15, 2016
  20. Nov 10, 2016
  21. Nov 08, 2016
    • jiangxingbo's avatar
      [SPARK-18191][CORE] Port RDD API to use commit protocol · 9c419698
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      This PR port RDD API to use commit protocol, the changes made here:
      1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`;
      2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now.
      
      ## How was this patch tested?
      Exsiting test cases.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15769 from jiangxb1987/rdd-commit.
      9c419698
  22. Nov 07, 2016
    • Hyukjin Kwon's avatar
      [SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests · 8f0ea011
      Hyukjin Kwon authored
      ## What changes were proposed in this pull request?
      
      Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files.
      ## How was this patch tested?
      
      Existing tests
      
      Author: U-FAREAST\tl <tl@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Tao LI <tl@microsoft.com>
      
      Closes #15618 from HyukjinKwon/SPARK-14914-1.
      8f0ea011
  23. Nov 02, 2016
  24. Sep 22, 2016
    • Shixiong Zhu's avatar
      [SPARK-17638][STREAMING] Stop JVM StreamingContext when the Python process is dead · 3cdae0ff
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When the Python process is dead, the JVM StreamingContext is still running. Hence we will see a lot of Py4jException before the JVM process exits. It's better to stop the JVM StreamingContext to avoid those annoying logs.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15201 from zsxwing/stop-jvm-ssc.
      3cdae0ff
    • Dhruve Ashar's avatar
      [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time. · 17b72d31
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor.
      
      ## How was this patch tested?
      Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled.
      
      Author: Dhruve Ashar <dashar@yahoo-inc.com>
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15152 from dhruve/impr/SPARK-17365.
      17b72d31
  25. Sep 21, 2016
    • Marcelo Vanzin's avatar
      [SPARK-4563][CORE] Allow driver to advertise a different network address. · 2cd1bfa4
      Marcelo Vanzin authored
      The goal of this feature is to allow the Spark driver to run in an
      isolated environment, such as a docker container, and be able to use
      the host's port forwarding mechanism to be able to accept connections
      from the outside world.
      
      The change is restricted to the driver: there is no support for achieving
      the same thing on executors (or the YARN AM for that matter). Those still
      need full access to the outside world so that, for example, connections
      can be made to an executor's block manager.
      
      The core of the change is simple: add a new configuration that tells what's
      the address the driver should bind to, which can be different than the address
      it advertises to executors (spark.driver.host). Everything else is plumbing
      the new configuration where it's needed.
      
      To use the feature, the host starting the container needs to set up the
      driver's port range to fall into a range that is being forwarded; this
      required the block manager port to need a special configuration just for
      the driver, which falls back to the existing spark.blockManager.port when
      not set. This way, users can modify the driver settings without affecting
      the executors; it would theoretically be nice to also have different
      retry counts for driver and executors, but given that docker (at least)
      allows forwarding port ranges, we can probably live without that for now.
      
      Because of the nature of the feature it's kinda hard to add unit tests;
      I just added a simple one to make sure the configuration works.
      
      This was tested with a docker image running spark-shell with the following
      command:
      
       docker blah blah blah \
         -p 38000-38100:38000-38100 \
         [image] \
         spark-shell \
           --num-executors 3 \
           --conf spark.shuffle.service.enabled=false \
           --conf spark.dynamicAllocation.enabled=false \
           --conf spark.driver.host=[host's address] \
           --conf spark.driver.port=38000 \
           --conf spark.driver.blockManager.port=38020 \
           --conf spark.ui.port=38040
      
      Running on YARN; verified the driver works, executors start up and listen
      on ephemeral ports (instead of using the driver's config), and that caching
      and shuffling (without the shuffle service) works. Clicked through the UI
      to make sure all pages (including executor thread dumps) worked. Also tested
      apps without docker, and ran unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15120 from vanzin/SPARK-4563.
      2cd1bfa4
  26. Sep 07, 2016
    • Liwei Lin's avatar
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of... · 3ce3a282
      Liwei Lin authored
      [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
      
      ## What changes were proposed in this pull request?
      
      We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14914 from lw-lin/append_to_plus_eq_v2.
      3ce3a282
  27. Sep 06, 2016
    • Josh Rosen's avatar
      [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues() · 29cfab3f
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.
      
      We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.
      
      This patch addresses the bug by modifying `BlockManager`'s `get()` and  `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`)
      
      ## How was this patch tested?
      
      Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14952 from JoshRosen/SPARK-17110.
      29cfab3f
  28. Sep 04, 2016
    • Shivansh's avatar
      [SPARK-17308] Improved the spark core code by replacing all pattern match on... · e75c162e
      Shivansh authored
      [SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.
      
      ## What changes were proposed in this pull request?
      Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.
      
      ## How was this patch tested?
      
      By running the tests
      
      Author: Shivansh <shiv4nsh@gmail.com>
      
      Closes #14873 from shiv4nsh/SPARK-17308.
      e75c162e
  29. Aug 17, 2016
    • Xin Ren's avatar
      [SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch' · e6bef7d5
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-17038
      
      ## What changes were proposed in this pull request?
      
      StreamingSource's lastReceivedBatch_submissionTime, lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch instead of lastReceivedBatch.
      
      In particular, this makes it impossible to match lastReceivedBatch_records with a batchID/submission time.
      
      This is apparent when looking at StreamingSource.scala, lines 89-94.
      
      ## How was this patch tested?
      
      Manually running unit tests on local laptop
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14681 from keypointt/SPARK-17038.
      e6bef7d5
    • Steve Loughran's avatar
      [SPARK-16736][CORE][SQL] purge superfluous fs calls · cc97ea18
      Steve Loughran authored
      A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.
      
      1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
      1. any `FileSystem.exists()`  check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.
      
      Initially, relying on Jenkins test runs.
      
      One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.
      
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.
      cc97ea18
Loading