Skip to content
Snippets Groups Projects
  1. Sep 01, 2017
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
  2. Aug 30, 2017
    • Yuval Itzchakov's avatar
      [SPARK-21873][SS] - Avoid using `return` inside `CachedKafkaConsumer.get` · 8f0df6bc
      Yuval Itzchakov authored
      During profiling of a structured streaming application with Kafka as the source, I came across this exception:
      
      ![Structured Streaming Kafka Exceptions](https://user-images.githubusercontent.com/3448320/29743366-4149ef78-8a99-11e7-94d6-f0cbb691134a.png)
      
      This is a 1 minute sample, which caused 106K `NonLocalReturnControl` exceptions to be thrown.
      This happens because `CachedKafkaConsumer.get` is ran inside:
      
      `private def runUninterruptiblyIfPossible[T](body: => T): T`
      
      Where `body: => T` is the `get` method. Turning the method into a function means that in order to escape the `while` loop defined in `get` the runtime has to do dirty tricks which involve throwing the above exception.
      
      ## What changes were proposed in this pull request?
      
      Instead of using `return` (which is generally not recommended in Scala), we place the result of the `fetchData` method inside a local variable and use a boolean flag to indicate the status of fetching data, which we monitor as our predicate to the `while` loop.
      
      ## How was this patch tested?
      
      I've ran the `KafkaSourceSuite` to make sure regression passes. Since the exception isn't visible from user code, there is no way (at least that I could think of) to add this as a test to the existing suite.
      
      Author: Yuval Itzchakov <yuval.itzchakov@clicktale.com>
      
      Closes #19059 from YuvalItzchakov/master.
      8f0df6bc
  3. Aug 22, 2017
    • Jose Torres's avatar
      [SPARK-21765] Set isStreaming on leaf nodes for streaming plans. · 3c0c2d09
      Jose Torres authored
      ## What changes were proposed in this pull request?
      All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
      
      ## How was this patch tested?
      
      Existing unit tests - no functional change is intended in this PR.
      
      Author: Jose Torres <joseph-torres@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18973 from joseph-torres/SPARK-21765.
      3c0c2d09
  4. Aug 21, 2017
  5. Aug 19, 2017
  6. Aug 15, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
  7. Aug 09, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) · b78cf13b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.
      
      Major diffs between the latest release and v1.3.0 in the master are as follows (https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6d4693f56253fcddfad7b441bb8d917b182efa2d);
      - fixed NPE in XXHashFactory similarly
      - Don't place resources in default package to support shading
      - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
      - Try to load lz4-java from java.library.path, then fallback to bundled
      - Add ppc64le binary
      - Add s390x JNI binding
      - Add basic LZ4 Frame v1.5.0 support
      - enable aarch64 support for lz4-java
      - Allow unsafeInstance() for ppc64le archiecture
      - Add unsafeInstance support for AArch64
      - Support 64-bit JNI build on Solaris
      - Avoid over-allocating a buffer
      - Allow EndMark to be incompressible for LZ4FrameInputStream.
      - Concat byte stream
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18883 from maropu/SPARK-21276.
      b78cf13b
  8. Jul 25, 2017
    • Yash Sharma's avatar
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the... · 4f77c062
      Yash Sharma authored
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils
      
      ## What changes were proposed in this pull request?
      
      The examples and docs for Spark-Kinesis integrations use the deprecated KinesisUtils. We should update the docs to use the KinesisInputDStream builder to create DStreams.
      
      ## How was this patch tested?
      
      The patch primarily updates the documents. The patch will also need to make changes to the Spark-Kinesis examples. The examples need to be tested.
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #18071 from yssharma/ysharma/kinesis_docs.
      4f77c062
  9. Jul 20, 2017
    • Tim Van Wassenhove's avatar
      [SPARK-21142][SS] spark-streaming-kafka-0-10 should depend on kafka-clients... · 03367d7a
      Tim Van Wassenhove authored
      [SPARK-21142][SS] spark-streaming-kafka-0-10 should depend on kafka-clients instead of full blown kafka library
      
      ## What changes were proposed in this pull request?
      
      Currently spark-streaming-kafka-0-10 has a dependency on the full kafka distribution (but only uses and requires the kafka-clients library).
      
      The PR fixes that (the library only depends on kafka-clients), and the tests depend on the full kafka.
      
      ## How was this patch tested?
      
      All existing tests still pass.
      
      Author: Tim Van Wassenhove <github@timvw.be>
      
      Closes #18353 from timvw/master.
      03367d7a
  10. Jul 18, 2017
    • Sean Owen's avatar
      [SPARK-21415] Triage scapegoat warnings, part 1 · e26dac5f
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Address scapegoat warnings for:
      - BigDecimal double constructor
      - Catching NPE
      - Finalizer without super
      - List.size is O(n)
      - Prefer Seq.empty
      - Prefer Set.empty
      - reverse.map instead of reverseMap
      - Type shadowing
      - Unnecessary if condition.
      - Use .log1p
      - Var could be val
      
      In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18635 from srowen/Scapegoat1.
      e26dac5f
  11. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  12. Jul 02, 2017
    • Rui Zha's avatar
      [SPARK-18004][SQL] Make sure the date or timestamp related predicate can be... · d4107196
      Rui Zha authored
      [SPARK-18004][SQL] Make sure the date or timestamp related predicate can be pushed down to Oracle correctly
      
      ## What changes were proposed in this pull request?
      
      Move `compileValue` method in JDBCRDD to JdbcDialect, and override the `compileValue` method in OracleDialect to rewrite the Oracle-specific timestamp and date literals in where clause.
      
      ## How was this patch tested?
      
      An integration test has been added.
      
      Author: Rui Zha <zrdt713@gmail.com>
      Author: Zharui <zrdt713@gmail.com>
      
      Closes #18451 from SharpRay/extend-compileValue-to-dialects.
      d4107196
  13. Jun 23, 2017
  14. Jun 21, 2017
    • sureshthalamati's avatar
      [SPARK-10655][SQL] Adding additional data type mappings to jdbc DB2dialect. · 9ce714dc
      sureshthalamati authored
      This patch adds DB2 specific data type mappings for decfloat, real, xml , and timestamp with time zone (DB2Z specific type)  types on read and for byte, short data types  on write to the to jdbc data source DB2 dialect. Default mapping does not work for these types when reading/writing from DB2 database.
      
      Added docker test, and a JDBC unit test case.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #9162 from sureshthalamati/db2dialect_enhancements-spark-10655.
      9ce714dc
  15. Jun 08, 2017
    • Mark Grover's avatar
      [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable · 55b8cfe6
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior.
      
      ## How was this patch tested?
      Running unit tests
      
      Author: Mark Grover <mark@apache.org>
      Author: Mark Grover <grover.markgrover@gmail.com>
      
      Closes #18234 from markgrover/spark-19185.
      55b8cfe6
  16. May 30, 2017
  17. May 29, 2017
  18. May 19, 2017
  19. May 17, 2017
  20. May 16, 2017
    • Yash Sharma's avatar
      [SPARK-20140][DSTREAM] Remove hardcoded kinesis retry wait and max retries · 38f4e869
      Yash Sharma authored
      ## What changes were proposed in this pull request?
      
      The pull requests proposes to remove the hardcoded values for Amazon Kinesis - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
      
      This change is critical for kinesis checkpoint recovery when the kinesis backed rdd is huge.
      Following happens in a typical kinesis recovery :
      - kinesis throttles large number of requests while recovering
      - retries in case of throttling are not able to recover due to the small wait period
      - kinesis throttles per second, the wait period should be configurable for recovery
      
      The patch picks the spark kinesis configs from:
      - spark.streaming.kinesis.retry.wait.time
      - spark.streaming.kinesis.retry.max.attempts
      
      Jira : https://issues.apache.org/jira/browse/SPARK-20140
      
      ## How was this patch tested?
      
      Modified the KinesisBackedBlockRDDSuite.scala to run kinesis tests with the modified configurations. Wasn't able to test the patch with actual throttling.
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #17467 from yssharma/ysharma/spark-kinesis-retries.
      38f4e869
  21. May 11, 2017
  22. May 10, 2017
    • Xianyang Liu's avatar
      [MINOR][BUILD] Fix lint-java breaks. · fcb88f92
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      ```
      [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
      ```
      
      after:
      ```
      dev/lint-java
      Checkstyle checks passed.
      ```
      [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
      
      ## How was this patch tested?
      
      Travis CI
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17890 from ConeyLiu/codestyle.
      fcb88f92
  23. May 07, 2017
    • Xiao Li's avatar
      [SPARK-20557][SQL] Support JDBC data type Time with Time Zone · cafca54c
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP
      
      In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.
      
      ```
      java.sql.SQLException: Unsupported type 2014
      ```
      After this PR, the message is like
      ```
      java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
      ```
      
      - Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17835 from gatorsmile/h2.
      cafca54c
  24. May 05, 2017
  25. Apr 28, 2017
  26. Apr 27, 2017
    • Shixiong Zhu's avatar
      [SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationException for batch Kafka DataFrame · 823baca2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Cancel a batch Kafka query but one of task cannot be cancelled, and rerun the same DataFrame may cause ConcurrentModificationException because it may launch two tasks sharing the same group id.
      
      This PR always create a new consumer when `reuseKafkaConsumer = false` to avoid ConcurrentModificationException. It also contains other minor fixes.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17752 from zsxwing/kafka-fix.
      823baca2
    • Shixiong Zhu's avatar
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the... · 01c999e7
      Shixiong Zhu authored
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the potential hang in CachedKafkaConsumer
      
      ## What changes were proposed in this pull request?
      
      This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen.
      
      ## How was this patch tested?
      
      The new added test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17761 from zsxwing/int.
      01c999e7
  27. Apr 24, 2017
  28. Apr 13, 2017
    • Yash Sharma's avatar
      [SPARK-20189][DSTREAM] Fix spark kinesis testcases to remove deprecated... · ec68d8f8
      Yash Sharma authored
      [SPARK-20189][DSTREAM] Fix spark kinesis testcases to remove deprecated createStream and use Builders
      
      ## What changes were proposed in this pull request?
      
      The spark-kinesis testcases use the KinesisUtils.createStream which are deprecated now. Modify the testcases to use the recommended KinesisInputDStream.builder instead.
      This change will also enable the testcases to automatically use the session tokens automatically.
      
      ## How was this patch tested?
      
      All the existing testcases work fine as expected with the changes.
      
      https://issues.apache.org/jira/browse/SPARK-20189
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #17506 from yssharma/ysharma/cleanup_kinesis_testcases.
      ec68d8f8
  29. Apr 10, 2017
    • Sean Owen's avatar
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish... · a26e3ed5
      Sean Owen authored
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
      
      ## What changes were proposed in this pull request?
      
      Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
      
      The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17527 from srowen/SPARK-20156.
      a26e3ed5
  30. Apr 05, 2017
    • Tathagata Das's avatar
      [SPARK-20209][SS] Execute next trigger immediately if previous batch took... · dad499f3
      Tathagata Das authored
      [SPARK-20209][SS] Execute next trigger immediately if previous batch took longer than trigger interval
      
      ## What changes were proposed in this pull request?
      
      For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.
      
      In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.
      
      ## How was this patch tested?
      Added new unit tests to comprehensively test this behavior.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17525 from tdas/SPARK-20209.
      dad499f3
  31. Mar 24, 2017
    • Adam Budde's avatar
      [SPARK-19911][STREAMING] Add builder interface for Kinesis DStreams · 707e5018
      Adam Budde authored
      ## What changes were proposed in this pull request?
      
      - Add new KinesisDStream.scala containing KinesisDStream.Builder class
      - Add KinesisDStreamBuilderSuite test suite
      - Make KinesisInputDStream ctor args package private for testing
      - Add JavaKinesisDStreamBuilderSuite test suite
      - Add args to KinesisInputDStream and KinesisReceiver for optional
        service-specific auth (Kinesis, DynamoDB and CloudWatch)
      ## How was this patch tested?
      
      Added ```KinesisDStreamBuilderSuite``` to verify builder class works as expected
      
      Author: Adam Budde <budde@amazon.com>
      
      Closes #17250 from budde/KinesisStreamBuilder.
      707e5018
  32. Mar 23, 2017
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
  33. Mar 16, 2017
    • Liwei Lin's avatar
      [SPARK-19721][SS] Good error message for version mismatch in log files · 2ea214dd
      Liwei Lin authored
      ## Problem
      
      There are several places where we write out version identifiers in various logs for structured streaming (usually `v1`). However, in the places where we check for this, we throw a confusing error message.
      
      ## What changes were proposed in this pull request?
      
      This patch made two major changes:
      1. added a `parseVersion(...)` method, and based on this method, fixed the following places the way they did version checking (no other place needed to do this checking):
      ```
      HDFSMetadataLog
        - CompactibleFileStreamLog  ------------> fixed with this patch
          - FileStreamSourceLog  ---------------> inherited the fix of `CompactibleFileStreamLog`
          - FileStreamSinkLog  -----------------> inherited the fix of `CompactibleFileStreamLog`
        - OffsetSeqLog  ------------------------> fixed with this patch
        - anonymous subclass in KafkaSource  ---> fixed with this patch
      ```
      
      2. changed the type of `FileStreamSinkLog.VERSION`, `FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can identify newer versions via `version > 1` instead of `version != "v1"`
          - note this didn't break any backwards compatibility -- we are still writing out `"v1"` and reading back `"v1"`
      
      ## Exception message with this patch
      ```
      java.lang.IllegalStateException: Failed to read log file /private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0. UnsupportedLogVersion: maximum supported log version is v1, but encountered v99. The log file was produced by a newer version of Spark and cannot be read by this version. Please upgrade.
      	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17070 from lw-lin/better-msg.
      2ea214dd
  34. Mar 12, 2017
    • uncleGen's avatar
      [SPARK-19853][SS] uppercase kafka topics fail when startingOffsets are SpecificOffsets · 0a4d06a7
      uncleGen authored
      When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase JSON is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer.
      
      KafkaSourceProvider.scala:
      ```
      val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match {
          case Some("latest") => LatestOffsets
          case Some("earliest") => EarliestOffsets
          case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
          case None => LatestOffsets
        }
      ```
      
      Thank cbowden for reporting.
      
      Jenkins
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17209 from uncleGen/SPARK-19853.
      0a4d06a7
  35. Mar 09, 2017
  36. Mar 06, 2017
    • Tyson Condie's avatar
      [SPARK-19719][SS] Kafka writer for both structured streaming and batch queires · b0a5cd89
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Add a new Kafka Sink and Kafka Relation for writing streaming and batch queries, respectively, to Apache Kafka.
      ### Streaming Kafka Sink
      - When addBatch is called
      -- If batchId is great than the last written batch
      --- Write batch to Kafka
      ---- Topic will be taken from the record, if present, or from a topic option, which overrides topic in record.
      -- Else ignore
      
      ### Batch Kafka Sink
      - KafkaSourceProvider will implement CreatableRelationProvider
      - CreatableRelationProvider#createRelation will write the passed in Dataframe to a Kafka
      - Topic will be taken from the record, if present, or from topic option, which overrides topic in record.
      - Save modes Append and ErrorIfExist supported under identical semantics. Other save modes result in an AnalysisException
      
      tdas zsxwing
      
      ## How was this patch tested?
      
      ### The following unit tests will be included
      - write to stream with topic field: valid stream write with data that includes an existing topic in the schema
      - write structured streaming aggregation w/o topic field, with default topic: valid stream write with data that does not include a topic field, but the configuration includes a default topic
      - write data with bad schema: various cases of writing data that does not conform to a proper schema e.g., 1. no topic field or default topic, and 2. no value field
      - write data with valid schema but wrong types: data with a complete schema but wrong types e.g., key and value types are integers.
      - write to non-existing topic: write a stream to a topic that does not exist in Kafka, which has been configured to not auto-create topics.
      - write batch to kafka: simple write batch to Kafka, which goes through the same code path as streaming scenario, so validity checks will not be redone here.
      
      ### Examples
      ```scala
      // Structured Streaming
      val writer = inputStringStream.map(s => s.get(0).toString.getBytes()).toDF("value")
       .selectExpr("value as key", "value as value")
       .writeStream
       .format("kafka")
       .option("checkpointLocation", checkpointDir)
       .outputMode(OutputMode.Append)
       .option("kafka.bootstrap.servers", brokerAddress)
       .option("topic", topic)
       .queryName("kafkaStream")
       .start()
      
      // Batch
      val df = spark
       .sparkContext
       .parallelize(Seq("1", "2", "3", "4", "5"))
       .map(v => (topic, v))
       .toDF("topic", "value")
      
      df.write
       .format("kafka")
       .option("kafka.bootstrap.servers",brokerAddress)
       .option("topic", topic)
       .save()
      ```
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #17043 from tcondie/kafka-writer.
      b0a5cd89
    • Gaurav's avatar
      [SPARK-19304][STREAMING][KINESIS] fix kinesis slow checkpoint recovery · 46a64d1e
      Gaurav authored
      ## What changes were proposed in this pull request?
      added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerably faster
      
      As we are storing the `fromSeqNum` & `toSeqNum` in checkpoint metadata, we can also store the number of records. Which can later be used for api call.
      
      ## How was this patch tested?
      The patch was manually tested
      
      Apologies for any silly mistakes, opening first pull request
      
      Author: Gaurav <gaurav@techtinium.com>
      
      Closes #16842 from Gauravshah/kinesis_checkpoint_recovery_fix_2_1_0.
      46a64d1e
  37. Feb 27, 2017
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix lint-java breaks in Java · 4ba9c6c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      
      ```
      [ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer.
      [ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers.
      [ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed.
      [ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121).
      [ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113).
      [ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122).
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time.
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114).
      [ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103).
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1.
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104).
      [ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD.
      [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext.
      ```
      
      ## How was this patch tested?
      
      Manually via
      
      ```bash
      ./dev/lint-java
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17072 from HyukjinKwon/java-lint.
      4ba9c6c4
Loading