Skip to content
Snippets Groups Projects
  1. Mar 06, 2017
    • Tyson Condie's avatar
      [SPARK-19719][SS] Kafka writer for both structured streaming and batch queires · fd6c6d5c
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Add a new Kafka Sink and Kafka Relation for writing streaming and batch queries, respectively, to Apache Kafka.
      ### Streaming Kafka Sink
      - When addBatch is called
      -- If batchId is great than the last written batch
      --- Write batch to Kafka
      ---- Topic will be taken from the record, if present, or from a topic option, which overrides topic in record.
      -- Else ignore
      
      ### Batch Kafka Sink
      - KafkaSourceProvider will implement CreatableRelationProvider
      - CreatableRelationProvider#createRelation will write the passed in Dataframe to a Kafka
      - Topic will be taken from the record, if present, or from topic option, which overrides topic in record.
      - Save modes Append and ErrorIfExist supported under identical semantics. Other save modes result in an AnalysisException
      
      tdas zsxwing
      
      ## How was this patch tested?
      
      ### The following unit tests will be included
      - write to stream with topic field: valid stream write with data that includes an existing topic in the schema
      - write structured streaming aggregation w/o topic field, with default topic: valid stream write with data that does not include a topic field, but the configuration includes a default topic
      - write data with bad schema: various cases of writing data that does not conform to a proper schema e.g., 1. no topic field or default topic, and 2. no value field
      - write data with valid schema but wrong types: data with a complete schema but wrong types e.g., key and value types are integers.
      - write to non-existing topic: write a stream to a topic that does not exist in Kafka, which has been configured to not auto-create topics.
      - write batch to kafka: simple write batch to Kafka, which goes through the same code path as streaming scenario, so validity checks will not be redone here.
      
      ### Examples
      ```scala
      // Structured Streaming
      val writer = inputStringStream.map(s => s.get(0).toString.getBytes()).toDF("value")
       .selectExpr("value as key", "value as value")
       .writeStream
       .format("kafka")
       .option("checkpointLocation", checkpointDir)
       .outputMode(OutputMode.Append)
       .option("kafka.bootstrap.servers", brokerAddress)
       .option("topic", topic)
       .queryName("kafkaStream")
       .start()
      
      // Batch
      val df = spark
       .sparkContext
       .parallelize(Seq("1", "2", "3", "4", "5"))
       .map(v => (topic, v))
       .toDF("topic", "value")
      
      df.write
       .format("kafka")
       .option("kafka.bootstrap.servers",brokerAddress)
       .option("topic", topic)
       .save()
      ```
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #17043 from tcondie/kafka-writer.
      fd6c6d5c
  2. Feb 17, 2017
    • Roberto Agostino Vitillo's avatar
      [SPARK-19517][SS] KafkaSource fails to initialize partition offsets · b083ec51
      Roberto Agostino Vitillo authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.
      
      ## How was this patch tested?
      
      I ran the test suite for spark-sql-kafka-0-10.
      
      Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
      
      Closes #16857 from vitillo/kafka_source_fix.
      b083ec51
  3. Feb 13, 2017
    • Liwei Lin's avatar
      [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group · fe4fcc57
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**
      
      The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.
      
      We should make sure that `groupId` and `nextId` are initialized before any consumer is created.
      
      ## How was this patch tested?
      
      Ran 100 times of `KafkaSourceSuite`; all passed
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16902 from lw-lin/SPARK-19564-.
      
      (cherry picked from commit 2bdbc870)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      fe4fcc57
  4. Feb 07, 2017
    • Tyson Condie's avatar
      [SPARK-18682][SS] Batch Source for Kafka · e642a07d
      Tyson Condie authored
      
      Today, you can start a stream that reads from kafka. However, given kafka's configurable retention period, it seems like sometimes you might just want to read all of the data that is available now. As such we should add a version that works with spark.read as well.
      The options should be the same as the streaming kafka source, with the following differences:
      startingOffsets should default to earliest, and should not allow latest (which would always be empty).
      endingOffsets should also be allowed and should default to latest. the same assign json format as startingOffsets should also be accepted.
      It would be really good, if things like .limit(n) were enough to prevent all the data from being read (this might just work).
      
      KafkaRelationSuite was added for testing batch queries via KafkaUtils.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #16686 from tcondie/SPARK-18682.
      
      (cherry picked from commit 8df44440)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      e642a07d
  5. Dec 22, 2016
  6. Dec 21, 2016
    • Shixiong Zhu's avatar
      [SPARK-18588][SS][KAFKA] Create a new KafkaConsumer when error happens to fix the flaky test · 17ef57fe
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.
      
      This PR also assigns a new group id to the new created consumer for a possible race condition:  the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)
      
      ## How was this patch tested?
      
      In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console
      
       , it ran this flaky test 120 times and all passed.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16282 from zsxwing/kafka-fix.
      
      (cherry picked from commit 95efc895)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      17ef57fe
  7. Dec 15, 2016
  8. Dec 13, 2016
  9. Dec 08, 2016
    • Tathagata Das's avatar
      [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json · fcd22e53
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log.
      - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it.
      
      Other minor changes
      - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch.
      
      ## How was this patch tested?
      
      Updated unit test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16205 from tdas/SPARK-18776.
      
      (cherry picked from commit 458fa332)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      fcd22e53
    • Patrick Wendell's avatar
      48aa6775
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.0-rc2 · 08071749
      Patrick Wendell authored
      08071749
  10. Dec 07, 2016
    • Michael Armbrust's avatar
      [SPARK-18754][SS] Rename recentProgresses to recentProgress · 1c641971
      Michael Armbrust authored
      
      Based on an informal survey, users find this option easier to understand / remember.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #16182 from marmbrus/renameRecentProgress.
      
      (cherry picked from commit 70b2bf71)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      1c641971
    • Shixiong Zhu's avatar
      [SPARK-18588][TESTS] Fix flaky test: KafkaSourceStressForDontFailOnDataLossSuite · e9b3afac
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Fixed the following failures:
      
      ```
      org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout.
      ```
      
      ```
      sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null
      	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
      	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
      Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
      	at java.util.ArrayList.addAll(ArrayList.java:577)
      	at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
      	at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
      	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
      	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
      	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
      	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
      	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
      	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
      	at
      ...
      ```
      
      ## How was this patch tested?
      
      Tested in #16048 by running many times.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16109 from zsxwing/fix-kafka-flaky-test.
      
      (cherry picked from commit edc87e18)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      e9b3afac
  11. Dec 06, 2016
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven · 3750c6e9
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.
      
      ## How was this patch tested?
      
      Manually ran maven test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16183 from tdas/SPARK-18671-1.
      
      (cherry picked from commit 5c6bcdbd)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      3750c6e9
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured... · d20e0d6b
      Tathagata Das authored
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats
      
      ## What changes were proposed in this pull request?
      
      To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
      
      ## How was this patch tested?
      new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16128 from tdas/SPARK-18671.
      
      (cherry picked from commit 1ef6b296)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      d20e0d6b
  12. Nov 29, 2016
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · 28b57c8a
      Tathagata Das authored
      
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      
      (cherry picked from commit c3d08e2f)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      28b57c8a
    • hyukjinkwon's avatar
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility... · 84b2af22
      hyukjinkwon authored
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
      
      ## What changes were proposed in this pull request?
      
      This PR make `sbt unidoc` complete with Java 8.
      
      This PR roughly includes several fixes as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```diff
        - * A column that will be computed based on the data in a [[DataFrame]].
        + * A column that will be computed based on the data in a `DataFrame`.
        ```
      
      - Fix throws annotations so that they are recognisable in javadoc
      
      - Fix URL links to `<a href="http..."></a>`.
      
        ```diff
        - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
        + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
        + * Decision tree (Wikipedia)</a> model for regression.
        ```
      
        ```diff
        -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
        +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
        +   * Receiver operating characteristic (Wikipedia)</a>
        ```
      
      - Fix < to > to
      
        - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.
      
        - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558
      
      
      
      - Fix `</p>` complaint
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
      
      (cherry picked from commit f830bb91)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      84b2af22
  13. Nov 28, 2016
  14. Nov 22, 2016
  15. Nov 19, 2016
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · 4b396a65
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png
      
      )
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      
      (cherry picked from commit d5b1d5fc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      4b396a65
  16. Nov 16, 2016
  17. Nov 14, 2016
  18. Nov 09, 2016
    • Tyson Condie's avatar
      [SPARK-17829][SQL] Stable format for offset log · b7d29256
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues:
      It can break across spark releases (though this is not the only thing preventing us from upgrading a running query)
      It is unnecessarily opaque to the user.
      I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option.
      ## How was this patch tested?
      
      Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala)
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      zsxwing marmbrus
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tyson Condie <tcondie@clash.local>
      
      Closes #15626 from tcondie/spark-8360.
      
      (cherry picked from commit 3f62e1b5)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      b7d29256
  19. Nov 07, 2016
  20. Nov 03, 2016
  21. Oct 27, 2016
    • cody koeninger's avatar
      [SPARK-17813][SQL][KAFKA] Maximum data per trigger · 10423258
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.
      
      ## How was this patch tested?
      
      Added unit test
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15527 from koeninger/SPARK-17813.
      10423258
  22. Oct 21, 2016
    • cody koeninger's avatar
      [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream · 268ccb9a
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy
      
      assign with specific topicpartitions as a consumer strategy
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15504 from koeninger/SPARK-17812.
      268ccb9a
  23. Oct 20, 2016
    • jerryshao's avatar
      [SPARK-17999][KAFKA][SQL] Add getPreferredLocations for KafkaSourceRDD · 947f4f25
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      The newly implemented Structured Streaming `KafkaSource` did calculate the preferred locations for each topic partition, but didn't offer this information through RDD's `getPreferredLocations` method. So here propose to add this method in `KafkaSourceRDD`.
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15545 from jerryshao/SPARK-17999.
      947f4f25
  24. Oct 18, 2016
    • cody koeninger's avatar
      [SPARK-17841][STREAMING][KAFKA] drain commitQueue · cd106b05
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Actually drain commit queue rather than just iterating it.
      iterator() on a concurrent linked queue won't remove items from the queue, poll() will.
      
      ## How was this patch tested?
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15407 from koeninger/SPARK-17841.
      cd106b05
  25. Oct 13, 2016
    • Tathagata Das's avatar
      [SPARK-17731][SQL][STREAMING] Metrics for structured streaming · 7106866c
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
      https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing
      
      Specifically, this PR adds the following public APIs changes.
      
      ### New APIs
      - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)
      
      - `StreamingQueryStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
        - processingRate - Current rate (rows/sec) at which the query is processing data from
                                        all the sources
        - ~~outputRate~~ - *Does not work with wholestage codegen*
        - latency - Current average latency between the data being available in source and the sink writing the corresponding output
        - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
        - sinkStatus: SinkStatus - Current status of the sink
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
          - latencies - getOffset, getBatch, full trigger, wal writes
          - timestamps - trigger start, finish, after getOffset, after getBatch
          - numRows - input, output, state total/updated rows for aggregations
      
      - `SourceStatus` has the following important fields
        - inputRate - Current rate (rows/sec) at which data is being generated by the source
        - processingRate - Current rate (rows/sec) at which the query is processing data from the source
        - triggerStatus - Low-level detailed status of the last completed/currently active trigger
      
      - Python API for `StreamingQuery.status()`
      
      ### Breaking changes to existing APIs
      **Existing direct public facing APIs**
      - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
        - Branch 2.0 should have it deprecated, master should have it removed.
      
      **Existing advanced listener APIs**
      - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
         - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)
      
      - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.
      
      - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.
      
      - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.
      
      ## How was this patch tested?
      
      Old and new unit tests.
      - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
      - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
      - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
      - Source-specific tests for making sure input rows are counted are is source-specific test suites.
      - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.
      
      Metrics also manually tested using Ganglia sink
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15307 from tdas/SPARK-17731.
      7106866c
    • Shixiong Zhu's avatar
      [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead... · 08eac356
      Shixiong Zhu authored
      [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
      
      ## What changes were proposed in this pull request?
      
      Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15397 from zsxwing/SPARK-17834.
      08eac356
  26. Oct 12, 2016
Loading