-
- Downloads
[SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column | Type ---- | ---- key | binary value | binary topic | string partition | int offset | long timestamp | long timestampType | int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options | value | default | meaning ------ | ------- | ------ | ----- startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.
Showing
- core/src/main/scala/org/apache/spark/util/UninterruptibleThread.scala 0 additions, 7 deletions...n/scala/org/apache/spark/util/UninterruptibleThread.scala
- dev/run-tests.py 1 addition, 1 deletiondev/run-tests.py
- dev/sparktestsupport/modules.py 12 additions, 0 deletionsdev/sparktestsupport/modules.py
- docs/structured-streaming-kafka-integration.md 239 additions, 0 deletionsdocs/structured-streaming-kafka-integration.md
- docs/structured-streaming-programming-guide.md 6 additions, 1 deletiondocs/structured-streaming-programming-guide.md
- external/kafka-0-10-sql/pom.xml 82 additions, 0 deletionsexternal/kafka-0-10-sql/pom.xml
- external/kafka-0-10-sql/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister 1 addition, 0 deletions.../services/org.apache.spark.sql.sources.DataSourceRegister
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala 152 additions, 0 deletions...a/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala 399 additions, 0 deletions...ain/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala 54 additions, 0 deletions...ala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala 282 additions, 0 deletions...a/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala 148 additions, 0 deletions.../scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/package-info.java 21 additions, 0 deletions...ain/scala/org/apache/spark/sql/kafka010/package-info.java
- external/kafka-0-10-sql/src/test/resources/log4j.properties 28 additions, 0 deletionsexternal/kafka-0-10-sql/src/test/resources/log4j.properties
- external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala 39 additions, 0 deletions...rg/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala
- external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala 424 additions, 0 deletions...cala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
- external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala 339 additions, 0 deletions.../scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
- pom.xml 1 addition, 0 deletionspom.xml
- project/SparkBuild.scala 3 additions, 3 deletionsproject/SparkBuild.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala 7 additions, 1 deletion...pache/spark/sql/execution/streaming/StreamExecution.scala
Loading
Please register or sign in to comment