Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    0561c454
    [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
    Davies Liu authored
    This PR brings the Python API for Spark Streaming Kafka data source.
    
    ```
        class KafkaUtils(__builtin__.object)
         |  Static methods defined here:
         |
         |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
    2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
         |      Create an input stream that pulls messages from a Kafka Broker.
         |
         |      :param ssc:  StreamingContext object
         |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
         |      :param groupId:  The group id for this consumer.
         |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
         |                      Each partition is consumed in its own thread.
         |      :param storageLevel:  RDD storage level.
         |      :param keyDecoder:  A function used to decode key
         |      :param valueDecoder:  A function used to decode value
         |      :return: A DStream object
    ```
    run the example:
    
    ```
    bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
    ```
    
    Author: Davies Liu <davies@databricks.com>
    Author: Tathagata Das <tdas@databricks.com>
    
    Closes #3715 from davies/kafka and squashes the following commits:
    
    d93bfe0 [Davies Liu] Update make-distribution.sh
    4280d04 [Davies Liu] address comments
    e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
    f257071 [Davies Liu] add tests for null in RDD
    23b039a [Davies Liu] address comments
    9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
    a74da87 [Davies Liu] address comments
    dc1eed0 [Davies Liu] Update kafka_wordcount.py
    31e2317 [Davies Liu] Update kafka_wordcount.py
    370ba61 [Davies Liu] Update kafka.py
    97386b3 [Davies Liu] address comment
    2c567a5 [Davies Liu] update logging and comment
    33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
    adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
    aea8953 [Tathagata Das] Kafka-assembly for Python API
    eea16a7 [Davies Liu] refactor
    f6ce899 [Davies Liu] add example and fix bugs
    98c8d17 [Davies Liu] fix python style
    5697a01 [Davies Liu] bypass decoder in scala
    048dbe6 [Davies Liu] fix python style
    75d485e [Davies Liu] add mqtt
    07923c4 [Davies Liu] support kafka in Python
    0561c454
    History
    [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python
    Davies Liu authored
    This PR brings the Python API for Spark Streaming Kafka data source.
    
    ```
        class KafkaUtils(__builtin__.object)
         |  Static methods defined here:
         |
         |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
    2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
         |      Create an input stream that pulls messages from a Kafka Broker.
         |
         |      :param ssc:  StreamingContext object
         |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
         |      :param groupId:  The group id for this consumer.
         |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
         |                      Each partition is consumed in its own thread.
         |      :param storageLevel:  RDD storage level.
         |      :param keyDecoder:  A function used to decode key
         |      :param valueDecoder:  A function used to decode value
         |      :return: A DStream object
    ```
    run the example:
    
    ```
    bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
    ```
    
    Author: Davies Liu <davies@databricks.com>
    Author: Tathagata Das <tdas@databricks.com>
    
    Closes #3715 from davies/kafka and squashes the following commits:
    
    d93bfe0 [Davies Liu] Update make-distribution.sh
    4280d04 [Davies Liu] address comments
    e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
    f257071 [Davies Liu] add tests for null in RDD
    23b039a [Davies Liu] address comments
    9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
    a74da87 [Davies Liu] address comments
    dc1eed0 [Davies Liu] Update kafka_wordcount.py
    31e2317 [Davies Liu] Update kafka_wordcount.py
    370ba61 [Davies Liu] Update kafka.py
    97386b3 [Davies Liu] address comment
    2c567a5 [Davies Liu] update logging and comment
    33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
    adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
    aea8953 [Tathagata Das] Kafka-assembly for Python API
    eea16a7 [Davies Liu] refactor
    f6ce899 [Davies Liu] add example and fix bugs
    98c8d17 [Davies Liu] fix python style
    5697a01 [Davies Liu] bypass decoder in scala
    048dbe6 [Davies Liu] fix python style
    75d485e [Davies Liu] add mqtt
    07923c4 [Davies Liu] support kafka in Python