-
- Downloads
[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch.
## What changes were proposed in this pull request? In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged. Furthermore, I have performance improvement of 10x in latency, with this patch. ### These are times that addBatch took in ms. Without applying this patch  ### These are times that addBatch took in ms. After applying this patch  ## How was this patch tested? Running distributed benchmarks comparing runs with this patch and without it. Added relevant unit tests. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #17308 from ScrapCodes/cached-kafka-producer.
Showing
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala 112 additions, 0 deletions...a/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala 7 additions, 7 deletions...ain/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala 8 additions, 9 deletions.../scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala
- external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala 1 addition, 2 deletions...ain/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala
- external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/CachedKafkaProducerSuite.scala 78 additions, 0 deletions.../apache/spark/sql/kafka010/CachedKafkaProducerSuite.scala
Loading
Please register or sign in to comment