Skip to content
Snippets Groups Projects
  1. Apr 05, 2017
  2. Mar 30, 2017
    • Seigneurin, Alexis (CONT)'s avatar
      [DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation · 669a11b6
      Seigneurin, Alexis (CONT) authored
      Fixed a few typos.
      
      There is one more I'm not sure of:
      
      ```
              Append mode uses watermark to drop old aggregation state. But the output of a
              windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
              the modes semantics, rows can be added to the Result Table only once after they are
      ```
      
      Not sure how to change `is delayed the late threshold`.
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #17443 from aseigneurin/typos.
      669a11b6
  3. Mar 22, 2017
    • uncleGen's avatar
      [SPARK-20021][PYSPARK] Miss backslash in python code · facfd608
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Add backslash for line continuation in python code.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      Author: dylon <hustyugm@gmail.com>
      
      Closes #17352 from uncleGen/python-example-doc.
      facfd608
  4. Mar 12, 2017
    • uncleGen's avatar
      [DOCS][SS] fix structured streaming python example · e29a74d5
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      - SS python example: `TypeError: 'xxx' object is not callable`
      - some other doc issue.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17257 from uncleGen/docs-ss-python.
      e29a74d5
  5. Mar 09, 2017
    • Liwei Lin's avatar
      [SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource · 40da4d18
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`).
      
      This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log).
      
      ## Usage
      
      ```scala
      spark
        .readStream
        .option("fileNameOnly", true)
        .text("s3n://bucket/dir1/dir2")
        .writeStream
        ...
      ```
      ## How was this patch tested?
      
      Added a test case
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17120 from lw-lin/filename-only.
      40da4d18
  6. Feb 24, 2017
    • Ramkumar Venkataraman's avatar
      [MINOR][DOCS] Fix few typos in structured streaming doc · 1b9ba258
      Ramkumar Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.
      
      ## How was this patch tested?
      
      N/A - since this is a doc fix. I did a jekyll build locally though.
      
      Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>
      
      Closes #17037 from ramkumarvenkat/doc-fix.
      1b9ba258
  7. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      Unverified
      0e240549
  8. Jan 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · bc6c56e9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      bc6c56e9
  9. Jan 06, 2017
  10. Jan 04, 2017
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      Unverified
      a1e40b1f
  11. Jan 02, 2017
  12. Dec 28, 2016
  13. Nov 16, 2016
  14. Nov 13, 2016
  15. Nov 01, 2016
  16. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
  17. Sep 26, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing · 8135e0e5
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      When reading file stream with non-globbing path, the results return data with all `null`s for the
      partitioned columns. E.g.,
      
          case class A(id: Int, value: Int)
          val data = spark.createDataset(Seq(
            A(1, 1),
            A(2, 2),
            A(2, 3))
          )
          val url = "/tmp/test"
          data.write.partitionBy("id").parquet(url)
          spark.read.parquet(url).show
      
          +-----+---+
          |value| id|
          +-----+---+
          |    2|  2|
          |    3|  2|
          |    1|  1|
          +-----+---+
      
          val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
          s.writeStream.queryName("test").format("memory").start()
      
          sql("SELECT * FROM test").show
      
          +-----+----+
          |value|  id|
          +-----+----+
          |    2|null|
          |    3|null|
          |    1|null|
          +-----+----+
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #14803 from viirya/filestreamsource-option.
      8135e0e5
  18. Sep 01, 2016
    • Seigneurin, Alexis (CONT)'s avatar
      fixed typos · dd859f95
      Seigneurin, Alexis (CONT) authored
      fixed 2 typos
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14877 from aseigneurin/fix-typo-2.
      dd859f95
  19. Aug 30, 2016
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
  20. Aug 29, 2016
    • Seigneurin, Alexis (CONT)'s avatar
      fixed a typo · 08913ce0
      Seigneurin, Alexis (CONT) authored
      idempotant -> idempotent
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14833 from aseigneurin/fix-typo.
      08913ce0
  21. Aug 23, 2016
  22. Aug 22, 2016
  23. Aug 13, 2016
    • Jagadeesan's avatar
      [SPARK-12370][DOCUMENTATION] Documentation should link to examples … · e46cb78b
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
      
      …from its own release version] [Streaming programming guide]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #14596 from jagadeesanas2/SPARK-12370.
      e46cb78b
  24. Aug 11, 2016
    • hyukjinkwon's avatar
      [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation · 7186e8c3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
      
      This PR fixes three things below:
      
       - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
      
      - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
      
      - Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
      
      ## How was this patch tested?
      
      N/A
      
      Closes https://github.com/apache/spark/pull/14491
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
      
      Closes #14564 from HyukjinKwon/SPARK-16886.
      7186e8c3
  25. Jul 21, 2016
  26. Jul 19, 2016
    • Ahmed Mahran's avatar
      [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar · 6caa2205
      Ahmed Mahran authored
      ## What changes were proposed in this pull request?
      
      Minor fixes correcting some typos, punctuations, grammar.
      Adding more anchors for easy navigation.
      Fixing minor issues with code snippets.
      
      ## How was this patch tested?
      
      `jekyll serve`
      
      Author: Ahmed Mahran <ahmed.mahran@mashin.io>
      
      Closes #14234 from ahmed-mahran/b-struct-streaming-docs.
      6caa2205
  27. Jul 13, 2016
    • James Thomas's avatar
      [SPARK-16114][SQL] updated structured streaming guide · 51a6706b
      James Thomas authored
      ## What changes were proposed in this pull request?
      
      Updated structured streaming programming guide with new windowed example.
      
      ## How was this patch tested?
      
      Docs
      
      Author: James Thomas <jamesjoethomas@gmail.com>
      
      Closes #14183 from jjthomas/ss_docs_update.
      51a6706b
  28. Jun 30, 2016
  29. Jun 29, 2016
Loading