Skip to content
Snippets Groups Projects
  1. May 02, 2016
  2. May 01, 2016
    • hyukjinkwon's avatar
      [SPARK-13425][SQL] Documentation for CSV datasource options · a832cef1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the explanation and documentation for CSV options for reading and writing.
      
      ## How was this patch tested?
      
      Style tests with `./dev/run_tests` for documentation style.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #12817 from HyukjinKwon/SPARK-13425.
      a832cef1
  3. Apr 28, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] Second cut of Python API for Structured Streaming · 78c8aaf8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds Python APIs for:
       - `ContinuousQueryManager`
       - `ContinuousQueryException`
      
      The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
      
      For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
      
      This PR also attempts to fix test flakiness by stopping all active streams just before tests.
      
      ## How was this patch tested?
      
      Python Doc tests and unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12673 from brkyvz/pyspark-cqm.
      78c8aaf8
    • Andrew Or's avatar
      [SPARK-14945][PYTHON] SparkSession Python API · 89addd40
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      ```
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
            /_/
      
      Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
      SparkSession available as 'spark'.
      >>> spark
      <pyspark.sql.session.SparkSession object at 0x101f3bfd0>
      >>> spark.sql("SHOW TABLES").show()
      ...
      +---------+-----------+
      |tableName|isTemporary|
      +---------+-----------+
      |      src|      false|
      +---------+-----------+
      
      >>> spark.range(1, 10, 2).show()
      +---+
      | id|
      +---+
      |  1|
      |  3|
      |  5|
      |  7|
      |  9|
      +---+
      ```
      **Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.
      
      ## How was this patch tested?
      
      Python tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12746 from andrewor14/python-spark-session.
      89addd40
  4. Apr 22, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-13266] [SQL] None read/writer options were not transalated to "null" · 056883e0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
      
      This is based on #11305 from mathieulongtin.
      
      ## How was this patch tested?
      
      Added test to readwriter.py.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: mathieu longtin <mathieu.longtin@nuance.com>
      
      Closes #12494 from viirya/py-df-none-option.
      056883e0
  5. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  6. Apr 03, 2016
    • hyukjinkwon's avatar
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double... · 2262a933
      hyukjinkwon authored
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14231
      
      Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
      
      But there are few restrictions in Spark `DecimalType` below:
      
      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.
      
      Currently, both restrictions are not being handled.
      
      This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
      
      So, the codes below:
      
      ```scala
      def doubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
          s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(doubleRecords)
      jsonDF.printSchema()
      ```
      
      produces below:
      
      - **Before**
      
      ```scala
      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at
      ...
      ```
      
      - **After**
      
      ```scala
      root
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
      ```
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12030 from HyukjinKwon/SPARK-14231.
      2262a933
  7. Mar 22, 2016
    • hyukjinkwon's avatar
      [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource · 4e09a0d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13953
      
      Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
      This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
      
      This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11881 from HyukjinKwon/SPARK-13953.
      4e09a0d5
  8. Mar 21, 2016
    • hyukjinkwon's avatar
      [SPARK-13764][SQL] Parse modes in JSON data source · e4740881
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
      
      This PR adds the support for parse modes just like CSV data source. There are three modes below:
      
      - `PERMISSIVE` :  When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
      - `DROPMALFORMED`: When it fails to parse, this drops the whole record.
      - `FAILFAST`: When it fails to parse, it just throws an exception.
      
      This PR also make JSON data source share the `ParseModes` in CSV data source.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for code style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11756 from HyukjinKwon/SPARK-13764.
      e4740881
  9. Mar 03, 2016
  10. Feb 29, 2016
  11. Jan 28, 2016
  12. Jan 04, 2016
  13. Jan 03, 2016
  14. Dec 17, 2015
    • Yanbo Liang's avatar
      [SQL] Update SQLContext.read.text doc · 6e077166
      Yanbo Liang authored
      Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10349 from yanboliang/text-value.
      6e077166
  15. Nov 24, 2015
  16. Nov 18, 2015
  17. Nov 16, 2015
  18. Nov 06, 2015
  19. Oct 28, 2015
  20. Oct 17, 2015
    • Koert Kuipers's avatar
      [SPARK-10185] [SQL] Feat sql comma separated paths · 57f83e36
      Koert Kuipers authored
      Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
      57f83e36
  21. Sep 08, 2015
  22. Aug 27, 2015
  23. Aug 14, 2015
  24. Aug 05, 2015
  25. Jul 21, 2015
    • Cheng Lian's avatar
      [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029
      Cheng Lian authored
      This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7444 from liancheng/spark-9100 and squashes the following commits:
      
      284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
      e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
      d38c5029
  26. Jun 29, 2015
    • Reynold Xin's avatar
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should... · 660c6cec
      Reynold Xin authored
      [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7079 from rxin/SPARK-8698 and squashes the following commits:
      
      8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
      660c6cec
    • Cheolsoo Park's avatar
      [SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala · ac2e17b0
      Cheolsoo Park authored
      I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.
      
      I added `Option` to reader and writer and updated the `pyspark-sql` test.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:
      
      c63d419 [Cheolsoo Park] Fix version
      524e0aa [Cheolsoo Park] Add option function to df reader and writer
      ac2e17b0
  27. Jun 22, 2015
    • Yin Huai's avatar
      [SPARK-8532] [SQL] In Python's DataFrameWriter,... · 5ab9fcfb
      Yin Huai authored
      [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
      
      https://issues.apache.org/jira/browse/SPARK-8532
      
      This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:
      
      f972d5d [Yin Huai] davies's comment.
      d37abd2 [Yin Huai] style.
      d21290a [Yin Huai] Python doc.
      889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
      7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
      d696dff [Yin Huai] Python style.
      88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
      c40c461 [Yin Huai] Regression test.
      5ab9fcfb
  28. Jun 03, 2015
    • Reynold Xin's avatar
      [SPARK-8060] Improve DataFrame Python test coverage and documentation. · ce320cb2
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:
      
      baa8ad5 [Reynold Xin] Code review feedback.
      f081d47 [Reynold Xin] More documentation updates.
      c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
      ce320cb2
  29. Jun 02, 2015
    • Davies Liu's avatar
      [SPARK-8021] [SQL] [PYSPARK] make Python read/write API consistent with Scala · 445647a1
      Davies Liu authored
      add schema()/format()/options() for reader,  add mode()/format()/options()/partitionBy() for writer
      
      cc rxin yhuai  pwendell
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6578 from davies/readwrite and squashes the following commits:
      
      720d293 [Davies Liu] address comments
      b65dfa2 [Davies Liu] Update readwriter.py
      1299ab6 [Davies Liu] make Python API consistent with Scala
      445647a1
  30. May 23, 2015
    • Davies Liu's avatar
      [SPARK-7840] add insertInto() to Writer · be47af1b
      Davies Liu authored
      Add tests later.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6375 from davies/insertInto and squashes the following commits:
      
      826423e [Davies Liu] add insertInto() to Writer
      be47af1b
  31. May 21, 2015
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
  32. May 19, 2015
    • Davies Liu's avatar
      [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python · 4de74d26
      Davies Liu authored
      cc rxin, please take a quick look, I'm working on tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6238 from davies/readwrite and squashes the following commits:
      
      c7200eb [Davies Liu] update tests
      9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      f0c5a04 [Davies Liu] use sqlContext.read.load
      5f68bc8 [Davies Liu] update tests
      6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      bcc6668 [Davies Liu] add reader amd writer API in Python
      4de74d26
Loading