Skip to content
Snippets Groups Projects
  1. Feb 28, 2017
    • hyukjinkwon's avatar
      [SPARK-19610][SQL] Support parsing multiline CSV files · 7e5359be
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes the support for multiple lines for CSV by resembling the multiline supports in JSON datasource (in case of JSON, per file).
      
      So, this PR introduces `wholeFile` option which makes the format not splittable and reads each whole file. Since Univocity parser can produces each row from a stream, it should be capable of parsing very large documents when the internal rows are fix in the memory.
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and `tests.py`
      
      Manual tests with a single 9GB CSV file in local file system, for example,
      
      ```scala
      spark.read.option("wholeFile", true).option("inferSchema", true).csv("tmp.csv").count()
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16976 from HyukjinKwon/SPARK-19610.
      7e5359be
  2. Feb 16, 2017
    • Nathan Howell's avatar
      [SPARK-18352][SQL] Support parsing multiline json files · 21fde57f
      Nathan Howell authored
      ## What changes were proposed in this pull request?
      
      If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.
      
      Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.
      
      These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing.
      
      I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.
      
      ## How was this patch tested?
      
      New and existing unit tests. No performance or load tests have been run.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #16386 from NathanHowell/SPARK-18352.
      21fde57f
  3. Sep 21, 2016
    • Yanbo Liang's avatar
      [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively · d3b88697
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15140 from yanboliang/spark-17585.
      d3b88697
  4. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  5. Feb 29, 2016
  6. Oct 28, 2015
  7. Oct 17, 2015
    • Koert Kuipers's avatar
      [SPARK-10185] [SQL] Feat sql comma separated paths · 57f83e36
      Koert Kuipers authored
      Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
      57f83e36
  8. Sep 22, 2015
  9. Jul 21, 2015
    • Cheng Lian's avatar
      [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029
      Cheng Lian authored
      This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7444 from liancheng/spark-9100 and squashes the following commits:
      
      284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
      e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
      d38c5029
  10. Jun 03, 2015
    • Reynold Xin's avatar
      [SPARK-8060] Improve DataFrame Python test coverage and documentation. · ce320cb2
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:
      
      baa8ad5 [Reynold Xin] Code review feedback.
      f081d47 [Reynold Xin] More documentation updates.
      c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
      ce320cb2
  11. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  12. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3634] [PySpark] User's module should take precedence over system modules · c854b9fc
      Davies Liu authored
      Python modules added through addPyFile should take precedence over system modules.
      
      This patch put the path for user added module in the front of sys.path (just after '').
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2492 from davies/path and squashes the following commits:
      
      4a2af78 [Davies Liu] fix tests
      f7ff4da [Davies Liu] ad license header
      6b0002f [Davies Liu] add tests
      c16c392 [Davies Liu] put addPyFile in front of sys.path
      c854b9fc
  13. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  14. Dec 09, 2013
  15. Aug 16, 2013
  16. Jan 23, 2013
  17. Jan 22, 2013
Loading