Skip to content
Snippets Groups Projects
  1. Jul 06, 2016
  2. Jul 01, 2016
    • Reynold Xin's avatar
      [SPARK-16335][SQL] Structured streaming should fail if source directory does not exist · d601894c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).
      
      ## How was this patch tested?
      Updated unit tests to reflect the new behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14002 from rxin/SPARK-16335.
      d601894c
  3. Jun 30, 2016
    • Reynold Xin's avatar
      [SPARK-15954][SQL] Disable loading test tables in Python tests · 38f4d6f4
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch introduces a flag to disable loading test tables in TestHiveSparkSession and disables that in Python. This fixes an issue in which python/run-tests would fail due to failure to load test tables.
      
      Note that these test tables are not used outside of HiveCompatibilitySuite. In the long run we should probably decouple the loading of test tables from the test Hive setup.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14005 from rxin/SPARK-15954.
      38f4d6f4
    • Reynold Xin's avatar
      [SPARK-16313][SQL] Spark should not silently drop exceptions in file listing · 3d75a5b2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.
      
      ## How was this patch tested?
      Manually verified.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13987 from rxin/SPARK-16313.
      3d75a5b2
    • Dongjoon Hyun's avatar
      [SPARK-16289][SQL] Implement posexplode table generating function · 46395db8
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
      
      **Before**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
      ```
      
      **After**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      +---+---+-----+
      |pos|key|value|
      +---+---+-----+
      |  0|  a|    1|
      |  1|  b|    2|
      +---+---+-----+
      ```
      
      For `array` argument, `after` is the same with `before`.
      ```
      scala> sql("select posexplode(array(1, 2, 3))").show
      +---+---+
      |pos|col|
      +---+---+
      |  0|  1|
      |  1|  2|
      |  2|  3|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with newly added testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13971 from dongjoon-hyun/SPARK-16289.
      46395db8
    • WeichenXu's avatar
      [SPARK-15820][PYSPARK][SQL] Add Catalog.refreshTable into python API · 5344bade
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add Catalog.refreshTable API into python interface for Spark-SQL.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13558 from WeichenXu123/update_python_sql_interface_refreshTable.
      5344bade
  4. Jun 29, 2016
    • hyukjinkwon's avatar
      [TRIVIAL] [PYSPARK] Clean up orc compression option as well · d8a87a3e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects ORC compression option for PySpark as well. I think this was missed mistakenly in https://github.com/apache/spark/pull/13948.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13963 from HyukjinKwon/minor-orc-compress.
      d8a87a3e
    • gatorsmile's avatar
      [SPARK-16236][SQL][FOLLOWUP] Add Path Option back to Load API in DataFrameReader · 39f2eb1d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In Python API, we have the same issue. Thanks for identifying this issue, zsxwing ! Below is an example:
      ```Python
      spark.read.format('json').load('python/test_support/sql/people.json')
      ```
      #### How was this patch tested?
      Existing test cases cover the changes by this PR
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13965 from gatorsmile/optionPaths.
      39f2eb1d
    • Tathagata Das's avatar
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to... · f454a7f9
      Tathagata Das authored
      [SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming
      
      ## What changes were proposed in this pull request?
      
      - Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
      - Exposed the necessary classes in sql.streaming package so that they appear in the docs
      - Added pyspark.sql.streaming module to the docs
      
      ## How was this patch tested?
      - updated unit tests.
      - generated docs for testing visibility of pyspark.sql.streaming classes.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13955 from tdas/SPARK-16266.
      f454a7f9
  5. Jun 28, 2016
    • Shixiong Zhu's avatar
      [SPARK-16268][PYSPARK] SQLContext should import DataStreamReader · 5bf8881b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the following error:
      ```
      >>> sqlContext.readStream
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "...", line 442, in readStream
          return DataStreamReader(self._wrapped)
      NameError: global name 'DataStreamReader' is not defined
      ```
      
      ## How was this patch tested?
      
      The added test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13958 from zsxwing/fix-import.
      5bf8881b
    • Burak Yavuz's avatar
      [MINOR][DOCS][STRUCTURED STREAMING] Minor doc fixes around `DataFrameWriter` and `DataStreamWriter` · 5545b791
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #13952 from brkyvz/minor-doc-fix.
      5545b791
    • Davies Liu's avatar
      [SPARK-16175] [PYSPARK] handle None for UDT · 35438fb0
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Scala UDT will bypass all the null and will not pass them into serialize() and deserialize() of UDT, this PR update the Python UDT to do this as well.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13878 from davies/udt_null.
      35438fb0
    • Davies Liu's avatar
      [SPARK-16259][PYSPARK] cleanup options in DataFrame read/write API · 1aad8c6e
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv().
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13948 from davies/csv_options.
      1aad8c6e
    • Yin Huai's avatar
      [SPARK-16224] [SQL] [PYSPARK] SparkSession builder's configs need to be set to... · 0923c4f5
      Yin Huai authored
      [SPARK-16224] [SQL] [PYSPARK] SparkSession builder's configs need to be set to the existing Scala SparkContext's SparkConf
      
      ## What changes were proposed in this pull request?
      When we create a SparkSession at the Python side, it is possible that a SparkContext has been created. For this case, we need to set configs of the SparkSession builder to the Scala SparkContext's SparkConf (we need to do so because conf changes on a active Python SparkContext will not be propagated to the JVM side). Otherwise, we may create a wrong SparkSession (e.g. Hive support is not enabled even if enableHiveSupport is called).
      
      ## How was this patch tested?
      New tests and manual tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13931 from yhuai/SPARK-16224.
      0923c4f5
    • Prashant Sharma's avatar
      [SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function. · f6b497fc
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      
      Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
      
      ## How was this patch tested?
      Existing tests. + 1 new test in DataFrameSuite.
      
      For SparkR and pyspark, existing tests and manual testing.
      
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      Author: Prashant Sharma <prashant@apache.org>
      
      Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
      f6b497fc
  6. Jun 27, 2016
  7. Jun 24, 2016
    • Davies Liu's avatar
      [SPARK-16179][PYSPARK] fix bugs for Python udf in generate · 4435de1b
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.
      
      ```
      >>> df.select(explode(f(*df))).show()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
          print(self._jdf.showString(n, truncate))
        File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
      : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
      Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
      +- Scan ExistingRDD[_1#0L]
      
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
      	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
      	at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
      	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
      	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
      	at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
      	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
      	at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
      	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
      	at scala.collection.immutable.List.foldLeft(List.scala:84)
      	at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
      	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:280)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:211)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.reflect.InvocationTargetException
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
      	... 42 more
      Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
      	at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
      	... 52 more
      Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
      	at scala.sys.package$.error(package.scala:27)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
      	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
      	... 67 more
      ```
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13883 from davies/udf_in_generate.
      4435de1b
  8. Jun 21, 2016
  9. Jun 20, 2016
    • Reynold Xin's avatar
      [SPARK-13792][SQL] Limit logging of bad records in CSV data source · c775bf09
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
      
      The error log looks something like
      ```
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
      16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
      ```
      
      Closes #12173
      
      ## How was this patch tested?
      Manually tested.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13795 from rxin/SPARK-13792.
      c775bf09
    • Davies Liu's avatar
      [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6) · a46553cb
      Davies Liu authored
      
      Fix the bug for Python UDF that does not have any arguments.
      
      Added regression tests.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #13793 from davies/fix_no_arguments.
      
      (cherry picked from commit abe36c53)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      a46553cb
  10. Jun 18, 2016
    • Josh Howes's avatar
      [SPARK-15973][PYSPARK] Fix GroupedData Documentation · e574c997
      Josh Howes authored
      *This contribution is my original work and that I license the work to the project under the project's open source license.*
      
      ## What changes were proposed in this pull request?
      
      Documentation updates to PySpark's GroupedData
      
      ## How was this patch tested?
      
      Manual Tests
      
      Author: Josh Howes <josh.howes@gmail.com>
      Author: Josh Howes <josh.howes@maxpoint.com>
      
      Closes #13724 from josh-howes/bugfix/SPARK-15973.
      e574c997
    • Jeff Zhang's avatar
      [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSession · 898cb652
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Support with statement syntax for SparkSession in pyspark
      
      ## How was this patch tested?
      
      Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13541 from zjffdu/SPARK-15803.
      898cb652
  11. Jun 16, 2016
    • Tathagata Das's avatar
      [SPARK-15981][SQL][STREAMING] Fixed bug and added tests in DataStreamReader Python API · 084dca77
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Fixed bug in Python API of DataStreamReader.  Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error.
      ```
      File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json
      Failed example:
          json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
      Exception raised:
          Traceback (most recent call last):
            File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
              compileflags, 1) in test.globs
            File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module>
              json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json
              return self._df(self._jreader.json(path))
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
              answer, self.gateway_client, self.target_id, self.name)
            File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
              format(target_id, ".", name, value))
          Py4JError: An error occurred while calling o121.json. Trace:
          py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
          	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
          	at py4j.Gateway.invoke(Gateway.java:272)
          	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
          	at py4j.commands.CallCommand.execute(CallCommand.java:79)
          	at py4j.GatewayConnection.run(GatewayConnection.java:211)
          	at java.lang.Thread.run(Thread.java:744)
      ```
      
      - Reduced code duplication between DataStreamReader and DataFrameWriter
      - Added missing Python doctests
      
      ## How was this patch tested?
      New tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13703 from tdas/SPARK-15981.
      084dca77
  12. Jun 15, 2016
    • Davies Liu's avatar
      [SPARK-15888] [SQL] fix Python UDF with aggregate · 5389013a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.
      
      ## How was this patch tested?
      
      Added regression tests. The plan of added test query looks like this:
      ```
      == Parsed Logical Plan ==
      'Project [<lambda>('k, 's) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Analyzed Logical Plan ==
      t: int
      Project [<lambda>(k#17, s#22L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
         +- LogicalRDD [key#5L, value#6]
      
      == Optimized Logical Plan ==
      Project [<lambda>(agg#29, agg#30L) AS t#26]
      +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
         +- LogicalRDD [key#5L, value#6]
      
      == Physical Plan ==
      *Project [pythonUDF0#37 AS t#26]
      +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
         +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
            +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
               +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
                  +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
                     +- Scan ExistingRDD[key#5L,value#6]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13682 from davies/fix_py_udf.
      5389013a
    • Tathagata Das's avatar
      [SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery · 9a507199
      Tathagata Das authored
      Renamed for simplicity, so that its obvious that its related to streaming.
      
      Existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13673 from tdas/SPARK-15953.
      9a507199
  13. Jun 14, 2016
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Fix a wrong format tag in the error message · 0ee9fd9e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      A follow up PR for #13655 to fix a wrong format tag.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13665 from zsxwing/fix.
      0ee9fd9e
    • Tathagata Das's avatar
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream... · 214adb14
      Tathagata Das authored
      [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs
      
      ## What changes were proposed in this pull request?
      Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
      
      - [x] Python API!!
      
      ## How was this patch tested?
      Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13653 from tdas/SPARK-15933.
      214adb14
    • Shixiong Zhu's avatar
      [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests · 96c3500c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR just enables tests for sql/streaming.py and also fixes the failures.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13655 from zsxwing/python-streaming-test.
      96c3500c
  14. Jun 13, 2016
    • Sandeep Singh's avatar
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the... · 1842cdd4
      Sandeep Singh authored
      [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
      
      ## What changes were proposed in this pull request?
      SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.
      
      ## How was this patch tested?
      CatalogSuite
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13413 from techaddict/SPARK-15663.
      1842cdd4
  15. Jun 12, 2016
  16. Jun 11, 2016
    • Takeshi YAMAMURO's avatar
      [SPARK-15585][SQL] Add doc for turning off quotations · cb5d933d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`.
      
      ## How was this patch tested?
      Check behavior  to put an empty string in csv options.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13616 from maropu/SPARK-15585-2.
      cb5d933d
  17. Jun 06, 2016
  18. Jun 01, 2016
    • Reynold Xin's avatar
      [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming · a71d1364
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.
      
      ## How was this patch tested?
      Updated tests to reflect the moves.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13429 from rxin/SPARK-15686.
      a71d1364
  19. May 31, 2016
    • Tathagata Das's avatar
      [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming · 90b11439
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      Currently structured streaming only supports append output mode.  This PR adds the following.
      
      - Added support for Complete output mode in the internal state store, analyzer and planner.
      - Added public API in Scala and Python for users to specify output mode
      - Added checks for unsupported combinations of output mode and DF operations
        - Plans with no aggregation should support only Append mode
        - Plans with aggregation should support only Update and Complete modes
        - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
      - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
      
      ## How was this patch tested?
      Unit tests in various test suites
      - StreamingAggregationSuite: tests for complete mode
      - MemorySinkSuite: tests for checking behavior in Append and Complete modes.
      - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
      - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
      - Python doc test and existing unit tests modified to call write.outputMode.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13286 from tdas/complete-mode.
      90b11439
    • Shixiong Zhu's avatar
      Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · 9a74de18
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This reverts commit c24b6b67. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
      
      ## How was this patch tested?
      
      Jenkins unit tests, integration tests, manual tests)
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13417 from zsxwing/revert-SPARK-11753.
      9a74de18
Loading