Skip to content
Snippets Groups Projects
  1. May 23, 2016
    • Andrew Or's avatar
    • Kazuaki Ishizaki's avatar
      [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB · fa244e5a
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
      
      ## How was this patch tested?
      
      Added new tests
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #13243 from kiszk/SPARK-15285.
      fa244e5a
    • gatorsmile's avatar
      [SPARK-15485][SQL][DOCS] Spark SQL Configuration · d2077164
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, the page Configuration in the official documentation does not have a section for Spark SQL.
      http://spark.apache.org/docs/latest/configuration.html
      
      For Spark users, the information and default values of these public configuration parameters are very useful. This PR is to add this missing section to the configuration.html.
      
      rxin yhuai marmbrus
      
      #### How was this patch tested?
      Below is the generated webpage.
      <img width="924" alt="screenshot 2016-05-23 11 35 57" src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png">
      <img width="914" alt="screenshot 2016-05-23 11 37 38" src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png">
      <img width="923" alt="screenshot 2016-05-23 11 36 11" src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png">
      <img width="920" alt="screenshot 2016-05-23 11 36 18" src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png">
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13263 from gatorsmile/configurationSQL.
      d2077164
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
    • gatorsmile's avatar
      [SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog · 5afd927a
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example,
      ```SQL
      CREATE TABLE tabName(i INT, j STRING)
      SELECT * FROM tabName
      INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
      ```
      In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT`
      ```
      org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None;
      ```
      
      This PR is to issue appropriate exceptions in this case. The message will be like
      ```
      org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`;
      ```
      #### How was this patch tested?
      Added a test case in `DDLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13093 from gatorsmile/selectAfterCreate.
      5afd927a
    • Xin Wu's avatar
      [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively · 01659bc5
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
      Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
      
      This PR is to support following commands:
      `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
      
      ### For example:
      ##### LIST FILE(s)
      ```
      scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
      res1: org.apache.spark.sql.DataFrame = []
      scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
      +----------------------------------------------+
      |result                                        |
      +----------------------------------------------+
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
      +----------------------------------------------+
      
      scala> spark.sql("list files").show(false)
      +----------------------------------------------+
      |result                                        |
      +----------------------------------------------+
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
      |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
      +----------------------------------------------+
      ```
      
      ##### LIST JAR(s)
      ```
      scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
      res9: org.apache.spark.sql.DataFrame = [result: int]
      
      scala> spark.sql("list jar TestUDTF.jar").show(false)
      +---------------------------------------------+
      |result                                       |
      +---------------------------------------------+
      |spark://192.168.1.234:50131/jars/TestUDTF.jar|
      +---------------------------------------------+
      
      scala> spark.sql("list jars").show(false)
      +---------------------------------------------+
      |result                                       |
      +---------------------------------------------+
      |spark://192.168.1.234:50131/jars/TestUDTF.jar|
      +---------------------------------------------+
      ```
      ## How was this patch tested?
      New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      Author: xin Wu <xinwu@us.ibm.com>
      
      Closes #13212 from xwu0226/list_command.
      01659bc5
    • hyukjinkwon's avatar
      [MINOR][SPARKR][DOC] Add a description for running unit tests in Windows · a8e97d17
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the description for running unit tests in Windows.
      
      ## How was this patch tested?
      
      On a bare machine (Window 7, 32bits), this was manually built and tested.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13217 from HyukjinKwon/minor-r-doc.
      a8e97d17
    • sureshthalamati's avatar
      [SPARK-15315][SQL] Adding error check to the CSV datasource writer for... · 03c7b7c4
      sureshthalamati authored
      [SPARK-15315][SQL] Adding error check to  the CSV datasource writer for unsupported complex data types.
      
      ## What changes were proposed in this pull request?
      
      Adds error handling to the CSV writer  for unsupported complex data types.  Currently garbage gets written to the output csv files if the data frame schema has complex data types.
      
      ## How was this patch tested?
      
      Added new unit test case.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.
      03c7b7c4
    • Dongjoon Hyun's avatar
      [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions · 37c617e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
      
      ## How was this patch tested?
      
      It's only about docs.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13087 from dongjoon-hyun/SPARK-15282.
      37c617e4
    • Andrew Or's avatar
      [SPARK-15279][SQL] Catch conflicting SerDe when creating table · 2585d2b3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      The user may do something like:
      ```
      CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
      CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
      CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
      CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
      ```
      None of these should be allowed because the SerDe's conflict. As of this patch:
      - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
      - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`
      
      ## How was this patch tested?
      
      New tests in `DDLCommandSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13068 from andrewor14/row-format-conflict.
      2585d2b3
    • Wenchen Fan's avatar
      [SPARK-15471][SQL] ScalaReflection cleanup · 07c36a2f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      1. simplify the logic of deserializing option type.
      2. simplify the logic of serializing array type, and remove silentSchemaFor
      3. remove some unnecessary code.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13250 from cloud-fan/encoder.
      07c36a2f
    • Davies Liu's avatar
      [SPARK-14031][SQL] speedup CSV writer · 80091b8a
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns).
      
      This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster).
      
      ## How was this patch tested?
      
      Manually benchmark it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13229 from davies/csv_writer.
      80091b8a
    • Sameer Agarwal's avatar
      [SPARK-15425][SQL] Disallow cross joins by default · dafcb05c
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
      
      ## How was this patch tested?
      
      Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13209 from sameeragarwal/disallow-cartesian.
      dafcb05c
  2. May 22, 2016
    • wangyang's avatar
      [SPARK-15379][SQL] check special invalid date · fc44b694
      wangyang authored
      ## What changes were proposed in this pull request?
      
      When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
      In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.
      
      ## How was this patch tested?
      
      Unit tests are added.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wangyang <wangyang@haizhi.com>
      
      Closes #13169 from wangyang1992/invalid_date.
      fc44b694
    • Sandeep Singh's avatar
      [MINOR] More than 100 chars in line in SparkSubmitCommandBuilderSuite · 3eff65f8
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      More than 100 chars in line.
      
      ## How was this patch tested?
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13249 from techaddict/fix-1.
      3eff65f8
    • Bo Meng's avatar
      [SPARK-15468][SQL] fix some typos · 72288fd6
      Bo Meng authored
      ## What changes were proposed in this pull request?
      
      Fix some typos while browsing the codes.
      
      ## How was this patch tested?
      
      None and obvious.
      
      Author: Bo Meng <mengbo@hotmail.com>
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13246 from bomeng/typo.
      72288fd6
    • Liang-Chi Hsieh's avatar
      [SPARK-15430][SQL] Fix potential ConcurrentModificationException for ListAccumulator · 7920296b
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In `ListAccumulator` we create an unmodifiable view for underlying list. However, it doesn't prevent the underlying to be modified further. So as we access the unmodifiable list, the underlying list can be modified in the same time. It could cause `java.util.ConcurrentModificationException`. We can observe such exception in recent tests.
      
      To fix it, we can copy a list of the underlying list and then create the unmodifiable view of this list instead.
      
      ## How was this patch tested?
      The exception might be difficult to test. Existing tests should be passed.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13211 from viirya/fix-concurrentmodify.
      7920296b
    • Tathagata Das's avatar
      [SPARK-15428][SQL] Disable multiple streaming aggregations · 1ffa608b
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations.
      
      ## How was this patch tested?
      Additional unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13210 from tdas/SPARK-15428.
      1ffa608b
    • Reynold Xin's avatar
      [SPARK-15459][SQL] Make Range logical and physical explain consistent · 845e447f
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it.
      
      Before this patch (note that the logical Range and physical Range actually output different information):
      ```
      == Optimized Logical Plan ==
      Range 0, 100, 2, 2, [id#8L]
      
      == Physical Plan ==
      *Range 0, 2, 2, 50, [id#8L]
      ```
      
      After this patch:
      If step size is 1:
      ```
      == Optimized Logical Plan ==
      Range(0, 100, splits=2)
      
      == Physical Plan ==
      *Range(0, 100, splits=2)
      ```
      
      If step size is not 1:
      ```
      == Optimized Logical Plan ==
      Range (0, 100, step=2, splits=2)
      
      == Physical Plan ==
      *Range (0, 100, step=2, splits=2)
      ```
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13239 from rxin/SPARK-15459.
      845e447f
    • gatorsmile's avatar
      [SPARK-15312][SQL] Detect Duplicate Key in Partition Spec and Table Properties · a11175ee
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`.
      
      partition specs or table properties are widely used in multiple DDL statements.
      
      This PR is to detect the duplicates and issue an exception if found.
      
      #### How was this patch tested?
      Added test cases in DDLSuite
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13095 from gatorsmile/detectDuplicate.
      a11175ee
    • Reynold Xin's avatar
      Small documentation and style fix. · 6d0bfb96
      Reynold Xin authored
      6d0bfb96
    • gatorsmile's avatar
      [SPARK-15396][SQL][DOC] It can't connect hive metastore database · 6cb8f836
      gatorsmile authored
      #### What changes were proposed in this pull request?
      The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`.
      
      This PR is to update the document and example for explaining the latest changes in the configuration of default location of database.
      
      Below is the screenshot of the latest generated docs:
      
      <img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png">
      
      <img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png">
      
      <img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png">
      
      No change is made in the R's example.
      
      <img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png">
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13225 from gatorsmile/document.
      6cb8f836
    • Jurriaan Pruis's avatar
      [SPARK-15415][SQL] Fix BroadcastHint when autoBroadcastJoinThreshold is 0 or -1 · 223f6339
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      This PR makes BroadcastHint more deterministic by using a special isBroadcastable property
      instead of setting the sizeInBytes to 1.
      
      See https://issues.apache.org/jira/browse/SPARK-15415
      
      ## How was this patch tested?
      
      Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13244 from jurriaan/broadcast-hint.
      223f6339
  3. May 21, 2016
    • xin Wu's avatar
      [SPARK-15206][SQL] add testcases for distinct aggregate in having clause · df9adb5e
      xin Wu authored
      ## What changes were proposed in this pull request?
      Add new test cases for including distinct aggregate in having clause in 2.0 branch.
      This is a followup PR for [#12974](https://github.com/apache/spark/pull/12974), which is for 1.6 branch.
      
      Author: xin Wu <xinwu@us.ibm.com>
      
      Closes #12984 from xwu0226/SPARK-15206.
      df9adb5e
    • gatorsmile's avatar
      [SPARK-15330][SQL] Implement Reset Command · 8f0a3d5b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
      
      Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202
      
      This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below:
      
      > For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query.
      
      #### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13121 from gatorsmile/resetCommand.
      8f0a3d5b
    • Ergin Seyfe's avatar
      [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved serialization to a new class. · c18fa464
      Ergin Seyfe authored
      ## What changes were proposed in this pull request?
      Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer.
      
      ## How was this patch tested?
      Manual tests & existing tests.
      
      Author: Ergin Seyfe <eseyfe@fb.com>
      
      Closes #13066 from seyfe/orc_serializer.
      c18fa464
    • Reynold Xin's avatar
      [SPARK-15452][SQL] Mark aggregator API as experimental · 201a51f3
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0.
      
      ## How was this patch tested?
      N/A - annotation only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13226 from rxin/SPARK-15452.
      201a51f3
    • Dilip Biswal's avatar
      [SPARK-15114][SQL] Column name generated by typed aggregate is super verbose · 5e1ee289
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name.
      
      ```SQL
      val ds = Seq(1, 3, 2, 5).toDS()
      ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show()
      ```
      
      Output before change.
      ```SQL
      +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
      |typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))|
      +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
      |                                                                                                                         11.0|                                                                                                                                         2.75|
      +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
      ```
      Output after change:
      ```SQL
      +-----------------+---------------+
      |typedsumdouble_c1|typedaverage_c2|
      +-----------------+---------------+
      |             11.0|           2.75|
      +-----------------+---------------+
      ```
      
      Note: There is one test in ParquetSuites.scala which shows that that the system picked alias
      name is not usable and is rejected.  [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687)
      ## How was this patch tested?
      
      A new test was added in DataSetAggregatorSuite.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #13045 from dilipbiswal/spark-15114.
      5e1ee289
    • Dongjoon Hyun's avatar
      [SPARK-15462][SQL][TEST] unresolved === false` is enough in testcases. · f39621c9
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations.
      ```scala
      scala> sql("select format_number(null, 3)")
      res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string]
      scala> sql("select format_number(cast(null as NULL), 3)")
      org.apache.spark.sql.catalyst.parser.ParseException:
      DataType null() is not supported.(line 1, pos 34)
      ```
      
      This PR makes those testcases more realistic.
      ```scala
      -    checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null)
      +    assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false)
      ```
      Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer.
      
      ## How was this patch tested?
      
      Pass the modified Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13241 from dongjoon-hyun/SPARK-15462.
      f39621c9
    • Sandeep Singh's avatar
      [SPARK-15445][SQL] Build fails for java 1.7 after adding java.mathBigInteger support · 666bf2e8
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Using longValue() and then checking whether the value is in the range for a long manually.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13223 from techaddict/SPARK-15445.
      666bf2e8
    • Reynold Xin's avatar
      [SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a hivecontext-compatibility module · 45b7557e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.
      
      ## How was this patch tested?
      Tests were moved.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13207 from rxin/SPARK-15424.
      45b7557e
  4. May 20, 2016
    • Bryan Cutler's avatar
      [SPARK-15456][PYSPARK] Fixed PySpark shell context initialization when HiveConf not present · 021c1970
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      When PySpark shell cannot find HiveConf, it will fallback to create a SparkSession from a SparkContext.  This fixes a bug caused by using a variable to SparkContext before it was initialized.
      
      ## How was this patch tested?
      
      Manually starting PySpark shell and using the SparkContext
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13237 from BryanCutler/pyspark-shell-session-context-SPARK-15456.
      021c1970
    • Zheng RuiFeng's avatar
      [SPARK-15031][EXAMPLE] Use SparkSession in examples · 127bf1bb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
      
      `MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
      `StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.
      
      cc andrewor14
      
      ## How was this patch tested?
      manual tests with spark-submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13164 from zhengruifeng/use_sparksession_ii.
      127bf1bb
    • tedyu's avatar
      [SPARK-15273] YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect... · 06c9f520
      tedyu authored
      [SPARK-15273] YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect OnOutOfMemoryError parameter given by user
      
      ## What changes were proposed in this pull request?
      
      As Nirav reported in this thread:
      http://search-hadoop.com/m/q3RTtdF3yNLMd7u
      
      YarnSparkHadoopUtil#getOutOfMemoryErrorArgument previously specified 'kill %p' unconditionally.
      We should respect the parameter given by user.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #13057 from tedyu/master.
      06c9f520
    • Sameer Agarwal's avatar
      [SPARK-15078] [SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL · a78d6ce3
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.
      
      ## How was this patch tested?
      
      Benchmark only
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13188 from sameeragarwal/tpcds-all.
      a78d6ce3
    • Reynold Xin's avatar
      [SPARK-15454][SQL] Filter out files starting with _ · dcac8e6f
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files.
      
      ## How was this patch tested?
      Added a unit test case.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13227 from rxin/SPARK-15454.
      dcac8e6f
    • Davies Liu's avatar
      [SPARK-15438][SQL] improve explain of whole stage codegen · 0e70fd61
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the explain of a query with whole-stage codegen looks like this
      ```
      >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [id#1L]
      :     +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
      :        :- Range 0, 1, 4, 1000, [id#1L]
      :        +- INPUT
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
         +- WholeStageCodegen
            :  +- Range 0, 1, 4, 1000, [id#4L]
      ```
      
      The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together).
      
      This PR will change it to:
      
      ```
      >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
      == Physical Plan ==
      *Project [id#0L]
      +- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
         :- *Range 0, 1, 4, 1000, [id#0L]
         +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
            +- *Range 0, 1, 4, 1000, [id#3L]
      ```
      
      The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand.
      
      ## How was this patch tested?
      
      Manually ran some queries and check the explain.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13204 from davies/explain_codegen.
      0e70fd61
    • Michael Armbrust's avatar
      [SPARK-10216][SQL] Revert "[] Avoid creating empty files during overwrit… · 2ba3ff04
      Michael Armbrust authored
      This reverts commit 8d05a7a9 from #12855, which seems to have caused regressions when working with empty DataFrames.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #13181 from marmbrus/revert12855.
      2ba3ff04
    • Shixiong Zhu's avatar
      [SPARK-15190][SQL] Support using SQLUserDefinedType for case classes · dfa61f7b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work.
      
      This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`.
      
      ## How was this patch tested?
      
      `encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12965 from zsxwing/SPARK-15190.
      dfa61f7b
    • Kousuke Saruta's avatar
      [SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in generated code · 22947cd0
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      This PR introduce place holder for comment in generated code and the purpose  is same for #12939 but much safer.
      
      Generated code to be compiled doesn't include actual comments but includes place holder instead.
      
      Place holders in generated code will be replaced with actual comments only at the time of  logging.
      
      Also, this PR can resolve SPARK-15205.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12979 from sarutak/SPARK-15205.
      22947cd0
Loading