Skip to content
Snippets Groups Projects
  1. Mar 05, 2015
    • Michael Armbrust's avatar
      [SQL] Make Strategies a public developer API · eb48fd6e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4920 from marmbrus/openStrategies and squashes the following commits:
      
      cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API
      eb48fd6e
    • Yin Huai's avatar
      [SPARK-6163][SQL] jsonFile should be backed by the data source API · 1b4bb25c
      Yin Huai authored
      jira: https://issues.apache.org/jira/browse/SPARK-6163
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4896 from yhuai/SPARK-6163 and squashes the following commits:
      
      45e023e [Yin Huai] Address @chenghao-intel's comment.
      2e8734e [Yin Huai] Use JSON data source for jsonFile.
      92a4a33 [Yin Huai] Test.
      1b4bb25c
    • Wenchen Fan's avatar
      [SPARK-6145][SQL] fix ORDER BY on nested fields · 5873c713
      Wenchen Fan authored
      Based on #4904 with style errors fixed.
      
      `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
      So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4918 from marmbrus/pr/4904 and squashes the following commits:
      
      997f84e [Michael Armbrust] fix style
      3eedbfc [Wenchen Fan] fix 6145
      5873c713
    • Sean Owen's avatar
      SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11 · c9cfba0c
      Sean Owen authored
      Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:
      
      eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
      c9cfba0c
    • Daoyuan Wang's avatar
      [SPARK-6153] [SQL] promote guava dep for hive-thriftserver · e06c7dfb
      Daoyuan Wang authored
      For package thriftserver, guava is used at runtime.
      
      /cc pwendell
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4884 from adrian-wang/test and squashes the following commits:
      
      4600ae7 [Daoyuan Wang] only promote for thriftserver
      44dda18 [Daoyuan Wang] promote guava dep for hive
      e06c7dfb
  2. Mar 04, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default... · aef8a84e
      Liang-Chi Hsieh authored
      [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive
      
      In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`.
      
      Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4870 from viirya/codegen_type and squashes the following commits:
      
      76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.
      aef8a84e
    • Cheng Lian's avatar
      [SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client · 76b472f1
      Cheng Lian authored
      Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4.
      
      Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency.
      
      [1]: |https://github.com/databricks/spark-integration-tests
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4872 from liancheng/remove-docker-client and squashes the following commits:
      
      1f4169e [Cheng Lian] Removes DockerHacks
      159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client
      76b472f1
  3. Mar 03, 2015
    • Reynold Xin's avatar
      [SPARK-5310][SQL] Fixes to Docs and Datasources API · 54d19689
      Reynold Xin authored
       - Various Fixes to docs
       - Make data source traits actually interfaces
      
      Based on #4862 but with fixed conflicts.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4868 from marmbrus/pr/4862 and squashes the following commits:
      
      fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
      0208497 [Reynold Xin] Test fixes.
      34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
      54d19689
  4. Mar 02, 2015
    • Yin Huai's avatar
      [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should... · 12599942
      Yin Huai authored
      [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api
      
      This PR contains the following changes:
      1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However,  the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
      2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
      3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
      4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
      5. Update the equality check of JSON relation. Since JSON does not really cares nullability,  `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-5950
      
      Thanks viirya for the initial work in #4729.
      
      cc marmbrus liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits:
      
      3b61a04 [Yin Huai] Revert change on equals.
      80e487e [Yin Huai] asNullable in UDT.
      587d88b [Yin Huai] Make methods private.
      0cb7ea2 [Yin Huai] marmbrus's comments.
      3cec464 [Yin Huai] Cheng's comments.
      486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
      d3747d1 [Yin Huai] Remove unnecessary change.
      8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
      8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
      0eb5578 [Yin Huai] Fix tests.
      f6ed813 [Yin Huai] Update old parquet path.
      e4f397c [Yin Huai] Unit tests.
      b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
      8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
      bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
      0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
      9a26611 [Yin Huai] Make InsertIntoTable happy.
      8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
      4ec17fd [Yin Huai] Failed test.
      12599942
    • Cheng Lian's avatar
      [SPARK-6082] [SQL] Provides better error message for malformed rows when caching tables · 1a49496b
      Cheng Lian authored
      Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4842 from liancheng/spark-6082 and squashes the following commits:
      
      b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables
      1a49496b
    • Michael Armbrust's avatar
      [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved · 8223ce6a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4855 from marmbrus/explodeBug and squashes the following commits:
      
      a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
      8223ce6a
    • q00251598's avatar
      [SPARK-6040][SQL] Fix the percent bug in tablesample · 582e5a24
      q00251598 authored
      HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark.
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #4789 from watermen/SPARK-6040 and squashes the following commits:
      
      2453ebe [q00251598] check and adjust the fraction.
      582e5a24
    • Liang-Chi Hsieh's avatar
      [Minor] Fix doc typo for describing primitiveTerm effectiveness condition · 3f9def81
      Liang-Chi Hsieh authored
      It should be `true` instead of `false`?
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4762 from viirya/doc_fix and squashes the following commits:
      
      2e37482 [Liang-Chi Hsieh] Fix doc.
      3f9def81
    • Paul Power's avatar
      [DOCS] Refactored Dataframe join comment to use correct parameter ordering · d9a8bae7
      Paul Power authored
      The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer")
      
      Author: Paul Power <paul.power@peerside.com>
      
      Closes #4847 from peerside/master and squashes the following commits:
      
      ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1
      e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins
      d9a8bae7
    • q00251598's avatar
      [SPARK-5741][SQL] Support the path contains comma in HiveContext · 9ce12aaf
      q00251598 authored
      When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```
      . Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma.
      
      ### SQL
      ```
      set hive.merge.mapfiles=true;
      set hive.merge.mapredfiles=true;
      set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
      set hive.exec.dynamic.partition=true;
      set hive.exec.dynamic.partition.mode=nonstrict;
      
      create table nzhang_part like srcpart;
      
      insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08';
      
      insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08';
      
      insert overwrite table nzhang_part partition (ds='2010-08-15', hr)
      select * from (
      select key, value, hr from srcpart where ds='2008-04-08'
      union all
      select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
      
      select * from nzhang_part where hr = 'file,';
      ```
      
      ### Error Log
      ```
      15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,']
      java.lang.IllegalArgumentException: Can not create a Path from an empty string
      at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
      at org.apache.hadoop.fs.Path.<init>(Path.java:135)
      at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
      at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
      at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
      at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
      at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
      at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
      at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
      at scala.Option.map(Option.scala:145)
      at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
      at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #4532 from watermen/SPARK-5741 and squashes the following commits:
      
      9758ab1 [q00251598] fix bug
      1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths)
      b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite
      9ce12aaf
    • Yin Huai's avatar
      [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull... · 3efd8bb6
      Yin Huai authored
      [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true
      
      Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6052
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits:
      
      05eab9d [Yin Huai] Change containsNull to true.
      3efd8bb6
    • Yin Huai's avatar
      [SPARK-6073][SQL] Need to refresh metastore cache after append data in... · 39a54b40
      Yin Huai authored
      [SPARK-6073][SQL] Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6073
      
      liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4824 from yhuai/refreshCache and squashes the following commits:
      
      b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect.
      39a54b40
  5. Mar 01, 2015
    • Marcelo Vanzin's avatar
      [SPARK-6074] [sql] Package pyspark sql bindings. · fd8d283e
      Marcelo Vanzin authored
      This is needed for the SQL bindings to work on Yarn.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4822 from vanzin/SPARK-6074 and squashes the following commits:
      
      fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings.
      fd8d283e
  6. Feb 28, 2015
    • Cheng Lian's avatar
      [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow... · e6003f0a
      Cheng Lian authored
      [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
      
      This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Cheng Lian <liancheng@users.noreply.github.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4792 from liancheng/spark-5775 and squashes the following commits:
      
      538f506 [Cheng Lian] Addresses comments
      cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin
      b0b74fb [Yin Huai] Remove runtime pattern matching.
      ca6e038 [Cheng Lian] Fixes SPARK-5775
      e6003f0a
  7. Feb 27, 2015
    • Cheng Lian's avatar
      [SPARK-5751] [SQL] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites · 8c468a66
      Cheng Lian authored
      This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem.
      
      Many thanks to chenghao-intel for pointing out this issue!
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits:
      
      252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory
      1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
      8c468a66
  8. Feb 26, 2015
    • Yin Huai's avatar
      [SPARK-6024][SQL] When a data source table has too many columns, it's schema... · 5e5ad655
      Yin Huai authored
      [SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6024
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4795 from yhuai/wideSchema and squashes the following commits:
      
      4882e6f [Yin Huai] Address comments.
      73e71b4 [Yin Huai] Address comments.
      143927a [Yin Huai] Simplify code.
      cc1d472 [Yin Huai] Make the schema wider.
      12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
      e9b4f70 [Yin Huai] Failed test.
      5e5ad655
    • Liang-Chi Hsieh's avatar
      [SPARK-6037][SQL] Avoiding duplicate Parquet schema merging · 4ad5153f
      Liang-Chi Hsieh authored
      `FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:
      
      ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
      4ad5153f
    • Jacky Li's avatar
      [SPARK-6007][SQL] Add numRows param in DataFrame.show() · 23586575
      Jacky Li authored
      It is useful to let the user decide the number of rows to show in DataFrame.show
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4767 from jackylk/show and squashes the following commits:
      
      a0e0f4b [Jacky Li] fix testcase
      7cdbe91 [Jacky Li] modify according to comment
      bb54537 [Jacky Li] for Java compatibility
      d7acc18 [Jacky Li] modify according to comments
      981be52 [Jacky Li] add numRows param in DataFrame.show()
      23586575
    • Yin Huai's avatar
      [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing... · 192e42a2
      Yin Huai authored
      [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
      
      Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4775 from yhuai/parquetFooterCache and squashes the following commits:
      
      78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat.
      dff6fba [Yin Huai] Failed unit test.
      192e42a2
    • Yin Huai's avatar
      [SPARK-6023][SQL] ParquetConversions fails to replace the destination... · f02394d0
      Yin Huai authored
      [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6023
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4782 from yhuai/parquetInsertInto and squashes the following commits:
      
      ae7e806 [Yin Huai] Convert MetastoreRelation in InsertIntoTable and InsertIntoHiveTable.
      ba543cd [Yin Huai] More tests.
      50b6d0f [Yin Huai] Update error messages.
      346780c [Yin Huai] Failed test.
      f02394d0
  9. Feb 25, 2015
    • Yanbo Liang's avatar
      [SPARK-5926] [SQL] make DataFrame.explain leverage queryExecution.logical · 41e2e5ac
      Yanbo Liang authored
      DataFrame.explain return wrong result when the query is DDL command.
      
      For example, the following two queries should print out the same execution plan, but it not.
      sql("create table tb as select * from src where key > 490").explain(true)
      sql("explain extended create table tb as select * from src where key > 490")
      
      This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use  the unexecuted plan queryExecution.logical.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #4707 from yanboliang/spark-5926 and squashes the following commits:
      
      fa6db63 [Yanbo Liang] logicalPlan is not lazy
      0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
      41e2e5ac
    • Liang-Chi Hsieh's avatar
      [SPARK-5999][SQL] Remove duplicate Literal matching block · 12dbf98c
      Liang-Chi Hsieh authored
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4760 from viirya/dup_literal and squashes the following commits:
      
      06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
      12dbf98c
    • Cheng Lian's avatar
      [SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits · e0fdd467
      Cheng Lian authored
      `ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.
      
      In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4768 from liancheng/spark-6010 and squashes the following commits:
      
      9002f0a [Cheng Lian] Fixes SPARK-6010
      e0fdd467
    • Michael Armbrust's avatar
      [SPARK-5996][SQL] Fix specialized outbound conversions · f84c799e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4757 from marmbrus/udtConversions and squashes the following commits:
      
      3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions
      f84c799e
  10. Feb 24, 2015
    • Yin Huai's avatar
      [SPARK-5286][SQL] SPARK-5286 followup · 769e092b
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-5286
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4755 from yhuai/SPARK-5286-throwable and squashes the following commits:
      
      4c0c450 [Yin Huai] Catch Throwable instead of Exception.
      769e092b
    • Reynold Xin's avatar
      [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python. · fba11c2f
      Reynold Xin authored
      Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4752 from rxin/SPARK-5985 and squashes the following commits:
      
      aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
      047ad03 [Reynold Xin] Lift alias out of cast.
      c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
      fba11c2f
    • Reynold Xin's avatar
      [SPARK-5904][SQL] DataFrame Java API test suites. · 53a1ebf3
      Reynold Xin authored
      Added a new test suite to make sure Java DF programs can use varargs properly.
      Also moved all suites into test.org.apache.spark package to make sure the suites also test for method visibility.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4751 from rxin/df-tests and squashes the following commits:
      
      1e8b8e4 [Reynold Xin] Fixed imports and renamed JavaAPISuite.
      a6ca53b [Reynold Xin] [SPARK-5904][SQL] DataFrame Java API test suites.
      53a1ebf3
    • Cheng Lian's avatar
      [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness · f816e739
      Cheng Lian authored
      **NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.
      
      `HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:
      
      1. Fixes a racing condition occurred while using `tail -f` to check log file
      
         It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.
      
      2. Retries up to 3 times if the server fails to start
      
         In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.
      
      3. A server instance is reused among all test cases within a single suite
      
         The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.
      
      **TODO**
      
      - [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:
      
      d6c80eb [Cheng Lian] Relaxes server startup timeout
      6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness
      f816e739
    • Michael Armbrust's avatar
      [SPARK-5952][SQL] Lock when using hive metastore client · a2b91379
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4746 from marmbrus/hiveLock and squashes the following commits:
      
      8b871cf [Michael Armbrust] [SPARK-5952][SQL] Lock when using hive metastore client
      a2b91379
    • Michael Armbrust's avatar
      [SPARK-5532][SQL] Repartition should not use external rdd representation · 20123662
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4738 from marmbrus/udtRepart and squashes the following commits:
      
      c06d7b5 [Michael Armbrust] fix compilation
      91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation
      20123662
    • Michael Armbrust's avatar
      [SPARK-5910][SQL] Support for as in selectExpr · 0a59e45e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4736 from marmbrus/asExprs and squashes the following commits:
      
      5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr
      0a59e45e
    • Cheng Lian's avatar
      [SPARK-5968] [SQL] Suppresses ParquetOutputCommitter WARN logs · 84033313
      Cheng Lian authored
      Please refer to the [JIRA ticket] [1] for the motivation.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-5968
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4744)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4744 from liancheng/spark-5968 and squashes the following commits:
      
      caac6a8 [Cheng Lian] Suppresses ParquetOutputCommitter WARN logs
      84033313
  11. Feb 23, 2015
    • Michael Armbrust's avatar
      [SPARK-5873][SQL] Allow viewing of partially analyzed plans in queryExecution · 1ed57086
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4684 from marmbrus/explainAnalysis and squashes the following commits:
      
      afbaa19 [Michael Armbrust] fix python
      d93278c [Michael Armbrust] fix hive
      e5fa0a4 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
      52119f2 [Michael Armbrust] more tests
      82a5431 [Michael Armbrust] fix tests
      25753d2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
      aee1e6a [Michael Armbrust] fix hive
      b23a844 [Michael Armbrust] newline
      de8dc51 [Michael Armbrust] more comments
      acf620a [Michael Armbrust] [SPARK-5873][SQL] Show partially analyzed plans in query execution
      1ed57086
    • Yin Huai's avatar
      [SPARK-5935][SQL] Accept MapType in the schema provided to a JSON dataset. · 48376bfe
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-5935
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #4710 from yhuai/jsonMapType and squashes the following commits:
      
      3e40390 [Yin Huai] Remove unnecessary changes.
      f8e6267 [Yin Huai] Fix test.
      baa36e3 [Yin Huai] Accept MapType in the schema provided to jsonFile/jsonRDD.
      48376bfe
  12. Feb 22, 2015
    • Cheng Hao's avatar
      [DataFrame] [Typo] Fix the typo · 275b1bef
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4717 from chenghao-intel/typo1 and squashes the following commits:
      
      858d7b0 [Cheng Hao] update the typo
      275b1bef
Loading