Skip to content
Snippets Groups Projects
  1. Sep 08, 2017
  2. Sep 07, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21939][TEST] Use TimeLimits instead of Timeouts · c26976fe
      Dongjoon Hyun authored
      Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated.
      This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`.
      
      ```scala
      -import org.scalatest.concurrent.Timeouts._
      +import org.scalatest.concurrent.TimeLimits._
      ```
      
      Pass the existing test suites.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19150 from dongjoon-hyun/SPARK-21939.
      
      Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e
      c26976fe
    • Dongjoon Hyun's avatar
      [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names · eea2b877
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
      
      **BEFORE**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:28:21 ERROR Utils: Aborting task
      java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
      17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
      17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      org.apache.spark.SparkException: Task failed while writing rows.
      ```
      
      **AFTER**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
      org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19124 from dongjoon-hyun/SPARK-21912.
      eea2b877
  3. Sep 05, 2017
    • gatorsmile's avatar
      [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable · 2974406d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19119 from gatorsmile/fallbackCodegen.
      2974406d
  4. Sep 01, 2017
    • gatorsmile's avatar
      [SPARK-21895][SQL] Support changing database in HiveClient · aba9492d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Supporting moving tables across different database in HiveClient `alterTable`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19104 from gatorsmile/alterTable.
      aba9492d
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
  5. Aug 31, 2017
    • gatorsmile's avatar
      [SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils · 19b0240d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases.
      
      Also, move two SQLMetrics test cases from sql/hive to sql/core.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19092 from gatorsmile/rewriteSQLMetrics.
      19b0240d
  6. Aug 30, 2017
  7. Aug 29, 2017
    • gatorsmile's avatar
      [SPARK-21845][SQL] Make codegen fallback of expressions configurable · 3d0e1742
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19062 from gatorsmile/fallbackCodegen.
      3d0e1742
    • Wang Gengliang's avatar
      [SPARK-21848][SQL] Add trait UserDefinedExpression to identify user-defined functions · 8fcbda9c
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Add trait UserDefinedExpression to identify user-defined functions.
      UDF can be expensive. In optimizer we may need to avoid executing UDF multiple times.
      E.g.
      ```scala
      table.select(UDF as 'a).select('a, ('a + 1) as 'b)
      ```
      If UDF is expensive in this case, optimizer should not collapse the project to
      ```scala
      table.select(UDF as 'a, (UDF+1) as 'b)
      ```
      
      Currently UDF classes like PythonUDF, HiveGenericUDF are not defined in catalyst.
      This PR is to add a new trait to make it easier to identify user-defined functions.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19064 from gengliangwang/UDFType.
      8fcbda9c
  8. Aug 25, 2017
  9. Aug 22, 2017
    • Jose Torres's avatar
      [SPARK-21765] Set isStreaming on leaf nodes for streaming plans. · 3c0c2d09
      Jose Torres authored
      ## What changes were proposed in this pull request?
      All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
      
      ## How was this patch tested?
      
      Existing unit tests - no functional change is intended in this PR.
      
      Author: Jose Torres <joseph-torres@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18973 from joseph-torres/SPARK-21765.
      3c0c2d09
    • gatorsmile's avatar
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas... · 01a8e462
      gatorsmile authored
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL
      
      ## What changes were proposed in this pull request?
      For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different.
      
      The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema.
      
      ## How was this patch tested?
      Added a cross-version test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19003 from gatorsmile/respectSparkSchema.
      01a8e462
    • gatorsmile's avatar
      [SPARK-21499][SQL] Support creating persistent function for Spark... · 43d71d96
      gatorsmile authored
      [SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)
      
      ## What changes were proposed in this pull request?
      This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction).
      
      ```SQL
      CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
      ```
      
      Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)`
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18700 from gatorsmile/javaUDFinScala.
      43d71d96
    • gatorsmile's avatar
      [SPARK-21803][TEST] Remove the HiveDDLCommandSuite · be72b157
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19015 from gatorsmile/combineDDL.
      be72b157
  10. Aug 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. · 84b5b16e
      Marcelo Vanzin authored
      For Hive tables, the current "replace the schema" code is the correct
      path, except that an exception in that path should result in an error, and
      not in retrying in a different way.
      
      For data source tables, Spark may generate a non-compatible Hive table;
      but for that to work with Hive 2.1, the detection of data source tables needs
      to be fixed in the Hive client, to also consider the raw tables used by code
      such as `alterTableSchema`.
      
      Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18849 from vanzin/SPARK-21617.
      84b5b16e
  11. Aug 20, 2017
  12. Aug 18, 2017
    • Masha Basmanova's avatar
      [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes · 23ea8980
      Masha Basmanova authored
      ## What changes were proposed in this pull request?
      
      Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.
      
      When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.
      
      For example, table t has 4 partitions with the following specs:
      
      * Partition1: (ds='2008-04-08', hr=11)
      * Partition2: (ds='2008-04-08', hr=12)
      * Partition3: (ds='2008-04-09', hr=11)
      * Partition4: (ds='2008-04-09', hr=12)
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.
      
      'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.
      
      When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.
      
      The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Masha Basmanova <mbasmanova@fb.com>
      
      Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
      23ea8980
    • donnyzone's avatar
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is... · 310454be
      donnyzone authored
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
      
      This issue is caused by introducing TimeZoneAwareExpression.
      When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase.
      
      However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases,  `NoSuchElementException: None.get` will be thrown for TimestampType.
      
      This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
      
      ## How was this patch tested?
      
      unit test
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #18960 from DonnyZone/spark-21739.
      310454be
  13. Aug 17, 2017
    • gatorsmile's avatar
      [SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite · 2caaed97
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Decimal is a logical type of AVRO. We need to ensure the support of Hive's AVRO serde works well in Spark
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18977 from gatorsmile/addAvroTest.
      2caaed97
    • Takeshi Yamamuro's avatar
      [SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent · 6aad02d0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent.  If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order.
      
      ## How was this patch tested?
      Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18959 from maropu/SPARK-18394.
      6aad02d0
    • gatorsmile's avatar
      [SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true · ae9e4247
      gatorsmile authored
      ## What changes were proposed in this pull request?
      When running IntelliJ, we are unable to capture the exception of memory leak detection.
      > org.apache.spark.executor.Executor: Managed memory leak detected
      
      Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when building the SparkSession, instead of reading it from system properties.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.
      ae9e4247
    • Kent Yao's avatar
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for... · b83b502c
      Kent Yao authored
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for reusing CliSessionState
      
      ## What changes were proposed in this pull request?
      
      Set isolated to false while using builtin hive jars and `SessionState.get` returns a `CliSessionState` instance.
      
      ## How was this patch tested?
      
      1 Unit Tests
      2 Manually verified: `hive.exec.strachdir` was only created once because of reusing cliSessionState
      ```java
      ➜  spark git:(SPARK-21428) ✗ bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin
      
      log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
      log4j:WARN Please initialize the log4j system properly.
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
      17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called
      17/07/16 23:59:28 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
      17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
      17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
      17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore
      17/07/16 23:59:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
      17/07/16 23:59:31 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since config is empty
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_all_databases
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=*
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*
      17/07/16 23:59:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db
      17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
      17/07/16 23:59:32 INFO SparkContext: Submitted application: SparkSQL::10.0.0.8
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(Kent); groups with view permissions: Set(); users  with modify permissions: Set(Kent); groups with modify permissions: Set()
      17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on port 51889.
      17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker
      17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
      17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at /private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5
      17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
      17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
      17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
      17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.8:4040
      17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host localhost
      17/07/16 23:59:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890.
      17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on 10.0.0.8:51890
      17/07/16 23:59:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
      17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/Kent/Documents/spark/spark-warehouse').
      17/07/16 23:59:34 INFO SharedState: Warehouse path is 'file:/Users/Kent/Documents/spark/spark-warehouse'.
      17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: default
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: global_temp
      17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
      spark-sql>
      
      ```
      cc cloud-fan gatorsmile
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      
      Closes #18648 from yaooqinn/SPARK-21428.
      b83b502c
  14. Aug 15, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
    • Wenchen Fan's avatar
      [SPARK-18464][SQL][FOLLOWUP] support old table which doesn't store schema in table properties · 14bdb25f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/15900 , to fix one more bug:
      When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *`
      
      The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred.
      
      ## How was this patch tested?
      
      regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18907 from cloud-fan/bug.
      14bdb25f
    • Liang-Chi Hsieh's avatar
      [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are successfully removed · 4c3cf1cc
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We put staging path to delete into the deleteOnExit cache of `FileSystem` in case of the path can't be successfully removed. But when we successfully remove the path, we don't remove it from the cache. We should do it to avoid continuing grow the cache size.
      
      ## How was this patch tested?
      
      Added a test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18934 from viirya/SPARK-21721.
      4c3cf1cc
  15. Aug 10, 2017
    • Reynold Xin's avatar
      [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog · 584c7f14
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption.
      
      ## How was this patch tested?
      Removed the test case.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18912 from rxin/remove-getTableOption.
      584c7f14
    • Adrian Ionescu's avatar
      [SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs · 95ad960c
      Adrian Ionescu authored
      ## What changes were proposed in this pull request?
      
      This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it.
      
      The interface basically consists of 3 traits:
      - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask`
        The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`.
      - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance.
      - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver.
      
      Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations.
      
      ## How was this patch tested?
      Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc.
      
      Author: Adrian Ionescu <adrian@databricks.com>
      
      Closes #18884 from adrian-ionescu/write-stats-tracker-api.
      95ad960c
  16. Aug 09, 2017
    • gatorsmile's avatar
      [SPARK-21504][SQL] Add spark version info into table metadata · 2d799d08
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18709 from gatorsmile/addVersion.
      2d799d08
  17. Aug 06, 2017
    • Sean Owen's avatar
      [MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency from Hive module · 39e044e3
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove duplicate test-jar:test spark-sql dependency from Hive module; move test-jar dependencies together logically. This generates a big warning at the start of the Maven build otherwise.
      
      ## How was this patch tested?
      
      Existing build. No functional changes here.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18858 from srowen/DupeSqlTestDep.
      39e044e3
    • Takeshi Yamamuro's avatar
      [SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for visitTableName · 74b47845
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18857 from maropu/SPARK-20963-FOLLOWUP.
      74b47845
  18. Aug 05, 2017
    • hzyaoqin's avatar
      [SPARK-21637][SPARK-21451][SQL] get `spark.hadoop.*` properties from sysProps to hiveconf · 41568e9a
      hzyaoqin authored
      ## What changes were proposed in this pull request?
      When we use `bin/spark-sql` command configuring `--conf spark.hadoop.foo=bar`, the `SparkSQLCliDriver` initializes an instance of  hiveconf, it does not add `foo->bar` to it.
      this pr gets `spark.hadoop.*` properties from sysProps to this hiveconf
      
      ## How was this patch tested?
      UT
      
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #18668 from yaooqinn/SPARK-21451.
      41568e9a
  19. Aug 04, 2017
    • Reynold Xin's avatar
      [SPARK-21634][SQL] Change OneRowRelation from a case object to case class · 5ad1796b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class.
      
      This blocks SPARK-21619.
      
      ## How was this patch tested?
      Should be covered by existing test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18839 from rxin/SPARK-21634.
      5ad1796b
  20. Aug 03, 2017
    • Dilip Biswal's avatar
      [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail... · 13785daa
      Dilip Biswal authored
      [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
      
      ## What changes were proposed in this pull request?
      In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.
      
      ## How was this patch tested?
      A new test case is added in StatisticsSuite.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #18804 from dilipbiswal/datasource_stats.
      13785daa
  21. Jul 29, 2017
    • Xingbo Jiang's avatar
      [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary · 92d85637
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.
      
      Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.
      
      This PR is mostly based on Herman's previous amazing work: https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768
      
      After this been merged, we can close #16818 .
      
      ## How was this patch tested?
      
      Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18540 from jiangxb1987/rangeFrame.
      92d85637
  22. Jul 23, 2017
  23. Jul 20, 2017
Loading