Skip to content
Snippets Groups Projects
  1. Sep 12, 2017
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when... · 7d0a3ef4
      Jen-Ming Chung authored
      [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file
      
      ## What changes were proposed in this pull request?
      
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added unit test in `CSVSuite`.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.
      7d0a3ef4
  2. Sep 11, 2017
    • caoxuewen's avatar
      [MINOR][SQL] remove unuse import class · dc74c0e6
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      this PR describe remove the import class that are unused.
      
      ## How was this patch tested?
      
      N/A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19131 from heary-cao/unuse_import.
      dc74c0e6
  3. Sep 10, 2017
    • Jen-Ming Chung's avatar
      [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file · 6273a711
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      ```
      echo '{"field": 1}
      {"field": 2}
      {"field": "3"}' >/tmp/sample.json
      ```
      
      ```scala
      import org.apache.spark.sql.types._
      
      val schema = new StructType()
        .add("field", ByteType)
        .add("_corrupt_record", StringType)
      
      val file = "/tmp/sample.json"
      
      val dfFromFile = spark.read.schema(schema).json(file)
      
      scala> dfFromFile.show(false)
      +-----+---------------+
      |field|_corrupt_record|
      +-----+---------------+
      |1    |null           |
      |2    |null           |
      |null |{"field": "3"} |
      +-----+---------------+
      
      scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
      res1: Long = 0
      
      scala> dfFromFile.filter($"_corrupt_record".isNull).count()
      res2: Long = 3
      ```
      When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.
      
      ## How was this patch tested?
      
      Added test case.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #18865 from jmchung/SPARK-21610.
      6273a711
  4. Sep 09, 2017
    • Jane Wang's avatar
      [SPARK-4131] Support "Writing data into the filesystem from queries" · f7679055
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      This PR implements the sql feature:
      INSERT OVERWRITE [LOCAL] DIRECTORY directory1
        [ROW FORMAT row_format] [STORED AS file_format]
        SELECT ... FROM ...
      
      ## How was this patch tested?
      Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory.
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #18975 from janewangfb/port_local_directory.
      f7679055
    • Yanbo Liang's avatar
      [MINOR][SQL] Correct DataFrame doc. · e4d8f9a3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Correct DataFrame doc.
      
      ## How was this patch tested?
      Only doc change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19173 from yanboliang/df-doc.
      e4d8f9a3
    • Liang-Chi Hsieh's avatar
      [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type · 6b45d7e9
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.
      
      Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19167 from viirya/test-jacksonutils.
      6b45d7e9
    • Andrew Ash's avatar
      [SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics · 8a5eb506
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      In a driver heap dump containing 390,105 instances of SQLTaskMetrics this
      would have saved me approximately 3.2MB of memory.
      
      Since we're not getting any benefit from storing this unused value, let's
      eliminate it until a future PR makes use of it.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19153 from ash211/aash/trim-sql-listener.
      8a5eb506
  5. Sep 08, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite · 8a4f228d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
      Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.
      
      ## How was this patch tested?
      
      Use existing test case
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #19159 from kiszk/SPARK-21946.
      8a4f228d
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL][FOLLOW-UP] Check for structural integrity of the plan in Optimzer in test mode · 0dfc1ec5
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The condition in `Optimizer.isPlanIntegral` is wrong. We should always return `true` if not in test mode.
      
      ## How was this patch tested?
      
      Manually test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19161 from viirya/SPARK-21726-followup.
      0dfc1ec5
    • Wenchen Fan's avatar
      [SPARK-21936][SQL] backward compatibility test framework for HiveExternalCatalog · dbb82412
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `HiveExternalCatalog` is a semi-public interface. When creating tables, `HiveExternalCatalog` converts the table metadata to hive table format and save into hive metastore. It's very import to guarantee backward compatibility here, i.e., tables created by previous Spark versions should still be readable in newer Spark versions.
      
      Previously we find backward compatibility issues manually, which is really easy to miss bugs. This PR introduces a test framework to automatically test `HiveExternalCatalog` backward compatibility, by downloading Spark binaries with different versions, and create tables with these Spark versions, and read these tables with current Spark version.
      
      ## How was this patch tested?
      
      test-only change
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19148 from cloud-fan/test.
      dbb82412
    • Liang-Chi Hsieh's avatar
      [SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. · 6e37524a
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.
      
      It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18956 from viirya/SPARK-21726.
      6e37524a
    • liuxian's avatar
      [SPARK-21949][TEST] Tables created in unit tests should be dropped after use · f62b20f3
      liuxian authored
      ## What changes were proposed in this pull request?
       Tables should be dropped after use in unit tests.
      ## How was this patch tested?
      N/A
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #19155 from 10110346/droptable.
      f62b20f3
  6. Sep 07, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21939][TEST] Use TimeLimits instead of Timeouts · c26976fe
      Dongjoon Hyun authored
      Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated.
      This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`.
      
      ```scala
      -import org.scalatest.concurrent.Timeouts._
      +import org.scalatest.concurrent.TimeLimits._
      ```
      
      Pass the existing test suites.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19150 from dongjoon-hyun/SPARK-21939.
      
      Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e
      c26976fe
    • Dongjoon Hyun's avatar
      [SPARK-13656][SQL] Delete spark.sql.parquet.cacheMetadata from SQLConf and docs · e00f1a1d
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since [SPARK-15639](https://github.com/apache/spark/pull/13701), `spark.sql.parquet.cacheMetadata` and `PARQUET_CACHE_METADATA` is not used. This PR removes from SQLConf and docs.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19129 from dongjoon-hyun/SPARK-13656.
      e00f1a1d
    • Dongjoon Hyun's avatar
      [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names · eea2b877
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables.
      
      **BEFORE**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:28:21 ERROR Utils: Aborting task
      java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
      17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
      17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
      org.apache.spark.SparkException: Task failed while writing rows.
      ```
      
      **AFTER**
      ```scala
      scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
      17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
      org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19124 from dongjoon-hyun/SPARK-21912.
      eea2b877
    • Liang-Chi Hsieh's avatar
      [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce unresolved query plans · ce7293c1
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #19050 to deal with `ExistenceJoin` case.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19151 from viirya/SPARK-21835-followup.
      ce7293c1
  7. Sep 06, 2017
    • Jacek Laskowski's avatar
      [SPARK-21901][SS] Define toString for StateOperatorProgress · fa0092bd
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Just `StateOperatorProgress.toString` + few formatting fixes
      
      ## How was this patch tested?
      
      Local build. Waiting for OK from Jenkins.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
      fa0092bd
    • Jose Torres's avatar
      [SPARK-21765] Check that optimization doesn't affect isStreaming bit. · acdf45fb
      Jose Torres authored
      ## What changes were proposed in this pull request?
      
      Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening.
      
      ## How was this patch tested?
      
      new and existing unit tests
      
      Author: Jose Torres <joseph.torres@databricks.com>
      Author: Jose Torres <joseph-torres@databricks.com>
      
      Closes #19056 from joseph-torres/SPARK-21765-followup.
      acdf45fb
    • Liang-Chi Hsieh's avatar
      [SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans · f2e22aeb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery`  during optimization.
      
      It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity.
      
      We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19050 from viirya/SPARK-21835.
      f2e22aeb
  8. Sep 05, 2017
    • jerryshao's avatar
      [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer thrift/http protocol · 6a232544
      jerryshao authored
      Spark ThriftServer doesn't support spnego auth for thrift/http protocol, this mainly used for knox+thriftserver scenario. Since in HiveServer2 CLIService there already has existing codes to support it. So here copy it to Spark ThriftServer to make it support.
      
      Related Hive JIRA HIVE-6697.
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18628 from jerryshao/SPARK-21407.
      
      Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e
      6a232544
    • Xingbo Jiang's avatar
      [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation · fd60d4fa
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration:
      ```
      Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1")
      Seq(1, 2).toDF("col").createOrReplaceTempView("t2")
      sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col")
      ```
      
      We can fix this by adjusting the indent of the optimize rules.
      
      ## How was this patch tested?
      
      Add test case that would have failed in `SQLQuerySuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #19099 from jiangxb1987/unconverge-optimization.
      fd60d4fa
    • gatorsmile's avatar
      [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable · 2974406d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19119 from gatorsmile/fallbackCodegen.
      2974406d
    • hyukjinkwon's avatar
      [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 · 02a4386a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:
      
      ```scala
      val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
      df.show()
      ```
      
      **Before**
      
      ```
      java.lang.NullPointerException
      	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
      	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      ...
      ```
      
      **After**
      
      ```
      +---+----+--------+
      |  a|   b|unparsed|
      +---+----+--------+
      |  a|null|       a|
      +---+----+--------+
      ```
      
      It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.
      
      ## How was this patch tested?
      
      Unit test added in `CSVSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19113 from HyukjinKwon/bump-up-univocity.
      02a4386a
    • Dongjoon Hyun's avatar
      [SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE · 4e7a29ef
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE.
      
      ## How was this patch tested?
      
      This is a change on test util. Pass the existing Jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19125 from dongjoon-hyun/SPARK-21913.
      4e7a29ef
  9. Sep 04, 2017
    • Sean Owen's avatar
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with... · ca59445a
      Sean Owen authored
      [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
      
      ## What changes were proposed in this pull request?
      
      If no SparkConf is available to Utils.redact, simply don't redact.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19123 from srowen/SPARK-21418.
      ca59445a
  10. Sep 03, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21654][SQL] Complement SQL predicates expression description · 9f30d928
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples.
      
      This change also adds related test cases for the SQL predicate expressions.
      
      ## How was this patch tested?
      
      Existing tests. And added predicate test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18869 from viirya/SPARK-21654.
      9f30d928
  11. Sep 02, 2017
    • gatorsmile's avatar
      [SPARK-21891][SQL] Add TBLPROPERTIES to DDL statement: CREATE TABLE USING · acb7fed2
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`.
      
      After this change, the DDL becomes
      ```
      CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      USING table_provider
      [OPTIONS table_property_list]
      [PARTITIONED BY (col_name, col_name, ...)]
      [CLUSTERED BY (col_name, col_name, ...)
       [SORTED BY (col_name [ASC|DESC], ...)]
       INTO num_buckets BUCKETS
      ]
      [LOCATION path]
      [COMMENT table_comment]
      [TBLPROPERTIES (property_name=property_value, ...)]
      [[AS] select_statement];
      ```
      
      ## How was this patch tested?
      Add a few tests
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.
      acb7fed2
  12. Sep 01, 2017
    • gatorsmile's avatar
      [SPARK-21895][SQL] Support changing database in HiveClient · aba9492d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Supporting moving tables across different database in HiveClient `alterTable`
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19104 from gatorsmile/alterTable.
      aba9492d
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
    • he.qiao's avatar
      [SPARK-21880][WEB UI] In the SQL table page, modify jobs trace information · 12f0d242
      he.qiao authored
      ## What changes were proposed in this pull request?
      As shown below, for example, When the job 5 is running, It was a mistake to think that five jobs were running, So I think it would be more appropriate to change jobs to job id.
      ![image](https://user-images.githubusercontent.com/21355020/29909612-4dc85064-8e59-11e7-87cd-275a869243bb.png)
      
      ## How was this patch tested?
      no need
      
      Author: he.qiao <he.qiao17@zte.com.cn>
      
      Closes #19093 from Geek-He/08_31_sqltable.
      12f0d242
  13. Aug 31, 2017
    • hyukjinkwon's avatar
      [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python · 5cd8ea99
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API.
      
      In short, the following examples are allowed:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample(0.5).count()
      7
      >>> df.sample(fraction=0.5).count()
      3
      >>> df.sample(0.5, seed=42).count()
      5
      >>> df.sample(fraction=0.5, seed=42).count()
      5
      ```
      
      In addition, this PR also adds some type checking logics as below:
      
      ```python
      >>> df = spark.range(10)
      >>> df.sample().count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
      >>> df.sample(True).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
      >>> df.sample(42).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
      >>> df.sample(fraction=False, seed="a").count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
      >>> df.sample(seed=[1]).count()
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
      >>> df.sample(withReplacement="a", fraction=0.5, seed=1)
      ...
      TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].
      ```
      
      ## How was this patch tested?
      
      Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18999 from HyukjinKwon/SPARK-21779.
      5cd8ea99
    • Andrew Ray's avatar
      [SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities · cba69aeb
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR).
      
      ## How was this patch tested?
      
      Updated unit tests to cover structs and arrays.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18818 from aray/SPARK-21110.
      cba69aeb
    • gatorsmile's avatar
      [SPARK-17107][SQL][FOLLOW-UP] Remove redundant pushdown rule for Union · 7ce11082
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Also remove useless function `partitionByDeterministic` after the changes of https://github.com/apache/spark/pull/14687
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19097 from gatorsmile/followupSPARK-17107.
      7ce11082
    • Bryan Cutler's avatar
      [SPARK-21583][HOTFIX] Removed intercept in test causing failures · 501370d9
      Bryan Cutler authored
      Removing a check in the ColumnarBatchSuite that depended on a Java assertion.  This assertion is being compiled out in the Maven builds causing the test to fail.  This part of the test is not specifically from to the functionality that is being tested here.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.
      501370d9
    • Jacek Laskowski's avatar
      [SPARK-21886][SQL] Use SparkSession.internalCreateDataFrame to create… · 9696580c
      Jacek Laskowski authored
      … Dataset with LogicalRDD logical operator
      
      ## What changes were proposed in this pull request?
      
      Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups)
      
      ## How was this patch tested?
      
      Local build and waiting for Jenkins
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.
      9696580c
    • gatorsmile's avatar
      [SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils · 19b0240d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases.
      
      Also, move two SQLMetrics test cases from sql/hive to sql/core.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19092 from gatorsmile/rewriteSQLMetrics.
      19b0240d
  14. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors · 964b507c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally.  This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration.  Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally.
      
      ## How was this patch tested?
      
      Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.
      964b507c
    • Andrew Ash's avatar
      [SPARK-21875][BUILD] Fix Java style bugs · 313c6ca4
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Fix Java code style so `./dev/lint-java` succeeds
      
      ## How was this patch tested?
      
      Run `./dev/lint-java`
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19088 from ash211/spark-21875-lint-java.
      313c6ca4
    • Dongjoon Hyun's avatar
      [SPARK-21839][SQL] Support SQL config for ORC compression · d8f45408
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins with new and updated test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19055 from dongjoon-hyun/SPARK-21839.
      d8f45408
    • caoxuewen's avatar
      [MINOR][SQL][TEST] Test shuffle hash join while is not expected · 235d2833
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      igore("shuffle hash join") is to shuffle hash join to test _case class ShuffledHashJoinExec_.
      But when you 'ignore' -> 'test', the test is _case class BroadcastHashJoinExec_.
      
      Before modified,  as a result of:canBroadcast is true.
      Print information in _canBroadcast(plan: LogicalPlan)_
      ```
      canBroadcast plan.stats.sizeInBytes:6710880
      canBroadcast conf.autoBroadcastJoinThreshold:10000000
      ```
      
      After modified, plan.stats.sizeInBytes is 11184808.
      Print information in _canBuildLocalHashMap(plan: LogicalPlan)_
      and _muchSmaller(a: LogicalPlan, b: LogicalPlan)_ :
      
      ```
      canBuildLocalHashMap plan.stats.sizeInBytes:11184808
      canBuildLocalHashMap conf.autoBroadcastJoinThreshold:10000000
      canBuildLocalHashMap conf.numShufflePartitions:2
      ```
      ```
      muchSmaller a.stats.sizeInBytes * 3:33554424
      muchSmaller b.stats.sizeInBytes:33554432
      ```
      ## How was this patch tested?
      
      existing test case.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19069 from heary-cao/shuffle_hash_join.
      235d2833
Loading