Skip to content
Snippets Groups Projects
  1. Dec 17, 2014
  2. Dec 16, 2014
    • scwf's avatar
      [SPARK-4618][SQL] Make foreign DDL commands options case-insensitive · 60698801
      scwf authored
      Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
      So flowing cmd work
      ```
            create temporary table normal_parquet
            USING org.apache.spark.sql.parquet
            OPTIONS (
              PATH '/xxx/data'
            )
      ```
      
      Author: scwf <wangfei1@huawei.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3470 from scwf/ddl-ulcase and squashes the following commits:
      
      ae78509 [scwf] address comments
      8f4f585 [wangfei] address comments
      3c132ef [scwf] minor fix
      a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
      4f86401 [scwf] adding CaseInsensitiveMap
      e244e8d [wangfei] using lower case in json
      e0cb017 [wangfei] make options in-casesensitive
      60698801
    • Davies Liu's avatar
      [SPARK-4866] support StructType as key in MapType · ec5c4279
      Davies Liu authored
      This PR brings support of using StructType(and other hashable types) as key in MapType.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3714 from davies/fix_struct_in_map and squashes the following commits:
      
      68585d7 [Davies Liu] fix primitive types in MapType
      9601534 [Davies Liu] support StructType as key in MapType
      ec5c4279
    • Cheng Hao's avatar
      [SPARK-4375] [SQL] Add 0 argument support for udf · 770d8153
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3595 from chenghao-intel/udf0 and squashes the following commits:
      
      a858973 [Cheng Hao] Add 0 arguments support for udf
      770d8153
    • Takuya UESHIN's avatar
      [SPARK-4720][SQL] Remainder should also return null if the divider is 0. · ddc7ba31
      Takuya UESHIN authored
      This is a follow-up of SPARK-4593 (#3443).
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3581 from ueshin/issues/SPARK-4720 and squashes the following commits:
      
      c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
      ddc7ba31
    • Cheng Hao's avatar
      [SPARK-4744] [SQL] Short circuit evaluation for AND & OR in CodeGen · 0aa834ad
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:
      
      f466303 [Cheng Hao] short circuit for AND & OR
      0aa834ad
    • Cheng Lian's avatar
      [SPARK-4798][SQL] A new set of Parquet testing API and test suites · 3b395e10
      Cheng Lian authored
      This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:
      
      - `ParquetQuerySuite`
      - `ParquetTestData`
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3644 from liancheng/parquet-tests and squashes the following commits:
      
      800e745 [Cheng Lian] Enforces ordering of test output
      3bb8731 [Cheng Lian] Refactors HiveParquetSuite
      aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
      7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
      7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
      3b395e10
    • Andrew Or's avatar
      [Release] Cache known author translations locally · b85044ec
      Andrew Or authored
      This bypasses unnecessary calls to the Github and JIRA API.
      Additionally, having a local cache allows us to remember names
      that we had to manually discover ourselves.
      b85044ec
    • Andrew Or's avatar
      [Release] Major improvements to generate contributors script · 6f80b749
      Andrew Or authored
      This commit introduces several major improvements to the script
      that generates the contributors list for release notes, notably:
      
      (1) Use release tags instead of a range of commits. Across branches,
      commits are not actually strictly two-dimensional, and so it is not
      sufficient to specify a start hash and an end hash. Otherwise, we
      end up counting commits that were already merged in an older branch.
      
      (2) Match PR numbers in addition to commit hashes. This is related
      to the first point in that if a PR is already merged in an older
      minor release tag, it should be filtered out here. This requires us
      to do some intelligent regex parsing on the commit description in
      addition to just relying on the GitHub API.
      
      (3) Relax author validity check. The old code fails on a name that
      has many middle names, for instance. The test was just too strict.
      
      (4) Use GitHub authentication. This allows us to make far more
      requests through the GitHub API than before (5000 as opposed to 60
      per hour).
      
      (5) Translate from Github username, not commit author name. This is
      important because the commit author name is not always configured
      correctly by the user. For instance, the username "falaki" used to
      resolve to just "Hossein", which was treated as a github username
      and translated to something else that is completely arbitrary.
      
      (6) Add an option to use the untranslated name. If there is not
      a satisfactory candidate to replace the untranslated name with,
      at least allow the user to not translate it.
      6f80b749
    • Jacky Li's avatar
      [SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin · fa66ef6c
      Jacky Li authored
      In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
      In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #3133 from jackylk/timeout-config and squashes the following commits:
      
      733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
      557acd4 [Jacky Li] switch to sqlContext.getConf
      81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
      fa66ef6c
    • Michael Armbrust's avatar
      [SPARK-4827][SQL] Fix resolution of deeply nested Project(attr, Project(Star,...)). · a66c23e1
      Michael Armbrust authored
      Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass.  Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3674 from marmbrus/projectStars and squashes the following commits:
      
      d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
      a66c23e1
    • tianyi's avatar
      [SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin · 30f6b85c
      tianyi authored
      In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
      Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
      
      for
      table test_csv contains 1 million records
      table dim_csv contains 10 thousand records
      
      SQL:
      `select * from test_csv a left outer join dim_csv b on a.key = b.key`
      
      the result is:
      master:
      ```
      CSV: 12671 ms
      CSV: 9021 ms
      CSV: 9200 ms
      Current Mem Usage:787788984
      ```
      after patch:
      ```
      CSV: 10382 ms
      CSV: 7543 ms
      CSV: 7469 ms
      Current Mem Usage:208145728
      ```
      
      Author: tianyi <tianyi@asiainfo-linkage.com>
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:
      
      72a8aec [tianyi] avoid having mutable state stored inside of the task
      99c5c97 [tianyi] performance optimization
      d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
      2be45d1 [tianyi] fix spell bug
      1f2c6f1 [tianyi] remove commented codes
      a676de6 [tianyi] optimize some codes
      9e7d5b5 [tianyi] remove commented old codes
      838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
      30f6b85c
    • wangxiaojing's avatar
      [SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite · ea1315e3
      wangxiaojing authored
      In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite
      
      Author: wangxiaojing <u9jing@gmail.com>
      
      Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:
      
      ea0e495 [wangxiaojing] change style
      53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
      ea1315e3
    • Holden Karau's avatar
      SPARK-4767: Add support for launching in a specified placement group to spark_ec2 · b0dfdbdd
      Holden Karau authored
      Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #3623 from holdenk/SPARK-4767-add-support-for-launching-in-a-specified-placement-group-to-spark-ec2-scripts and squashes the following commits:
      
      111a5fd [Holden Karau] merge in master
      70ace25 [Holden Karau] Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy
      b0dfdbdd
    • zsxwing's avatar
      [SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled' · 6530243a
      zsxwing authored
      The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.
      
      ```Scala
      scala> :paste
      // Entering paste mode (ctrl-D to finish)
      
      abstract class Foo {
      
        protected val sqlContext = "Foo"
      
        val codegenEnabled: Boolean = {
          println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
          if (sqlContext != null) {
            true
          } else {
            false
          }
        }
      }
      
      class Bar extends Foo {
        override val sqlContext = "Bar"
      }
      
      println(new Bar().codegenEnabled)
      
      // Exiting paste mode, now interpreting.
      
      null
      false
      defined class Foo
      defined class Bar
      ```
      
      We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits:
      
      1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
      6530243a
    • jerryshao's avatar
      [SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue · dc8280dc
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits:
      
      4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
      dc8280dc
    • Peter Vandenabeele's avatar
      [DOCS][SQL] Add a Note on jsonFile having separate JSON objects per line · 1a9e35e5
      Peter Vandenabeele authored
      * This commit hopes to avoid the confusion I faced when trying
        to submit a regular, valid multi-line JSON file, also see
      
        http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html
      
      Author: Peter Vandenabeele <peter@vandenabeele.com>
      
      Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:
      
      1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
      6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
      fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
      1a9e35e5
    • Judy Nash's avatar
      [SQL] SPARK-4700: Add HTTP protocol spark thrift server · 17688d14
      Judy Nash authored
      Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
      
      Author: Judy Nash <judynash@microsoft.com>
      Author: judynash <judynash@microsoft.com>
      
      Closes #3672 from judynash/master and squashes the following commits:
      
      526315d [Judy Nash] correct spacing on startThriftServer method
      31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
      47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
      2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
      1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
      2b1d312 [Judy Nash] updated http thrift server support based on feedback
      377532c [judynash] add HTTP protocol spark thrift server
      17688d14
    • Mike Jennings's avatar
      [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py · d12c0711
      Mike Jennings authored
      Based on this gist:
      https://gist.github.com/amar-analytx/0b62543621e1f246c0a2
      
      We use security group ids instead of security group to get around this issue:
      https://github.com/boto/boto/issues/350
      
      Author: Mike Jennings <mvj101@gmail.com>
      Author: Mike Jennings <mvj@google.com>
      
      Closes #2872 from mvj101/SPARK-3405 and squashes the following commits:
      
      be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly.
      4dc6756 [Mike Jennings] Remove duplicate comment
      731d94c [Mike Jennings] Update for code review.
      ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405
      1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405
      52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
      d12c0711
    • jbencook's avatar
      [SPARK-4855][mllib] testing the Chi-squared hypothesis test · cb484474
      jbencook authored
      This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc5 and moves some of the error messaging in to python.
      
      It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)
      
      Hopefully, SPARK-2980 can be closed.
      
      Author: jbencook <jbenjamincook@gmail.com>
      
      Closes #3679 from jbencook/master and squashes the following commits:
      
      44078e0 [jbencook] checking that bad input throws the correct exceptions
      f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
      7536cf1 [jbencook] removing python checks for invalid input
      a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
      3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
      cb484474
    • Davies Liu's avatar
      [SPARK-4437] update doc for WholeCombineFileRecordReader · ed362008
      Davies Liu authored
      update doc for WholeCombineFileRecordReader
      
      Author: Davies Liu <davies@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3301 from davies/fix_doc and squashes the following commits:
      
      1d7422f [Davies Liu] Merge pull request #2 from JoshRosen/whole-text-file-cleanup
      dc3d21a [Josh Rosen] More genericization in ConfigurableCombineFileRecordReader.
      95d13eb [Davies Liu] address comment
      bf800b9 [Davies Liu] update doc for WholeCombineFileRecordReader
      ed362008
    • Davies Liu's avatar
      [SPARK-4841] fix zip with textFile() · c246b95d
      Davies Liu authored
      UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
      
      Also, if two RDD have the same batch size already, they did not need re-serialize any more.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3706 from davies/fix_4841 and squashes the following commits:
      
      20ce3a3 [Davies Liu] fix bug in _reserialize()
      e3ebf7c [Davies Liu] add comment
      379d2c8 [Davies Liu] fix zip with textFile()
      c246b95d
    • meiyoula's avatar
      [SPARK-4792] Add error message when making local dir unsuccessfully · c7628771
      meiyoula authored
      Author: meiyoula <1039320815@qq.com>
      
      Closes #3635 from XuTingjun/master and squashes the following commits:
      
      dd1c66d [meiyoula] when old is deleted, it will throw an exception where call it
      2a55bc2 [meiyoula] Update DiskBlockManager.scala
      1483a4a [meiyoula] Delete multiple retries to make dir
      67f7902 [meiyoula] Try some times to make dir maybe more reasonable
      1c51a0c [meiyoula] Update DiskBlockManager.scala
      c7628771
  3. Dec 15, 2014
    • Sean Owen's avatar
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from... · 81112e4b
      Sean Owen authored
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
      
      This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3692 from srowen/SPARK-4814 and squashes the following commits:
      
      caca704 [Sean Owen] Disable assertions just for Hive
      f71e783 [Sean Owen] Enable assertions for SBT and Maven build
      81112e4b
    • wangfei's avatar
      [Minor][Core] fix comments in MapOutputTracker · 5c24759d
      wangfei authored
      Using driver and executor in the comments of ```MapOutputTracker``` is more clear.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3700 from scwf/commentFix and squashes the following commits:
      
      aa68524 [wangfei] master and worker should be driver and executor
      5c24759d
    • Sean Owen's avatar
      SPARK-785 [CORE] ClosureCleaner not invoked on most PairRDDFunctions · 2a28bc61
      Sean Owen authored
      This looked like perhaps a simple and important one. `combineByKey` looks like it should clean its arguments' closures, and that in turn covers apparently all remaining functions in `PairRDDFunctions` which delegate to it.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3690 from srowen/SPARK-785 and squashes the following commits:
      
      8df68fe [Sean Owen] Clean context of most remaining functions in PairRDDFunctions, which ultimately call combineByKey
      2a28bc61
    • Ryan Williams's avatar
      [SPARK-4668] Fix some documentation typos. · 8176b7a0
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #3523 from ryan-williams/tweaks and squashes the following commits:
      
      d2eddaa [Ryan Williams] code review feedback
      ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
      c6cfad9 [Ryan Williams] remove unnecessary if statement
      b74ea35 [Ryan Williams] comment fix
      b0221f0 [Ryan Williams] fix a gendered pronoun
      c71ffed [Ryan Williams] use names on a few boolean parameters
      89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
      e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
      83e8358 [Ryan Williams] fix pom.xml typo
      dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
      8176b7a0
    • Ilya Ganelin's avatar
      [SPARK-1037] The name of findTaskFromList & findTask in TaskSetManager.scala is confusing · 38703bbc
      Ilya Ganelin authored
      Hi all - I've renamed the methods referenced in this JIRA to clarify that they modify the provided arrays (find vs. deque).
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #3665 from ilganeli/SPARK-1037B and squashes the following commits:
      
      64c177c [Ilya Ganelin] Renamed deque to dequeue
      f27d85e [Ilya Ganelin] Renamed private methods to clarify that they modify the provided parameters
      683482a [Ilya Ganelin] Renamed private methods to clarify that they modify the provided parameters
      38703bbc
    • Josh Rosen's avatar
      [SPARK-4826] Fix generation of temp file names in WAL tests · f6b8591a
      Josh Rosen authored
      This PR should fix SPARK-4826, an issue where a bug in how we generate temp. file names was causing spurious test failures in the write ahead log suites.
      
      Closes #3695.
      Closes #3701.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3704 from JoshRosen/SPARK-4826 and squashes the following commits:
      
      f2307f5 [Josh Rosen] Use Spark Utils class for directory creation/deletion
      a693ddb [Josh Rosen] remove unused Random import
      b275e41 [Josh Rosen] Move creation of temp. dir to beforeEach/afterEach.
      9362919 [Josh Rosen] [SPARK-4826] Fix bug in generation of temp file names. in WAL suites.
      86c1944 [Josh Rosen] Revert "HOTFIX: Disabling failing block manager test"
      f6b8591a
    • Yuu ISHIKAWA's avatar
      [SPARK-4494][mllib] IDFModel.transform() add support for single vector · 8098fab0
      Yuu ISHIKAWA authored
      I improved `IDFModel.transform` to allow using a single vector.
      
      [[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)
      
      Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #3603 from yu-iskw/idf and squashes the following commits:
      
      256ff3d [Yuu ISHIKAWA] Fix typo
      a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
      d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
      8098fab0
    • Patrick Wendell's avatar
      4c067387
  4. Dec 14, 2014
    • Peter Klipfel's avatar
      fixed spelling errors in documentation · 2a2983f7
      Peter Klipfel authored
      changed "form" to "from" in 3 documentation entries for Kafka integration
      
      Author: Peter Klipfel <peter@klipfel.me>
      
      Closes #3691 from peterklipfel/master and squashes the following commits:
      
      0fe7fc5 [Peter Klipfel] fixed spelling errors in documentation
      2a2983f7
  5. Dec 12, 2014
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · ef84dab8
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3488 (close requested by 'pwendell')
      Closes #2939 (close requested by 'marmbrus')
      Closes #3173 (close requested by 'marmbrus')
      ef84dab8
    • Daoyuan Wang's avatar
      [SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null · 41a3f934
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3676 from adrian-wang/countexpr and squashes the following commits:
      
      dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null
      41a3f934
    • Sasaki Toru's avatar
      [SPARK-4742][SQL] The name of Parquet File generated by... · 8091dd62
      Sasaki Toru authored
      [SPARK-4742][SQL] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
      
      When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.
      
      Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
      
      Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:
      
      6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
      20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
      8091dd62
    • Cheng Hao's avatar
      [SPARK-4825] [SQL] CTAS fails to resolve when created using saveAsTable · 0abbff28
      Cheng Hao authored
      Fix bug when query like:
      ```
        test("save join to table") {
          val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
          sql("CREATE TABLE test1 (key INT, value STRING)")
          testData.insertInto("test1")
          sql("CREATE TABLE test2 (key INT, value STRING)")
          testData.insertInto("test2")
          testData.insertInto("test2")
          sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
          checkAnswer(
            table("test"),
            sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
        }
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3673 from chenghao-intel/spark_4825 and squashes the following commits:
      
      e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
      e004895 [Cheng Hao] fix bug
      0abbff28
    • Daoyuan Wang's avatar
      [SQL] enable empty aggr test case · cbb634ae
      Daoyuan Wang authored
      This is fixed by SPARK-4318 #3184
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3445 from adrian-wang/emptyaggr and squashes the following commits:
      
      982575e [Daoyuan Wang] enable empty aggr test case
      cbb634ae
    • Daoyuan Wang's avatar
      [SPARK-4828] [SQL] sum and avg on empty table should always return null · acb3be6b
      Daoyuan Wang authored
      So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance.
      
      Can we merge #3445 before I add a comparison test case from this?
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3675 from adrian-wang/sumempty and squashes the following commits:
      
      42df763 [Daoyuan Wang] sum and avg on empty table should always return null
      acb3be6b
    • scwf's avatar
      [SQL] Remove unnecessary case in HiveContext.toHiveString · d8cf6785
      scwf authored
      a follow up of #3547
      /cc marmbrus
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3563 from scwf/rnc and squashes the following commits:
      
      9395661 [scwf] remove unnecessary condition
      d8cf6785
    • Takuya UESHIN's avatar
      [SPARK-4293][SQL] Make Cast be able to handle complex types. · 33448036
      Takuya UESHIN authored
      Inserting data of type including `ArrayType.containsNull == false` or `MapType.valueContainsNull == false` or `StructType.fields.exists(_.nullable == false)` into Hive table will fail because `Cast` inserted by `HiveMetastoreCatalog.PreInsertionCasts` rule of `Analyzer` can't handle these types correctly.
      
      Complex type cast rule proposal:
      
      - Cast for non-complex types should be able to cast the same as before.
      - Cast for `ArrayType` can evaluate if
        - Element type can cast
        - Nullability rule doesn't break
      - Cast for `MapType` can evaluate if
        - Key type can cast
        - Nullability for casted key type is `false`
        - Value type can cast
        - Nullability rule for value type doesn't break
      - Cast for `StructType` can evaluate if
        - The field size is the same
        - Each field can cast
        - Nullability rule for each field doesn't break
      - The nested structure should be the same.
      
      Nullability rule:
      
      - If the casted type is `nullable == true`, the target nullability should be `true`
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3150 from ueshin/issues/SPARK-4293 and squashes the following commits:
      
      e935939 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
      ba14003 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
      8999868 [Takuya UESHIN] Fix a test title.
      f677c30 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
      287f410 [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
      4f71bb8 [Takuya UESHIN] Make Cast be able to handle complex types.
      33448036
Loading