Skip to content
Snippets Groups Projects
  1. Aug 31, 2017
    • Bryan Cutler's avatar
      [SPARK-21583][HOTFIX] Removed intercept in test causing failures · 501370d9
      Bryan Cutler authored
      Removing a check in the ColumnarBatchSuite that depended on a Java assertion.  This assertion is being compiled out in the Maven builds causing the test to fail.  This part of the test is not specifically from to the functionality that is being tested here.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.
      501370d9
    • Jacek Laskowski's avatar
      [SPARK-21886][SQL] Use SparkSession.internalCreateDataFrame to create… · 9696580c
      Jacek Laskowski authored
      … Dataset with LogicalRDD logical operator
      
      ## What changes were proposed in this pull request?
      
      Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups)
      
      ## How was this patch tested?
      
      Local build and waiting for Jenkins
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.
      9696580c
    • gatorsmile's avatar
      [SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils · 19b0240d
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases.
      
      Also, move two SQLMetrics test cases from sql/hive to sql/core.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19092 from gatorsmile/rewriteSQLMetrics.
      19b0240d
  2. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors · 964b507c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally.  This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration.  Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally.
      
      ## How was this patch tested?
      
      Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.
      964b507c
    • Andrew Ash's avatar
      [SPARK-21875][BUILD] Fix Java style bugs · 313c6ca4
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Fix Java code style so `./dev/lint-java` succeeds
      
      ## How was this patch tested?
      
      Run `./dev/lint-java`
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #19088 from ash211/spark-21875-lint-java.
      313c6ca4
    • Dongjoon Hyun's avatar
      [SPARK-21839][SQL] Support SQL config for ORC compression · d8f45408
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins with new and updated test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19055 from dongjoon-hyun/SPARK-21839.
      d8f45408
    • caoxuewen's avatar
      [MINOR][SQL][TEST] Test shuffle hash join while is not expected · 235d2833
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      igore("shuffle hash join") is to shuffle hash join to test _case class ShuffledHashJoinExec_.
      But when you 'ignore' -> 'test', the test is _case class BroadcastHashJoinExec_.
      
      Before modified,  as a result of:canBroadcast is true.
      Print information in _canBroadcast(plan: LogicalPlan)_
      ```
      canBroadcast plan.stats.sizeInBytes:6710880
      canBroadcast conf.autoBroadcastJoinThreshold:10000000
      ```
      
      After modified, plan.stats.sizeInBytes is 11184808.
      Print information in _canBuildLocalHashMap(plan: LogicalPlan)_
      and _muchSmaller(a: LogicalPlan, b: LogicalPlan)_ :
      
      ```
      canBuildLocalHashMap plan.stats.sizeInBytes:11184808
      canBuildLocalHashMap conf.autoBroadcastJoinThreshold:10000000
      canBuildLocalHashMap conf.numShufflePartitions:2
      ```
      ```
      muchSmaller a.stats.sizeInBytes * 3:33554424
      muchSmaller b.stats.sizeInBytes:33554432
      ```
      ## How was this patch tested?
      
      existing test case.
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19069 from heary-cao/shuffle_hash_join.
      235d2833
    • gatorsmile's avatar
      32d6d9d7
    • hyukjinkwon's avatar
      [SPARK-21764][TESTS] Fix tests failures on Windows: resources not being closed and incorrect paths · b30a11a6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `org.apache.spark.deploy.RPackageUtilsSuite`
      
      ```
       - jars without manifest return false *** FAILED *** (109 milliseconds)
         java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1500266936418-0\dep1-c.jar
      ```
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - download one file to local *** FAILED *** (16 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2630198944759847458.jar
      
       - download list of files to local *** FAILED *** (0 milliseconds)
         java.net.URISyntaxException: Illegal character in authority at index 6: s3a://C:\projects\spark\target\tmp\test2783551769392880031.jar
      ```
      
      `org.apache.spark.scheduler.ReplayListenerSuite`
      
      ```
       - Replay compressed inprogress log file succeeding on partial read (156 milliseconds)
         Exception encountered when attempting to run a suite with class name:
         org.apache.spark.scheduler.ReplayListenerSuite *** ABORTED *** (1 second, 391 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-8f3cacd6-faad-4121-b901-ba1bba8025a0
      
       - End-to-end replay *** FAILED *** (62 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      
       - End-to-end replay with compression *** FAILED *** (110 milliseconds)
         java.io.IOException: No FileSystem for scheme: C
      ```
      
      `org.apache.spark.sql.hive.StatisticsSuite`
      
      ```
       - SPARK-21079 - analyze table with location different than that of individual partitions *** FAILED *** (875 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - SPARK-21079 - analyze partitioned table with only a subset of partitions visible *** FAILED *** (47 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      **Note:** this PR does not fix:
      
      `org.apache.spark.deploy.SparkSubmitSuite`
      
      ```
       - launch simple application with spark-submit with redaction *** FAILED *** (172 milliseconds)
         java.util.NoSuchElementException: next on empty iterator
      ```
      
      I can't reproduce this on my Windows machine but looks appearntly consistently failed on AppVeyor. This one is unclear to me yet and hard to debug so I did not include this one for now.
      
      **Note:** it looks there are more instances but it is hard to identify them partly due to flakiness and partly due to swarming logs and errors. Will probably go one more time if it is fine.
      
      ## How was this patch tested?
      
      Manually via AppVeyor:
      
      **Before**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/8t8ra3lrljuir7q4
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/taquy84yudjjen64
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/24omrfn2k0xfa9xq
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/771-windows-fix/job/2079y1plgj76dc9l
      
      **After**
      
      - `org.apache.spark.deploy.RPackageUtilsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/3803dbfn89ne1164
      - `org.apache.spark.deploy.SparkSubmitSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/m5l350dp7u9a4xjr
      - `org.apache.spark.scheduler.ReplayListenerSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/565vf74pp6bfdk18
      - `org.apache.spark.sql.hive.StatisticsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/775-windows-fix/job/qm78tsk8c37jb6s4
      
      Jenkins tests are required and AppVeyor tests will be triggered.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18971 from HyukjinKwon/windows-fixes.
      b30a11a6
  3. Aug 29, 2017
    • gatorsmile's avatar
      [SPARK-21845][SQL] Make codegen fallback of expressions configurable · 3d0e1742
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases.
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19062 from gatorsmile/fallbackCodegen.
      3d0e1742
    • Wenchen Fan's avatar
      [SPARK-21255][SQL] simplify encoder for java enum · 6327ea57
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up for https://github.com/apache/spark/pull/18488, to simplify the code.
      
      The major change is, we should map java enum to string type, instead of a struct type with a single string field.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19066 from cloud-fan/fix.
      6327ea57
    • Wang Gengliang's avatar
      [SPARK-21848][SQL] Add trait UserDefinedExpression to identify user-defined functions · 8fcbda9c
      Wang Gengliang authored
      ## What changes were proposed in this pull request?
      
      Add trait UserDefinedExpression to identify user-defined functions.
      UDF can be expensive. In optimizer we may need to avoid executing UDF multiple times.
      E.g.
      ```scala
      table.select(UDF as 'a).select('a, ('a + 1) as 'b)
      ```
      If UDF is expensive in this case, optimizer should not collapse the project to
      ```scala
      table.select(UDF as 'a, (UDF+1) as 'b)
      ```
      
      Currently UDF classes like PythonUDF, HiveGenericUDF are not defined in catalyst.
      This PR is to add a new trait to make it easier to identify user-defined functions.
      
      ## How was this patch tested?
      
      Unit test
      
      Author: Wang Gengliang <ltnwgl@gmail.com>
      
      Closes #19064 from gengliangwang/UDFType.
      8fcbda9c
    • Takuya UESHIN's avatar
      [SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. · 32fa0b81
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.
      
      As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18989 from ueshin/issues/SPARK-21781.
      32fa0b81
  4. Aug 27, 2017
  5. Aug 25, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script · 3b66b1c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes both:
      
      - Add information about Javadoc, SQL docs and few more information in `docs/README.md` and a comment in `docs/_plugins/copy_api_dirs.rb` related with Javadoc.
      
      - Adds some commands so that the script always runs the SQL docs build under `./sql` directory (for directly running `./sql/create-docs.sh` in the root directory).
      
      ## How was this patch tested?
      
      Manual tests with `jekyll build` and `./sql/create-docs.sh` in the root directory.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19019 from HyukjinKwon/minor-doc-build.
      3b66b1c4
    • Dongjoon Hyun's avatar
      [SPARK-21831][TEST] Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite · 522e1f80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-19025](https://github.com/apache/spark/pull/16869) removes SQLBuilder, so we don't need the following in HiveCompatibilitySuite.
      
      ```scala
      // Ensures that the plans generation use metastore relation and not OrcRelation
      // Was done because SqlBuilder does not work with plans having logical relation
      TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false)
      ```
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19043 from dongjoon-hyun/SPARK-21831.
      522e1f80
    • Sean Owen's avatar
      [SPARK-21837][SQL][TESTS] UserDefinedTypeSuite Local UDTs not actually testing what it intends · 1a598d71
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Adjust Local UDTs test to assert about results, and fix index of vector column. See JIRA for details.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19053 from srowen/SPARK-21837.
      1a598d71
    • vinodkc's avatar
      [SPARK-21756][SQL] Add JSON option to allow unquoted control characters · 51620e28
      vinodkc authored
      ## What changes were proposed in this pull request?
      
      This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters)
      
      ## How was this patch tested?
      Add new test cases
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19008 from vinodkc/br_fix_SPARK-21756.
      51620e28
    • Dongjoon Hyun's avatar
      [SPARK-21832][TEST] Merge SQLBuilderTest into ExpressionSQLBuilderSuite · 1f24ceee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-19025](https://github.com/apache/spark/pull/16869), there is no need to keep SQLBuilderTest.
      ExpressionSQLBuilderSuite is the only place to use it.
      This PR aims to remove SQLBuilderTest.
      
      ## How was this patch tested?
      
      Pass the updated `ExpressionSQLBuilderSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19044 from dongjoon-hyun/SPARK-21832.
      1f24ceee
    • Sean Owen's avatar
      [MINOR][BUILD] Fix build warnings and Java lint errors · de7af295
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19051 from srowen/JavaWarnings.
      de7af295
    • mike's avatar
      [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum · 7d16776d
      mike authored
      ## What changes were proposed in this pull request?
      
      Fixed NPE when creating encoder for enum.
      
      When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
      I did a little research and it turns out, that in JavaTypeInference following code
      ```
        def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
          val beanInfo = Introspector.getBeanInfo(beanClass)
          beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
            .filter(_.getReadMethod != null)
        }
      ```
      filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
      
      I added property name "declaringClass" to filtering to resolve this.
      
      ## How was this patch tested?
      Unit test in JavaDatasetSuite which creates an encoder for enum
      
      Author: mike <mike0sv@gmail.com>
      Author: Mikhail Sveshnikov <mike0sv@gmail.com>
      
      Closes #18488 from mike0sv/enum-support.
      7d16776d
  6. Aug 24, 2017
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
    • Shixiong Zhu's avatar
      [SPARK-21788][SS] Handle more exceptions when stopping a streaming query · d3abb369
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add more cases we should view as a normal query stop rather than a failure.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #18997 from zsxwing/SPARK-21788.
      d3abb369
    • Wenchen Fan's avatar
      [SPARK-21826][SQL] outer broadcast hash join should not throw NPE · 2dd37d82
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
      
      Non-equal join condition should only be applied when the equal-join condition matches.
      
      ## How was this patch tested?
      
      regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19036 from cloud-fan/bug.
      2dd37d82
    • Liang-Chi Hsieh's avatar
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved... · 183d4cb7
      Liang-Chi Hsieh authored
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
      
      ## What changes were proposed in this pull request?
      
      With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans.
      
      For a correlated IN query looks like:
      
          SELECT t1.a FROM t1
          WHERE
          t1.a IN (SELECT t2.c
                  FROM t2
                  WHERE t1.b < t2.d);
      
      The query plan might look like:
      
          Project [a#0]
          +- Filter a#0 IN (list#4 [b#1])
             :  +- Project [c#2]
             :     +- Filter (outer(b#1) < d#3)
             :        +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      After `PullupCorrelatedPredicates`, it produces query plan like:
      
          'Project [a#0]
          +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
             :  +- Project [c#2, d#3]
             :     +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery.
      
      When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`.
      
      We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18968 from viirya/SPARK-21759.
      183d4cb7
    • Takuya UESHIN's avatar
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector... · 9e33954d
      Takuya UESHIN authored
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
      
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ColumnVector` hierarchy and related classes.
      
      1. make `ColumnVector` read-only
      2. introduce `WritableColumnVector` with write interface
      3. remove `ReadOnlyColumnVector`
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18958 from ueshin/issues/SPARK-21745.
      9e33954d
    • Jen-Ming Chung's avatar
      [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one · 95713eb4
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      
      When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
      
      ``` scala
      scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
      +---+---+----+
      | c0| c1|  c2|
      +---+---+----+
      |  1|  2|null|
      +---+---+----+
      ```
      
      I think this should be consistent with Hive's implementation:
      ```
      hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
      ...
      1    1
      ```
      
      In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf.
      
      ## How was this patch tested?
      
      Added test in JsonExpressionsSuite.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19017 from jmchung/SPARK-21804.
      95713eb4
    • lufei's avatar
      [MINOR][SQL] The comment of Class ExchangeCoordinator exist a typing and context error · 846bc61c
      lufei authored
      ## What changes were proposed in this pull request?
      
      The given example in the comment of Class ExchangeCoordinator is exist four post-shuffle partitions,but the current comment is “three”.
      
      ## How was this patch tested?
      
      Author: lufei <lu.fei80@zte.com.cn>
      
      Closes #19028 from figo77/SPARK-21816.
      846bc61c
  7. Aug 23, 2017
    • 10129659's avatar
      [SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time · b8aaef49
      10129659 authored
      ## What changes were proposed in this pull request?
      The getAliasedConstraints  fuction in LogicalPlan.scala will clone the expression set when an element added,
      and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time.
      
      Before modified, the cost of getAliasedConstraints is:
      100 expressions:  41 seconds
      150 expressions:  466 seconds
      
      After modified, the cost of getAliasedConstraints is:
      100 expressions:  1.8 seconds
      150 expressions:  6.5 seconds
      
      The test is like this:
      test("getAliasedConstraints") {
          val expressionNum = 150
          val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")())
          val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
      
          val beginTime = System.currentTimeMillis()
          val expressions = aggPlan.validConstraints
          println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
          // The size of Aliased expression is n * (n - 1) / 2 + n
          assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum)
        }
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Run new added test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #19022 from eatoncys/getAliasedConstraints.
      b8aaef49
    • Takeshi Yamamuro's avatar
      [SPARK-21603][SQL][FOLLOW-UP] Change the default value of maxLinesPerFunction into 4000 · 6942aeeb
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression:
      
      ```
      Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement)
      Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression)
      ```
      
      To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19021 from maropu/SPARK-21603-FOLLOWUP-1.
      6942aeeb
  8. Aug 22, 2017
    • Jose Torres's avatar
      [SPARK-21765] Set isStreaming on leaf nodes for streaming plans. · 3c0c2d09
      Jose Torres authored
      ## What changes were proposed in this pull request?
      All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
      
      ## How was this patch tested?
      
      Existing unit tests - no functional change is intended in this PR.
      
      Author: Jose Torres <joseph-torres@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18973 from joseph-torres/SPARK-21765.
      3c0c2d09
    • gatorsmile's avatar
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas... · 01a8e462
      gatorsmile authored
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL
      
      ## What changes were proposed in this pull request?
      For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different.
      
      The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema.
      
      ## How was this patch tested?
      Added a cross-version test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19003 from gatorsmile/respectSparkSchema.
      01a8e462
    • gatorsmile's avatar
      [SPARK-21499][SQL] Support creating persistent function for Spark... · 43d71d96
      gatorsmile authored
      [SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)
      
      ## What changes were proposed in this pull request?
      This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction).
      
      ```SQL
      CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
      ```
      
      Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)`
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18700 from gatorsmile/javaUDFinScala.
      43d71d96
    • gatorsmile's avatar
      [SPARK-21803][TEST] Remove the HiveDDLCommandSuite · be72b157
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19015 from gatorsmile/combineDDL.
      be72b157
  9. Aug 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. · 84b5b16e
      Marcelo Vanzin authored
      For Hive tables, the current "replace the schema" code is the correct
      path, except that an exception in that path should result in an error, and
      not in retrying in a different way.
      
      For data source tables, Spark may generate a non-compatible Hive table;
      but for that to work with Hive 2.1, the detection of data source tables needs
      to be fixed in the Hive client, to also consider the raw tables used by code
      such as `alterTableSchema`.
      
      Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18849 from vanzin/SPARK-21617.
      84b5b16e
    • Sean Owen's avatar
      [SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." · b3a07526
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Reduce 'Skipping partitions' message to debug
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19010 from srowen/SPARK-21718.
      b3a07526
  10. Aug 20, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths... · 28a6cca7
      Liang-Chi Hsieh authored
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed
      
      ## What changes were proposed in this pull request?
      
      Fix a typo in test.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19005 from viirya/SPARK-21721-followup.
      28a6cca7
    • hyukjinkwon's avatar
      [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build · 41e0eb71
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to install `mkdocs` by `pip install` if missing in the path. Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
      
      It also adds `mkdocs` as requirements in `docs/README.md`.
      
      ## How was this patch tested?
      
      I manually ran `jekyll build` under `docs` directory after manually removing `mkdocs` via `pip uninstall mkdocs`.
      
      Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) where I built Spark few times but never built documentation before and `mkdocs` is not installed.
      
      ```
      ...
      Moving back into docs dir.
      Moving to SQL directory and building docs.
      Missing mkdocs in your path, trying to install mkdocs for SQL documentation generation.
      Collecting mkdocs
        Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
          100% |████████████████████████████████| 1.2MB 574kB/s
      Requirement already satisfied: PyYAML>=3.10 in /usr/lib64/python2.7/site-packages (from mkdocs)
      Collecting livereload>=2.5.1 (from mkdocs)
        Downloading livereload-2.5.1-py2-none-any.whl
      Collecting tornado>=4.1 (from mkdocs)
        Downloading tornado-4.5.1.tar.gz (483kB)
          100% |████████████████████████████████| 491kB 1.4MB/s
      Collecting Markdown>=2.3.1 (from mkdocs)
        Downloading Markdown-2.6.9.tar.gz (271kB)
          100% |████████████████████████████████| 276kB 2.4MB/s
      Collecting click>=3.3 (from mkdocs)
        Downloading click-6.7-py2.py3-none-any.whl (71kB)
          100% |████████████████████████████████| 71kB 2.8MB/s
      Requirement already satisfied: Jinja2>=2.7.1 in /usr/lib/python2.7/site-packages (from mkdocs)
      Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs)
      Requirement already satisfied: backports.ssl_match_hostname in /usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
      Collecting singledispatch (from tornado>=4.1->mkdocs)
        Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
      Collecting certifi (from tornado>=4.1->mkdocs)
        Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
          100% |████████████████████████████████| 358kB 2.1MB/s
      Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
        Downloading backports_abc-0.5-py2.py3-none-any.whl
      Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
      Building wheels for collected packages: tornado, Markdown
        Running setup.py bdist_wheel for tornado ... done
        Stored in directory: /root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
        Running setup.py bdist_wheel for Markdown ... done
        Stored in directory: /root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
      Successfully built tornado Markdown
      Installing collected packages: singledispatch, certifi, backports-abc, tornado, livereload, Markdown, click, mkdocs
      Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
      Generating markdown files for SQL documentation.
      Generating HTML files for SQL documentation.
      INFO    -  Cleaning site directory
      INFO    -  Building documentation to directory: .../spark/sql/site
      Moving back into docs dir.
      Making directory api/sql
      cp -r ../sql/site/. api/sql
                  Source: .../spark/docs
             Destination: .../spark/docs/_site
            Generating...
                          done.
       Auto-regeneration: disabled. Use --watch to enable.
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18984 from HyukjinKwon/sql-doc-mkdocs.
      41e0eb71
  11. Aug 18, 2017
    • Wenchen Fan's avatar
      [SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak · 7880909c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18993 from cloud-fan/bug.
      7880909c
    • Masha Basmanova's avatar
      [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes · 23ea8980
      Masha Basmanova authored
      ## What changes were proposed in this pull request?
      
      Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.
      
      When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.
      
      For example, table t has 4 partitions with the following specs:
      
      * Partition1: (ds='2008-04-08', hr=11)
      * Partition2: (ds='2008-04-08', hr=12)
      * Partition3: (ds='2008-04-09', hr=11)
      * Partition4: (ds='2008-04-09', hr=12)
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.
      
      'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.
      
      When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.
      
      The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Masha Basmanova <mbasmanova@fb.com>
      
      Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
      23ea8980
Loading