Skip to content
Snippets Groups Projects
  1. Aug 29, 2017
    • Takuya UESHIN's avatar
      [SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. · 32fa0b81
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.
      
      As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18989 from ueshin/issues/SPARK-21781.
      32fa0b81
  2. Aug 27, 2017
  3. Aug 25, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script · 3b66b1c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes both:
      
      - Add information about Javadoc, SQL docs and few more information in `docs/README.md` and a comment in `docs/_plugins/copy_api_dirs.rb` related with Javadoc.
      
      - Adds some commands so that the script always runs the SQL docs build under `./sql` directory (for directly running `./sql/create-docs.sh` in the root directory).
      
      ## How was this patch tested?
      
      Manual tests with `jekyll build` and `./sql/create-docs.sh` in the root directory.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19019 from HyukjinKwon/minor-doc-build.
      3b66b1c4
    • Dongjoon Hyun's avatar
      [SPARK-21831][TEST] Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite · 522e1f80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-19025](https://github.com/apache/spark/pull/16869) removes SQLBuilder, so we don't need the following in HiveCompatibilitySuite.
      
      ```scala
      // Ensures that the plans generation use metastore relation and not OrcRelation
      // Was done because SqlBuilder does not work with plans having logical relation
      TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false)
      ```
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19043 from dongjoon-hyun/SPARK-21831.
      522e1f80
    • Sean Owen's avatar
      [SPARK-21837][SQL][TESTS] UserDefinedTypeSuite Local UDTs not actually testing what it intends · 1a598d71
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Adjust Local UDTs test to assert about results, and fix index of vector column. See JIRA for details.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19053 from srowen/SPARK-21837.
      1a598d71
    • vinodkc's avatar
      [SPARK-21756][SQL] Add JSON option to allow unquoted control characters · 51620e28
      vinodkc authored
      ## What changes were proposed in this pull request?
      
      This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters)
      
      ## How was this patch tested?
      Add new test cases
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      
      Closes #19008 from vinodkc/br_fix_SPARK-21756.
      51620e28
    • Dongjoon Hyun's avatar
      [SPARK-21832][TEST] Merge SQLBuilderTest into ExpressionSQLBuilderSuite · 1f24ceee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-19025](https://github.com/apache/spark/pull/16869), there is no need to keep SQLBuilderTest.
      ExpressionSQLBuilderSuite is the only place to use it.
      This PR aims to remove SQLBuilderTest.
      
      ## How was this patch tested?
      
      Pass the updated `ExpressionSQLBuilderSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19044 from dongjoon-hyun/SPARK-21832.
      1f24ceee
    • Sean Owen's avatar
      [MINOR][BUILD] Fix build warnings and Java lint errors · de7af295
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19051 from srowen/JavaWarnings.
      de7af295
    • mike's avatar
      [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum · 7d16776d
      mike authored
      ## What changes were proposed in this pull request?
      
      Fixed NPE when creating encoder for enum.
      
      When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
      I did a little research and it turns out, that in JavaTypeInference following code
      ```
        def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
          val beanInfo = Introspector.getBeanInfo(beanClass)
          beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
            .filter(_.getReadMethod != null)
        }
      ```
      filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
      
      I added property name "declaringClass" to filtering to resolve this.
      
      ## How was this patch tested?
      Unit test in JavaDatasetSuite which creates an encoder for enum
      
      Author: mike <mike0sv@gmail.com>
      Author: Mikhail Sveshnikov <mike0sv@gmail.com>
      
      Closes #18488 from mike0sv/enum-support.
      7d16776d
  4. Aug 24, 2017
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
    • Shixiong Zhu's avatar
      [SPARK-21788][SS] Handle more exceptions when stopping a streaming query · d3abb369
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add more cases we should view as a normal query stop rather than a failure.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #18997 from zsxwing/SPARK-21788.
      d3abb369
    • Wenchen Fan's avatar
      [SPARK-21826][SQL] outer broadcast hash join should not throw NPE · 2dd37d82
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
      
      Non-equal join condition should only be applied when the equal-join condition matches.
      
      ## How was this patch tested?
      
      regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19036 from cloud-fan/bug.
      2dd37d82
    • Liang-Chi Hsieh's avatar
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved... · 183d4cb7
      Liang-Chi Hsieh authored
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
      
      ## What changes were proposed in this pull request?
      
      With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans.
      
      For a correlated IN query looks like:
      
          SELECT t1.a FROM t1
          WHERE
          t1.a IN (SELECT t2.c
                  FROM t2
                  WHERE t1.b < t2.d);
      
      The query plan might look like:
      
          Project [a#0]
          +- Filter a#0 IN (list#4 [b#1])
             :  +- Project [c#2]
             :     +- Filter (outer(b#1) < d#3)
             :        +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      After `PullupCorrelatedPredicates`, it produces query plan like:
      
          'Project [a#0]
          +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
             :  +- Project [c#2, d#3]
             :     +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery.
      
      When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`.
      
      We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18968 from viirya/SPARK-21759.
      183d4cb7
    • Takuya UESHIN's avatar
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector... · 9e33954d
      Takuya UESHIN authored
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
      
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ColumnVector` hierarchy and related classes.
      
      1. make `ColumnVector` read-only
      2. introduce `WritableColumnVector` with write interface
      3. remove `ReadOnlyColumnVector`
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18958 from ueshin/issues/SPARK-21745.
      9e33954d
    • Jen-Ming Chung's avatar
      [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one · 95713eb4
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      
      When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
      
      ``` scala
      scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
      +---+---+----+
      | c0| c1|  c2|
      +---+---+----+
      |  1|  2|null|
      +---+---+----+
      ```
      
      I think this should be consistent with Hive's implementation:
      ```
      hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
      ...
      1    1
      ```
      
      In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf.
      
      ## How was this patch tested?
      
      Added test in JsonExpressionsSuite.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19017 from jmchung/SPARK-21804.
      95713eb4
    • lufei's avatar
      [MINOR][SQL] The comment of Class ExchangeCoordinator exist a typing and context error · 846bc61c
      lufei authored
      ## What changes were proposed in this pull request?
      
      The given example in the comment of Class ExchangeCoordinator is exist four post-shuffle partitions,but the current comment is “three”.
      
      ## How was this patch tested?
      
      Author: lufei <lu.fei80@zte.com.cn>
      
      Closes #19028 from figo77/SPARK-21816.
      846bc61c
  5. Aug 23, 2017
    • 10129659's avatar
      [SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time · b8aaef49
      10129659 authored
      ## What changes were proposed in this pull request?
      The getAliasedConstraints  fuction in LogicalPlan.scala will clone the expression set when an element added,
      and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time.
      
      Before modified, the cost of getAliasedConstraints is:
      100 expressions:  41 seconds
      150 expressions:  466 seconds
      
      After modified, the cost of getAliasedConstraints is:
      100 expressions:  1.8 seconds
      150 expressions:  6.5 seconds
      
      The test is like this:
      test("getAliasedConstraints") {
          val expressionNum = 150
          val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")())
          val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
      
          val beginTime = System.currentTimeMillis()
          val expressions = aggPlan.validConstraints
          println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
          // The size of Aliased expression is n * (n - 1) / 2 + n
          assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum)
        }
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Run new added test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #19022 from eatoncys/getAliasedConstraints.
      b8aaef49
    • Takeshi Yamamuro's avatar
      [SPARK-21603][SQL][FOLLOW-UP] Change the default value of maxLinesPerFunction into 4000 · 6942aeeb
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression:
      
      ```
      Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement)
      Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression)
      ```
      
      To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19021 from maropu/SPARK-21603-FOLLOWUP-1.
      6942aeeb
  6. Aug 22, 2017
    • Jose Torres's avatar
      [SPARK-21765] Set isStreaming on leaf nodes for streaming plans. · 3c0c2d09
      Jose Torres authored
      ## What changes were proposed in this pull request?
      All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
      
      ## How was this patch tested?
      
      Existing unit tests - no functional change is intended in this PR.
      
      Author: Jose Torres <joseph-torres@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18973 from joseph-torres/SPARK-21765.
      3c0c2d09
    • gatorsmile's avatar
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas... · 01a8e462
      gatorsmile authored
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL
      
      ## What changes were proposed in this pull request?
      For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different.
      
      The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema.
      
      ## How was this patch tested?
      Added a cross-version test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19003 from gatorsmile/respectSparkSchema.
      01a8e462
    • gatorsmile's avatar
      [SPARK-21499][SQL] Support creating persistent function for Spark... · 43d71d96
      gatorsmile authored
      [SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)
      
      ## What changes were proposed in this pull request?
      This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction).
      
      ```SQL
      CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
      ```
      
      Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)`
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18700 from gatorsmile/javaUDFinScala.
      43d71d96
    • gatorsmile's avatar
      [SPARK-21803][TEST] Remove the HiveDDLCommandSuite · be72b157
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19015 from gatorsmile/combineDDL.
      be72b157
  7. Aug 21, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. · 84b5b16e
      Marcelo Vanzin authored
      For Hive tables, the current "replace the schema" code is the correct
      path, except that an exception in that path should result in an error, and
      not in retrying in a different way.
      
      For data source tables, Spark may generate a non-compatible Hive table;
      but for that to work with Hive 2.1, the detection of data source tables needs
      to be fixed in the Hive client, to also consider the raw tables used by code
      such as `alterTableSchema`.
      
      Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18849 from vanzin/SPARK-21617.
      84b5b16e
    • Sean Owen's avatar
      [SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." · b3a07526
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Reduce 'Skipping partitions' message to debug
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19010 from srowen/SPARK-21718.
      b3a07526
  8. Aug 20, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths... · 28a6cca7
      Liang-Chi Hsieh authored
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed
      
      ## What changes were proposed in this pull request?
      
      Fix a typo in test.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19005 from viirya/SPARK-21721-followup.
      28a6cca7
    • hyukjinkwon's avatar
      [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build · 41e0eb71
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to install `mkdocs` by `pip install` if missing in the path. Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
      
      It also adds `mkdocs` as requirements in `docs/README.md`.
      
      ## How was this patch tested?
      
      I manually ran `jekyll build` under `docs` directory after manually removing `mkdocs` via `pip uninstall mkdocs`.
      
      Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) where I built Spark few times but never built documentation before and `mkdocs` is not installed.
      
      ```
      ...
      Moving back into docs dir.
      Moving to SQL directory and building docs.
      Missing mkdocs in your path, trying to install mkdocs for SQL documentation generation.
      Collecting mkdocs
        Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
          100% |████████████████████████████████| 1.2MB 574kB/s
      Requirement already satisfied: PyYAML>=3.10 in /usr/lib64/python2.7/site-packages (from mkdocs)
      Collecting livereload>=2.5.1 (from mkdocs)
        Downloading livereload-2.5.1-py2-none-any.whl
      Collecting tornado>=4.1 (from mkdocs)
        Downloading tornado-4.5.1.tar.gz (483kB)
          100% |████████████████████████████████| 491kB 1.4MB/s
      Collecting Markdown>=2.3.1 (from mkdocs)
        Downloading Markdown-2.6.9.tar.gz (271kB)
          100% |████████████████████████████████| 276kB 2.4MB/s
      Collecting click>=3.3 (from mkdocs)
        Downloading click-6.7-py2.py3-none-any.whl (71kB)
          100% |████████████████████████████████| 71kB 2.8MB/s
      Requirement already satisfied: Jinja2>=2.7.1 in /usr/lib/python2.7/site-packages (from mkdocs)
      Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs)
      Requirement already satisfied: backports.ssl_match_hostname in /usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
      Collecting singledispatch (from tornado>=4.1->mkdocs)
        Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
      Collecting certifi (from tornado>=4.1->mkdocs)
        Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
          100% |████████████████████████████████| 358kB 2.1MB/s
      Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
        Downloading backports_abc-0.5-py2.py3-none-any.whl
      Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
      Building wheels for collected packages: tornado, Markdown
        Running setup.py bdist_wheel for tornado ... done
        Stored in directory: /root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
        Running setup.py bdist_wheel for Markdown ... done
        Stored in directory: /root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
      Successfully built tornado Markdown
      Installing collected packages: singledispatch, certifi, backports-abc, tornado, livereload, Markdown, click, mkdocs
      Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
      Generating markdown files for SQL documentation.
      Generating HTML files for SQL documentation.
      INFO    -  Cleaning site directory
      INFO    -  Building documentation to directory: .../spark/sql/site
      Moving back into docs dir.
      Making directory api/sql
      cp -r ../sql/site/. api/sql
                  Source: .../spark/docs
             Destination: .../spark/docs/_site
            Generating...
                          done.
       Auto-regeneration: disabled. Use --watch to enable.
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18984 from HyukjinKwon/sql-doc-mkdocs.
      41e0eb71
  9. Aug 18, 2017
    • Wenchen Fan's avatar
      [SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak · 7880909c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18993 from cloud-fan/bug.
      7880909c
    • Masha Basmanova's avatar
      [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes · 23ea8980
      Masha Basmanova authored
      ## What changes were proposed in this pull request?
      
      Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.
      
      When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.
      
      For example, table t has 4 partitions with the following specs:
      
      * Partition1: (ds='2008-04-08', hr=11)
      * Partition2: (ds='2008-04-08', hr=12)
      * Partition3: (ds='2008-04-09', hr=11)
      * Partition4: (ds='2008-04-09', hr=12)
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.
      
      'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.
      
      When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.
      
      The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Masha Basmanova <mbasmanova@fb.com>
      
      Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
      23ea8980
    • Reynold Xin's avatar
      [SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java · 07a2b873
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed.
      
      ## How was this patch tested?
      Tested manually. Not sure yet if we should add a test case for just this wrapper ...
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18988 from rxin/SPARK-21778.
      07a2b873
    • donnyzone's avatar
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is... · 310454be
      donnyzone authored
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
      
      This issue is caused by introducing TimeZoneAwareExpression.
      When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase.
      
      However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases,  `NoSuchElementException: None.get` will be thrown for TimestampType.
      
      This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
      
      ## How was this patch tested?
      
      unit test
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #18960 from DonnyZone/spark-21739.
      310454be
  10. Aug 17, 2017
    • gatorsmile's avatar
      [SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite · 2caaed97
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Decimal is a logical type of AVRO. We need to ensure the support of Hive's AVRO serde works well in Spark
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18977 from gatorsmile/addAvroTest.
      2caaed97
    • Jen-Ming Chung's avatar
      [SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type · 7ab95188
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      ``` scala
      scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
      ...
      java.lang.NullPointerException
      	at ...
      ```
      
      Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names.
      
      ## How was this patch tested?
      Added unit test.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #18930 from jmchung/SPARK-21677.
      7ab95188
    • Takeshi Yamamuro's avatar
      [SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent · 6aad02d0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent.  If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order.
      
      ## How was this patch tested?
      Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18959 from maropu/SPARK-18394.
      6aad02d0
    • gatorsmile's avatar
      [SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true · ae9e4247
      gatorsmile authored
      ## What changes were proposed in this pull request?
      When running IntelliJ, we are unable to capture the exception of memory leak detection.
      > org.apache.spark.executor.Executor: Managed memory leak detected
      
      Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when building the SparkSession, instead of reading it from system properties.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.
      ae9e4247
    • Kent Yao's avatar
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for... · b83b502c
      Kent Yao authored
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for reusing CliSessionState
      
      ## What changes were proposed in this pull request?
      
      Set isolated to false while using builtin hive jars and `SessionState.get` returns a `CliSessionState` instance.
      
      ## How was this patch tested?
      
      1 Unit Tests
      2 Manually verified: `hive.exec.strachdir` was only created once because of reusing cliSessionState
      ```java
      ➜  spark git:(SPARK-21428) ✗ bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin
      
      log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
      log4j:WARN Please initialize the log4j system properly.
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
      17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called
      17/07/16 23:59:28 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
      17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
      17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
      17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore
      17/07/16 23:59:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
      17/07/16 23:59:31 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since config is empty
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_all_databases
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=*
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*
      17/07/16 23:59:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db
      17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
      17/07/16 23:59:32 INFO SparkContext: Submitted application: SparkSQL::10.0.0.8
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(Kent); groups with view permissions: Set(); users  with modify permissions: Set(Kent); groups with modify permissions: Set()
      17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on port 51889.
      17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker
      17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
      17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at /private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5
      17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
      17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
      17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
      17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.8:4040
      17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host localhost
      17/07/16 23:59:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890.
      17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on 10.0.0.8:51890
      17/07/16 23:59:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
      17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/Kent/Documents/spark/spark-warehouse').
      17/07/16 23:59:34 INFO SharedState: Warehouse path is 'file:/Users/Kent/Documents/spark/spark-warehouse'.
      17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: default
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: global_temp
      17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
      spark-sql>
      
      ```
      cc cloud-fan gatorsmile
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      
      Closes #18648 from yaooqinn/SPARK-21428.
      b83b502c
    • Wenchen Fan's avatar
      [SPARK-21743][SQL] top-most limit should not cause memory leak · a45133b8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For top-most limit, we will use a special operator to execute it: `CollectLimitExec`.
      
      `CollectLimitExec` will retrieve `n`(which is the limit) rows from each partition of the child plan output, see https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311. It's very likely that we don't exhaust the child plan output.
      
      This is fine when whole-stage-codegen is off, as child plan will release the resource via task completion listener. However, when whole-stage codegen is on, the resource can only be released if all output is consumed.
      
      To fix this memory leak, one simple approach is, when `CollectLimitExec` retrieve `n` rows from child plan output, child plan output should only have `n` rows, then the output is exhausted and resource is released. This can be done by wrapping child plan with `LocalLimit`
      
      ## How was this patch tested?
      
      a regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18955 from cloud-fan/leak.
      a45133b8
  11. Aug 16, 2017
    • Marco Gaido's avatar
      [SPARK-21738] Thriftserver doesn't cancel jobs when session is closed · 7add4e98
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      When a session is closed the Thriftserver doesn't cancel the jobs which may still be running. This is a huge waste of resources.
      This PR address the problem canceling the pending jobs when a session is closed.
      
      ## How was this patch tested?
      
      The patch was tested manually.
      
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #18951 from mgaido91/SPARK-21738.
      7add4e98
    • 10129659's avatar
      [SPARK-21603][SQL] The wholestage codegen will be much slower then that is... · 1cce1a3b
      10129659 authored
      [SPARK-21603][SQL] The wholestage codegen will be much slower then that is closed when the function is too long
      
      ## What changes were proposed in this pull request?
      Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by
      spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT  optimizing.
      A benchmark test result is 10x slower when the generated function is too long :
      
      ignore("max function length of wholestagecodegen") {
          val N = 20 << 15
      
          val benchmark = new Benchmark("max function length of wholestagecodegen", N)
          def f(): Unit = sparkSession.range(N)
            .selectExpr(
              "id",
              "(id & 1023) as k1",
              "cast(id & 1023 as double) as k2",
              "cast(id & 1023 as int) as k3",
              "case when id > 100 and id <= 200 then 1 else 0 end as v1",
              "case when id > 200 and id <= 300 then 1 else 0 end as v2",
              "case when id > 300 and id <= 400 then 1 else 0 end as v3",
              "case when id > 400 and id <= 500 then 1 else 0 end as v4",
              "case when id > 500 and id <= 600 then 1 else 0 end as v5",
              "case when id > 600 and id <= 700 then 1 else 0 end as v6",
              "case when id > 700 and id <= 800 then 1 else 0 end as v7",
              "case when id > 800 and id <= 900 then 1 else 0 end as v8",
              "case when id > 900 and id <= 1000 then 1 else 0 end as v9",
              "case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
              "case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
              "case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
              "case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
              "case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
              "case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
              "case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
              "case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
              "case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
            .groupBy("k1", "k2", "k3")
            .sum()
            .collect()
      
          benchmark.addCase(s"codegen = F") { iter =>
            sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
            f()
          }
      
          benchmark.addCase(s"codegen = T") { iter =>
            sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
            sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
            f()
          }
      
          benchmark.run()
      
          /*
          Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
          Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
          max function length of wholestagecodegen: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          ------------------------------------------------------------------------------------------------
          codegen = F                                    443 /  507          1.5         676.0       1.0X
          codegen = T                                   3279 / 3283          0.2        5002.6       0.1X
           */
        }
      
      ## How was this patch tested?
      Run the unit test
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #18810 from eatoncys/codegen.
      1cce1a3b
    • Dongjoon Hyun's avatar
      [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 · 8c54f1eb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
      
      - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
      - Maintainability: Reduce the Hive dependency and can remove old legacy code later.
      
      Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
      - Usability: User can use ORC data sources without hive module, i.e, -Phive.
      - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
      
      ## How was this patch tested?
      
      Pass the jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18640 from dongjoon-hyun/SPARK-21422.
      8c54f1eb
  12. Aug 15, 2017
    • WeichenXu's avatar
      [SPARK-19634][ML] Multivariate summarizer - dataframes API · 07549b20
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.
      
      ## How was this patch tested?
      
      Testcases added.
      
      ## Performance
      Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan
      
      ### Performance data
      
      (test on my laptop, use 2 partitions. tries out = 20, warm up = 10)
      
      The unit of test results is records/milliseconds (higher is better)
      
      Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
      ----|------|----|---|----|----
      Dataframe | 15149  | 7441 | 2118 | 224 | 21
      RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
      raw RDD | 53931  | 20683 | 3966 | 528 | 53
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
      07549b20
Loading