Skip to content
Snippets Groups Projects
  1. Jan 14, 2017
    • hyukjinkwon's avatar
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor... · b6a7aa4f
      hyukjinkwon authored
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly
      
      ## What changes were proposed in this pull request?
      
      It seems Hadoop libraries need winutils binaries for native libraries in the path.
      
      It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below:
      
      ```
       - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds)
         org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
         ...
      ```
      
      This PR proposes to add it to the `Path` for AppVeyor tests.
      
      ## How was this patch tested?
      
      Manually via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16584 from HyukjinKwon/set-path-appveyor.
      b6a7aa4f
  2. Jan 13, 2017
    • Yucai Yu's avatar
      [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · ad0dadaa
      Yucai Yu authored
      ## What changes were proposed in this pull request?
      
      the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
      
      ## How was this patch tested?
      
      unit test
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #16555 from yucai/offheap_short.
      ad0dadaa
    • Felix Cheung's avatar
      [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · b0e8eb6d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      To allow specifying number of partitions when the DataFrame is created
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16512 from felixcheung/rnumpart.
      b0e8eb6d
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · 285a7798
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      285a7798
    • Andrew Ash's avatar
      Fix missing close-parens for In filter's toString · b040cef2
      Andrew Ash authored
      Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
      
          PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #16558 from ash211/patch-9.
      b040cef2
    • Wenchen Fan's avatar
      [SPARK-19178][SQL] convert string of large numbers to int should return null · 6b34e745
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
      
      However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
      
      This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16550 from cloud-fan/string-to-int.
      6b34e745
    • wm624@hotmail.com's avatar
      [SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters · 7f24a0b6
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
      
      Add missing parameters and corresponding document.
      
      Modified existing unit tests to take additional parameters.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16523 from wangmiao1981/kmeans.
      7f24a0b6
  3. Jan 12, 2017
  4. Jan 11, 2017
    • hyukjinkwon's avatar
      [SPARK-16848][SQL] Check schema validation for user-specified schema in jdbc and table APIs · 24100f16
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to throw an exception for both jdbc APIs when user specified schemas are not allowed or useless.
      
      **DataFrameReader.jdbc(...)**
      
      ``` scala
      spark.read.schema(StructType(Nil)).jdbc(...)
      ```
      
      **DataFrameReader.table(...)**
      
      ```scala
      spark.read.schema(StructType(Nil)).table("usrdb.test")
      ```
      
      ## How was this patch tested?
      
      Unit test in `JDBCSuite` and `DataFrameReaderWriterSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14451 from HyukjinKwon/SPARK-16848.
      24100f16
    • wangzhenhua's avatar
      [SPARK-19132][SQL] Add test cases for row size estimation and aggregate estimation · 43fa21b3
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      In this pr, we add more test cases for project and aggregate estimation.
      
      ## How was this patch tested?
      
      Add test cases.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16551 from wzhfy/addTests.
      43fa21b3
    • Reynold Xin's avatar
      [SPARK-19149][SQL] Follow-up: simplify cache implementation. · 66fe819a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch simplifies slightly the logical plan statistics cache implementation, as discussed in https://github.com/apache/spark/pull/16529
      
      ## How was this patch tested?
      N/A - this has no behavior change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16544 from rxin/SPARK-19149.
      66fe819a
    • jiangxingbo's avatar
      [SPARK-18801][SQL] Support resolve a nested view · 30a07071
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated.
      The new approach should be compatible with older versions of SPARK/HIVE, that means:
      1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE;
      2. The new approach should be able to resolve the views that are currently supported by SPARK SQL.
      
      The new approach mainly brings in the following changes:
      1. Add a new operator called `View` to keep track of the CatalogTable that describes the view, and the output attributes as well as the child of the view;
      2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly;
      3. Add `viewDefaultDatabase` variable to `CatalogTable` to keep track of the default database name used to resolve a view, if the `CatalogTable` is not a view, then the variable should be `None`;
      4. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query;
      5. Enables the view support without enabling Hive support (i.e., enableHiveSupport);
      6. Fix a weird behavior: the result of a view query may have different schema if the referenced table has been changed. After this PR, we try to cast the child output attributes to that from the view schema, throw an AnalysisException if cast is not allowed.
      
      Note this is compatible with the views defined by older versions of Spark(before 2.2), which have empty `defaultDatabase` and all the relations in `viewText` have database part defined.
      
      ## How was this patch tested?
      1. Add new tests in `SessionCatalogSuite` to test the function `lookupRelation`;
      2. Add new test case in `SQLViewSuite` to test resolve a nested view.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16233 from jiangxb1987/resolve-view.
      30a07071
    • Bryan Cutler's avatar
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings... · 3bc2eff8
      Bryan Cutler authored
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts
      
      ## What changes were proposed in this pull request?
      
      Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.
      
      This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
      ## How was this patch tested?
      
      Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Ian Hummel <ian@themodernlife.net>
      
      Closes #15119 from BryanCutler/spark-custom-IvySettings.
      3bc2eff8
    • Felix Cheung's avatar
      [SPARK-19130][SPARKR] Support setting literal value as column implicitly · d749c066
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      ```
      df$foo <- 1
      ```
      
      instead of
      ```
      df$foo <- lit(1)
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16510 from felixcheung/rlitcol.
      d749c066
    • jerryshao's avatar
      [SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems · 4239a108
      jerryshao authored
      Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.
      
      ## How was this patch tested?
      
      Manually verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16432 from jerryshao/SPARK-19021.
      4239a108
    • wangzhenhua's avatar
      [SPARK-19149][SQL] Unify two sets of statistics in LogicalPlan · a6155135
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      Currently we have two sets of statistics in LogicalPlan: a simple stats and a stats estimated by cbo, but the computing logic and naming are quite confusing, we need to unify these two sets of stats.
      
      ## How was this patch tested?
      
      Just modify existing tests.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #16529 from wzhfy/unifyStats.
      a6155135
  5. Jan 10, 2017
    • Wenchen Fan's avatar
      [SPARK-19157][SQL] should be able to change spark.sql.runSQLOnFiles at runtime · 3b19c74e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The analyzer rule that supports to query files directly will be added to `Analyzer.extendedResolutionRules` when SparkSession is created, according to the `spark.sql.runSQLOnFiles` flag. If the flag is off when we create `SparkSession`, this rule is not added and we can not query files directly even we turn on the flag later.
      
      This PR fixes this bug by always adding that rule to `Analyzer.extendedResolutionRules`.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16531 from cloud-fan/sql-on-files.
      3b19c74e
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · bc6c56e9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      bc6c56e9
    • Sean Owen's avatar
      [SPARK-18997][CORE] Recommended upgrade libthrift to 0.9.3 · 856bae6a
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Updates to libthrift 0.9.3 to address a CVE.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16530 from srowen/SPARK-18997.
      856bae6a
    • Felix Cheung's avatar
      [SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported · 9bc3507e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      R family is a longer list than what Spark supports.
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16511 from felixcheung/rdocglmfamily.
      9bc3507e
    • Dongjoon Hyun's avatar
      [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly · d5b1dc93
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `DataStreamReaderWriterSuite` makes test files in source folder like the followings. Interestingly, the root cause is `withSQLConf` fails to reset `OptionalConfigEntry` correctly. In other words, it resets the config into `Some(undefined)`.
      
      ```bash
      $ git status
      Untracked files:
        (use "git add <file>..." to include in what will be committed)
      
              sql/core/%253Cundefined%253E/
              sql/core/%3Cundefined%3E/
      ```
      
      ## How was this patch tested?
      
      Manual.
      ```
      build/sbt "project sql" test
      git status
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16522 from dongjoon-hyun/SPARK-19137.
      d5b1dc93
    • Shixiong Zhu's avatar
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to... · 3ef183a9
      Shixiong Zhu authored
      [SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization
      
      ## What changes were proposed in this pull request?
      
      StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may not be able to catch fatal errors during query initialization. This PR uses `onQueryStarted` callback to fix it.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16492 from zsxwing/SPARK-19113.
      3ef183a9
    • Dongjoon Hyun's avatar
      [SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server · a2c6adcc
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However,
      Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**,
      this causes GC and hangs for queries with large number of rows. We should not use this,
      especially for `spark.sql.thriftServer.incrementalCollect`.
      
      https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300
      
      ## How was this patch tested?
      
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16440 from dongjoon-hyun/SPARK-18857.
      a2c6adcc
    • hyukjinkwon's avatar
      [SPARK-19117][TESTS] Skip the tests using script transformation on Windows · 2cfd41ac
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to skip the tests for script transformation failed on Windows due to fixed bash location.
      
      ```
      SQLQuerySuite:
       - script *** FAILED *** (553 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 54, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - Star Expansion - script transform *** FAILED *** (2 seconds, 375 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 389.0 failed 1 times, most recent failure: Lost task 0.0 in stage 389.0 (TID 725, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - test script transform for stdout *** FAILED *** (2 seconds, 813 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 391.0 failed 1 times, most recent failure: Lost task 0.0 in stage 391.0 (TID 726, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - test script transform for stderr *** FAILED *** (2 seconds, 407 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 393.0 failed 1 times, most recent failure: Lost task 0.0 in stage 393.0 (TID 727, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - test script transform data type *** FAILED *** (171 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 395.0 (TID 728, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      ```
      
      ```
      HiveQuerySuite:
       - transform *** FAILED *** (359 milliseconds)
         Failed to execute query using catalyst:
         Error: Job aborted due to stage failure: Task 0 in stage 1347.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1347.0 (TID 2395, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - schema-less transform *** FAILED *** (344 milliseconds)
         Failed to execute query using catalyst:
         Error: Job aborted due to stage failure: Task 0 in stage 1348.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1348.0 (TID 2396, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - transform with custom field delimiter *** FAILED *** (296 milliseconds)
         Failed to execute query using catalyst:
         Error: Job aborted due to stage failure: Task 0 in stage 1349.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1349.0 (TID 2397, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - transform with custom field delimiter2 *** FAILED *** (297 milliseconds)
         Failed to execute query using catalyst:
         Error: Job aborted due to stage failure: Task 0 in stage 1350.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1350.0 (TID 2398, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - transform with custom field delimiter3 *** FAILED *** (312 milliseconds)
         Failed to execute query using catalyst:
         Error: Job aborted due to stage failure: Task 0 in stage 1351.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1351.0 (TID 2399, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - transform with SerDe2 *** FAILED *** (437 milliseconds)
         org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1355.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1355.0 (TID 2403, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      ```
      
      ```
      LogicalPlanToSQLSuite:
       - script transformation - schemaless *** FAILED *** (78 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1968.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1968.0 (TID 3932, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
        - script transformation - alias list *** FAILED *** (94 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1969.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1969.0 (TID 3933, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation - alias list with type *** FAILED *** (93 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1970.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1970.0 (TID 3934, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation - row format delimited clause with only one format property *** FAILED *** (78 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1971.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1971.0 (TID 3935, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation - row format delimited clause with multiple format properties *** FAILED *** (94 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1972.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1972.0 (TID 3936, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation - row format serde clauses with SERDEPROPERTIES *** FAILED *** (78 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1973.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1973.0 (TID 3937, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation - row format serde clauses without SERDEPROPERTIES *** FAILED *** (78 milliseconds)
         ...
         Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1974.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1974.0 (TID 3938, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      ```
      
      ```
      ScriptTransformationSuite:
       - cat without SerDe *** FAILED *** (156 milliseconds)
         ...
         Caused by: java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - cat with LazySimpleSerDe *** FAILED *** (63 milliseconds)
          ...
          org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2383.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2383.0 (TID 4819, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation should not swallow errors from upstream operators (no serde) *** FAILED *** (78 milliseconds)
          ...
          org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2384.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2384.0 (TID 4820, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - script transformation should not swallow errors from upstream operators (with serde) *** FAILED *** (47 milliseconds)
          ...
          org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2385.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2385.0 (TID 4821, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-14400 script transformation should fail for bad script command *** FAILED *** (47 milliseconds)
         "Job aborted due to stage failure: Task 0 in stage 2386.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2386.0 (TID 4822, localhost, executor driver): java.io.IOException: Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
      ```
      
      ## How was this patch tested?
      
      AppVeyor as below:
      
      ```
      SQLQuerySuite:
        - script !!! CANCELED !!! (63 milliseconds)
        - Star Expansion - script transform !!! CANCELED !!! (0 milliseconds)
        - test script transform for stdout !!! CANCELED !!! (0 milliseconds)
        - test script transform for stderr !!! CANCELED !!! (0 milliseconds)
        - test script transform data type !!! CANCELED !!! (0 milliseconds)
      ```
      
      ```
      HiveQuerySuite:
        - transform !!! CANCELED !!! (31 milliseconds)
        - schema-less transform !!! CANCELED !!! (0 milliseconds)
        - transform with custom field delimiter !!! CANCELED !!! (0 milliseconds)
        - transform with custom field delimiter2 !!! CANCELED !!! (0 milliseconds)
        - transform with custom field delimiter3 !!! CANCELED !!! (0 milliseconds)
        - transform with SerDe2 !!! CANCELED !!! (0 milliseconds)
      ```
      
      ```
      LogicalPlanToSQLSuite:
        - script transformation - schemaless !!! CANCELED !!! (78 milliseconds)
        - script transformation - alias list !!! CANCELED !!! (0 milliseconds)
        - script transformation - alias list with type !!! CANCELED !!! (0 milliseconds)
        - script transformation - row format delimited clause with only one format property !!! CANCELED !!! (15 milliseconds)
        - script transformation - row format delimited clause with multiple format properties !!! CANCELED !!! (0 milliseconds)
        - script transformation - row format serde clauses with SERDEPROPERTIES !!! CANCELED !!! (0 milliseconds)
        - script transformation - row format serde clauses without SERDEPROPERTIES !!! CANCELED !!! (0 milliseconds)
      ```
      
      ```
      ScriptTransformationSuite:
        - cat without SerDe !!! CANCELED !!! (62 milliseconds)
        - cat with LazySimpleSerDe !!! CANCELED !!! (0 milliseconds)
        - script transformation should not swallow errors from upstream operators (no serde) !!! CANCELED !!! (0 milliseconds)
        - script transformation should not swallow errors from upstream operators (with serde) !!! CANCELED !!! (0 milliseconds)
        - SPARK-14400 script transformation should fail for bad script command !!! CANCELED !!! (0 milliseconds)
      ```
      
      Jenkins tests
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16501 from HyukjinKwon/windows-bash.
      2cfd41ac
    • hyukjinkwon's avatar
      [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due... · 4e27578f
      hyukjinkwon authored
      [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due to path and resource-not-closed problems on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix all the test failures identified by testing with AppVeyor.
      
      **Scala - aborted tests**
      
      ```
      WindowQuerySuite:
        Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.WindowQuerySuite *** ABORTED *** (156 milliseconds)
         org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive   argetscala-2.11   est-classesdatafilespart_tiny.txt;
      
      OrcSourceSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.orc.OrcSourceSuite *** ABORTED *** (62 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      ParquetMetastoreSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetMetastoreSuite *** ABORTED *** (4 seconds, 703 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      ParquetSourceSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetSourceSuite *** ABORTED *** (3 seconds, 907 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-581a6575-454f-4f21-a516-a07f95266143;
      
      KafkaRDDSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaRDDSuite *** ABORTED *** (5 seconds, 212 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-4722304d-213e-4296-b556-951df1a46807
      
      DirectKafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite *** ABORTED *** (7 seconds, 127 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d0d3eba7-4215-4e10-b40e-bb797e89338e
         at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
      
      ReliableKafkaStreamSuite
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite *** ABORTED *** (5 seconds, 498 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d33e45a0-287e-4bed-acae-ca809a89d888
      
      KafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaStreamSuite *** ABORTED *** (2 seconds, 892 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-59c9d169-5a56-4519-9ef0-cefdbd3f2e6c
      
      KafkaClusterSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaClusterSuite *** ABORTED *** (1 second, 690 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-3ef402b0-8689-4a60-85ae-e41e274f179d
      
      DirectKafkaStreamSuite:
       Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6
      
      KafkaRDDSuite:
      Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.KafkaRDDSuite *** ABORTED *** (2 minutes, 6 seconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-b9ce7929-5dae-46ab-a0c4-9ef6f58fbc2
      ```
      
      **Java - failed tests**
      
      ```
      Test org.apache.spark.streaming.kafka.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-1cee32f4-4390-4321-82c9-e8616b3f0fb0, took 9.61 sec
      
      Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-f42695dd-242e-4b07-847c-f299b8e4676e, took 11.797 sec
      
      Test org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-85c0d062-78cf-459c-a2dd-7973572101ce, took 1.581 sec
      
      Test org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-49eb6b5c-8366-47a6-83f2-80c443c48280, took 17.895 sec
      
      org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-898cf826-d636-4b1c-a61a-c12a364c02e7, took 8.858 sec
      ```
      
      **Scala - failed tests**
      
      ```
      PartitionProviderCompatibilitySuite:
       - insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (828 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-bb6337b9-4f99-45ab-ad2c-a787ab965c09
      
       - SPARK-18635 special chars in partition values - partition management true *** FAILED *** (5 seconds, 360 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - SPARK-18635 special chars in partition values - partition management false *** FAILED *** (141 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      UtilsSuite:
       - reading offset bytes of a file (compressed) *** FAILED *** (0 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ecb2b7d5-db8b-43a7-b268-1bf242b5a491
      
       - reading offset bytes across multiple files (compressed) *** FAILED *** (0 milliseconds)
         java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-25cc47a8-1faa-4da5-8862-cf174df63ce0
      ```
      
      ```
      StatisticsSuite:
       - MetastoreRelations fallback to HDFS for size estimation *** FAILED *** (110 milliseconds)
         org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'csv_table' not found in database 'default';
      ```
      
      ```
      SQLQuerySuite:
       - permanent UDTF *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count_temp'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24
      
       - describe functions - user defined functions *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
      
       - CTAS without serde with location *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-ed673d73-edfc-404e-829e-2e2b9725d94e/c1
      
       - derived from Hive query file: drop_database_removes_partition_dirs.q *** FAILED *** (47 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_database_removes_partition_dirs_table
      
       - derived from Hive query file: drop_table_removes_partition_dirs.q *** FAILED *** (0 milliseconds)
         java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_table_removes_partition_dirs_table2
      
       - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH *** FAILED *** (109 milliseconds)
         java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/sql/hive/projectsspark	arget	mpspark-1a122f8c-dfb3-46c4-bab1-f30764baee0e/*part-r*
      ```
      
      ```
      HiveDDLSuite:
       - drop external tables in default database *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - add/drop partitions - external table *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - create/drop database - location without pre-created directory *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - create/drop database - location with pre-created directory *** FAILED *** (32 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - drop database containing tables - CASCADE *** FAILED *** (94 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop an empty database - CASCADE *** FAILED *** (63 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop database containing tables - RESTRICT *** FAILED *** (47 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - drop an empty database - RESTRICT *** FAILED *** (47 milliseconds)
         CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675)
      
       - CREATE TABLE LIKE an external data source table *** FAILED *** (140 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-c5eba16d-07ae-4186-95bb-21c5811cf888;
      
       - CREATE TABLE LIKE an external Hive serde table *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - desc table for data source table - no user-defined schema *** FAILED *** (125 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e8bf5bf5-721a-4cbe-9d6	at scala.collection.immutable.List.foreach(List.scala:381)d-5543a8301c1d;
      ```
      
      ```
      MetastoreDataSourcesSuite
       - CTAS: persisted bucketed data source table *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
      ```
      
      ```
      ShowCreateTableSuite:
       - simple external hive table *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      PartitionedTablePerfStatsSuite:
       - hive table: partitioned pruned table reports only selected files *** FAILED *** (313 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: partitioned pruned table reports only selected files *** FAILED *** (219 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-311f45f8-d064-4023-a4bb-e28235bff64d;
      
       - hive table: lazy partition pruning reads only necessary partition data *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: lazy partition pruning reads only necessary partition data *** FAILED *** (187 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-fde874ca-66bd-4d0b-a40f-a043b65bf957;
      
       - hive table: lazy partition pruning with file status caching enabled *** FAILED *** (188 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: lazy partition pruning with file status caching enabled *** FAILED *** (187 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-e6d20183-dd68-4145-acbe-4a509849accd;
      
       - hive table: file status caching respects refresh table and refreshByPath *** FAILED *** (172 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: file status caching respects refresh table and refreshByPath *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-8b2c9651-2adf-4d58-874f-659007e21463;
      
       - hive table: file status cache respects size limit *** FAILED *** (219 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: file status cache respects size limit *** FAILED *** (171 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-7835ab57-cb48-4d2c-bb1d-b46d5a4c47e4;
      
       - datasource table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-20598d76-c004-42a7-8061-6c56f0eda5e2;
      
       - hive table: table setup does not scan filesystem *** FAILED *** (266 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - hive table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 281 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: num hive client calls does not scale with partition count *** FAILED *** (2 seconds, 422 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4cfed321-4d1d-4b48-8d34-5c169afff383;
      
       - hive table: files read and cached when filesource partition management is off *** FAILED *** (234 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - datasource table: all partition data cached in memory when partition management is off *** FAILED *** (203 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4bcc0398-15c9-4f6a-811e-12d40f3eec12;
      
       - SPARK-18700: table loaded only once even when resolved concurrently *** FAILED *** (1 second, 266 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      HiveSparkSubmitSuite:
       - temporary Hive UDF: define a UDF and use it *** FAILED *** (2 seconds, 94 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - permanent Hive UDF: define a UDF and use it *** FAILED *** (281 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - permanent Hive UDF: use a already defined permanent function *** FAILED *** (718 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8368: includes jars passed in through --jars *** FAILED *** (3 seconds, 521 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8020: set sql conf in spark conf *** FAILED *** (0 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-8489: MissingRequirementError during reflection *** FAILED *** (94 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-9757 Persist Parquet relation with decimal column *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-11009 fix wrong result of Window function in cluster mode *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-14244 fix window partition size attribute binding failure *** FAILED *** (78 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - set spark.sql.warehouse.dir *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - set hive.metastore.warehouse.dir *** FAILED *** (15 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-16901: set javax.jdo.option.ConnectionURL *** FAILED *** (16 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      
       - SPARK-18360: default table path of tables in default database should depend on the location of default database *** FAILED *** (15 milliseconds)
         java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      ```
      
      ```
      UtilsSuite:
       - resolveURIs with multiple paths *** FAILED *** (0 milliseconds)
         ".../jar3,file:/C:/pi.py[%23]py.pi,file:/C:/path%..." did not equal ".../jar3,file:/C:/pi.py[#]py.pi,file:/C:/path%..." (UtilsSuite.scala:468)
      ```
      
      ```
      CheckpointSuite:
       - recovery with file input stream *** FAILED *** (10 seconds, 205 milliseconds)
         The code passed to eventually never returned normally. Attempted 660 times over 10.014272499999999 seconds. Last failure message: Unexpected internal error near index 1
         \
          ^. (CheckpointSuite.scala:680)
      ```
      
      ## How was this patch tested?
      
      Manually via AppVeyor as below:
      
      **Scala - aborted tests**
      
      ```
      WindowQuerySuite - all passed
      OrcSourceSuite:
      - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
        org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
      ParquetMetastoreSuite - all passed
      ParquetSourceSuite - all passed
      KafkaRDDSuite - all passed
      DirectKafkaStreamSuite - all passed
      ReliableKafkaStreamSuite - all passed
      KafkaStreamSuite - all passed
      KafkaClusterSuite - all passed
      DirectKafkaStreamSuite - all passed
      KafkaRDDSuite - all passed
      ```
      
      **Java - failed tests**
      
      ```
      org.apache.spark.streaming.kafka.JavaKafkaRDDSuite - all passed
      org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka.JavaKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite - all passed
      org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite - all passed
      ```
      
      **Scala - failed tests**
      
      ```
      PartitionProviderCompatibilitySuite:
      - insert overwrite partition of new datasource table overwrites just partition (1 second, 953 milliseconds)
      - SPARK-18635 special chars in partition values - partition management true (6 seconds, 31 milliseconds)
      - SPARK-18635 special chars in partition values - partition management false (4 seconds, 578 milliseconds)
      ```
      
      ```
      UtilsSuite:
      - reading offset bytes of a file (compressed) (203 milliseconds)
      - reading offset bytes across multiple files (compressed) (0 milliseconds)
      ```
      
      ```
      StatisticsSuite:
      - MetastoreRelations fallback to HDFS for size estimation (94 milliseconds)
      ```
      
      ```
      SQLQuerySuite:
       - permanent UDTF (407 milliseconds)
       - describe functions - user defined functions (441 milliseconds)
       - CTAS without serde with location (2 seconds, 831 milliseconds)
       - derived from Hive query file: drop_database_removes_partition_dirs.q (734 milliseconds)
       - derived from Hive query file: drop_table_removes_partition_dirs.q (563 milliseconds)
       - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH (453 milliseconds)
      ```
      
      ```
      HiveDDLSuite:
       - drop external tables in default database (3 seconds, 5 milliseconds)
       - add/drop partitions - external table (2 seconds, 750 milliseconds)
       - create/drop database - location without pre-created directory (500 milliseconds)
       - create/drop database - location with pre-created directory (407 milliseconds)
       - drop database containing tables - CASCADE (453 milliseconds)
       - drop an empty database - CASCADE (375 milliseconds)
       - drop database containing tables - RESTRICT (328 milliseconds)
       - drop an empty database - RESTRICT (391 milliseconds)
       - CREATE TABLE LIKE an external data source table (953 milliseconds)
       - CREATE TABLE LIKE an external Hive serde table (3 seconds, 782 milliseconds)
       - desc table for data source table - no user-defined schema (1 second, 150 milliseconds)
      ```
      
      ```
      MetastoreDataSourcesSuite
       - CTAS: persisted bucketed data source table (875 milliseconds)
      ```
      
      ```
      ShowCreateTableSuite:
       - simple external hive table (78 milliseconds)
      ```
      
      ```
      PartitionedTablePerfStatsSuite:
       - hive table: partitioned pruned table reports only selected files (1 second, 109 milliseconds)
      - datasource table: partitioned pruned table reports only selected files (860 milliseconds)
       - hive table: lazy partition pruning reads only necessary partition data (859 milliseconds)
       - datasource table: lazy partition pruning reads only necessary partition data (1 second, 219 milliseconds)
       - hive table: lazy partition pruning with file status caching enabled (875 milliseconds)
       - datasource table: lazy partition pruning with file status caching enabled (890 milliseconds)
       - hive table: file status caching respects refresh table and refreshByPath (922 milliseconds)
       - datasource table: file status caching respects refresh table and refreshByPath (640 milliseconds)
       - hive table: file status cache respects size limit (469 milliseconds)
       - datasource table: file status cache respects size limit (453 milliseconds)
       - datasource table: table setup does not scan filesystem (328 milliseconds)
       - hive table: table setup does not scan filesystem (313 milliseconds)
       - hive table: num hive client calls does not scale with partition count (5 seconds, 431 milliseconds)
       - datasource table: num hive client calls does not scale with partition count (4 seconds, 79 milliseconds)
       - hive table: files read and cached when filesource partition management is off (656 milliseconds)
       - datasource table: all partition data cached in memory when partition management is off (484 milliseconds)
       - SPARK-18700: table loaded only once even when resolved concurrently (2 seconds, 578 milliseconds)
      ```
      
      ```
      HiveSparkSubmitSuite:
       - temporary Hive UDF: define a UDF and use it (1 second, 745 milliseconds)
       - permanent Hive UDF: define a UDF and use it (406 milliseconds)
       - permanent Hive UDF: use a already defined permanent function (375 milliseconds)
       - SPARK-8368: includes jars passed in through --jars (391 milliseconds)
       - SPARK-8020: set sql conf in spark conf (156 milliseconds)
       - SPARK-8489: MissingRequirementError during reflection (187 milliseconds)
       - SPARK-9757 Persist Parquet relation with decimal column (157 milliseconds)
       - SPARK-11009 fix wrong result of Window function in cluster mode (156 milliseconds)
       - SPARK-14244 fix window partition size attribute binding failure (156 milliseconds)
       - set spark.sql.warehouse.dir (172 milliseconds)
       - set hive.metastore.warehouse.dir (156 milliseconds)
       - SPARK-16901: set javax.jdo.option.ConnectionURL (157 milliseconds)
       - SPARK-18360: default table path of tables in default database should depend on the location of default database (172 milliseconds)
      ```
      
      ```
      UtilsSuite:
       - resolveURIs with multiple paths (0 milliseconds)
      ```
      
      ```
      CheckpointSuite:
       - recovery with file input stream (4 seconds, 452 milliseconds)
      ```
      
      Note: after resolving the aborted tests, there is a test failure identified as below:
      
      ```
      OrcSourceSuite:
      - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (4 seconds, 417 milliseconds)
        org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
      ```
      
      This does not look due to this problem so this PR does not fix it here.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16451 from HyukjinKwon/all-path-resource-fixes.
      4e27578f
    • Peng, Meng's avatar
      [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change · 32286ba6
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      Add FDR test case in ml/feature/ChiSqSelectorSuite.
      Improve some comments in the code.
      This is a follow-up pr for #15212.
      
      ## How was this patch tested?
      ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #16434 from mpjlu/fdr_fwe_update.
      32286ba6
    • Liwei Lin's avatar
      [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` grows beyond 64 KB · acfc5f35
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Prior to this patch, we'll generate `compare(...)` for `GeneratedClass$SpecificOrdering` like below, leading to Janino exceptions saying the code grows beyond 64 KB.
      
      ``` scala
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* ..... */   ...
      /* 10969 */   private int compare(InternalRow a, InternalRow b) {
      /* 10970 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 10971 */
      /* 1.... */     code for comparing field0
      /* 1.... */     code for comparing field1
      /* 1.... */     ...
      /* 1.... */     code for comparing field449
      /* 15012 */
      /* 15013 */     return 0;
      /* 15014 */   }
      /* 15015 */ }
      ```
      
      This patch would break `compare(...)` into smaller `compare_xxx(...)` methods when necessary; then we'll get generated `compare(...)` like:
      
      ``` scala
      /* 001 */ public SpecificOrdering generate(Object[] references) {
      /* 002 */   return new SpecificOrdering(references);
      /* 003 */ }
      /* 004 */
      /* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
      /* 006 */
      /* 007 */     ...
      /* 1.... */
      /* 11290 */   private int compare_0(InternalRow a, InternalRow b) {
      /* 11291 */     InternalRow i = null;  // Holds current row being evaluated.
      /* 11292 */
      /* 11293 */     i = a;
      /* 11294 */     boolean isNullA;
      /* 11295 */     UTF8String primitiveA;
      /* 11296 */     {
      /* 11297 */
      /* 11298 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11299 */       UTF8String value = (UTF8String) obj;
      /* 11300 */       isNullA = false;
      /* 11301 */       primitiveA = value;
      /* 11302 */     }
      /* 11303 */     i = b;
      /* 11304 */     boolean isNullB;
      /* 11305 */     UTF8String primitiveB;
      /* 11306 */     {
      /* 11307 */
      /* 11308 */       Object obj = ((Expression) references[0]).eval(null);
      /* 11309 */       UTF8String value = (UTF8String) obj;
      /* 11310 */       isNullB = false;
      /* 11311 */       primitiveB = value;
      /* 11312 */     }
      /* 11313 */     if (isNullA && isNullB) {
      /* 11314 */       // Nothing
      /* 11315 */     } else if (isNullA) {
      /* 11316 */       return -1;
      /* 11317 */     } else if (isNullB) {
      /* 11318 */       return 1;
      /* 11319 */     } else {
      /* 11320 */       int comp = primitiveA.compare(primitiveB);
      /* 11321 */       if (comp != 0) {
      /* 11322 */         return comp;
      /* 11323 */       }
      /* 11324 */     }
      /* 11325 */
      /* 11326 */
      /* 11327 */     i = a;
      /* 11328 */     boolean isNullA1;
      /* 11329 */     UTF8String primitiveA1;
      /* 11330 */     {
      /* 11331 */
      /* 11332 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11333 */       UTF8String value1 = (UTF8String) obj1;
      /* 11334 */       isNullA1 = false;
      /* 11335 */       primitiveA1 = value1;
      /* 11336 */     }
      /* 11337 */     i = b;
      /* 11338 */     boolean isNullB1;
      /* 11339 */     UTF8String primitiveB1;
      /* 11340 */     {
      /* 11341 */
      /* 11342 */       Object obj1 = ((Expression) references[1]).eval(null);
      /* 11343 */       UTF8String value1 = (UTF8String) obj1;
      /* 11344 */       isNullB1 = false;
      /* 11345 */       primitiveB1 = value1;
      /* 11346 */     }
      /* 11347 */     if (isNullA1 && isNullB1) {
      /* 11348 */       // Nothing
      /* 11349 */     } else if (isNullA1) {
      /* 11350 */       return -1;
      /* 11351 */     } else if (isNullB1) {
      /* 11352 */       return 1;
      /* 11353 */     } else {
      /* 11354 */       int comp = primitiveA1.compare(primitiveB1);
      /* 11355 */       if (comp != 0) {
      /* 11356 */         return comp;
      /* 11357 */       }
      /* 11358 */     }
      /* 1.... */
      /* 1.... */   ...
      /* 1.... */
      /* 12652 */     return 0;
      /* 12653 */   }
      /* 1.... */
      /* 1.... */   ...
      /* 15387 */
      /* 15388 */   public int compare(InternalRow a, InternalRow b) {
      /* 15389 */
      /* 15390 */     int comp_0 = compare_0(a, b);
      /* 15391 */     if (comp_0 != 0) {
      /* 15392 */       return comp_0;
      /* 15393 */     }
      /* 15394 */
      /* 15395 */     int comp_1 = compare_1(a, b);
      /* 15396 */     if (comp_1 != 0) {
      /* 15397 */       return comp_1;
      /* 15398 */     }
      /* 1.... */
      /* 1.... */     ...
      /* 1.... */
      /* 15450 */     return 0;
      /* 15451 */   }
      /* 15452 */ }
      ```
      ## How was this patch tested?
      - a new added test case which
        - would fail prior to this patch
        - would pass with this patch
      - ordering correctness should already be covered by existing tests like those in `OrderingSuite`
      
      ## Acknowledgement
      
      A major part of this PR - the refactoring work of `splitExpression()` - has been done by ueshin.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      Author: Takuya Ueshin <ueshin@happy-camper.st>
      
      Closes #15480 from lw-lin/spec-ordering-64k-.
      acfc5f35
    • Wenchen Fan's avatar
      [SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog · b0319c2e
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      After unifying the CREATE TABLE syntax in https://github.com/apache/spark/pull/16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now.
      
      This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests.
      
      ## How was this patch tested?
      
      new tests in `HiveDDLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16487 from cloud-fan/hive-table.
      b0319c2e
    • hyukjinkwon's avatar
      [SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working · b0e5840d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      **binary_classification_metrics_example.py**
      
      LibSVM datasource loads `ml.linalg.SparseVector` whereas the example requires it to be `mllib.linalg.SparseVector`.  For the equivalent Scala exmaple, `BinaryClassificationMetricsExample.scala` seems fine.
      
      ```
      ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
      ```
      
      ```
        File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda>
          .rdd.map(lambda row: LabeledPoint(row[0], row[1]))
        File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__
          self.features = _convert_to_vector(features)
        File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector
          raise TypeError("Cannot convert type %s into Vector" % type(l))
      TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
      ```
      
      **status_api_demo.py** (this one does not work on Python 3.4.6)
      
      It's `queue` in Python 3+.
      
      ```
      PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
      ```
      
      ```
      Traceback (most recent call last):
        File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module>
          import Queue
      ImportError: No module named 'Queue'
      ```
      
      **bisecting_k_means_example.py**
      
      `BisectingKMeansModel` does not implement `save` and `load` in Python.
      
      ```bash
      ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
      ```
      
      ```
      Traceback (most recent call last):
        File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module>
          model.save(sc, path)
      AttributeError: 'BisectingKMeansModel' object has no attribute 'save'
      ```
      
      **elementwise_product_example.py**
      
      It calls `collect` from the vector.
      
      ```bash
      ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
      ```
      
      ```
      Traceback (most recent call last):
        File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module>
          for each in transformedData2.collect():
        File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__
          return getattr(self.array, item)
      AttributeError: 'numpy.ndarray' object has no attribute 'collect'
      ```
      
      **These three tests look throwing an exception for a relative path set in `spark.sql.warehouse.dir`.**
      
      **hive.py**
      
      ```
      ./bin/spark-submit examples/src/main/python/sql/hive.py
      ```
      
      ```
      Traceback (most recent call last):
        File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module>
          spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
        File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
        File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
      pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);'
      ```
      
      **SparkHiveExample.scala**
      
      ```
      ./bin/run-example sql.hive.SparkHiveExample
      ```
      
      ```
      Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
      	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
      	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
      	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
      ```
      
      **JavaSparkHiveExample.java**
      
      ```
      ./bin/run-example sql.hive.JavaSparkHiveExample
      ```
      
      ```
      Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
      	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
      	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
      	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
      ```
      
      ## How was this patch tested?
      
      Manually via
      
      ```
      ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
      ```
      
      ```
      PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
      ```
      
      ```
      ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
      ```
      
      ```
      ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
      ```
      
      ```
      ./bin/spark-submit examples/src/main/python/sql/hive.py
      ```
      
      ```
      ./bin/run-example sql.hive.JavaSparkHiveExample
      ```
      
      ```
      ./bin/run-example sql.hive.SparkHiveExample
      ```
      
      These were found via
      
      ```bash
      find ./examples/src/main/python -name "*.py" -exec spark-submit {} \;
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16515 from HyukjinKwon/minor-example-fix.
      b0e5840d
  6. Jan 09, 2017
    • Yanbo Liang's avatar
      [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the... · 3ef6d98a
      Yanbo Liang authored
      [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
      
      ## What changes were proposed in this pull request?
      
      Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
      I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
      - mllib `GaussianMixture` allows k == 1, but ml does not.
      - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)
      
      We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.
      
      Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.
      
      Before this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
      After this PR:
      ![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
      ## How was this patch tested?
      
      Existing tests and added new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15413 from yanboliang/spark-17847.
      3ef6d98a
    • Burak Yavuz's avatar
      [SPARK-18952] Regex strings not properly escaped in codegen for aggregations · faabe69c
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      If I use the function regexp_extract, and then in my regex string, use `\`, i.e. escape character, this fails codegen, because the `\` character is not properly escaped when codegen'd.
      
      Example stack trace:
      ```
      /* 059 */     private int maxSteps = 2;
      /* 060 */     private int numRows = 0;
      /* 061 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("date_format(window#325.start, yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
      /* 062 */     .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 1)", org.apache.spark.sql.types.DataTypes.StringType);
      /* 063 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.LongType);
      /* 064 */     private Object emptyVBase;
      
      ...
      
      org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, Column 58: Invalid escape sequence
      	at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
      	at org.codehaus.janino.Scanner.produce(Scanner.java:604)
      	at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
      	at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
      	at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
      	at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
      	at org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
      	at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
      ```
      
      In the codegend expression, the literal should use `\\` instead of `\`
      
      A similar problem was solved here: https://github.com/apache/spark/pull/15156.
      
      ## How was this patch tested?
      
      Regression test in `DataFrameAggregationSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16361 from brkyvz/reg-break.
      faabe69c
    • Zhenhua Wang's avatar
      [SPARK-19020][SQL] Cardinality estimation of aggregate operator · 15c2bd01
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Support cardinality estimation of aggregate operator
      
      ## How was this patch tested?
      
      Add test cases
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16431 from wzhfy/aggEstimation.
      15c2bd01
Loading