Skip to content
Snippets Groups Projects
  1. May 30, 2017
    • Wenchen Fan's avatar
      [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL UI tab · 10e526e7
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently the `DataFrameWriter` operations have several problems:
      
      1. non-file-format data source writing action doesn't show up in the SQL tab in Spark UI
      2. file-format data source writing action shows a scan node in the SQL tab, without saying anything about writing. (streaming also have this issue, but not fixed in this PR)
      3. Spark SQL CLI actions don't show up in the SQL tab.
      
      This PR fixes all of them, by refactoring the `ExecuteCommandExec` to make it have children.
      
       close https://github.com/apache/spark/pull/17540
      
      ## How was this patch tested?
      
      existing tests.
      
      Also test the UI manually. For a simple command: `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`
      
      before this PR:
      <img width="266" alt="qq20170523-035840 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326050/24e18ba2-3f6c-11e7-8817-6dd275bf6ac5.png">
      after this PR:
      <img width="287" alt="qq20170523-035708 2x" src="https://cloud.githubusercontent.com/assets/3182036/26326054/2ad7f460-3f6c-11e7-8053-d68325beb28f.png">
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18064 from cloud-fan/execution.
      10e526e7
    • Tathagata Das's avatar
      [SPARK-20883][SPARK-20376][SS] Refactored StateStore APIs and added conf to choose implementation · fa757ee1
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      A bunch of changes to the StateStore APIs and implementation.
      Current state store API has a bunch of problems that causes too many transient objects causing memory pressure.
      
      - `StateStore.get(): Option` forces creation of Some/None objects for every get. Changed this to return the row or null.
      - `StateStore.iterator(): (UnsafeRow, UnsafeRow)` forces creation of new tuple for each record returned. Changed this to return a UnsafeRowTuple which can be reused across records.
      - `StateStore.updates()` requires the implementation to keep track of updates, while this is used minimally (only by Append mode in streaming aggregations). Removed updates() and updated StateStoreSaveExec accordingly.
      - `StateStore.filter(condition)` and `StateStore.remove(condition)` has been merge into a single API `getRange(start, end)` which allows a state store to do optimized range queries (i.e. avoid full scans). Stateful operators have been updated accordingly.
      - Removed a lot of unnecessary row copies Each operator copied rows before calling StateStore.put() even if the implementation does not require it to be copied. It is left up to the implementation on whether to copy the row or not.
      
      Additionally,
      - Added a name to the StateStoreId so that each operator+partition can use multiple state stores (different names)
      - Added a configuration that allows the user to specify which implementation to use.
      - Added new metrics to understand the time taken to update keys, remove keys and commit all changes to the state store. These metrics will be visible on the plan diagram in the SQL tab of the UI.
      - Refactored unit tests such that they can be reused to test any implementation of StateStore.
      
      ## How was this patch tested?
      Old and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18107 from tdas/SPARK-20376.
      fa757ee1
    • Xiao Li's avatar
      [SPARK-20924][SQL] Unable to call the function registered in the not-current database · 4bb6a53e
      Xiao Li authored
      ### What changes were proposed in this pull request?
      We are unable to call the function registered in the not-current database.
      ```Scala
      sql("CREATE DATABASE dAtABaSe1")
      sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS '${classOf[GenericUDAFAverage].getName}'")
      sql("SELECT dAtABaSe1.test_avg(1)")
      ```
      The above code returns an error:
      ```
      Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
      ```
      
      This PR is to fix the above issue.
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18146 from gatorsmile/qualifiedFunction.
      4bb6a53e
    • Josh Rosen's avatar
      798a04fd
    • jinxing's avatar
      [SPARK-20333] HashPartitioner should be compatible with num of child RDD's partitions. · de953c21
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Fix test
      "don't submit stage until its dependencies map outputs are registered (SPARK-5259)" ,
      "run trivial shuffle with out-of-band executor failure and retry",
      "reduce tasks should be placed locally with map output"
      in DAGSchedulerSuite.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #17634 from jinxing64/SPARK-20333.
      de953c21
    • Arman's avatar
      [SPARK-19236][CORE] Added createOrReplaceGlobalTempView method · 4d57981c
      Arman authored
      ## What changes were proposed in this pull request?
      
      Added the createOrReplaceGlobalTempView method for dataset
      
      Author: Arman <arman.yazdani.10@gmail.com>
      
      Closes #16598 from arman1371/patch-1.
      4d57981c
    • actuaryzhang's avatar
      [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula · ff5676b0
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      PySpark supports stringIndexerOrderType in RFormula as in #17967.
      
      ## How was this patch tested?
      docstring test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18122 from actuaryzhang/PythonRFormula.
      ff5676b0
    • Liang-Chi Hsieh's avatar
      [SPARK-20916][SQL] Improve error message for unaliased subqueries in FROM clause · 35b644bd
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We changed the parser to reject unaliased subqueries in the FROM clause in SPARK-20690. However, the error message that we now give isn't very helpful:
      
          scala> sql("""SELECT x FROM (SELECT 1 AS x)""")
          org.apache.spark.sql.catalyst.parser.ParseException:
          mismatched input 'FROM' expecting {<EOF>, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
      
      We should modify the parser to throw a more clear error for such queries:
      
          scala> sql("""SELECT x FROM (SELECT 1 AS x)""")
          org.apache.spark.sql.catalyst.parser.ParseException:
          The unaliased subqueries in the FROM clause are not supported.(line 1, pos 14)
      
      ## How was this patch tested?
      
      Modified existing tests to reflect this change.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18141 from viirya/SPARK-20916.
      35b644bd
    • Yuming Wang's avatar
      [MINOR] Fix some indent issues. · 80fb24b8
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Fix some indent issues.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18133 from wangyum/IndentIssues.
      80fb24b8
    • Yuming Wang's avatar
      [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEEK · d797ed0e
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Add build-int SQL function - DAYOFWEEK
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18134 from wangyum/SPARK-20909.
      d797ed0e
  2. May 29, 2017
  3. May 28, 2017
    • Zhenhua Wang's avatar
      [SPARK-20881][SQL] Clearly document the mechanism to choose between two sources of statistics · 9d0db5a7
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Now, we have two sources of statistics, i.e. Spark's stats and Hive's stats. Spark's stats is generated by running "analyze" command in Spark. Once it's available, we respect this stats over Hive's.
      
      This pr is to clearly document in related code the mechanism to choose between these two sources of stats.
      
      ## How was this patch tested?
      
      Not related.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18105 from wzhfy/cboSwitchStats.
      9d0db5a7
    • Takeshi Yamamuro's avatar
      [SPARK-20841][SQL] Support table column aliases in FROM clause · 24d34281
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support table column aliases in FROM clause.
      
      ## How was this patch tested?
      Added tests in `PlanParserSuite`,  `SQLQueryTestSuite`, and `PlanParserSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18079 from maropu/SPARK-20841.
      24d34281
  4. May 27, 2017
    • Xiao Li's avatar
      [SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching · 06c155c9
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      In Cache manager, the plan matching should ignore Hint.
      ```Scala
            val df1 = spark.range(10).join(broadcast(spark.range(10)))
            df1.cache()
            spark.range(10).join(spark.range(10)).explain()
      ```
      The output plan of the above query shows that the second query is  not using the cached data of the first query.
      ```
      BroadcastNestedLoopJoin BuildRight, Inner
      :- *Range (0, 10, step=1, splits=2)
      +- BroadcastExchange IdentityBroadcastMode
         +- *Range (0, 10, step=1, splits=2)
      ```
      
      After the fix, the plan becomes
      ```
      InMemoryTableScan [id#20L, id#23L]
         +- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- BroadcastNestedLoopJoin BuildRight, Inner
                  :- *Range (0, 10, step=1, splits=2)
                  +- BroadcastExchange IdentityBroadcastMode
                     +- *Range (0, 10, step=1, splits=2)
      ```
      
      ### How was this patch tested?
      Added a test.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18131 from gatorsmile/HintCache.
      06c155c9
    • liuxian's avatar
      [SPARK-20876][SQL] If the input parameter is float type for ceil or floor,the... · 3969a807
      liuxian authored
      [SPARK-20876][SQL] If the input parameter is float type for ceil or floor,the result is not we expected
      
      ## What changes were proposed in this pull request?
      
      spark-sql>SELECT ceil(cast(12345.1233 as float));
      spark-sql>12345
      For this case, the result we expected is `12346`
      spark-sql>SELECT floor(cast(-12345.1233 as float));
      spark-sql>-12345
      For this case, the result we expected is `-12346`
      
      Because in `Ceil` or `Floor`, `inputTypes` has no FloatType, so it is converted to LongType.
      ## How was this patch tested?
      
      After the modification:
      spark-sql>SELECT ceil(cast(12345.1233 as float));
      spark-sql>12346
      spark-sql>SELECT floor(cast(-12345.1233 as float));
      spark-sql>-12346
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18103 from 10110346/wip-lx-0525-1.
      3969a807
    • Wenchen Fan's avatar
      [SPARK-20897][SQL] cached self-join should not fail · 08ede46b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The failed test case is, we have a `SortMergeJoinExec` for a self-join, which means we have a `ReusedExchange` node in the query plan. It works fine without caching, but throws an exception in `SortMergeJoinExec.outputPartitioning` if we cache it.
      
      The root cause is, `ReusedExchange` doesn't propagate the output partitioning from its child, so in `SortMergeJoinExec.outputPartitioning` we create `PartitioningCollection` with a hash partitioning and an unknown partitioning, and fail.
      
      This bug is mostly fine, because inserting the `ReusedExchange` is the last step to prepare the physical plan, we won't call `SortMergeJoinExec.outputPartitioning` anymore after this.
      
      However, if the dataframe is cached, the physical plan of it becomes `InMemoryTableScanExec`, which contains another physical plan representing the cached query, and it has gone through the entire planning phase and may have `ReusedExchange`. Then the planner call `InMemoryTableScanExec.outputPartitioning`, which then calls `SortMergeJoinExec.outputPartitioning` and trigger this bug.
      
      ## How was this patch tested?
      
      a new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18121 from cloud-fan/bug.
      08ede46b
    • liuzhaokun's avatar
      [SPARK-20875] Spark should print the log when the directory has been deleted · 8faffc41
      liuzhaokun authored
      [https://issues.apache.org/jira/browse/SPARK-20875](https://issues.apache.org/jira/browse/SPARK-20875)
      When the "deleteRecursively" method is invoked,spark doesn't print any log if the path was deleted.For example,spark only print "Removing directory" when the worker began cleaning spark.work.dir,but didn't print any log about "the path has been delete".So, I can't judge whether the path was deleted form the worker's logfile,If there is any accidents about Linux.
      
      Author: liuzhaokun <liu.zhaokun@zte.com.cn>
      
      Closes #18102 from liu-zhaokun/master_log.
      8faffc41
    • Shixiong Zhu's avatar
      [SPARK-20843][CORE] Add a config to set driver terminate timeout · 6c1dbd6f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add a `worker` configuration to set how long to wait before forcibly killing driver.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18126 from zsxwing/SPARK-20843.
      6c1dbd6f
  5. May 26, 2017
    • Yuming Wang's avatar
      [SPARK-20748][SQL] Add built-in SQL function CH[A]R. · a0f8a072
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      Add built-in SQL function `CH[A]R`:
      For `CHR(bigint|double n)`, returns the ASCII character having the binary equivalent to `n`. If n is larger than 256 the result is equivalent to CHR(n % 256)
      
      ## How was this patch tested?
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18019 from wangyum/SPARK-20748.
      a0f8a072
    • Wenchen Fan's avatar
      [SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read · 1d62f8ac
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR includes some minor improvement for the comments and tests in https://github.com/apache/spark/pull/16989
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18117 from cloud-fan/follow.
      1d62f8ac
    • Yu Peng's avatar
      [SPARK-10643][CORE] Make spark-submit download remote files to local in client mode · 4af37812
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      This PR makes spark-submit script download remote files to local file system for local/standalone client mode.
      
      ## How was this patch tested?
      
      - Unit tests
      - Manual tests by adding s3a jar and testing against file on s3.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #18078 from loneknightpy/download-jar-in-spark-submit.
      4af37812
    • setjet's avatar
      [SPARK-20873][SQL] Improve the error message for unsupported Column Type · c491e2ed
      setjet authored
      ## What changes were proposed in this pull request?
      Upon encountering an invalid columntype, the column type object is printed, rather than the type.
      This  change improves this by outputting its name.
      
      ## How was this patch tested?
      Added a simple  unit test to verify the contents of the raised exception
      
      Author: setjet <rubenljanssen@gmail.com>
      
      Closes #18097 from setjet/spark-20873.
      c491e2ed
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      
      ## How was this patch tested?
      
      Manual tests, docs build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
      ae33abf7
    • Sital Kedia's avatar
      [SPARK-20014] Optimize mergeSpillsWithFileStream method · 473d7552
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      When the individual partition size in a spill is small, mergeSpillsWithTransferTo method does many small disk ios which is really inefficient. One way to improve the performance will be to use mergeSpillsWithFileStream method by turning off transfer to and using buffered file read/write to improve the io throughput.
      However, the current implementation of mergeSpillsWithFileStream does not do a buffer read/write of the files and in addition to that it unnecessarily flushes the output files for each partitions.
      
      ## How was this patch tested?
      
      Tested this change by running a job on the cluster and the map stage run time was reduced by around 20%.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #17343 from sitalkedia/upstream_mergeSpillsWithFileStream.
      473d7552
    • Michael Armbrust's avatar
      [SPARK-20844] Remove experimental from Structured Streaming APIs · d935e0a9
      Michael Armbrust authored
      Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #18065 from marmbrus/streamingGA.
      d935e0a9
    • 10129659's avatar
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores... · 0fd84b05
      10129659 authored
      [SPARK-20835][CORE] It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      
      ## What changes were proposed in this pull request?
      In my test, the submitted app running with out an error when the --total-executor-cores less than 0
      and given the warnings:
      "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources";
      
      It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      Run the ut tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #18060 from eatoncys/totalcores.
      0fd84b05
    • Wenchen Fan's avatar
      [SPARK-20887][CORE] support alternative keys in ConfigBuilder · 629f38e1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `ConfigBuilder` builds `ConfigEntry` which can only read value with one key, if we wanna change the config name but still keep the old one, it's hard to do.
      
      This PR introduce `ConfigBuilder.withAlternative`, to support reading config value with alternative keys. And also rename `spark.scheduler.listenerbus.eventqueue.size` to `spark.scheduler.listenerbus.eventqueue.capacity` with this feature, according to https://github.com/apache/spark/pull/14269#discussion_r118432313
      
      ## How was this patch tested?
      
      a new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18110 from cloud-fan/config.
      629f38e1
    • Wil Selwood's avatar
      [MINOR] document edge case of updateFunc usage · b6f2017a
      Wil Selwood authored
      ## What changes were proposed in this pull request?
      
      Include documentation of the fact that the updateFunc is sometimes called with no new values. This is documented in the main documentation here: https://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation however from the docs included with the code it is not clear that this is the case.
      
      ## How was this patch tested?
      
      PR only changes comments. Confirmed code still builds.
      
      Author: Wil Selwood <wil.selwood@sa.catapult.org.uk>
      
      Closes #18088 from wselwood/note-edge-case-in-docs.
      b6f2017a
    • Wenchen Fan's avatar
      [SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo · d9ad7890
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.
      
       However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.
      
      https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18091 from cloud-fan/shuffle.
      d9ad7890
    • Zheng RuiFeng's avatar
      [SPARK-20849][DOC][SPARKR] Document R DecisionTree · a97c4970
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add an example for sparkr `decisionTree`
      2, document it in user guide
      
      ## How was this patch tested?
      local submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18067 from zhengruifeng/dt_example.
      a97c4970
    • Liang-Chi Hsieh's avatar
      [SPARK-20392][SQL] Set barrier to prevent re-entering a tree · 8ce0d8ff
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows.
      
      A big part of the performance downgrade comes from some operations (e.g., `select`) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new `LogicalPlan`. The cost can be ignored in the usage of SQL, normally.
      
      However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the `Analyzer` will go through the big query plan even most part of it is analyzed.
      
      By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs.
      
      In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 `Bucketizer`s. Before the change, each call of `Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all `Bucketizer`s' transform is about 50 secs. After the change, each call only costs about 0.1 sec.
      
      <del>We also make `boundEnc` as lazy variable to reduce unnecessary running time.</del>
      
      ### Performance improvement
      
      The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA.
      
      Before this patch: about 1 min
      After this patch: about 20 secs
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17770 from viirya/SPARK-20392.
      8ce0d8ff
  6. May 25, 2017
    • Wayne Zhang's avatar
      [SPARK-14659][ML] RFormula consistent with R when handling strings · f47700c9
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      When handling strings, the category dropped by RFormula and R are different:
      - RFormula drops the least frequent level
      - R drops the first level after ascending alphabetical ordering
      
      This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`.
      
      ## How was this patch tested?
      new tests
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17967 from actuaryzhang/RFormula.
      f47700c9
    • setjet's avatar
      [SPARK-20775][SQL] Added scala support from_json · 2dbe0c52
      setjet authored
      ## What changes were proposed in this pull request?
      
      from_json function required to take in a java.util.Hashmap. For other functions, a java wrapper is provided which casts a java hashmap to a scala map. Only a java function is provided in this case, forcing scala users to pass in a java.util.Hashmap.
      
      Added the missing wrapper.
      
      ## How was this patch tested?
      Added a unit test for passing in a scala map
      
      Author: setjet <rubenljanssen@gmail.com>
      
      Closes #18094 from setjet/spark-20775.
      2dbe0c52
    • Michael Allman's avatar
      [SPARK-20888][SQL][DOCS] Document change of default setting of... · c1e7989c
      Michael Allman authored
      [SPARK-20888][SQL][DOCS] Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode
      
      (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888)
      
      ## What changes were proposed in this pull request?
      
      Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #18112 from mallman/spark-20888-document_infer_and_save.
      c1e7989c
    • Shixiong Zhu's avatar
      [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project · 98c38529
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.
      
      ## How was this patch tested?
      
      manually tested it.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18101 from zsxwing/add-missing-example-dep.
      98c38529
    • hyukjinkwon's avatar
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid... · e9f983df
      hyukjinkwon authored
      [SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid path check for sc.addJar on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes two things:
      
      - A follow up for SPARK-19707 (Improving the invalid path check for sc.addJar on Windows as well).
      
      ```
      org.apache.spark.SparkContextSuite:
       - add jar with invalid path *** FAILED *** (32 milliseconds)
         2 was not equal to 1 (SparkContextSuite.scala:309)
         ...
      ```
      
      - Fix path vs URI related test failures on Windows.
      
      ```
      org.apache.spark.storage.LocalDirsSuite:
       - SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0 milliseconds)
         new java.io.File("/NONEXISTENT_PATH").exists() was true (LocalDirsSuite.scala:50)
         ...
      
       - Utils.getLocalDir() throws an exception if any temporary directory cannot be retrieved *** FAILED *** (15 milliseconds)
         Expected exception java.io.IOException to be thrown, but no exception was thrown. (LocalDirsSuite.scala:64)
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
       - orc: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
         ...
      
       - parquet: schema should be inferred and saved when INFER_AND_SAVE is specified *** FAILED *** (203 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
         ...
      
       - orc: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (141 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c
         ...
      
       - parquet: schema should be inferred but not stored when INFER_ONLY is specified *** FAILED *** (125 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc
         ...
      
       - orc: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (156 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a
         ...
      
       - parquet: schema should not be inferred when NEVER_INFER is specified *** FAILED *** (547 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.command.DDLSuite:
       - create temporary view using *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-3881d9ca-561b-488d-90b9-97587472b853	mp;
         ...
      
       - insert data to a data source table which has a non-existing location should succeed *** FAILED *** (109 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54 did not equal file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54 (DDLSuite.scala:1869)
         ...
      
       - insert into a data source table with a non-existing partition location should succeed *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d did not equal file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d (DDLSuite.scala:1910)
         ...
      
       - read data from a data source table which has a non-existing location should succeed *** FAILED *** (93 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34 did not equal file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34 (DDLSuite.scala:1937)
         ...
      
       - read data from a data source table with non-existing partition location should succeed *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - create datasource table with a non-existing location *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8 did not equal file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8 (DDLSuite.scala:1982)
         ...
      
       - CTAS for external data source table with a non-existing location *** FAILED *** (16 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - CTAS for external data source table with a existed location *** FAILED *** (15 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a:b *** FAILED *** (143 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a%b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - data source table:partition column name containing a,b *** FAILED *** (109 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - location uri contains a b for datasource table *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b did not equal file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a:b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b did not equal file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a%b for datasource table *** FAILED *** (78 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b did not equal file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b (DDLSuite.scala:2084)
         ...
      
       - location uri contains a b for database *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a:b for database *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - location uri contains a%b for database *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.hive.execution.HiveDDLSuite:
       - create hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a non-existing location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - CTAS for external hive table with a existed location *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of parquet table containing a b *** FAILED *** (156 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a:b *** FAILED *** (94 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a%b *** FAILED *** (125 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of parquet table containing a,b *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      
       - partition column name of hive table containing a b *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a:b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a%b *** FAILED *** (16 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - partition column name of hive table containing a,b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a:b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      
       - hive table: location uri contains a%b *** FAILED *** (0 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
         ...
      ```
      
      ```
      org.apache.spark.sql.sources.PathOptionSuite:
       - path option also exist for write path *** FAILED *** (94 milliseconds)
         file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc did not equal file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc (PathOptionSuite.scala:98)
         ...
      ```
      
      ```
      org.apache.spark.sql.CachedTableSuite:
       - SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer to this table *** FAILED *** (110 milliseconds)
         java.lang.IllegalArgumentException: Can not create a Path from an empty string
         ...
      ```
      
      ```
      org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite:
       - treeString is redacted *** FAILED *** (250 milliseconds)
         "file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" did not contain "C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" (DataSourceScanExecRedactionSuite.scala:46)
         ...
      ```
      
      ## How was this patch tested?
      
      Tested via AppVeyor for each and checked it passed once each. These should be retested via AppVeyor in this PR.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17987 from HyukjinKwon/windows-20170515.
      e9f983df
Loading