Skip to content
Snippets Groups Projects
  1. Feb 19, 2017
  2. Feb 18, 2017
    • Ala Luszczak's avatar
      [SPARK-19447] Make Range operator generate "recordsRead" metric · b486ffc8
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      The Range was modified to produce "recordsRead" metric instead of "generated rows". The tests were updated and partially moved to SQLMetricsSuite.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16960 from ala/range-records-read.
      b486ffc8
    • jinxing's avatar
      [SPARK-19263] DAGScheduler should avoid sending conflicting task set. · 729ce370
      jinxing authored
      In current `DAGScheduler handleTaskCompletion` code, when event.reason is `Success`, it will first do `stage.pendingPartitions -= task.partitionId`, which maybe a bug when `FetchFailed` happens.
      
      **Think about below**
      
      1.  Stage 0 runs and generates shuffle output data.
      2. Stage 1 reads the output from stage 0 and generates more shuffle data. It has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are launched on executorA.
      3. ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to the driver. The driver marks executorA as lost and updates failedEpoch;
      4. The driver resubmits stage 0 so the missing output can be re-generated, and then once it completes, resubmits stage 1 with ShuffleMapTask1x and ShuffleMapTask2x.
      5. ShuffleMapTask2 (from the original attempt of stage 1) successfully finishes on executorA and sends Success back to driver. This causes DAGScheduler::handleTaskCompletion to remove partition 2 from stage.pendingPartitions (line 1149), but it does not add the partition to the set of output locations (line 1192), because the task’s epoch is less than the failure epoch for the executor (because of the earlier failure on executor A)
      6. ShuffleMapTask1x successfully finishes on executorB, causing the driver to remove partition 1 from stage.pendingPartitions. Combined with the previous step, this means that there are no more pending partitions for the stage, so the DAGScheduler marks the stage as finished (line 1196). However, the shuffle stage is not available (line 1215) because the completion for ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler resubmits the stage.
      7. ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks is called for the re-submitted stage, it throws an error, because there’s an existing active task set
      
      **In this fix**
      
      If a task completion is from a previous stage attempt and the epoch is too low
      (i.e., it was from a failed executor), don't remove the corresponding partition
      from pendingPartitions.
      
      Author: jinxing <jinxing@meituan.com>
      Author: jinxing <jinxing6042@126.com>
      
      Closes #16620 from jinxing64/SPARK-19263.
      729ce370
    • Moussa Taifi's avatar
      [MLLIB][TYPO] Replace LeastSquaresAggregator with LogisticAggregator · 21c7d3c3
      Moussa Taifi authored
      ## What changes were proposed in this pull request?
      
      Replace LeastSquaresAggregator with LogisticAggregator in the require statement of the merge op.
      
      ## How was this patch tested?
      
      Simple message fix.
      
      Author: Moussa Taifi <moutai10@gmail.com>
      
      Closes #16903 from moutai/master.
      Unverified
      21c7d3c3
    • Shuai Lin's avatar
      [SPARK-19550] Follow-up: fixed a typo that fails the dev/make-distribution.sh script. · e553b1e8
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Fixed a typo in `dev/make-distribution.sh` script that sets the MAVEN_OPTS variable, introduced [here](https://github.com/apache/spark/commit/0e24054#diff-ba2c046d92a1d2b5b417788bfb5cb5f8R149).
      
      ## How was this patch tested?
      
      Run `dev/make-distribution.sh` manually.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16984 from lins05/fix-spark-make-distribution-after-removing-java7.
      Unverified
      e553b1e8
  3. Feb 17, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19639][SPARKR][EXAMPLE] Add spark.svmLinear example and update vignettes · 8b57ea4a
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      We recently add the spark.svmLinear API for SparkR. We need to add an example and update the vignettes.
      
      ## How was this patch tested?
      
      Manually run example.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16969 from wangmiao1981/example.
      8b57ea4a
    • Shixiong Zhu's avatar
      [SPARK-19617][SS] Fix the race condition when starting and stopping a query quickly · 15b144d2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The streaming thread in StreamExecution uses the following ways to check if it should exit:
      - Catch an InterruptException.
      - `StreamExecution.state` is TERMINATED.
      
      When starting and stopping a query quickly, the above two checks may both fail:
      - Hit [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) and swallow InterruptException
      - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches](https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252) changes the state from `TERMINATED` to `ACTIVE`.
      
      If the above cases both happen, the query will hang forever.
      
      This PR changes `state` to `AtomicReference` and uses`compareAndSet` to make sure we only change the state from `INITIALIZING` to `ACTIVE`. It also removes the `runUninterruptibly` hack from ``HDFSMetadata`, because HADOOP-14084 won't cause any problem after we fix the race condition.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16947 from zsxwing/SPARK-19617.
      15b144d2
    • Felix Cheung's avatar
      [SPARKR][EXAMPLES] update examples to stop spark session · 988f6d7e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      stop session at end of example
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16973 from felixcheung/rexamples.
      988f6d7e
    • Yanbo Liang's avatar
      [SPARK-18285][SPARKR] SparkR approxQuantile supports input multiple columns · b4065983
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```approxQuantile``` supports input multiple columns.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16951 from yanboliang/spark-19619.
      b4065983
    • Roberto Agostino Vitillo's avatar
      [SPARK-19517][SS] KafkaSource fails to initialize partition offsets · 1a3f5f8c
      Roberto Agostino Vitillo authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.
      
      ## How was this patch tested?
      
      I ran the test suite for spark-sql-kafka-0-10.
      
      Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
      
      Closes #16857 from vitillo/kafka_source_fix.
      1a3f5f8c
    • Liang-Chi Hsieh's avatar
      [SPARK-18986][CORE] ExternalAppendOnlyMap shouldn't fail when forced to spill... · 4cc06f4e
      Liang-Chi Hsieh authored
      [SPARK-18986][CORE] ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
      
      ## What changes were proposed in this pull request?
      
      `ExternalAppendOnlyMap.forceSpill` now uses an assert to check if an iterator is not null in the map. However, the assertion is only true after the map is asked for iterator. Before it, if another memory consumer asks more memory than currently available, `ExternalAppendOnlyMap.forceSpill` is also be called too. In this case, we will see failure like this:
      
          [info]   java.lang.AssertionError: assertion failed
          [info]   at scala.Predef$.assert(Predef.scala:156)
          [info]   at org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
          [info]   at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
          [info]   at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnlyMapSuite.scala:294)
      
      This fixing is motivated by http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-AssertionError-assertion-failed-tc20277.html.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16387 from viirya/fix-externalappendonlymap.
      4cc06f4e
    • Davies Liu's avatar
      [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap · 3d0c3af0
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1).
      
      This PR fix the off-by-one bug in BytesToBytesMap.
      
      This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 .
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16844 from davies/off_by_one.
      3d0c3af0
    • Stan Zhai's avatar
      [SPARK-19622][WEBUI] Fix a http error in a paged table when using a `Go` button to search. · 021062af
      Stan Zhai authored
      ## What changes were proposed in this pull request?
      
      The search function of paged table is not available because of we don't skip the hash data of the reqeust path.
      
      ![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png)
      
      ## How was this patch tested?
      
      Tested manually with my browser.
      
      Author: Stan Zhai <zhaishidan@haizhi.com>
      
      Closes #16953 from stanzhai/fix-webui-paged-table.
      Unverified
      021062af
    • Rolando Espinoza's avatar
      [MINOR][PYTHON] Fix typo docstring: 'top' -> 'topic' · 9d2d2204
      Rolando Espinoza authored
      ## What changes were proposed in this pull request?
      
      Fix typo in docstring.
      
      Author: Rolando Espinoza <rndmax84@gmail.com>
      
      Closes #16967 from rolando/pyspark-doc-typo.
      Unverified
      9d2d2204
    • hyukjinkwon's avatar
      [BUILD] Close stale PRs · ed338f72
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to close stale PRs.
      
      What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month.
      
      I left some comments roughly a week ago to ping and the author looks still inactive in these PR below
      
      These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate.
      
      Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it.
      
      Closes #7963
      Closes #8374
      Closes #11192
      Closes #11374
      Closes #11692
      Closes #12243
      Closes #12583
      Closes #12620
      Closes #12675
      Closes #12697
      Closes #12800
      Closes #13715
      Closes #14266
      Closes #15053
      Closes #15159
      Closes #15209
      Closes #15264
      Closes #15267
      Closes #15871
      Closes #15861
      Closes #16319
      Closes #16324
      Closes #16890
      
      Closes #12398
      Closes #12933
      Closes #14517
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16937 from HyukjinKwon/stale-prs-close.
      Unverified
      ed338f72
  4. Feb 16, 2017
    • Wenchen Fan's avatar
      [SPARK-18120][SPARK-19557][SQL] Call QueryExecutionListener callback methods... · 54d23599
      Wenchen Fan authored
      [SPARK-18120][SPARK-19557][SQL] Call QueryExecutionListener callback methods for DataFrameWriter methods
      
      ## What changes were proposed in this pull request?
      
      We only notify `QueryExecutionListener` for several `Dataset` operations, e.g. collect, take, etc. We should also do the notification for `DataFrameWriter` operations.
      
      ## How was this patch tested?
      
      new regression test
      
      close https://github.com/apache/spark/pull/16664
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16962 from cloud-fan/insert.
      54d23599
    • Nathan Howell's avatar
      [SPARK-18352][SQL] Support parsing multiline json files · 21fde57f
      Nathan Howell authored
      ## What changes were proposed in this pull request?
      
      If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory.
      
      Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired.
      
      These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing.
      
      I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one.
      
      ## How was this patch tested?
      
      New and existing unit tests. No performance or load tests have been run.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #16386 from NathanHowell/SPARK-18352.
      21fde57f
    • Sean Owen's avatar
      [SPARK-19550][HOTFIX][BUILD] Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima · dcc2d540
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima script to run MiMa
      This follows on https://github.com/apache/spark/pull/16871 -- it's a slightly separate issue, but, is currently causing a build failure.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16957 from srowen/SPARK-19550.2.
      Unverified
      dcc2d540
    • Zheng RuiFeng's avatar
      [SPARK-19436][SQL] Add missing tests for approxQuantile · 54a30c8a
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, check the behavior with illegal `quantiles` and `relativeError`
      2, add tests for `relativeError` > 1
      3, update tests for `null` data
      4, update some docs for javadoc8
      
      ## How was this patch tested?
      local test in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #16776 from zhengruifeng/fix_approxQuantile.
      54a30c8a
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix javadoc8 break · 3b437687
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      These error below seems caused by unidoc that does not understand double commented block.
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
      [error]  * MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt; mappingFunction =
      [error]                                  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
      [error]  * MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt; mappingFunction =
      [error]                                                                       ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
      [error]  *    new MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt;() {
      [error]                                         ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
      [error]  *    new MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt;() {
      [error]                                                                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#'
      [error]  *      &#64;Override
      [error]          ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected
      [error]  *      &#64;Override
      [error]              ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
      [error]  *      public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) {
      [error]                ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
      [error]  *      public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) {
      [error]                                                    ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
      [error]  *      public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) {
      [error]                                                                ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
      [error]  *      public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) {
      [error]                                                                                     ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
      [error]  *      public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) {
      [error]                                                                                                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected
      [error]  *          boolean shouldRemove = ...; // Decide whether to remove the state
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected
      [error]  *          if (shouldRemove) {
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected
      [error]  *          } else {
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected
      [error]  *            state.update(newState); // Set the new state
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected
      [error]  *          }
      [error]  ^
      [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected
      [error]  *          state.update(initialState);
      [error]  ^
      [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected
      [error]  *        }
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected
      [error]  * </code></pre>
      [error]  ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected
      [error]  * tparam S User-defined type of the state to be stored for each key. Must be encodable into
      [error]            ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected
      [error]  *           Spark SQL types (see {link Encoder} for more details).
      [error]                                          ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected
      [error]  * since 2.1.1
      [error]           ^
      ```
      
      And another link seems unrecognisable.
      
      ```
      .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found
      [error]  * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery},
      [error]
      ```
      
      Note that this PR does not fix the two breaks as below:
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
      [error]    * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
      [error]      ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
      [error]    * param relativeError The relative target precision to achieve (>= 0).
      [error]                                                                     ^
      [error]
      ```
      
      because these seem probably fixed soon in https://github.com/apache/spark/pull/16776 and I intended to avoid potential conflicts.
      
      ## How was this patch tested?
      
      Manually via `jekyll build`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16926 from HyukjinKwon/javadoc-break.
      Unverified
      3b437687
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      Unverified
      0e240549
    • Kevin Yu's avatar
      [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 3rd batch · 3871d94a
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      
      This is 3ird batch of test case for IN/NOT IN subquery. In this PR, it has these test files:
      
      `in-having.sql`
      `in-joins.sql`
      `in-multiple-columns.sql`
      
      These are the queries and results from running on DB2.
      [in-having DB2 version](https://github.com/apache/spark/files/772668/in-having.sql.db2.txt)
      [output of in-having](https://github.com/apache/spark/files/772670/in-having.sql.db2.out.txt)
      [in-joins DB2 version](https://github.com/apache/spark/files/772672/in-joins.sql.db2.txt)
      [output of in-joins](https://github.com/apache/spark/files/772673/in-joins.sql.db2.out.txt)
      [in-multiple-columns DB2 version](https://github.com/apache/spark/files/772678/in-multiple-columns.sql.db2.txt)
      [output of in-multiple-columns](https://github.com/apache/spark/files/772680/in-multiple-columns.sql.db2.out.txt)
      
      ## How was this patch tested?
      This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16841 from kevinyu98/spark-18871-33.
      3871d94a
    • Tejas Patil's avatar
      [SPARK-19618][SQL] Inconsistency wrt max. buckets allowed from Dataframe API vs SQL · f041e55e
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Jira: https://issues.apache.org/jira/browse/SPARK-19618
      
      Moved the check for validating number of buckets from `DataFrameWriter` to `BucketSpec` creation
      
      ## How was this patch tested?
      
      - Added more unit tests
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #16948 from tejasapatil/SPARK-19618_max_buckets.
      f041e55e
  5. Feb 15, 2017
    • Kevin Yu's avatar
      [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 4th batch · 8487902a
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      
      This is 4th batch of test case for IN/NOT IN subquery. In this PR, it has these test files:
      
      `in-set-operations.sql`
      `in-with-cte.sql`
      `not-in-joins.sql`
      
      Here are the queries and results from running on DB2.
      
      [in-set-operations DB2 version](https://github.com/apache/spark/files/772846/in-set-operations.sql.db2.txt)
      [Output of in-set-operations](https://github.com/apache/spark/files/772848/in-set-operations.sql.db2.out.txt)
      [in-with-cte DB2 version](https://github.com/apache/spark/files/772849/in-with-cte.sql.db2.txt)
      [Output of in-with-cte](https://github.com/apache/spark/files/772856/in-with-cte.sql.db2.out.txt)
      [not-in-joins DB2 version](https://github.com/apache/spark/files/772851/not-in-joins.sql.db2.txt)
      [Output of not-in-joins](https://github.com/apache/spark/files/772852/not-in-joins.sql.db2.out.txt)
      
      ## How was this patch tested?
      
      This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16915 from kevinyu98/spark-18871-44.
      8487902a
    • Shixiong Zhu's avatar
      [SPARK-19603][SS] Fix StreamingQuery explain command · fc02ef95
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false.
      
      This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan.
      
      Examples of the explain outputs:
      
      - streaming DataFrame.explain()
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#513.toString, obj#516: java.lang.String
                                 +- StreamingRelation MemoryStream[value#513], [value#513]
      ```
      
      - StreamingQuery.explain(extended = false)
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      - StreamingQuery.explain(extended = true)
      ```
      == Parsed Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Analyzed Logical Plan ==
      value: string, count(1): bigint
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Optimized Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject value#543.toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      ## How was this patch tested?
      
      The updated unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16934 from zsxwing/SPARK-19603.
      fc02ef95
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
    • Shixiong Zhu's avatar
      [SPARK-19599][SS] Clean up HDFSMetadataLog · 21b4ba2d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog.
      
      This PR includes the following changes:
      - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that prevents us from removing the workaround codes.
      - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly.
      - Remove catching FileNotFoundException.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16932 from zsxwing/metadata-cleanup.
      21b4ba2d
    • Yin Huai's avatar
      [SPARK-19604][TESTS] Log the start of every Python test · f6c3bba2
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
      
      ## How was this patch tested?
      This is a change for python tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16935 from yhuai/SPARK-19604.
      f6c3bba2
    • Takuya UESHIN's avatar
      [SPARK-18937][SQL] Timezone support in CSV/JSON parsing · 865b2fd8
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up pr of #16308.
      
      This pr enables timezone support in CSV/JSON parsing.
      
      We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone).
      
      The datasources should use the `timeZone` option to format/parse to write/read timestamp values.
      Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values.
      
      For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are:
      
      ```scala
      scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
      
      scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
      df: org.apache.spark.sql.DataFrame = [ts: timestamp]
      
      scala> df.show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> df.write.json("/path/to/gmtjson")
      ```
      
      ```sh
      $ cat /path/to/gmtjson/part-*
      {"ts":"2016-01-01T00:00:00.000Z"}
      ```
      
      whereas setting the option to `"PST"`, they are:
      
      ```scala
      scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
      ```
      
      ```sh
      $ cat /path/to/pstjson/part-*
      {"ts":"2015-12-31T16:00:00.000-08:00"}
      ```
      
      We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info:
      
      ```scala
      scala> val schema = new StructType().add("ts", TimestampType)
      schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true))
      
      scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option:
      
      ```scala
      scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
      ```
      
      ```sh
      $ cat /path/to/jstjson/part-*
      {"ts":"2016-01-01T09:00:00"}
      ```
      
      ```scala
      // wrong result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 09:00:00|
      +-------------------+
      
      // correct result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option.
      
      ## How was this patch tested?
      
      Existing tests and added some tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16750 from ueshin/issues/SPARK-18937.
      865b2fd8
    • windpiger's avatar
      [SPARK-19329][SQL] Reading from or writing to a datasource table with a non... · 6a9a85b8
      windpiger authored
      [SPARK-19329][SQL] Reading from or writing to a datasource table with a non pre-existing location should succeed
      
      ## What changes were proposed in this pull request?
      
      when we insert data into a datasource table use `sqlText`, and the table has an not exists location,
      this will throw an Exception.
      
      example:
      
      ```
      spark.sql("create table t(a string, b int) using parquet")
      spark.sql("alter table t set location '/xx'")
      spark.sql("insert into table t select 'c', 1")
      ```
      
      Exception:
      ```
      com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
      at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
      at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
      at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
      at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
      ```
      
      As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:
      
      1. reading from a datasource table: return 0 rows
      2. writing to a datasource table:  write data successfully
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16672 from windpiger/insertNotExistLocation.
      6a9a85b8
    • Dongjoon Hyun's avatar
      [SPARK-19607][HOTFIX] Finding QueryExecution that matches provided executionId · 59dc26e3
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      #16940 adds a test case which does not stop the spark job. It causes many failures of other test cases.
      
      - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/2403/consoleFull
      - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2600/consoleFull
      
      ```
      [info]   org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16943 from dongjoon-hyun/SPARK-19607-2.
      59dc26e3
    • jiangxingbo's avatar
      [SPARK-19331][SQL][TESTS] Improve the test coverage of SQLViewSuite · 3755da76
      jiangxingbo authored
      Move `SQLViewSuite` from `sql/hive` to `sql/core`, so we can test the view supports without hive metastore. Also moved the test cases that specified to hive to `HiveSQLViewSuite`.
      
      Improve the test coverage of SQLViewSuite, cover the following cases:
      1. view resolution(possibly a referenced table/view have changed after the view creation);
      2. handle a view with user specified column names;
      3. improve the test cases for a nested view.
      
      Also added a test case for cyclic view reference, which is a known issue that is not fixed yet.
      
      N/A
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16674 from jiangxb1987/view-test.
      3755da76
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 671bc08e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16739 from felixcheung/rcoalesce.
      671bc08e
    • zero323's avatar
      [SPARK-19160][PYTHON][SQL] Add udf decorator · c97f4e17
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).
      
      This allows users to define UDF using simplified syntax:
      
      ```python
      from pyspark.sql.decorators import udf
      
      udf(IntegerType())
      def add_one(x):
          """Adds one"""
          if x is not None:
              return x + 1
       ```
      
      without need to define a separate function and udf.
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16533 from zero323/SPARK-19160.
      c97f4e17
    • VinceShieh's avatar
      [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark · 6eca21ba
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
      https://github.com/apache/spark/pull/15428
      
      ## How was this patch tested?
      No test needed
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16922 from VinceShieh/spark-19590.
      6eca21ba
    • Liang-Chi Hsieh's avatar
      [SPARK-16475][SQL] broadcast hint for SQL queries - disallow space as the delimiter · acf71c63
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      A follow-up to disallow space as the delimiter in broadcast hint.
      
      ## How was this patch tested?
      
      Jenkins test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16941 from viirya/disallow-space-delimiter.
      acf71c63
    • Dilip Biswal's avatar
      [SPARK-18872][SQL][TESTS] New test cases for EXISTS subquery (Joins + CTE) · a8a13982
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      This PR adds the third and final set of tests for EXISTS subquery.
      
      File name                        | Brief description
      ------------------------| -----------------
      exists-cte.sql              |Tests Exist subqueries referencing CTE
      exists-joins-and-set-ops.sql|Tests Exists subquery used in Joins (Both when joins occurs in outer and suquery blocks)
      
      DB2 results are attached here as reference :
      
      [exists-cte-db2.txt](https://github.com/apache/spark/files/752091/exists-cte-db2.txt)
      [exists-joins-and-set-ops-db2.txt](https://github.com/apache/spark/files/753283/exists-joins-and-set-ops-db2.txt) (updated)
      
      ## How was this patch tested?
      The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #16802 from dilipbiswal/exists-pr3.
      a8a13982
Loading