Skip to content
Snippets Groups Projects
  1. Jan 06, 2017
    • Michal Senkyr's avatar
      [SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a... · 903bb8e8
      Michal Senkyr authored
      [SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)
      
      ## What changes were proposed in this pull request?
      
      Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`.
      
      Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed
      
      `ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered
      
      `SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types
      
      Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException
      
      ## How was this patch tested?
      ```bash
      ./build/mvn -DskipTests clean package && ./dev/run-tests
      ```
      
      Also manual execution of the following sets of commands in the Spark shell:
      ```scala
      case class TestCC(key: Int, letters: List[String])
      
      val ds1 = sc.makeRDD(Seq(
      (List("D")),
      (List("S","H")),
      (List("F","H")),
      (List("D","L","L"))
      )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
      
      val test1=ds1.map{_.key}
      test1.show
      ```
      
      ```scala
      case class X(l: List[String])
      spark.createDataset(Seq(List("A"))).map(X).show
      ```
      
      ```scala
      spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect
      ```
      
      After adding arbitrary sequence support also tested with the following commands:
      
      ```scala
      case class QueueClass(q: scala.collection.immutable.Queue[Int])
      
      spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16240 from michalsenkyr/sql-caseclass-list-fix.
      903bb8e8
  2. Jan 05, 2017
    • Kevin Yu's avatar
      [SPARK-18871][SQL] New test cases for IN/NOT IN subquery · bcc510b0
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.
      
      Based on the discussion, we will create  `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory.
      
      This is the high level grouping for IN subquery:
      
      `subquery/in-subquery/`
      `subquery/in-subquery/simple-in.sql`
      `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
      `subquery/in-subquery/not-in-group-by.sql`
      `subquery/in-subquery/in-order-by.sql`
      `subquery/in-subquery/in-limit.sql`
      `subquery/in-subquery/in-having.sql`
      `subquery/in-subquery/in-joins.sql`
      `subquery/in-subquery/not-in-joins.sql`
      `subquery/in-subquery/in-set-operations.sql`
      `subquery/in-subquery/in-with-cte.sql`
      `subquery/in-subquery/not-in-with-cte.sql`
      subquery/in-subquery/in-multiple-columns.sql`
      
      We will deliver it through multiple prs, this is the first pr for the IN subquery, it has
      
      `subquery/in-subquery/simple-in.sql`
      `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
      
      These are the results from running on DB2.
      [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt)
      [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt)
      [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt)
      [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt)
      
      ## How was this patch tested?
      
      This patch is adding tests.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16337 from kevinyu98/spark-18871.
      bcc510b0
    • Yanbo Liang's avatar
      [MINOR] Correct LogisticRegression test case for probability2prediction. · dfc4c935
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```.
      
      ## How was this patch tested?
      Change unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16477 from yanboliang/lor-pred.
      dfc4c935
    • Wenchen Fan's avatar
      [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables · cca945b6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.
      
      Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.
      
      TODO(for follow-up PRs):
      1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
      2. `SHOW CREATE TABLE` should be updated to use the new syntax.
      3. we should decide if we wanna change the behavior of `SET LOCATION`.
      
      ## How was this patch tested?
      
      new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16296 from cloud-fan/create-table.
      cca945b6
    • Rui Li's avatar
      [SPARK-14958][CORE] Failed task not handled when there's error deserializing failure reason · f5d18af6
      Rui Li authored
      ## What changes were proposed in this pull request?
      
      TaskResultGetter tries to deserialize the TaskEndReason before handling the failed task. If an error is thrown during deserialization, the failed task won't be handled, which leaves the job hanging.
      The PR proposes to handle the failed task in a finally block.
      ## How was this patch tested?
      
      In my case I hit a NoClassDefFoundError and the job hangs. Manually verified the patch can fix it.
      
      Author: Rui Li <rui.li@intel.com>
      Author: Rui Li <lirui@apache.org>
      Author: Rui Li <shlr@cn.ibm.com>
      
      Closes #12775 from lirui-intel/SPARK-14958.
      f5d18af6
    • Wenchen Fan's avatar
      [SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable · 30345c43
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues:
      1. doesn't work when the partition has custom location.
      2. will recover all partitions
      
      This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it.
      
      ## How was this patch tested?
      
      newly added regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16460 from cloud-fan/append.
      30345c43
  3. Jan 04, 2017
    • uncleGen's avatar
      [SPARK-19009][DOC] Add streaming rest api doc · 6873430c
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      add streaming rest api doc
      
      related to pr #16253
      
      cc saturday-shi srowen
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16414 from uncleGen/SPARK-19009.
      6873430c
    • Kay Ousterhout's avatar
      [SPARK-19062] Utils.writeByteBuffer bug fix · 00074b57
      Kay Ousterhout authored
      This commit changes Utils.writeByteBuffer so that it does not change
      the position of the ByteBuffer that it writes out, and adds a unit test for
      this functionality.
      
      cc mridulm
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #16462 from kayousterhout/SPARK-19062.
      00074b57
    • Herman van Hovell's avatar
      [SPARK-19070] Clean-up dataset actions · 4262fb0d
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Dataset actions currently spin off a new `Dataframe` only to track query execution. This PR simplifies this code path by using the `Dataset.queryExecution` directly. This PR also merges the typed and untyped action evaluation paths.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16466 from hvanhovell/SPARK-19070.
      4262fb0d
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      Unverified
      a1e40b1f
    • Zheng RuiFeng's avatar
      [SPARK-19054][ML] Eliminate extra pass in NB · 7a825058
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      eliminate unnecessary extra pass in NB's train
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16453 from zhengruifeng/nb_getNC.
      Unverified
      7a825058
    • Wenchen Fan's avatar
      [SPARK-19060][SQL] remove the supportsPartial flag in AggregateFunction · 101556d0
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction`
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16461 from cloud-fan/partial.
      101556d0
    • mingfei's avatar
      [SPARK-19073] LauncherState should be only set to SUBMITTED after the application is submitted · fe1c895e
      mingfei authored
      ## What changes were proposed in this pull request?
      LauncherState should be only set to SUBMITTED after the application is submitted.
      Currently the state is set before the application is actually submitted.
      
      ## How was this patch tested?
      no test is added in this patch
      
      Author: mingfei <mingfei.smf@alipay.com>
      
      Closes #16459 from shimingfei/fixLauncher.
      Unverified
      fe1c895e
    • Wenchen Fan's avatar
      [SPARK-19072][SQL] codegen of Literal should not output boxed value · cbd11d23
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result.
      
      This PR fixes this by special handling infinity to not output boxed value.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16469 from cloud-fan/literal.
      cbd11d23
  4. Jan 03, 2017
    • gatorsmile's avatar
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · b67b35f7
      gatorsmile authored
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
      
      ### What changes were proposed in this pull request?
      The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
      
      This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
      
      ### How was this patch tested?
      Added test cases for both HiveExternalCatalog and InMemoryCatalog
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16448 from gatorsmile/unsetSerdeProp.
      b67b35f7
    • Devaraj K's avatar
      [SPARK-15555][MESOS] Driver with --supervise option cannot be killed in Mesos mode · 89bf370e
      Devaraj K authored
      ## What changes were proposed in this pull request?
      
      Not adding the Killed applications for retry.
      ## How was this patch tested?
      
      I have verified manually in the Mesos cluster, with the changes the killed applications move to Finished Drivers section and will not retry.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13323 from devaraj-kavali/SPARK-15555.
      89bf370e
    • Dongjoon Hyun's avatar
      [SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a... · 7a2b5f93
      Dongjoon Hyun authored
      [SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`
      
      ## What changes were proposed in this pull request?
      
      CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.
      
      **decimal.csv**
      ```
      9.03E+12
      1.19E+11
      ```
      
      **BEFORE**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(3,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
      java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
      ```
      
      **AFTER**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(4,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      +---------+
      |      _c0|
      +---------+
      |9.030E+12|
      | 1.19E+11|
      +---------+
      ```
      
      ## How was this patch tested?
      
      Pass the newly add test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16320 from dongjoon-hyun/SPARK-18877.
      7a2b5f93
    • Liang-Chi Hsieh's avatar
      [SPARK-18932][SQL] Support partial aggregation for collect_set/collect_list · 52636226
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently collect_set/collect_list aggregation expression don't support partial aggregation. This patch is to enable partial aggregation for them.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16371 from viirya/collect-partial-support.
      52636226
    • Weiqing Yang's avatar
      [MINOR] Add missing sc.stop() to end of examples · e5c307c5
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.
      
      ## How was this patch tested?
      Pass the build and unit tests.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16426 from weiqingy/testIssue.
      Unverified
      e5c307c5
  5. Jan 02, 2017
    • Zhenhua Wang's avatar
      [SPARK-18998][SQL] Add a cbo conf to switch between default statistics and estimated statistics · ae83c211
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      We add a cbo configuration to switch between default stats and estimated stats.
      We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats.
      
      ## How was this patch tested?
      
      Add a test case using a dummy LogicalPlan.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #16401 from wzhfy/cboSwitch.
      ae83c211
    • gatorsmile's avatar
      [SPARK-19029][SQL] Remove databaseName from SimpleCatalogRelation · a6cd9dbc
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Remove useless `databaseName ` from `SimpleCatalogRelation`.
      
      ### How was this patch tested?
      Existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16438 from gatorsmile/removeDBFromSimpleCatalogRelation.
      a6cd9dbc
    • hyukjinkwon's avatar
      [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts · 46b21260
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to check pep8 against all other Python scripts and fix the errors as below:
      
      ```bash
      ./dev/create-release/generate-contributors.py
      ./dev/create-release/releaseutils.py
      ./dev/create-release/translate-contributors.py
      ./dev/lint-python
      ./python/docs/epytext.py
      ./examples/src/main/python/mllib/decision_tree_classification_example.py
      ./examples/src/main/python/mllib/decision_tree_regression_example.py
      ./examples/src/main/python/mllib/gradient_boosting_classification_example.py
      ./examples/src/main/python/mllib/gradient_boosting_regression_example.py
      ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
      ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
      ./examples/src/main/python/mllib/naive_bayes_example.py
      ./examples/src/main/python/mllib/random_forest_classification_example.py
      ./examples/src/main/python/mllib/random_forest_regression_example.py
      ./examples/src/main/python/mllib/svm_with_sgd_example.py
      ./examples/src/main/python/streaming/network_wordjoinsentiments.py
      ./sql/hive/src/test/resources/data/scripts/cat.py
      ./sql/hive/src/test/resources/data/scripts/cat_error.py
      ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
      ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
      ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
      ./sql/hive/src/test/resources/data/scripts/escapednewline.py
      ./sql/hive/src/test/resources/data/scripts/escapedtab.py
      ./sql/hive/src/test/resources/data/scripts/input20_script.py
      ./sql/hive/src/test/resources/data/scripts/newline.py
      ```
      
      ## How was this patch tested?
      
      - `./python/docs/epytext.py`
      
        ```bash
        cd ./python/docs $$ make html
        ```
      
      - pep8 check (Python 2.7 / Python 3.3.6)
      
        ```
        ./dev/lint-python
        ```
      
      - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python -m doctest -v ./dev/merge_spark_pr.py
        ```
      
      - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python generate-contributors.py
        python translate-contributors.py
        ```
      
      - Examples (Python 2.7 / Python 3.3.6)
      
        ```bash
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
        ```
      
      - Examples (Python 2.7 only / Python 3.3.6 not working)
        ```
        ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
        ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
        ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
        ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
        ```
      
      - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)
      
        Manually tested only changed ones.
      
      - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)
      
        Manually tested this after disabling actually adding comments and links.
      
      And also via Jenkins tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16405 from HyukjinKwon/minor-pep8.
      Unverified
      46b21260
    • hyukjinkwon's avatar
      [SPARK-19022][TESTS] Fix tests dependent on OS due to different newline characters · f1330b1d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There are two tests failing on Windows due to the different newlines.
      
      ```
       - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds)
       "{
          "id" : "39788670-6722-48b7-a248-df6ba08722ac",
          "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
          "name" : "myName",
          ...
        }" did not equal "{
          "id" : "39788670-6722-48b7-a248-df6ba08722ac",
          "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
          "name" : "myName",
          ...
        }"
        ...
      ```
      
      ```
       - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds)
       "{
          "message" : "active",
          "isDataAvailable" : true,
          "isTriggerActive" : false
        }" did not equal "{
          "message" : "active",
          "isDataAvailable" : true,
          "isTriggerActive" : false
        }"
        ...
      ```
      
      The reason is, `pretty` in `org.json4s.pretty` writes OS-dependent newlines but the string defined in the tests are `\n`. This ends up with test failures.
      
      This PR proposes to compare these regardless of newline concerns.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/417-newlines-fix-before
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/418-newlines-fix
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16433 from HyukjinKwon/tests-StreamingQueryStatusAndProgressSuite.
      Unverified
      f1330b1d
    • Liang-Chi Hsieh's avatar
      [MINOR][DOC] Minor doc change for YARN credential providers · 0ac2f1e7
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.
      
      ## How was this patch tested?
      
      N/A. Just doc change.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16444 from viirya/minor-credential-provider-doc.
      Unverified
      0ac2f1e7
    • Liwei Lin's avatar
      [SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide · 808b84e2
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Currently some code snippets in the programming guide just do not compile. We should fix them.
      
      ## How was this patch tested?
      
      ```
      SKIP_API=1 jekyll build
      ```
      
      ## Screenshot from part of the change:
      
      ![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16442 from lw-lin/ss-pro-guide-.
      Unverified
      808b84e2
    • Sean Owen's avatar
      [BUILD] Close stale PRs · ba488126
      Sean Owen authored
      Closes #12968
      Closes #16215
      Closes #16212
      Closes #16086
      Closes #15713
      Closes #16413
      Closes #16396
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16447 from srowen/CloseStalePRs.
      Unverified
      ba488126
  6. Jan 01, 2017
  7. Dec 31, 2016
  8. Dec 30, 2016
    • Cheng Lian's avatar
      [SPARK-19016][SQL][DOC] Document scalable partition handling · 871f6114
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR documents the scalable partition handling feature in the body of the programming guide.
      
      Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16424 from liancheng/scalable-partition-handling-doc.
      871f6114
    • Dongjoon Hyun's avatar
      [SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing · b85e2943
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Apache Spark supports the following cases **by quoting RDD column names** while saving through JDBC.
      - Allow reserved keyword as a column name, e.g., 'order'.
      - Allow mixed-case colume names like the following, e.g., `[a: int, A: int]`.
      
        ``` scala
        scala> val df = sql("select 1 a, 1 A")
        df: org.apache.spark.sql.DataFrame = [a: int, A: int]
        ...
        scala> df.write.mode("overwrite").format("jdbc").options(option).save()
        scala> df.write.mode("append").format("jdbc").options(option).save()
        ```
      
      This PR aims to use **database column names** instead of RDD column ones in order to support the following additionally.
      Note that this case succeeds with `MySQL`, but fails on `Postgres`/`Oracle` before.
      
      ``` scala
      val df1 = sql("select 1 a")
      val df2 = sql("select 1 A")
      ...
      df1.write.mode("overwrite").format("jdbc").options(option).save()
      df2.write.mode("append").format("jdbc").options(option).save()
      ```
      ## How was this patch tested?
      
      Pass the Jenkins test with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15664 from dongjoon-hyun/SPARK-18123.
      b85e2943
    • hyukjinkwon's avatar
      [SPARK-18922][TESTS] Fix more path-related test failures on Windows · 852782b8
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the test failures due to different format of paths on Windows.
      
      Failed tests are as below:
      
      ```
      ColumnExpressionSuite:
      - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD *** FAILED *** (187 milliseconds)
        "file:///C:/projects/spark/target/tmp/spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce/part-00001-c083a03a-e55e-4b05-9073-451de352d006.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce" (ColumnExpressionSuite.scala:545)
      
      - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD *** FAILED *** (172 milliseconds)
        "file:/C:/projects/spark/target/tmp/spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f/part-00000-f6530138-9ad3-466d-ab46-0eeb6f85ed0b.txt" did not contain "C:\projects\spark\target\tmp\spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f" (ColumnExpressionSuite.scala:569)
      
      - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD *** FAILED *** (156 milliseconds)
        "file:/C:/projects/spark/target/tmp/spark-a894c7df-c74d-4d19-82a2-a04744cb3766/part-00000-29674e3f-3fcf-4327-9b04-4dab1d46338d.txt" did not contain "C:\projects\spark\target\tmp\spark-a894c7df-c74d-4d19-82a2-a04744cb3766" (ColumnExpressionSuite.scala:598)
      ```
      
      ```
      DataStreamReaderWriterSuite:
      - source metadataPath *** FAILED *** (62 milliseconds)
        org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted:
      streamSourceProvider.createSource(
          org.apache.spark.sql.SQLContext3b04133b,
          "C:\projects\spark\target\tmp\streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0",
          None,
          "org.apache.spark.sql.streaming.test",
          Map()
      );
      -> at org.apache.spark.sql.streaming.test.DataStreamReaderWriterSuite$$anonfun$12.apply$mcV$sp(DataStreamReaderWriterSuite.scala:374)
      Actual invocation has different arguments:
      streamSourceProvider.createSource(
          org.apache.spark.sql.SQLContext3b04133b,
          "/C:/projects/spark/target/tmp/streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0",
          None,
          "org.apache.spark.sql.streaming.test",
          Map()
      );
      ```
      
      ```
      GlobalTempViewSuite:
      - CREATE GLOBAL TEMP VIEW USING *** FAILED *** (110 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-960398ba-a0a1-45f6-a59a-d98533f9f519;
      ```
      
      ```
      CreateTableAsSelectSuite:
      - CREATE TABLE USING AS SELECT *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - create a table, drop it and create another one with the same name *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - create table using as select - with partitioned by *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - create table using as select - with non-zero buckets *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      ```
      
      ```
      HiveMetadataCacheSuite:
      - partitioned table is cached when partition pruning is true *** FAILED *** (532 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - partitioned table is cached when partition pruning is false *** FAILED *** (297 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      MultiDatabaseSuite:
      - createExternalTable() to non-default database - with USE *** FAILED *** (954 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-0839d9a7-5e29-467a-9e3e-3e4cd618ee09;
      
      - createExternalTable() to non-default database - without USE *** FAILED *** (500 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-c7e24d73-1d8f-45e8-ab7d-53a83087aec3;
      
       - invalid database name and table names *** FAILED *** (31 milliseconds)
         "Path does not exist: file:/C:projectsspark  arget mpspark-15a2a494-3483-4876-80e5-ec396e704b77;" did not contain "`t:a` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _." (MultiDatabaseSuite.scala:296)
      ```
      
      ```
      OrcQuerySuite:
       - SPARK-8501: Avoids discovery schema from empty ORC files *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC *** FAILED *** (78 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
       - converted ORC table supports resolving mixed case field *** FAILED *** (297 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite:
       - Locality support for FileScanRDD *** FAILED *** (15 milliseconds)
         java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-383d1f13-8783-47fd-964d-9c75e5eec50f, expected: file:///
      ```
      
      ```
      HiveQuerySuite:
      - CREATE TEMPORARY FUNCTION *** FAILED *** (0 milliseconds)
         java.net.MalformedURLException: For input string: "%5Cprojects%5Cspark%5Csql%5Chive%5Ctarget%5Cscala-2.11%5Ctest-classes%5CTestUDTF.jar"
      
       - ADD FILE command *** FAILED *** (500 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\sql\hive\target\scala-2.11\test-classes\data\files\v1.txt
      
       - ADD JAR command 2 *** FAILED *** (110 milliseconds)
         org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive  argetscala-2.11 est-classesdatafilessample.json;
      ```
      
      ```
      PruneFileSourcePartitionsSuite:
       - PruneFileSourcePartitions should not change the output of LogicalRelation *** FAILED *** (15 milliseconds)
         org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      HiveCommandSuite:
       - LOAD DATA LOCAL *** FAILED *** (109 milliseconds)
         org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive  argetscala-2.11 est-classesdatafilesemployee.dat;
      
       - LOAD DATA *** FAILED *** (93 milliseconds)
         java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpemployee.dat7496657117354281006.tmp
      
       - Truncate Table *** FAILED *** (78 milliseconds)
         org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive  argetscala-2.11 est-classesdatafilesemployee.dat;
      ```
      
      ```
      HiveExternalCatalogBackwardCompatibilitySuite:
      - make sure we can read table created by old version of Spark *** FAILED *** (0 milliseconds)
        "[/C:/projects/spark/target/tmp/]spark-0554d859-74e1-..." did not equal "[C:\projects\spark\target\tmp\]spark-0554d859-74e1-..." (HiveExternalCatalogBackwardCompatibilitySuite.scala:213)
        org.scalatest.exceptions.TestFailedException
      
      - make sure we can alter table location created by old version of Spark *** FAILED *** (110 milliseconds)
        java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark	arget	mpspark-0e9b2c5f-49a1-4e38-a32a-c0ab1813a79f
      ```
      
      ```
      ExternalCatalogSuite:
      - create/drop/rename partitions should create/delete/rename the directory *** FAILED *** (610 milliseconds)
        java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-4c24f010-18df-437b-9fed-990c6f9adece
      ```
      
      ```
      SQLQuerySuite:
      - describe functions - temporary user defined functions *** FAILED *** (16 milliseconds)
        java.net.URISyntaxException: Illegal character in opaque part at index 22: C:projectssparksqlhive	argetscala-2.11	est-classesTestUDTF.jar
      
      - specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-a34c9814-a483-43f2-be29-37f616b6df91;
      ```
      
      ```
      PartitionProviderCompatibilitySuite:
      - convert partition provider to hive with repair table *** FAILED *** (281 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-ee5fc96d-8c7d-4ebf-8571-a1d62736473e;
      
      - when partition management is enabled, new tables have partition provider hive *** FAILED *** (187 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-803ad4d6-3e8c-498d-9ca5-5cda5d9b2a48;
      
      - when partition management is disabled, new tables have no partition provider *** FAILED *** (172 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-c9fda9e2-4020-465f-8678-52cd72d0a58f;
      
      - when partition management is disabled, we preserve the old behavior even for new tables *** FAILED *** (203 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget
      mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e13;
      
      - insert overwrite partition of legacy datasource table *** FAILED *** (188 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e79;
      
      - insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (219 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-6ba3a88d-6f6c-42c5-a9f4-6d924a0616ff;
      
      - SPARK-18544 append with saveAsTable - partition management true *** FAILED *** (173 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-cd234a6d-9cb4-4d1d-9e51-854ae9543bbd;
      
      - SPARK-18635 special chars in partition values - partition management true *** FAILED *** (2 seconds, 967 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-18635 special chars in partition values - partition management false *** FAILED *** (62 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-18659 insert overwrite table with lowercase - partition management true *** FAILED *** (63 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-18544 append with saveAsTable - partition management false *** FAILED *** (266 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-18659 insert overwrite table files - partition management false *** FAILED *** (63 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-18659 insert overwrite table with lowercase - partition management false *** FAILED *** (78 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - sanity check table setup *** FAILED *** (31 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - insert into partial dynamic partitions *** FAILED *** (47 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - insert into fully dynamic partitions *** FAILED *** (62 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - insert into static partition *** FAILED *** (78 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - overwrite partial dynamic partitions *** FAILED *** (63 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - overwrite fully dynamic partitions *** FAILED *** (47 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - overwrite static partition *** FAILED *** (63 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      MetastoreDataSourcesSuite:
      - check change without refresh *** FAILED *** (203 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-00713fe4-ca04-448c-bfc7-6c5e9a2ad2a1;
      
      - drop, change, recreate *** FAILED *** (78 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-2030a21b-7d67-4385-a65b-bb5e2bed4861;
      
      - SPARK-15269 external data source table creation *** FAILED *** (78 milliseconds)
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark	arget	mpspark-4d50fd4a-14bc-41d6-9232-9554dd233f86;
      
      - CTAS *** FAILED *** (109 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - CTAS with IF NOT EXISTS *** FAILED *** (109 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - CTAS: persisted partitioned bucketed data source table *** FAILED *** (0 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - SPARK-15025: create datasource table with path with select *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - CTAS: persisted partitioned data source table *** FAILED *** (47 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      ```
      
      ```
      HiveMetastoreCatalogSuite:
      - Persist non-partitioned parquet relation into metastore as managed table using CTAS *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      
      - Persist non-partitioned orc relation into metastore as managed table using CTAS *** FAILED *** (16 milliseconds)
        java.lang.IllegalArgumentException: Can not create a Path from an empty string
      ```
      
      ```
      HiveUDFSuite:
      - SPARK-11522 select input_file_name from non-parquet table *** FAILED *** (16 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      QueryPartitionSuite:
      - SPARK-13709: reading partitioned Avro table with nested schema *** FAILED *** (250 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ```
      ParquetHiveCompatibilitySuite:
      - simple primitives *** FAILED *** (16 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-10177 timestamp *** FAILED *** (0 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - array *** FAILED *** (16 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - map *** FAILED *** (16 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - struct *** FAILED *** (0 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      
      - SPARK-16344: array of struct with a single field named 'array_element' *** FAILED *** (15 milliseconds)
        org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
      ```
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      ```
      ColumnExpressionSuite:
      - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD (234 milliseconds)
      - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD (235 milliseconds)
      - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD (203 milliseconds)
      ```
      
      ```
      DataStreamReaderWriterSuite:
      - source metadataPath (63 milliseconds)
      ```
      
      ```
      GlobalTempViewSuite:
       - CREATE GLOBAL TEMP VIEW USING (436 milliseconds)
      ```
      
      ```
      CreateTableAsSelectSuite:
      - CREATE TABLE USING AS SELECT (171 milliseconds)
      - create a table, drop it and create another one with the same name (422 milliseconds)
      - create table using as select - with partitioned by (141 milliseconds)
      - create table using as select - with non-zero buckets (125 milliseconds)
      ```
      
      ```
      HiveMetadataCacheSuite:
      - partitioned table is cached when partition pruning is true (3 seconds, 211 milliseconds)
      - partitioned table is cached when partition pruning is false (1 second, 781 milliseconds)
      ```
      
      ```
      MultiDatabaseSuite:
       - createExternalTable() to non-default database - with USE (797 milliseconds)
       - createExternalTable() to non-default database - without USE (640 milliseconds)
       - invalid database name and table names (62 milliseconds)
      ```
      
      ```
      OrcQuerySuite:
       - SPARK-8501: Avoids discovery schema from empty ORC files (703 milliseconds)
       - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC (750 milliseconds)
       - converted ORC table supports resolving mixed case field (625 milliseconds)
      ```
      
      ```
      HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite:
       - Locality support for FileScanRDD (296 milliseconds)
      ```
      
      ```
      HiveQuerySuite:
       - CREATE TEMPORARY FUNCTION (125 milliseconds)
       - ADD FILE command (250 milliseconds)
       - ADD JAR command 2 (609 milliseconds)
      ```
      
      ```
      PruneFileSourcePartitionsSuite:
      - PruneFileSourcePartitions should not change the output of LogicalRelation (359 milliseconds)
      ```
      
      ```
      HiveCommandSuite:
       - LOAD DATA LOCAL (1 second, 829 milliseconds)
       - LOAD DATA (1 second, 735 milliseconds)
       - Truncate Table (1 second, 641 milliseconds)
      ```
      
      ```
      HiveExternalCatalogBackwardCompatibilitySuite:
       - make sure we can read table created by old version of Spark (32 milliseconds)
       - make sure we can alter table location created by old version of Spark (125 milliseconds)
       - make sure we can rename table created by old version of Spark (281 milliseconds)
      ```
      
      ```
      ExternalCatalogSuite:
      - create/drop/rename partitions should create/delete/rename the directory (625 milliseconds)
      ```
      
      ```
      SQLQuerySuite:
      - describe functions - temporary user defined functions (31 milliseconds)
      - specifying database name for a temporary table is not allowed (390 milliseconds)
      ```
      
      ```
      PartitionProviderCompatibilitySuite:
       - convert partition provider to hive with repair table (813 milliseconds)
       - when partition management is enabled, new tables have partition provider hive (562 milliseconds)
       - when partition management is disabled, new tables have no partition provider (344 milliseconds)
       - when partition management is disabled, we preserve the old behavior even for new tables (422 milliseconds)
       - insert overwrite partition of legacy datasource table (750 milliseconds)
       - SPARK-18544 append with saveAsTable - partition management true (985 milliseconds)
       - SPARK-18635 special chars in partition values - partition management true (3 seconds, 328 milliseconds)
       - SPARK-18635 special chars in partition values - partition management false (2 seconds, 891 milliseconds)
       - SPARK-18659 insert overwrite table with lowercase - partition management true (750 milliseconds)
       - SPARK-18544 append with saveAsTable - partition management false (656 milliseconds)
       - SPARK-18659 insert overwrite table files - partition management false (922 milliseconds)
       - SPARK-18659 insert overwrite table with lowercase - partition management false (469 milliseconds)
       - sanity check table setup (937 milliseconds)
       - insert into partial dynamic partitions (2 seconds, 985 milliseconds)
       - insert into fully dynamic partitions (1 second, 937 milliseconds)
       - insert into static partition (1 second, 578 milliseconds)
       - overwrite partial dynamic partitions (7 seconds, 561 milliseconds)
       - overwrite fully dynamic partitions (1 second, 766 milliseconds)
       - overwrite static partition (1 second, 797 milliseconds)
      ```
      
      ```
      MetastoreDataSourcesSuite:
       - check change without refresh (610 milliseconds)
       - drop, change, recreate (437 milliseconds)
       - SPARK-15269 external data source table creation (297 milliseconds)
       - CTAS with IF NOT EXISTS (437 milliseconds)
       - CTAS: persisted partitioned bucketed data source table (422 milliseconds)
       - SPARK-15025: create datasource table with path with select (265 milliseconds)
       - CTAS (438 milliseconds)
       - CTAS with IF NOT EXISTS (469 milliseconds)
       - CTAS: persisted partitioned bucketed data source table (406 milliseconds)
      ```
      
      ```
      HiveMetastoreCatalogSuite:
       - Persist non-partitioned parquet relation into metastore as managed table using CTAS (406 milliseconds)
       - Persist non-partitioned orc relation into metastore as managed table using CTAS (313 milliseconds)
      ```
      
      ```
      HiveUDFSuite:
       - SPARK-11522 select input_file_name from non-parquet table (3 seconds, 144 milliseconds)
      ```
      
      ```
      QueryPartitionSuite:
       - SPARK-13709: reading partitioned Avro table with nested schema (1 second, 67 milliseconds)
      ```
      
      ```
      ParquetHiveCompatibilitySuite:
       - simple primitives (745 milliseconds)
       - SPARK-10177 timestamp (375 milliseconds)
       - array (407 milliseconds)
       - map (409 milliseconds)
       - struct (437 milliseconds)
       - SPARK-16344: array of struct with a single field named 'array_element' (391 milliseconds)
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16397 from HyukjinKwon/SPARK-18922-paths.
      Unverified
      852782b8
    • Sean Owen's avatar
      [SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient · 56d3a7eb
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being called every time `predict` is called on a single vector, which is bad news for now the ml.KMeansModel Transformer works, which necessarily transforms one vector at a time.
      
      This causes the model to just store the vectors with norms upfront. The extra norm should be small compared to the vectors. This would avoid this form of overhead on this and other code paths.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16328 from srowen/SPARK-18808.
      Unverified
      56d3a7eb
  9. Dec 29, 2016
    • Yin Huai's avatar
      Update known_translations for contributor names and also fix a small issue in... · 63036aee
      Yin Huai authored
      Update known_translations for contributor names and also fix a small issue in translate-contributors.py
      
      ## What changes were proposed in this pull request?
      This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py
      
      ## How was this patch tested?
      manually tested
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16423 from yhuai/contributors.
      63036aee
    • adesharatushar's avatar
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design... · dba81e1d
      adesharatushar authored
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design Patterns for using foreachRDD
      
      ## What changes were proposed in this pull request?
      
      Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.
      
      ## How was this patch tested?
      
      Manual.
      Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.
      
      The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.
      
      Author: adesharatushar <tushar_adeshara@persistent.com>
      
      Closes #16408 from adesharatushar/streaming-doc-fix.
      Unverified
      dba81e1d
    • Ilya Matiach's avatar
      [SPARK-18698][ML] Adding public constructor that takes uid for IndexToString · 87bc4112
      Ilya Matiach authored
      ## What changes were proposed in this pull request?
      
      Based on SPARK-18698, this adds a public constructor that takes a UID for IndexToString.  Other transforms have similar constructors.
      
      ## How was this patch tested?
      
      A unit test was added to verify the new functionality.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16436 from imatiach-msft/ilmat/fix-indextostring.
      87bc4112
    • Dongjoon Hyun's avatar
      [SPARK-19012][SQL] Fix `createTempViewCommand` to throw AnalysisException instead of ParseException · 752d9eeb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `createTempView`, `createOrReplaceTempView`, and `createGlobalTempView` show `ParseExceptions` on invalid table names. We had better show better error message. Also, this PR also adds and updates the missing description on the API docs correctly.
      
      **BEFORE**
      ```
      scala> spark.range(10).createOrReplaceTempView("11111")
      org.apache.spark.sql.catalyst.parser.ParseException:
      mismatched input '11111' expecting {'SELECT', 'FROM', 'ADD', ...}(line 1, pos 0)
      
      == SQL ==
      11111
      ...
      ```
      
      **AFTER**
      ```
      scala> spark.range(10).createOrReplaceTempView("11111")
      org.apache.spark.sql.AnalysisException: Invalid view name: 11111;
      ...
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with updated a test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16427 from dongjoon-hyun/SPARK-19012.
      752d9eeb
  10. Dec 28, 2016
    • Wenchen Fan's avatar
      [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand · 7d19b6ab
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists:
      
      1. throw exception if we don't want to ignore it.
      2. do some check and adjust the schema if we want to append data.
      3. drop the table and create it again if we want to overwrite.
      
      The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15996 from cloud-fan/append.
      7d19b6ab
    • Kazuaki Ishizaki's avatar
      [SPARK-16213][SQL] Reduce runtime overhead of a program that creates an... · 93f35569
      Kazuaki Ishizaki authored
      [SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame
      
      ## What changes were proposed in this pull request?
      
      This PR reduces runtime overhead of a program the creates an primitive array in DataFrame by using the similar approach to #15044. Generated code performs boxing operation in an assignment from InternalRow to an `Object[]` temporary array (at Lines 051 and 061 in the generated code before without this PR). If we know that type of array elements is primitive, we apply the following optimizations:
      1. Eliminate a pair of `isNullAt()` and a null assignment
      2. Allocate an primitive array instead of `Object[]` (eliminate boxing operations)
      3. Create `UnsafeArrayData` by using `UnsafeArrayWriter` to keep a primitive array in a row format instead of doing non-lightweight operations in constructor of `GenericArrayData`
      The PR also performs the same things for `CreateMap`.
      
      Here are performance results of [DataFrame programs](https://github.com/kiszk/spark/blob/6bf54ec5e227689d69f6db991e9ecbc54e153d0a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/PrimitiveArrayBenchmark.scala#L83-L112) by up to 17.9x over without this PR.
      
      ```
      Without SPARK-16043
      OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
      Intel Xeon E3-12xx v2 (Ivy Bridge)
      Read a primitive array in DataFrame:     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                           3805 / 4150          0.0      507308.9       1.0X
      Double                                        3593 / 3852          0.0      479056.9       1.1X
      
      With SPARK-16043
      Read a primitive array in DataFrame:     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      Int                                            213 /  271          0.0       28387.5       1.0X
      Double                                         204 /  223          0.0       27250.9       1.0X
      ```
      Note : #15780 is enabled for these measurements
      
      An motivating example
      
      ``` java
      val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
      df.selectExpr("Array(value + 1.1d, value + 2.2d)").show
      ```
      
      Generated code without this PR
      
      ``` java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */   private Object[] project_values;
      /* 013 */   private UnsafeRow project_result;
      /* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
      /* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
      /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter;
      /* 017 */
      /* 018 */   public GeneratedIterator(Object[] references) {
      /* 019 */     this.references = references;
      /* 020 */   }
      /* 021 */
      /* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 023 */     partitionIndex = index;
      /* 024 */     this.inputs = inputs;
      /* 025 */     inputadapter_input = inputs[0];
      /* 026 */     serializefromobject_result = new UnsafeRow(1);
      /* 027 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 028 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 029 */     this.project_values = null;
      /* 030 */     project_result = new UnsafeRow(1);
      /* 031 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32);
      /* 032 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1);
      /* 033 */     this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
      /* 034 */
      /* 035 */   }
      /* 036 */
      /* 037 */   protected void processNext() throws java.io.IOException {
      /* 038 */     while (inputadapter_input.hasNext()) {
      /* 039 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 040 */       double inputadapter_value = inputadapter_row.getDouble(0);
      /* 041 */
      /* 042 */       final boolean project_isNull = false;
      /* 043 */       this.project_values = new Object[2];
      /* 044 */       boolean project_isNull1 = false;
      /* 045 */
      /* 046 */       double project_value1 = -1.0;
      /* 047 */       project_value1 = inputadapter_value + 1.1D;
      /* 048 */       if (false) {
      /* 049 */         project_values[0] = null;
      /* 050 */       } else {
      /* 051 */         project_values[0] = project_value1;
      /* 052 */       }
      /* 053 */
      /* 054 */       boolean project_isNull4 = false;
      /* 055 */
      /* 056 */       double project_value4 = -1.0;
      /* 057 */       project_value4 = inputadapter_value + 2.2D;
      /* 058 */       if (false) {
      /* 059 */         project_values[1] = null;
      /* 060 */       } else {
      /* 061 */         project_values[1] = project_value4;
      /* 062 */       }
      /* 063 */
      /* 064 */       final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
      /* 065 */       this.project_values = null;
      /* 066 */       project_holder.reset();
      /* 067 */
      /* 068 */       project_rowWriter.zeroOutNullBytes();
      /* 069 */
      /* 070 */       if (project_isNull) {
      /* 071 */         project_rowWriter.setNullAt(0);
      /* 072 */       } else {
      /* 073 */         // Remember the current cursor so that we can calculate how many bytes are
      /* 074 */         // written later.
      /* 075 */         final int project_tmpCursor = project_holder.cursor;
      /* 076 */
      /* 077 */         if (project_value instanceof UnsafeArrayData) {
      /* 078 */           final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes();
      /* 079 */           // grow the global buffer before writing data.
      /* 080 */           project_holder.grow(project_sizeInBytes);
      /* 081 */           ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
      /* 082 */           project_holder.cursor += project_sizeInBytes;
      /* 083 */
      /* 084 */         } else {
      /* 085 */           final int project_numElements = project_value.numElements();
      /* 086 */           project_arrayWriter.initialize(project_holder, project_numElements, 8);
      /* 087 */
      /* 088 */           for (int project_index = 0; project_index < project_numElements; project_index++) {
      /* 089 */             if (project_value.isNullAt(project_index)) {
      /* 090 */               project_arrayWriter.setNullDouble(project_index);
      /* 091 */             } else {
      /* 092 */               final double project_element = project_value.getDouble(project_index);
      /* 093 */               project_arrayWriter.write(project_index, project_element);
      /* 094 */             }
      /* 095 */           }
      /* 096 */         }
      /* 097 */
      /* 098 */         project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
      /* 099 */       }
      /* 100 */       project_result.setTotalSize(project_holder.totalSize());
      /* 101 */       append(project_result);
      /* 102 */       if (shouldStop()) return;
      /* 103 */     }
      /* 104 */   }
      /* 105 */ }
      ```
      
      Generated code with this PR
      
      ``` java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */   private UnsafeArrayData project_arrayData;
      /* 013 */   private UnsafeRow project_result;
      /* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
      /* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
      /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter;
      /* 017 */
      /* 018 */   public GeneratedIterator(Object[] references) {
      /* 019 */     this.references = references;
      /* 020 */   }
      /* 021 */
      /* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 023 */     partitionIndex = index;
      /* 024 */     this.inputs = inputs;
      /* 025 */     inputadapter_input = inputs[0];
      /* 026 */     serializefromobject_result = new UnsafeRow(1);
      /* 027 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 028 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 029 */
      /* 030 */     project_result = new UnsafeRow(1);
      /* 031 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32);
      /* 032 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1);
      /* 033 */     this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
      /* 034 */
      /* 035 */   }
      /* 036 */
      /* 037 */   protected void processNext() throws java.io.IOException {
      /* 038 */     while (inputadapter_input.hasNext()) {
      /* 039 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 040 */       double inputadapter_value = inputadapter_row.getDouble(0);
      /* 041 */
      /* 042 */       byte[] project_array = new byte[32];
      /* 043 */       project_arrayData = new UnsafeArrayData();
      /* 044 */       Platform.putLong(project_array, 16, 2);
      /* 045 */       project_arrayData.pointTo(project_array, 16, 32);
      /* 046 */
      /* 047 */       boolean project_isNull1 = false;
      /* 048 */
      /* 049 */       double project_value1 = -1.0;
      /* 050 */       project_value1 = inputadapter_value + 1.1D;
      /* 051 */       if (false) {
      /* 052 */         project_arrayData.setNullAt(0);
      /* 053 */       } else {
      /* 054 */         project_arrayData.setDouble(0, project_value1);
      /* 055 */       }
      /* 056 */
      /* 057 */       boolean project_isNull4 = false;
      /* 058 */
      /* 059 */       double project_value4 = -1.0;
      /* 060 */       project_value4 = inputadapter_value + 2.2D;
      /* 061 */       if (false) {
      /* 062 */         project_arrayData.setNullAt(1);
      /* 063 */       } else {
      /* 064 */         project_arrayData.setDouble(1, project_value4);
      /* 065 */       }
      /* 066 */       project_holder.reset();
      /* 067 */
      /* 068 */       // Remember the current cursor so that we can calculate how many bytes are
      /* 069 */       // written later.
      /* 070 */       final int project_tmpCursor = project_holder.cursor;
      /* 071 */
      /* 072 */       if (project_arrayData instanceof UnsafeArrayData) {
      /* 073 */         final int project_sizeInBytes = ((UnsafeArrayData) project_arrayData).getSizeInBytes();
      /* 074 */         // grow the global buffer before writing data.
      /* 075 */         project_holder.grow(project_sizeInBytes);
      /* 076 */         ((UnsafeArrayData) project_arrayData).writeToMemory(project_holder.buffer, project_holder.cursor);
      /* 077 */         project_holder.cursor += project_sizeInBytes;
      /* 078 */
      /* 079 */       } else {
      /* 080 */         final int project_numElements = project_arrayData.numElements();
      /* 081 */         project_arrayWriter.initialize(project_holder, project_numElements, 8);
      /* 082 */
      /* 083 */         for (int project_index = 0; project_index < project_numElements; project_index++) {
      /* 084 */           if (project_arrayData.isNullAt(project_index)) {
      /* 085 */             project_arrayWriter.setNullDouble(project_index);
      /* 086 */           } else {
      /* 087 */             final double project_element = project_arrayData.getDouble(project_index);
      /* 088 */             project_arrayWriter.write(project_index, project_element);
      /* 089 */           }
      /* 090 */         }
      /* 091 */       }
      /* 092 */
      /* 093 */       project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
      /* 094 */       project_result.setTotalSize(project_holder.totalSize());
      /* 095 */       append(project_result);
      /* 096 */       if (shouldStop()) return;
      /* 097 */     }
      /* 098 */   }
      /* 099 */ }
      ```
      ## How was this patch tested?
      
      Added unit tests into `DataFrameComplexTypeSuite`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #13909 from kiszk/SPARK-16213.
      93f35569
    • Tathagata Das's avatar
      [SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding... · 092c6725
      Tathagata Das authored
      [SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status
      
      ## What changes were proposed in this pull request?
      
      - Extended the Window operation section with code snippet and explanation of watermarking
      - Extended the Output Mode section with a table showing the compatibility between query type and output mode
      - Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
      - Updated API changes in the StreamingQueryListener example
      
      TODO
      - [x] Figure showing the watermarking
      
      ## How was this patch tested?
      
      N/A
      
      ## Screenshots
      ### Section: Windowed Aggregation with Event Time
      
      <img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">
      
      ![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)
      
      <img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">
      
      ----------------------------
      ### Section: Output Modes
      ![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)
      
      ----------------------------
      ### Section: Monitoring
      ![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
      ![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16294 from tdas/SPARK-18669.
      092c6725
    • sethah's avatar
      [SPARK-17772][ML][TEST] Add test functions for ML sample weights · 6a475ae4
      sethah authored
      ## What changes were proposed in this pull request?
      
      More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:
      
      * Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
      * Check that outliers with tiny sample weights do not affect the algorithm's performance
      
      This patch adds an additional test:
      
      * Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`
      
      The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.
      
      ## How was this patch tested?
      
      This patch only involves modifying test suites.
      
      ## Other notes
      
      Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15721 from sethah/SPARK-17772.
      6a475ae4
Loading