Skip to content
Snippets Groups Projects
  1. Oct 27, 2016
    • cody koeninger's avatar
      [SPARK-17813][SQL][KAFKA] Maximum data per trigger · 10423258
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.
      
      ## How was this patch tested?
      
      Added unit test
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15527 from koeninger/SPARK-17813.
      10423258
    • wm624@hotmail.com's avatar
      [SPARK-CORE][TEST][MINOR] Fix the wrong comment in test · 701a9d36
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      While learning core scheduler code, I found two lines of wrong comments. This PR simply corrects the comments.
      
      ## How was this patch tested?
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15631 from wangmiao1981/Rbug.
      Unverified
      701a9d36
    • Felix Cheung's avatar
      [SQL][DOC] updating doc for JSON source to link to jsonlines.org · 44c8bfda
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      API and programming guide doc changes for Scala, Python and R.
      
      ## How was this patch tested?
      
      manual test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15629 from felixcheung/jsondoc.
      44c8bfda
    • Felix Cheung's avatar
      [SPARK-17157][SPARKR][FOLLOW-UP] doc fixes · 1dbe9896
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      a couple of small late finding fixes for doc
      
      ## How was this patch tested?
      
      manually
      wangmiao1981
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15650 from felixcheung/logitfix.
      1dbe9896
    • Yin Huai's avatar
      [SPARK-18132] Fix checkstyle · d3b4831d
      Yin Huai authored
      This PR fixes checkstyle.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #15656 from yhuai/fix-format.
      d3b4831d
    • Dilip Biswal's avatar
      [SPARK-18009][SQL] Fix ClassCastException while calling toLocalIterator() on... · dd4f088c
      Dilip Biswal authored
      [SPARK-18009][SQL] Fix ClassCastException while calling toLocalIterator() on dataframe produced by RunnableCommand
      
      ## What changes were proposed in this pull request?
      A short code snippet that uses toLocalIterator() on a dataframe produced by a RunnableCommand
      reproduces the problem. toLocalIterator() is called by thriftserver when
      `spark.sql.thriftServer.incrementalCollect`is set to handle queries producing large result
      set.
      
      **Before**
      ```SQL
      scala> spark.sql("show databases")
      res0: org.apache.spark.sql.DataFrame = [databaseName: string]
      
      scala> res0.toLocalIterator()
      16/10/26 03:00:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
      java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
      ```
      
      **After**
      ```SQL
      scala> spark.sql("drop database databases")
      res30: org.apache.spark.sql.DataFrame = []
      
      scala> spark.sql("show databases")
      res31: org.apache.spark.sql.DataFrame = [databaseName: string]
      
      scala> res31.toLocalIterator().asScala foreach println
      [default]
      [parquet]
      ```
      ## How was this patch tested?
      Added a test in DDLSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15642 from dilipbiswal/SPARK-18009.
      dd4f088c
  2. Oct 26, 2016
    • ALeksander Eskilson's avatar
      [SPARK-17770][CATALYST] making ObjectType public · f1aeed8b
      ALeksander Eskilson authored
      ## What changes were proposed in this pull request?
      
      In order to facilitate the writing of additional Encoders, I proposed opening up the ObjectType SQL DataType. This DataType is used extensively in the JavaBean Encoder, but would also be useful in writing other custom encoders.
      
      As mentioned by marmbrus, it is understood that the Expressions API is subject to potential change.
      
      ## How was this patch tested?
      
      The change only affects the visibility of the ObjectType class, and the existing SQL test suite still runs without error.
      
      Author: ALeksander Eskilson <alek.eskilson@cerner.com>
      
      Closes #15453 from bdrillard/master.
      f1aeed8b
    • frreiss's avatar
      [SPARK-16963][STREAMING][SQL] Changes to Source trait and related implementation classes · 5b27598f
      frreiss authored
      ## What changes were proposed in this pull request?
      
      This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes:
      * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive.
      * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer".
      * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`.
      * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code.
      * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint.
      * `MemoryStream` now cleans committed batches out of its internal buffer.
      * `TextSocketSource` now cleans committed batches from its internal buffer.
      
      ## How was this patch tested?
      Existing regression tests already exercise the new code.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #14553 from frreiss/fred-16963.
      5b27598f
    • Miao Wang's avatar
      [SPARK-18126][SPARK-CORE] getIteratorZipWithIndex accepts negative value as index · a76846cf
      Miao Wang authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      `Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.
      
      method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.
      
      This PR just adds a defensive check on `startIndex` to make sure it is >= 0.
      
      ## How was this patch tested?
      
      Add a new unit test.
      
      Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>
      
      Closes #15639 from wangmiao1981/zip.
      a76846cf
    • wm624@hotmail.com's avatar
      [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapper · 29cea8f3
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
      
      This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
      
      ## How was this patch tested?
      
      New unit tests are added.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15365 from wangmiao1981/glm.
      29cea8f3
    • jiangxingbo's avatar
      [SPARK-18094][SQL][TESTS] Move group analytics test cases from `SQLQuerySuite`... · 5b7d403c
      jiangxingbo authored
      [SPARK-18094][SQL][TESTS] Move group analytics test cases from `SQLQuerySuite` into a query file test.
      
      ## What changes were proposed in this pull request?
      
      Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING SETS) in `SQLQuerySuite`, should better move them into a query file test.
      The following test cases are moved to `group-analytics.sql`:
      ```
      test("rollup")
      test("grouping sets when aggregate functions containing groupBy columns")
      test("cube")
      test("grouping sets")
      test("grouping and grouping_id")
      test("grouping and grouping_id in having")
      test("grouping and grouping_id in sort")
      ```
      
      This is followup work of #15582
      
      ## How was this patch tested?
      
      Modified query file `group-analytics.sql`, which will be tested by `SQLQueryTestSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15624 from jiangxb1987/group-analytics-test.
      5b7d403c
    • Xin Ren's avatar
      [SPARK-14300][DOCS][MLLIB] Scala MLlib examples code merge and clean up · dcdda197
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14300
      
      Duplicated code found in scala/examples/mllib, below all deleted in this PR:
      
      - DenseGaussianMixture.scala
      - StreamingLinearRegression.scala
      
      ## delete reasons:
      
      #### delete: mllib/DenseGaussianMixture.scala
      
      - duplicate of mllib/GaussianMixtureExample
      
      #### delete: mllib/StreamingLinearRegression.scala
      
      - duplicate of mllib/StreamingLinearRegressionExample
      
      When merging and cleaning those code, be sure not disturb the previous example on and off blocks.
      
      ## How was this patch tested?
      
      Test with `SKIP_API=1 jekyll` manually to make sure that works well.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #12195 from keypointt/SPARK-14300.
      dcdda197
    • WeichenXu's avatar
      [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR · fb0a8a8d
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add storageLevel to DataFrame for SparkR.
      This is similar to this RP:  https://github.com/apache/spark/pull/13780
      
      but in R I do not make a class for `StorageLevel`
      but add a method `storageToString`
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15516 from WeichenXu123/storageLevel_df_r.
      fb0a8a8d
    • Yanbo Liang's avatar
      [MINOR][ML] Refactor clustering summary. · ea3605e8
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15555 from yanboliang/clustering-summary.
      ea3605e8
    • Shixiong Zhu's avatar
      [SPARK-18104][DOC] Don't build KafkaSource doc · 7d10631c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal.
      
      ## How was this patch tested?
      
      Verified manually.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15630 from zsxwing/kafka-unidoc.
      7d10631c
    • jiangxingbo's avatar
      [SPARK-18063][SQL] Failed to infer constraints over multiple aliases · fa7d9d70
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      The `UnaryNode.getAliasedConstraints` function fails to replace all expressions by their alias where constraints contains more than one expression to be replaced.
      For example:
      ```
      val tr = LocalRelation('a.int, 'b.string, 'c.int)
      val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
      multiAlias.analyze.constraints
      ```
      currently outputs:
      ```
      ExpressionSet(Seq(
          IsNotNull(resolveColumn(multiAlias.analyze, "x")),
          IsNotNull(resolveColumn(multiAlias.analyze, "y"))
      )
      ```
      The constraint `resolveColumn(multiAlias.analyze, "x") === resolveColumn(multiAlias.analyze, "y") + 10)` is missing.
      
      ## How was this patch tested?
      
      Add new test cases in `ConstraintPropagationSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15597 from jiangxb1987/alias-constraints.
      fa7d9d70
    • Shixiong Zhu's avatar
      [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL · 7ac70e7b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool.
      
      This PR just uses `Awaitable.result` instead to  prevent ForkJoinPool from running other tasks in the current waiting thread.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15520 from zsxwing/SPARK-13747.
      7ac70e7b
    • Yanbo Liang's avatar
      [SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares. · 312ea3f7
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This is follow-up work of #15394.
      Reorg some variables of ```WeightedLeastSquares``` and fix one minor issue of ```WeightedLeastSquaresSuite```.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15621 from yanboliang/spark-17748.
      312ea3f7
    • Mark Grover's avatar
      [SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega… · 4bee9540
      Mark Grover authored
      …rdless of warehouse dir's existence
      
      ## What changes were proposed in this pull request?
      Appending a trailing slash, if there already isn't one for the
      sake comparison of the two paths. It doesn't take away from
      the essence of the check, but removes any potential mismatch
      due to lack of trailing slash.
      
      ## How was this patch tested?
      Ran unit tests and they passed.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #15623 from markgrover/spark-18093.
      4bee9540
    • jiangxingbo's avatar
      [SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query · 3c023570
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
      `a = b, a = f(b, c)`
      Applying both these rules in the next iteration would infer:
      `f(b, c) = f(f(b, c), c)`
      This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.
      
      ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
      To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.
      
      ## How was this patch tested?
      
      Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15319 from jiangxb1987/constraints.
      3c023570
    • Shuai Lin's avatar
      [SPARK-17802] Improved caller context logging. · 402205dd
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
      error is logged, which pollutes the spark logs when there are lots of tasks.
      
      This patch improves this behaviour by only logging the `ClassNotFoundException` once.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #15377 from lins05/spark-17802-improve-callercontext-logging.
      Unverified
      402205dd
    • Alex Bozarth's avatar
      [SPARK-4411][WEB UI] Add "kill" link for jobs in the UI · 5d0f81da
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.
      
      ## How was this patch tested?
      
      Manually tested and dev/run-tests
      
      ![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #15441 from ajbozarth/spark4411.
      Unverified
      5d0f81da
    • Sean Owen's avatar
      [SPARK-18027][YARN] .sparkStaging not clean on RM ApplicationNotFoundException · 29781364
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Cleanup YARN staging dir on all `KILLED`/`FAILED` paths in `monitorApplication`
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15598 from srowen/SPARK-18027.
      Unverified
      29781364
    • Sean Owen's avatar
      [SPARK-18022][SQL] java.lang.NullPointerException instead of real exception when saving DF to MySQL · 6c7d094e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      On null next exception in JDBC, don't init it as cause or suppressed
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15599 from srowen/SPARK-18022.
      Unverified
      6c7d094e
    • gatorsmile's avatar
      [SPARK-17693][SQL] Fixed Insert Failure To Data Source Tables when the Schema has the Comment Field · 93b8ad18
      gatorsmile authored
      ### What changes were proposed in this pull request?
      ```SQL
      CREATE TABLE tab1(col1 int COMMENT 'a', col2 int) USING parquet
      INSERT INTO TABLE tab1 SELECT 1, 2
      ```
      The insert attempt will fail if the target table has a column with comments. The error is strange to the external users:
      ```
      assertion failed: No plan for InsertIntoTable Relation[col1#15,col2#16] parquet, false, false
      +- Project [1 AS col1#19, 2 AS col2#20]
         +- OneRowRelation$
      ```
      
      This PR is to fix the above bug by checking the metadata when comparing the schema between the table and the query. If not matched, we also copy the metadata. This is an alternative to https://github.com/apache/spark/pull/15266
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15615 from gatorsmile/insertDataSourceTableWithCommentSolution2.
      93b8ad18
  3. Oct 25, 2016
    • WeichenXu's avatar
      [SPARK-18007][SPARKR][ML] update SparkR MLP - add initalWeights parameter · 12b3e8d2
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update SparkR MLP, add initalWeights parameter.
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15552 from WeichenXu123/mlp_r_add_initialWeight_param.
      12b3e8d2
    • hayashidac's avatar
      [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show... · c329a568
      hayashidac authored
      [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled
      
      spark history server log needs to be fixed to show https url when ssl is enabled
      
      Author: chie8842 <chie@chie-no-Mac-mini.local>
      
      Closes #15611 from hayashidac/SPARK-16988.
      c329a568
    • sethah's avatar
      [SPARK-18019][ML] Add instrumentation to GBTs · 2c7394ad
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add instrumentation for logging in ML GBT, part of umbrella ticket [SPARK-14567](https://issues.apache.org/jira/browse/SPARK-14567)
      
      ## How was this patch tested?
      
      Tested locally:
      
      ````
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: training: numPartitions=1 storageLevel=StorageLevel(1 replicas)
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"maxIter":1}
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numFeatures":2}
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numClasses":0}
      ...
      16/10/20 15:54:21 INFO Instrumentation: GBTRegressor-gbtr_065fad465377-1922077832-22: training finished
      ````
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15574 from sethah/gbt_instr.
      2c7394ad
    • Wenchen Fan's avatar
      [SPARK-18070][SQL] binary operator should not consider nullability when comparing input types · a21791e3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Binary operator requires its inputs to be of same type, but it should not consider nullability, e.g. `EqualTo` should be able to compare an element-nullable array and an element-non-nullable array.
      
      ## How was this patch tested?
      
      a regression test in `DataFrameSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15606 from cloud-fan/type-bug.
      a21791e3
    • Vinayak's avatar
      [SPARK-18010][CORE] Reduce work performed for building up the application list... · c5fe3dd4
      Vinayak authored
      [SPARK-18010][CORE] Reduce work performed for building up the application list for the History Server app list UI page
      
      ## What changes were proposed in this pull request?
      allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change.
      
      ## How was this patch tested?
      
      Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      
      Closes #15556 from vijoshi/SAAS-467_master.
      c5fe3dd4
    • Yanbo Liang's avatar
      [SPARK-17748][FOLLOW-UP][ML] Fix build error for Scala 2.10. · ac8ff920
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #15394 introduced build error for Scala 2.10, this PR fix it.
      
      ## How was this patch tested?
      Existing test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15625 from yanboliang/spark-17748-scala.
      ac8ff920
    • Zheng RuiFeng's avatar
      [SPARK-14634][ML][FOLLOWUP] Delete superfluous line in BisectingKMeans · 38cdd6cc
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      As commented by jkbradley in https://github.com/apache/spark/pull/12394, `model.setSummary(summary)` is superfluous
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15619 from zhengruifeng/del_superfluous.
      38cdd6cc
    • Wenchen Fan's avatar
      [SPARK-18026][SQL] should not always lowercase partition columns of partition spec in parser · 6f31833d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently we always lowercase the partition columns of partition spec in parser, with the assumption that table partition columns are always lowercased.
      
      However, this is not true for data source tables, which are case preserving. It's safe for now because data source tables don't store partition spec in metastore and don't support `ADD PARTITION`, `DROP PARTITION`, `RENAME PARTITION`, but we should make our code future-proof.
      
      This PR makes partition spec case preserving at parser, and improve the `PreprocessTableInsertion` analyzer rule to normalize the partition columns in partition spec, w.r.t. the table partition columns.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15566 from cloud-fan/partition-spec.
      6f31833d
    • sethah's avatar
      [SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet · 78d740a0
      sethah authored
      ## What changes were proposed in this pull request?
      
      1. Make a pluggable solver interface for `WeightedLeastSquares`
      2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares`
      3. Add method `BLAS.dspmv` used by QN solver
      4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails.
      
      ## How was this patch tested?
      Unit tests - see below.
      
      ## Design choices
      
      **Pluggable Normal Solver**
      
      Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait:
      
      ````scala
      private[ml] sealed trait NormalEquationSolver {
      
        def solve(
            bBar: Double,
            bbBar: Double,
            abBar: DenseVector,
            aaBar: DenseVector,
            aBar: DenseVector): NormalEquationSolution
      }
      ````
      
      We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface.
      
      **Always train in the standardized space**
      
      The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels.
      
      **Use L-BFGS locally to solve normal equations for singular matrix**
      
      When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well.
      
      ## Test cases
      I found it helpful to enumerate some of the test cases and hopefully it makes review easier.
      
      **WeightedLeastSquares**
      
      1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
      2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
      3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero.
      4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package.
      5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet.
      6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean.
      7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R.
      
      **LinearRegression**
      1. For all existing L1 test cases, test the "normal" solver too.
      2. Check that using the normal solver now handles singular matrices.
      3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA.
      
      **BLAS**
      1. Test new method `dspmv`.
      
      ## Performance Testing
      This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement.
      
      We should decide if this PR needs to be blocked before performance testing is conducted.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15394 from sethah/SPARK-17748.
      78d740a0
  4. Oct 24, 2016
    • Kay Ousterhout's avatar
      [SPARK-17894][HOTFIX] Fix broken build from · 483c37c5
      Kay Ousterhout authored
      The named parameter in an overridden class isn't supported in Scala 2.10 so was breaking the build.
      
      cc zsxwing
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #15617 from kayousterhout/hotfix.
      483c37c5
    • gatorsmile's avatar
      [SPARK-17409][SQL][FOLLOW-UP] Do Not Optimize Query in CTAS More Than Once · d479c526
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This follow-up PR is for addressing the [comment](https://github.com/apache/spark/pull/15048).
      
      We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table.
      
      Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15459 from gatorsmile/ctasOptimizedTestCases.
      d479c526
    • Wenchen Fan's avatar
      [SPARK-18028][SQL] simplify TableFileCatalog · 84a33999
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Simplify/cleanup TableFileCatalog:
      
      1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again
      2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.
      3. add `equals` and `hashCode` to `TableFileCatalog`
      4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15568 from cloud-fan/table-file-catalog.
      84a33999
    • Tathagata Das's avatar
      [SPARK-17624][SQL][STREAMING][TEST] Fixed flaky StateStoreSuite.maintenance · 407c3ced
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      The reason for the flakiness was follows. The test starts the maintenance background thread, and then writes 20 versions of the state store. The maintenance thread is expected to create snapshots in the middle, and clean up old files that are not needed any more. The earliest delta file (1.delta) is expected to be deleted as snapshots will ensure that the earliest delta would not be needed.
      
      However, the default configuration for the maintenance thread is to retain files such that last 2 versions can be recovered, and delete the rest. Now while generating the versions, the maintenance thread can kick in and create snapshots anywhere between version 10 and 20 (at least 10 deltas needed for snapshot). Then later it will choose to retain only version 20 and 19 (last 2). There are two cases.
      
      - Common case: One of the version between 10 and 19 gets snapshotted. Then recovering versions 19 and 20 just needs 19.snapshot and 20.delta, so 1.delta gets deleted.
      
      - Uncommon case (reason for flakiness): Only version 20 gets snapshotted. Then recovering versoin 20 requires 20.snapshot, and recovering version 19 all the previous 19...1.delta. So 1.delta does not get deleted.
      
      This PR rearranges the checks such that it create 20 versions, and then waits that there is at least one snapshot, then creates another 20. This will ensure that the latest 2 versions cannot require anything older than the first snapshot generated, and therefore will 1.delta will be deleted.
      
      In addition, I have added more logs, and comments that I felt would help future debugging and understanding what is going on.
      
      ## How was this patch tested?
      
      Ran the StateStoreSuite > 6K times in a heavily loaded machine (10 instances of tests running in parallel). No failures.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15592 from tdas/SPARK-17624.
      407c3ced
    • Eren Avsarogullari's avatar
      [SPARK-17894][CORE] Ensure uniqueness of TaskSetManager name. · 81d6933e
      Eren Avsarogullari authored
      `TaskSetManager` should have unique name to avoid adding duplicate ones to parent `Pool` via `SchedulableBuilder`. This problem has been surfaced with following discussion: [[PR: Avoid adding duplicate schedulables]](https://github.com/apache/spark/pull/15326)
      
      **Proposal** :
      There is 1x1 relationship between `stageAttemptId` and `TaskSetManager` so `taskSet.Id` covering both `stageId` and `stageAttemptId` looks to be used for uniqueness of `TaskSetManager` name instead of just `stageId`.
      
      **Current TaskSetManager Name** :
      `var name = "TaskSet_" + taskSet.stageId.toString`
      **Sample**: TaskSet_0
      
      **Proposed TaskSetManager Name** :
      `val name = "TaskSet_" + taskSet.Id ` `// taskSet.Id = (stageId + "." + stageAttemptId)`
      **Sample** : TaskSet_0.0
      
      Added new Unit Test.
      
      Author: erenavsarogullari <erenavsarogullari@gmail.com>
      
      Closes #15463 from erenavsarogullari/SPARK-17894.
      81d6933e
    • Sean Owen's avatar
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but... · 4ecbe1b9
      Sean Owen authored
      [SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
      
      ## What changes were proposed in this pull request?
      
      Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15382 from srowen/SPARK-17810.
      Unverified
      4ecbe1b9
Loading