Skip to content
Snippets Groups Projects
  1. Oct 30, 2016
  2. Oct 28, 2016
    • Eric Liang's avatar
      [SPARK-18167][SQL] Add debug code for SQLQuerySuite flakiness when metastore... · d2d438d1
      Eric Liang authored
      [SPARK-18167][SQL] Add debug code for SQLQuerySuite flakiness when metastore partition pruning is enabled
      
      ## What changes were proposed in this pull request?
      
      org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive partition pruning is enabled.
      Based on the stack traces, it seems to be an old issue where Hive fails to cast a numeric partition column ("Invalid character string format for type DECIMAL"). There are two possibilities here: either we are somehow corrupting the partition table to have non-decimal values in that column, or there is a transient issue with Derby.
      
      This PR logs the result of the retry when this exception is encountered, so we can confirm what is going on.
      
      ## How was this patch tested?
      
      n/a
      
      cc yhuai
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15676 from ericl/spark-18167.
      d2d438d1
    • Shixiong Zhu's avatar
      [SPARK-18164][SQL] ForeachSink should fail the Spark job if `process` throws exception · 59cccbda
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the issue that ForeachSink didn't rethrow the exception.
      
      ## How was this patch tested?
      
      The fixed unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15674 from zsxwing/foreach-sink-error.
      59cccbda
    • Yunni's avatar
      [SPARK-5992][ML] Locality Sensitive Hashing · ac26e9cf
      Yunni authored
      ## What changes were proposed in this pull request?
      
      Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).
      
      Detailed changes are as follows:
      (1) Implement abstract LSH, LSHModel classes as Estimator-Model
      (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
      (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
      (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
      
      Things that will be implemented in a follow-up PR:
       - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
       - PySpark Integration for the scala classes and methods.
      
      ## How was this patch tested?
      Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.
      
      Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).
      
      ## References
      Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
      Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
      
      Author: Yunni <Euler57721@gmail.com>
      Author: Yun Ni <yunn@uber.com>
      
      Closes #15148 from Yunni/SPARK-5992-yunn-lsh.
      ac26e9cf
    • Jagadeesan's avatar
      [SPARK-18133][EXAMPLES][ML] Python ML Pipeline Example has syntax e… · e9746f87
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      In Python 3, there is only one integer type (i.e., int), which mostly behaves like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" in all examples.
      
      ## How was this patch tested?
      
      Unit tests.
      
      …rrors]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15660 from jagadeesanas2/SPARK-18133.
      e9746f87
    • Zheng RuiFeng's avatar
      [SPARK-18109][ML] Add instrumentation to GMM · 569788a5
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add instrumentation to GMM
      
      ## How was this patch tested?
      
      Test in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15636 from zhengruifeng/gmm_instr.
      569788a5
  3. Oct 27, 2016
    • Sunitha Kambhampati's avatar
      [SPARK-18121][SQL] Unable to query global temp views when hive support is enabled · ab5f938b
      Sunitha Kambhampati authored
      ## What changes were proposed in this pull request?
      
      Issue:
      Querying on a global temp view throws Table or view not found exception.
      
      Fix:
      Update the lookupRelation in HiveSessionCatalog to check for global temp views similar to the SessionCatalog.lookupRelation.
      
      Before fix:
      Querying on a global temp view ( for. e.g.:  select * from global_temp.v1)  throws Table or view not found exception
      
      After fix:
      Query succeeds and returns the right result.
      
      ## How was this patch tested?
      - Two unit tests are added to check for global temp view for the code path when hive support is enabled.
      - Regression unit tests were run successfully. ( build/sbt -Phive hive/test, build/sbt sql/test, build/sbt catalyst/test)
      
      Author: Sunitha Kambhampati <skambha@us.ibm.com>
      
      Closes #15649 from skambha/lookuprelationChanges.
      ab5f938b
    • Eric Liang's avatar
      [SPARK-17970][SQL] store partition spec in metastore for data source table · ccb11543
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      We should follow hive table and also store partition spec in metastore for data source table.
      This brings 2 benefits:
      
      1. It's more flexible to manage the table data files, as users can use `ADD PARTITION`, `DROP PARTITION` and `RENAME PARTITION`
      2. We don't need to cache all file status for data source table anymore.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Michael Allman <michael@videoamp.com>
      Author: Eric Liang <ekhliang@gmail.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15515 from cloud-fan/partition.
      ccb11543
    • Shixiong Zhu's avatar
      [SPARK-16963][SQL] Fix test "StreamExecution metadata garbage collection" · 79fd0cc0
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      A follow up PR for #14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15661 from zsxwing/fix-StreamingQuerySuite.
      79fd0cc0
    • VinceShieh's avatar
      [SPARK-17219][ML] enhanced NaN value handling in Bucketizer · 0b076d4c
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      This PR is an enhancement of PR with commit ID:57dc326b.
      NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
      
      '''Before:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
      '''After:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
                .setHandleNaN("keep")
      
      ## How was this patch tested?
      Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Vincent Xie <vincent.xie@intel.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15428 from VinceShieh/spark-17219_followup.
      0b076d4c
    • cody koeninger's avatar
      [SPARK-17813][SQL][KAFKA] Maximum data per trigger · 10423258
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.
      
      ## How was this patch tested?
      
      Added unit test
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15527 from koeninger/SPARK-17813.
      10423258
    • wm624@hotmail.com's avatar
      [SPARK-CORE][TEST][MINOR] Fix the wrong comment in test · 701a9d36
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      While learning core scheduler code, I found two lines of wrong comments. This PR simply corrects the comments.
      
      ## How was this patch tested?
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15631 from wangmiao1981/Rbug.
      Unverified
      701a9d36
    • Felix Cheung's avatar
      [SQL][DOC] updating doc for JSON source to link to jsonlines.org · 44c8bfda
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      API and programming guide doc changes for Scala, Python and R.
      
      ## How was this patch tested?
      
      manual test
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15629 from felixcheung/jsondoc.
      44c8bfda
    • Felix Cheung's avatar
      [SPARK-17157][SPARKR][FOLLOW-UP] doc fixes · 1dbe9896
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      a couple of small late finding fixes for doc
      
      ## How was this patch tested?
      
      manually
      wangmiao1981
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15650 from felixcheung/logitfix.
      1dbe9896
    • Yin Huai's avatar
      [SPARK-18132] Fix checkstyle · d3b4831d
      Yin Huai authored
      This PR fixes checkstyle.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #15656 from yhuai/fix-format.
      d3b4831d
    • Dilip Biswal's avatar
      [SPARK-18009][SQL] Fix ClassCastException while calling toLocalIterator() on... · dd4f088c
      Dilip Biswal authored
      [SPARK-18009][SQL] Fix ClassCastException while calling toLocalIterator() on dataframe produced by RunnableCommand
      
      ## What changes were proposed in this pull request?
      A short code snippet that uses toLocalIterator() on a dataframe produced by a RunnableCommand
      reproduces the problem. toLocalIterator() is called by thriftserver when
      `spark.sql.thriftServer.incrementalCollect`is set to handle queries producing large result
      set.
      
      **Before**
      ```SQL
      scala> spark.sql("show databases")
      res0: org.apache.spark.sql.DataFrame = [databaseName: string]
      
      scala> res0.toLocalIterator()
      16/10/26 03:00:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
      java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
      ```
      
      **After**
      ```SQL
      scala> spark.sql("drop database databases")
      res30: org.apache.spark.sql.DataFrame = []
      
      scala> spark.sql("show databases")
      res31: org.apache.spark.sql.DataFrame = [databaseName: string]
      
      scala> res31.toLocalIterator().asScala foreach println
      [default]
      [parquet]
      ```
      ## How was this patch tested?
      Added a test in DDLSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15642 from dilipbiswal/SPARK-18009.
      dd4f088c
  4. Oct 26, 2016
    • ALeksander Eskilson's avatar
      [SPARK-17770][CATALYST] making ObjectType public · f1aeed8b
      ALeksander Eskilson authored
      ## What changes were proposed in this pull request?
      
      In order to facilitate the writing of additional Encoders, I proposed opening up the ObjectType SQL DataType. This DataType is used extensively in the JavaBean Encoder, but would also be useful in writing other custom encoders.
      
      As mentioned by marmbrus, it is understood that the Expressions API is subject to potential change.
      
      ## How was this patch tested?
      
      The change only affects the visibility of the ObjectType class, and the existing SQL test suite still runs without error.
      
      Author: ALeksander Eskilson <alek.eskilson@cerner.com>
      
      Closes #15453 from bdrillard/master.
      f1aeed8b
    • frreiss's avatar
      [SPARK-16963][STREAMING][SQL] Changes to Source trait and related implementation classes · 5b27598f
      frreiss authored
      ## What changes were proposed in this pull request?
      
      This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes:
      * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive.
      * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer".
      * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`.
      * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code.
      * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint.
      * `MemoryStream` now cleans committed batches out of its internal buffer.
      * `TextSocketSource` now cleans committed batches from its internal buffer.
      
      ## How was this patch tested?
      Existing regression tests already exercise the new code.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #14553 from frreiss/fred-16963.
      5b27598f
    • Miao Wang's avatar
      [SPARK-18126][SPARK-CORE] getIteratorZipWithIndex accepts negative value as index · a76846cf
      Miao Wang authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      `Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.
      
      method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.
      
      This PR just adds a defensive check on `startIndex` to make sure it is >= 0.
      
      ## How was this patch tested?
      
      Add a new unit test.
      
      Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>
      
      Closes #15639 from wangmiao1981/zip.
      a76846cf
    • wm624@hotmail.com's avatar
      [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapper · 29cea8f3
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
      
      This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
      
      ## How was this patch tested?
      
      New unit tests are added.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15365 from wangmiao1981/glm.
      29cea8f3
    • jiangxingbo's avatar
      [SPARK-18094][SQL][TESTS] Move group analytics test cases from `SQLQuerySuite`... · 5b7d403c
      jiangxingbo authored
      [SPARK-18094][SQL][TESTS] Move group analytics test cases from `SQLQuerySuite` into a query file test.
      
      ## What changes were proposed in this pull request?
      
      Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING SETS) in `SQLQuerySuite`, should better move them into a query file test.
      The following test cases are moved to `group-analytics.sql`:
      ```
      test("rollup")
      test("grouping sets when aggregate functions containing groupBy columns")
      test("cube")
      test("grouping sets")
      test("grouping and grouping_id")
      test("grouping and grouping_id in having")
      test("grouping and grouping_id in sort")
      ```
      
      This is followup work of #15582
      
      ## How was this patch tested?
      
      Modified query file `group-analytics.sql`, which will be tested by `SQLQueryTestSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15624 from jiangxb1987/group-analytics-test.
      5b7d403c
    • Xin Ren's avatar
      [SPARK-14300][DOCS][MLLIB] Scala MLlib examples code merge and clean up · dcdda197
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14300
      
      Duplicated code found in scala/examples/mllib, below all deleted in this PR:
      
      - DenseGaussianMixture.scala
      - StreamingLinearRegression.scala
      
      ## delete reasons:
      
      #### delete: mllib/DenseGaussianMixture.scala
      
      - duplicate of mllib/GaussianMixtureExample
      
      #### delete: mllib/StreamingLinearRegression.scala
      
      - duplicate of mllib/StreamingLinearRegressionExample
      
      When merging and cleaning those code, be sure not disturb the previous example on and off blocks.
      
      ## How was this patch tested?
      
      Test with `SKIP_API=1 jekyll` manually to make sure that works well.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #12195 from keypointt/SPARK-14300.
      dcdda197
    • WeichenXu's avatar
      [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR · fb0a8a8d
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Add storageLevel to DataFrame for SparkR.
      This is similar to this RP:  https://github.com/apache/spark/pull/13780
      
      but in R I do not make a class for `StorageLevel`
      but add a method `storageToString`
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15516 from WeichenXu123/storageLevel_df_r.
      fb0a8a8d
    • Yanbo Liang's avatar
      [MINOR][ML] Refactor clustering summary. · ea3605e8
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15555 from yanboliang/clustering-summary.
      ea3605e8
    • Shixiong Zhu's avatar
      [SPARK-18104][DOC] Don't build KafkaSource doc · 7d10631c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal.
      
      ## How was this patch tested?
      
      Verified manually.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15630 from zsxwing/kafka-unidoc.
      7d10631c
    • jiangxingbo's avatar
      [SPARK-18063][SQL] Failed to infer constraints over multiple aliases · fa7d9d70
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      The `UnaryNode.getAliasedConstraints` function fails to replace all expressions by their alias where constraints contains more than one expression to be replaced.
      For example:
      ```
      val tr = LocalRelation('a.int, 'b.string, 'c.int)
      val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
      multiAlias.analyze.constraints
      ```
      currently outputs:
      ```
      ExpressionSet(Seq(
          IsNotNull(resolveColumn(multiAlias.analyze, "x")),
          IsNotNull(resolveColumn(multiAlias.analyze, "y"))
      )
      ```
      The constraint `resolveColumn(multiAlias.analyze, "x") === resolveColumn(multiAlias.analyze, "y") + 10)` is missing.
      
      ## How was this patch tested?
      
      Add new test cases in `ConstraintPropagationSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15597 from jiangxb1987/alias-constraints.
      fa7d9d70
    • Shixiong Zhu's avatar
      [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL · 7ac70e7b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool.
      
      This PR just uses `Awaitable.result` instead to  prevent ForkJoinPool from running other tasks in the current waiting thread.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15520 from zsxwing/SPARK-13747.
      7ac70e7b
    • Yanbo Liang's avatar
      [SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares. · 312ea3f7
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This is follow-up work of #15394.
      Reorg some variables of ```WeightedLeastSquares``` and fix one minor issue of ```WeightedLeastSquaresSuite```.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15621 from yanboliang/spark-17748.
      312ea3f7
    • Mark Grover's avatar
      [SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega… · 4bee9540
      Mark Grover authored
      …rdless of warehouse dir's existence
      
      ## What changes were proposed in this pull request?
      Appending a trailing slash, if there already isn't one for the
      sake comparison of the two paths. It doesn't take away from
      the essence of the check, but removes any potential mismatch
      due to lack of trailing slash.
      
      ## How was this patch tested?
      Ran unit tests and they passed.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #15623 from markgrover/spark-18093.
      4bee9540
    • jiangxingbo's avatar
      [SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query · 3c023570
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias):
      `a = b, a = f(b, c)`
      Applying both these rules in the next iteration would infer:
      `f(b, c) = f(f(b, c), c)`
      This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM.
      
      ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~
      To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that.
      
      ## How was this patch tested?
      
      Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15319 from jiangxb1987/constraints.
      3c023570
    • Shuai Lin's avatar
      [SPARK-17802] Improved caller context logging. · 402205dd
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
      error is logged, which pollutes the spark logs when there are lots of tasks.
      
      This patch improves this behaviour by only logging the `ClassNotFoundException` once.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #15377 from lins05/spark-17802-improve-callercontext-logging.
      Unverified
      402205dd
    • Alex Bozarth's avatar
      [SPARK-4411][WEB UI] Add "kill" link for jobs in the UI · 5d0f81da
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.
      
      ## How was this patch tested?
      
      Manually tested and dev/run-tests
      
      ![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #15441 from ajbozarth/spark4411.
      Unverified
      5d0f81da
    • Sean Owen's avatar
      [SPARK-18027][YARN] .sparkStaging not clean on RM ApplicationNotFoundException · 29781364
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Cleanup YARN staging dir on all `KILLED`/`FAILED` paths in `monitorApplication`
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15598 from srowen/SPARK-18027.
      Unverified
      29781364
    • Sean Owen's avatar
      [SPARK-18022][SQL] java.lang.NullPointerException instead of real exception when saving DF to MySQL · 6c7d094e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      On null next exception in JDBC, don't init it as cause or suppressed
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15599 from srowen/SPARK-18022.
      Unverified
      6c7d094e
    • gatorsmile's avatar
      [SPARK-17693][SQL] Fixed Insert Failure To Data Source Tables when the Schema has the Comment Field · 93b8ad18
      gatorsmile authored
      ### What changes were proposed in this pull request?
      ```SQL
      CREATE TABLE tab1(col1 int COMMENT 'a', col2 int) USING parquet
      INSERT INTO TABLE tab1 SELECT 1, 2
      ```
      The insert attempt will fail if the target table has a column with comments. The error is strange to the external users:
      ```
      assertion failed: No plan for InsertIntoTable Relation[col1#15,col2#16] parquet, false, false
      +- Project [1 AS col1#19, 2 AS col2#20]
         +- OneRowRelation$
      ```
      
      This PR is to fix the above bug by checking the metadata when comparing the schema between the table and the query. If not matched, we also copy the metadata. This is an alternative to https://github.com/apache/spark/pull/15266
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15615 from gatorsmile/insertDataSourceTableWithCommentSolution2.
      93b8ad18
  5. Oct 25, 2016
    • WeichenXu's avatar
      [SPARK-18007][SPARKR][ML] update SparkR MLP - add initalWeights parameter · 12b3e8d2
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      update SparkR MLP, add initalWeights parameter.
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15552 from WeichenXu123/mlp_r_add_initialWeight_param.
      12b3e8d2
    • hayashidac's avatar
      [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show... · c329a568
      hayashidac authored
      [SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled
      
      spark history server log needs to be fixed to show https url when ssl is enabled
      
      Author: chie8842 <chie@chie-no-Mac-mini.local>
      
      Closes #15611 from hayashidac/SPARK-16988.
      c329a568
    • sethah's avatar
      [SPARK-18019][ML] Add instrumentation to GBTs · 2c7394ad
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add instrumentation for logging in ML GBT, part of umbrella ticket [SPARK-14567](https://issues.apache.org/jira/browse/SPARK-14567)
      
      ## How was this patch tested?
      
      Tested locally:
      
      ````
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: training: numPartitions=1 storageLevel=StorageLevel(1 replicas)
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"maxIter":1}
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numFeatures":2}
      16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numClasses":0}
      ...
      16/10/20 15:54:21 INFO Instrumentation: GBTRegressor-gbtr_065fad465377-1922077832-22: training finished
      ````
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15574 from sethah/gbt_instr.
      2c7394ad
    • Wenchen Fan's avatar
      [SPARK-18070][SQL] binary operator should not consider nullability when comparing input types · a21791e3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Binary operator requires its inputs to be of same type, but it should not consider nullability, e.g. `EqualTo` should be able to compare an element-nullable array and an element-non-nullable array.
      
      ## How was this patch tested?
      
      a regression test in `DataFrameSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15606 from cloud-fan/type-bug.
      a21791e3
    • Vinayak's avatar
      [SPARK-18010][CORE] Reduce work performed for building up the application list... · c5fe3dd4
      Vinayak authored
      [SPARK-18010][CORE] Reduce work performed for building up the application list for the History Server app list UI page
      
      ## What changes were proposed in this pull request?
      allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change.
      
      ## How was this patch tested?
      
      Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      
      Closes #15556 from vijoshi/SAAS-467_master.
      c5fe3dd4
Loading