Skip to content
Snippets Groups Projects
  1. Jun 11, 2017
    • Yuming Wang's avatar
      [SPARK-13933][BUILD] Update hadoop-2.7 profile's curator version to 2.7.1 · 823f1eef
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933).
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18247 from wangyum/SPARK-13933.
      823f1eef
    • hyukjinkwon's avatar
      [SPARK-20935][STREAMING] Always close WriteAheadLog and make it idempotent · eb3ea3a0
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to stop `ReceiverTracker` to close `WriteAheadLog` whenever it is and make `WriteAheadLog` and its implementations idempotent.
      
      ## How was this patch tested?
      
      Added a test in `WriteAheadLogSuite`. Note that  the added test looks passing even if it closes twice (namely even without the changes in `FileBasedWriteAheadLog` and `BatchedWriteAheadLog`. It looks both are already idempotent but this is a rather sanity check.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18224 from HyukjinKwon/streaming-closing.
      eb3ea3a0
    • Michael Gummelt's avatar
      [SPARK-21000][MESOS] Add Mesos labels support to the Spark Dispatcher · 8da3f704
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Add Mesos labels support to the Spark Dispatcher
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #18220 from mgummelt/SPARK-21000-dispatcher-labels.
      8da3f704
    • Felix Cheung's avatar
      [SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN · dc4c3518
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Move all existing tests to non-installed directory so that it will never run by installing SparkR package
      
      For a follow-up PR:
      - remove all skip_on_cran() calls in tests
      - clean up test timer
      - improve or change basic tests that do run on CRAN (if anyone has suggestion)
      
      It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them)
      
      ## How was this patch tested?
      
      - [x] unit tests, Jenkins
      - [x] AppVeyor
      - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18264 from felixcheung/rtestset.
      dc4c3518
  2. Jun 10, 2017
    • liuxian's avatar
      [SPARK-20620][TEST] Improve some unit tests for NullExpressionsSuite and TypeCoercionSuite · 5301a19a
      liuxian authored
      ## What changes were proposed in this pull request?
      add more  datatype for some unit tests
      
      ## How was this patch tested?
      unit tests
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #17880 from 10110346/wip_lx_0506.
      5301a19a
    • Xiao Li's avatar
      [SPARK-20211][SQL] Fix the Precision and Scale of Decimal Values when the... · 8e96acf7
      Xiao Li authored
      [SPARK-20211][SQL] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0
      
      ### What changes were proposed in this pull request?
      The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
      
      The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
      
      Before this PR, the following queries failed:
      ```SQL
      select 1 > 0.0001
      select floor(0.0001)
      select ceil(0.0001)
      ```
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18244 from gatorsmile/bigdecimal.
      8e96acf7
  3. Jun 09, 2017
  4. Jun 08, 2017
    • Josh Rosen's avatar
      [SPARK-20863] Add metrics/instrumentation to LiveListenerBus · 2a23cdd0
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch adds Coda Hale metrics for instrumenting the `LiveListenerBus` in order to track the number of events received, dropped, and processed. In addition, it adds per-SparkListener-subclass timers to track message processing time. This is useful for identifying when slow third-party SparkListeners cause performance bottlenecks.
      
      See the new `LiveListenerBusMetrics` for a complete description of the new metrics.
      
      ## How was this patch tested?
      
      New tests in SparkListenerSuite, including a test to ensure proper counting of dropped listener events.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #18083 from JoshRosen/listener-bus-metrics.
      2a23cdd0
    • Dongjoon Hyun's avatar
      [SPARK-20954][SQL] `DESCRIBE [EXTENDED]` result should be compatible with previous Spark · 6e95897e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-20067](https://issues.apache.org/jira/browse/SPARK-20067), `DESCRIBE` and `DESCRIBE EXTENDED` shows the following result. This is incompatible with Spark 2.1.1. This PR removes the column header line in case of those command.
      
      **MASTER** and **BRANCH-2.2**
      ```scala
      scala> sql("desc t").show(false)
      +----------+---------+-------+
      |col_name  |data_type|comment|
      +----------+---------+-------+
      |# col_name|data_type|comment|
      |a         |int      |null   |
      +----------+---------+-------+
      ```
      
      **SPARK 2.1.1** and **this PR**
      ```scala
      scala> sql("desc t").show(false)
      +--------+---------+-------+
      |col_name|data_type|comment|
      +--------+---------+-------+
      |a       |int      |null   |
      +--------+---------+-------+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with the updated test suites.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18203 from dongjoon-hyun/SPARK-20954.
      6e95897e
    • Xiao Li's avatar
      [SPARK-20976][SQL] Unify Error Messages for FAILFAST mode · 1a527bde
      Xiao Li authored
      ### What changes were proposed in this pull request?
      Before 2.2, we indicate the job was terminated because of `FAILFAST` mode.
      ```
      Malformed line in FAILFAST mode: {"a":{, b:3}
      ```
      If possible, we should keep it. This PR is to unify the error messages.
      
      ### How was this patch tested?
      Modified the existing messages.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18196 from gatorsmile/messFailFast.
      1a527bde
    • Mark Grover's avatar
      [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable · 55b8cfe6
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior.
      
      ## How was this patch tested?
      Running unit tests
      
      Author: Mark Grover <mark@apache.org>
      Author: Mark Grover <grover.markgrover@gmail.com>
      
      Closes #18234 from markgrover/spark-19185.
      55b8cfe6
    • hyukjinkwon's avatar
      [INFRA] Close stale PRs · b771fed7
      hyukjinkwon authored
      # What changes were proposed in this pull request?
      
      This PR proposes to close stale PRs, mostly the same instances with https://github.com/apache/spark/pull/18017
      
      Closes #11459
      Closes #13833
      Closes #13720
      Closes #12506
      Closes #12456
      Closes #12252
      Closes #17689
      Closes #17791
      Closes #18163
      Closes #17640
      Closes #17926
      Closes #18163
      Closes #12506
      Closes #18044
      Closes #14036
      Closes #15831
      Closes #14461
      Closes #17638
      Closes #18222
      
      Added:
      Closes #18045
      Closes #18061
      Closes #18010
      Closes #18041
      Closes #18124
      Closes #18130
      Closes #12217
      
      Added:
      Closes #16291
      Closes #17480
      Closes #14995
      
      Added:
      Closes #12835
      Closes #17141
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18223 from HyukjinKwon/close-stale-prs.
      b771fed7
    • 10087686's avatar
      [SPARK-21006][TESTS] Create rpcEnv and run later needs shutdown and awaitTermination · 9be79458
      10087686 authored
      Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
      
      ## What changes were proposed in this pull request?
      When  run test("port conflict") case, we need run anotherEnv.shutdown() and anotherEnv.awaitTermination() for free resource.
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      run RpcEnvSuit.scala Utest
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10087686 <wang.jiaochun@zte.com.cn>
      
      Closes #18226 from wangjiaochun/master.
      9be79458
    • Sean Owen's avatar
      [SPARK-20914][DOCS] Javadoc contains code that is invalid · 847efe12
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix Java, Scala Dataset examples in scaladoc, which didn't compile.
      
      ## How was this patch tested?
      
      Existing compilation/test
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18215 from srowen/SPARK-20914.
      847efe12
  5. Jun 07, 2017
  6. Jun 06, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation. · 0cba4951
      Marcelo Vanzin authored
      This change adds an abstraction and LevelDB implementation for a key-value
      store that will be used to store UI and SHS data.
      
      The interface is described in KVStore.java (see javadoc). Specifics
      of the LevelDB implementation are discussed in the javadocs of both
      LevelDB.java and LevelDBTypeInfo.java.
      
      Included also are a few small benchmarks just to get some idea of
      latency. Because they're too slow for regular unit test runs, they're
      disabled by default.
      
      Tested with the included unit tests, and also as part of the overall feature
      implementation (including running SHS with hundreds of apps).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17902 from vanzin/shs-ng/M1.
      0cba4951
    • Reza Safi's avatar
      [SPARK-20926][SQL] Removing exposures to guava library caused by directly... · b61a401d
      Reza Safi authored
      [SPARK-20926][SQL] Removing exposures to guava library caused by directly accessing  SessionCatalog's tableRelationCache
      
      There could be test failures because DataStorageStrategy, HiveMetastoreCatalog and also HiveSchemaInferenceSuite were exposed to guava library by directly accessing SessionCatalog's tableRelationCacheg. These failures occur when guava shading is in place.
      
      ## What changes were proposed in this pull request?
      This change removes those guava exposures by introducing new methods in SessionCatalog and also changing DataStorageStrategy, HiveMetastoreCatalog and HiveSchemaInferenceSuite so that they use those proxy methods.
      
      ## How was this patch tested?
      
      Unit tests passed after applying these changes.
      
      Author: Reza Safi <rezasafi@cloudera.com>
      
      Closes #18148 from rezasafi/branch-2.2.
      
      (cherry picked from commit 1388fdd7)
      b61a401d
    • jinxing's avatar
      [SPARK-20985] Stop SparkContext using LocalSparkContext.withSpark · 44de108d
      jinxing authored
      ## What changes were proposed in this pull request?
      SparkContext should always be stopped after using, thus other tests won't complain that there's only one `SparkContext` can exist.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18204 from jinxing64/SPARK-20985.
      44de108d
  7. Jun 05, 2017
    • Feng Liu's avatar
      [SPARK-20991][SQL] BROADCAST_TIMEOUT conf should be a TimeoutConf · 88a23d3d
      Feng Liu authored
      ## What changes were proposed in this pull request?
      
      The construction of BROADCAST_TIMEOUT conf should take the TimeUnit argument as a TimeoutConf.
      
      Author: Feng Liu <fengliu@databricks.com>
      
      Closes #18208 from liufengdb/fix_timeout.
      88a23d3d
    • Shixiong Zhu's avatar
      [SPARK-20957][SS][TESTS] Fix o.a.s.sql.streaming.StreamingQueryManagerSuite listing · bc537e40
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When stopping StreamingQuery, StreamExecution will set `streamDeathCause` then notify StreamingQueryManager to remove this query. So it's possible that when `q2.exception.isDefined` returns `true`, StreamingQueryManager's active list still has `q2`.
      
      This PR just puts the checks into `eventually` to fix the flaky test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18180 from zsxwing/SPARK-20957.
      bc537e40
    • jerryshao's avatar
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as... · 06c05441
      jerryshao authored
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as equivalence of --repositories
      
      ## What changes were proposed in this pull request?
      
      In our use case of launching Spark applications via REST APIs (Livy), there's no way for user to specify command line arguments, all Spark configurations are set through configurations map. For "--repositories" because there's no equivalent Spark configuration, so we cannot specify the custom repository through configuration.
      
      So here propose to add "--repositories" equivalent configuration in Spark.
      
      ## How was this patch tested?
      
      New UT added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18201 from jerryshao/SPARK-20981.
      06c05441
    • sethah's avatar
      [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code · 1665b5f7
      sethah authored
      ## What changes were proposed in this pull request?
      
      JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
      
      The larger changes in this patch are:
      
      * Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
      * Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.
      * Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function.
      * Changes `LinearRegression` to use this new hierarchy as a proof of concept.
      * Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator`
      
      Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.
      
      **NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.**
      
      BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.
      
      ## How was this patch tested?
      Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.
      
      * DifferentiablLossAggregatorSuite
      * LeastSquaresAggregatorSuite
      * RDDLossFunctionSuite
      * DifferentiableRegularizationSuite
      
      Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.
      
      **Note:** timings are best of 3 runs.
      
      |    |   numFeatures |   numPoints |   maxIter |   regParam |   elasticNetParam |   SPARK-19762 (sec) |   master (sec) |
      |----|---------------|-------------|-----------|------------|-------------------|---------------------|----------------|
      |  0 |          5000 |       1e+06 |        30 |       0    |               0   |             129.594 |        131.153 |
      |  1 |          5000 |       1e+06 |        30 |       0.1  |               0   |             135.54  |        136.327 |
      |  2 |          5000 |       1e+06 |        30 |       0.01 |               0.5 |             135.148 |        129.771 |
      |  3 |         50000 |  100000     |        30 |       0    |               0   |             145.764 |        144.096 |
      
      ## Follow ups
      
      If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      Author: sethah <shendrickson@cloudera.com>
      
      Closes #17094 from sethah/ml_aggregators.
      1665b5f7
    • Zheng RuiFeng's avatar
      [SPARK-20930][ML] Destroy broadcasted centers after computing cost in KMeans · 98b5ccd3
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
       Destroy broadcasted centers after computing cost
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18152 from zhengruifeng/destroy_kmeans_model.
      98b5ccd3
    • liupengcheng's avatar
      [SPARK-20945] Fix TID key not found in TaskSchedulerImpl · 2d39711b
      liupengcheng authored
      ## What changes were proposed in this pull request?
      
      This pull request fix the TaskScheulerImpl bug in some condition.
      Detail see:
      https://issues.apache.org/jira/browse/SPARK-20945
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      manual tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: liupengcheng <liupengcheng@xiaomi.com>
      Author: PengchengLiu <pengchengliu_bupt@163.com>
      
      Closes #18171 from liupc/Fix-tid-key-not-found-in-TaskSchedulerImpl.
      2d39711b
  8. Jun 04, 2017
  9. Jun 03, 2017
    • Wieland Hoffmann's avatar
      [DOCS] Fix a typo in Encoder.clsTag · c70c38eb
      Wieland Hoffmann authored
      ## What changes were proposed in this pull request?
      
      Fixes a typo: `and` -> `an`
      
      ## How was this patch tested?
      
      Not at all.
      
      Author: Wieland Hoffmann <mineo@users.noreply.github.com>
      
      Closes #17759 from mineo/patch-1.
      c70c38eb
    • zuotingbing's avatar
      [SPARK-20936][CORE] Lack of an important case about the test of resolveURI in... · 887cf0ec
      zuotingbing authored
      [SPARK-20936][CORE] Lack of an important case about the test of resolveURI in UtilsSuite, and add it as needed.
      
      ## What changes were proposed in this pull request?
      1.  add `assert(resolve(before) === after)` to check before and after in test of resolveURI.
      the function `assertResolves(before: String, after: String)` have two params, it means we should check the before value whether equals the after value which we want.
      e.g. the after value of Utils.resolveURI("hdfs:///root/spark.jar#app.jar").toString should be "hdfs:///root/spark.jar#app.jar" rather than "hdfs:/root/spark.jar#app.jar". we need `assert(resolve(before) === after)` to make it more safe.
      2. identify the cases between resolveURI and resolveURIs.
      3. delete duplicate cases and some small fix make this suit more clear.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #18158 from zuotingbing/spark-UtilsSuite.
      887cf0ec
    • David Eis's avatar
      [SPARK-20790][MLLIB] Remove extraneous logging in test · 96e6ba6c
      David Eis authored
      ## What changes were proposed in this pull request?
      
      Remove extraneous logging.
      
      ## How was this patch tested?
      
      Unit tests pass.
      
      Author: David Eis <deis@bloomberg.net>
      
      Closes #18188 from davideis/fix-test.
      96e6ba6c
    • Ruben Berenguel Montoro's avatar
      [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets · 6cbc61d1
      Ruben Berenguel Montoro authored
      ## What changes were proposed in this pull request?
      
      Allow fill/replace of NAs with booleans, both in Python and Scala
      
      ## How was this patch tested?
      
      Unit tests, doctests
      
      This PR is original work from me and I license this work to the Spark project
      
      Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
      Author: Ruben Berenguel <ruben@mostlymaths.net>
      
      Closes #18164 from rberenguel/SPARK-19732-fillna-bools.
      6cbc61d1
  10. Jun 02, 2017
    • Wenchen Fan's avatar
      [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes · 864d94fe
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18191 from cloud-fan/test.
      864d94fe
    • Zhenhua Wang's avatar
      [SPARK-17078][SQL][FOLLOWUP] Simplify explain cost command · 6de41e95
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Usually when using explain cost command, users want to see the stats of plan. Since stats is only showed in optimized plan, it is more direct and convenient to include only optimized plan and physical plan in the output.
      
      ## How was this patch tested?
      
      Enhanced existing test.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #18190 from wzhfy/simplifyExplainCost.
      6de41e95
Loading