Skip to content
Snippets Groups Projects
  1. Jan 25, 2017
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · 3fdce814
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      
      ## How was this patch tested?
      
      The patch was tested locally by building the docs. The examples were run as well.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16329 from aokolnychyi/SPARK-16046.
      3fdce814
  2. Jan 24, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19139][CORE] New auth mechanism for transport library. · 8f3f73ab
      Marcelo Vanzin authored
      This change introduces a new auth mechanism to the transport library,
      to be used when users enable strong encryption. This auth mechanism
      has better security than the currently used DIGEST-MD5.
      
      The new protocol uses symmetric key encryption to mutually authenticate
      the endpoints, and is very loosely based on ISO/IEC 9798.
      
      The new protocol falls back to SASL when it thinks the remote end is old.
      Because SASL does not support asking the server for multiple auth protocols,
      which would mean we could re-use the existing SASL code by just adding a
      new SASL provider, the protocol is implemented outside of the SASL API
      to avoid the boilerplate of adding a new provider.
      
      Details of the auth protocol are discussed in the included README.md
      file.
      
      This change partly undos the changes added in SPARK-13331; AES encryption
      is now decoupled from SASL authentication. The encryption code itself,
      though, has been re-used as part of this change.
      
      ## How was this patch tested?
      
      - Unit tests
      - Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
      - Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16521 from vanzin/SPARK-19139.
      8f3f73ab
    • Parag Chaudhari's avatar
      [SPARK-14049][CORE] Add functionality in spark history sever API to query applications by end time · 0ff67a1c
      Parag Chaudhari authored
      ## What changes were proposed in this pull request?
      
      Currently, spark history server REST API provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it  lacks support to query applications by their end time. In this pull request we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server REST API. This functionality can be used for following queries,
      1. Applications finished in last 'x' minutes
      2. Applications finished before 'y' time
      3. Applications finished between 'x' time to 'y' time
      4. Applications started from 'x' time and finished before 'y' time.
      
      For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range.
      ## How was this patch tested?
      
      Existing unit tests and 4 new unit tests.
      
      Author: Parag Chaudhari <paragpc@amazon.com>
      
      Closes #11867 from paragpc/master-SHS-query-by-endtime_2.
      0ff67a1c
    • uncleGen's avatar
      [DOCS] Fix typo in docs · 7c61c2a1
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Fix typo in docs
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16658 from uncleGen/typo-issue.
      7c61c2a1
  3. Jan 23, 2017
  4. Jan 20, 2017
  5. Jan 17, 2017
    • jerryshao's avatar
      [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs · b79cc7ce
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.
      
      ## How was this patch tested?
      
      Local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16560 from jerryshao/SPARK-19179.
      b79cc7ce
  6. Jan 15, 2017
    • Maurus Cuelenaere's avatar
      [MINOR][DOC] Document local[*,F] master modes · 3df2d931
      Maurus Cuelenaere authored
      ## What changes were proposed in this pull request?
      
      core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so.
      
      ## How was this patch tested?
      
      By using the Github Markdown preview feature.
      
      Author: Maurus Cuelenaere <mcuelenaere@gmail.com>
      
      Closes #16562 from mcuelenaere/patch-1.
      3df2d931
  7. Jan 11, 2017
    • Bryan Cutler's avatar
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings... · 3bc2eff8
      Bryan Cutler authored
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts
      
      ## What changes were proposed in this pull request?
      
      Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.
      
      This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
      ## How was this patch tested?
      
      Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Ian Hummel <ian@themodernlife.net>
      
      Closes #15119 from BryanCutler/spark-custom-IvySettings.
      3bc2eff8
    • jerryshao's avatar
      [SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems · 4239a108
      jerryshao authored
      Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.
      
      ## How was this patch tested?
      
      Manually verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16432 from jerryshao/SPARK-19021.
      4239a108
  8. Jan 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · bc6c56e9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      bc6c56e9
    • Peng, Meng's avatar
      [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change · 32286ba6
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      Add FDR test case in ml/feature/ChiSqSelectorSuite.
      Improve some comments in the code.
      This is a follow-up pr for #15212.
      
      ## How was this patch tested?
      ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #16434 from mpjlu/fdr_fwe_update.
      32286ba6
  9. Jan 07, 2017
  10. Jan 06, 2017
  11. Jan 05, 2017
  12. Jan 04, 2017
    • uncleGen's avatar
      [SPARK-19009][DOC] Add streaming rest api doc · 6873430c
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      add streaming rest api doc
      
      related to pr #16253
      
      cc saturday-shi srowen
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16414 from uncleGen/SPARK-19009.
      6873430c
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      a1e40b1f
  13. Jan 02, 2017
  14. Dec 30, 2016
    • Cheng Lian's avatar
      [SPARK-19016][SQL][DOC] Document scalable partition handling · 871f6114
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR documents the scalable partition handling feature in the body of the programming guide.
      
      Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #16424 from liancheng/scalable-partition-handling-doc.
      871f6114
  15. Dec 29, 2016
    • adesharatushar's avatar
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design... · dba81e1d
      adesharatushar authored
      [SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design Patterns for using foreachRDD
      
      ## What changes were proposed in this pull request?
      
      Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.
      
      ## How was this patch tested?
      
      Manual.
      Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.
      
      The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.
      
      Author: adesharatushar <tushar_adeshara@persistent.com>
      
      Closes #16408 from adesharatushar/streaming-doc-fix.
      dba81e1d
  16. Dec 28, 2016
  17. Dec 27, 2016
    • Yuexin Zhang's avatar
      [SPARK-19006][DOCS] mention spark.kryoserializer.buffer.max must be less than 2048m in doc · 28ab0ec4
      Yuexin Zhang authored
      ## What changes were proposed in this pull request?
      
      On configuration doc page:https://spark.apache.org/docs/latest/configuration.html
      We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo.
      from source code, it has hard coded upper limit :
      ```
      val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt
      if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2))
      { throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") }
      ```
      We should mention "this value must be less than 2048 mb" on the configuration doc page as well.
      
      ## How was this patch tested?
      
      None. Since it's minor doc change.
      
      Author: Yuexin Zhang <yxzhang@cloudera.com>
      
      Closes #16412 from cnZach/SPARK-19006.
      28ab0ec4
  18. Dec 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-18923][DOC][BUILD] Support skipping R/Python API docs · ba4468bb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      We can build Python API docs by `cd ./python/docs && make html for Python` and R API docs by `cd ./R && sh create-docs.sh for R` separately. However, `jekyll` fails in some environments.
      
      This PR aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation build in `docs` folder. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. The reason providing additional options is that the Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Python and R. Specifically, for Python and R,
      
      - Python API docs requires `sphinx`.
      - R API docs requires `R` installation and `knitr` (and more others libraries).
      
      In other words, we cannot generate Python API docs without R installation. Also, we cannot generate R API docs without Python `sphinx` installation. If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it would be more convenient.
      
      ## How was this patch tested?
      
      Manual.
      
      **Skipping Scala/Java/Python API Doc Build**
      ```bash
      $ cd docs
      $ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 jekyll build
      $ ls api
      DESCRIPTION R
      ```
      
      **Skipping Scala/Java/R API Doc Build**
      ```bash
      $ cd docs
      $ SKIP_SCALADOC=1 SKIP_RDOC=1 jekyll build
      $ ls api
      python
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16336 from dongjoon-hyun/SPARK-18923.
      ba4468bb
  19. Dec 19, 2016
    • Josh Rosen's avatar
      [SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors · fa829ce2
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.
      
      This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.
      
      This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.
      
      ## How was this patch tested?
      
      Tested via a new test case in `JobCancellationSuite`, plus manual testing.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16189 from JoshRosen/cancellation.
      fa829ce2
  20. Dec 18, 2016
  21. Dec 17, 2016
  22. Dec 16, 2016
    • Michal Senkyr's avatar
      [SPARK-18723][DOC] Expanded programming guide information on wholeTex… · 836c95b1
      Michal Senkyr authored
      ## What changes were proposed in this pull request?
      
      Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.
      
      Also added reference to the underlying CombineFileInputFormat
      
      ## How was this patch tested?
      
      Manual build of documentation and inspection in browser
      
      ```
      cd docs
      jekyll serve --watch
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
      836c95b1
  23. Dec 15, 2016
    • Imran Rashid's avatar
      [SPARK-8425][CORE] Application Level Blacklisting · 93cdb8a7
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
      ## How was this patch tested?
      
      Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.
      
      The added tests include:
      - verifying BlacklistTracker works correctly
      - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
      - an integration test for the entire scheduler with blacklisting in a few different scenarios
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: mwws <wei.mao@intel.com>
      
      Closes #14079 from squito/blacklist-SPARK-8425.
      93cdb8a7
  24. Dec 14, 2016
  25. Dec 12, 2016
    • Marcelo Vanzin's avatar
      [SPARK-18773][CORE] Make commons-crypto config translation consistent. · bc59951b
      Marcelo Vanzin authored
      This change moves the logic that translates Spark configuration to
      commons-crypto configuration to the network-common module. It also
      extends TransportConf and ConfigProvider to provide the necessary
      interfaces for the translation to work.
      
      As part of the change, I removed SystemPropertyConfigProvider, which
      was mostly used as an "empty config" in unit tests, and adjusted the
      very few tests that required a specific config.
      
      I also changed the config keys for AES encryption to live under the
      "spark.network." namespace, which is more correct than their previous
      names under "spark.authenticate.".
      
      Tested via existing unit test.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16200 from vanzin/SPARK-18773.
      bc59951b
    • Bill Chambers's avatar
      [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed · 70ffff21
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      This PR clarifies where accumulators will be displayed.
      
      ## How was this patch tested?
      
      No testing.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      
      Closes #16180 from anabranch/improve-acc-docs.
      70ffff21
  26. Dec 10, 2016
  27. Dec 09, 2016
    • Xiangrui Meng's avatar
      [SPARK-18812][MLLIB] explain "Spark ML" · d2493a20
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion.
      
      I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems too high-level for the content here. So I added it to the MLlib user guide instead.
      
      cc: mateiz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16241 from mengxr/SPARK-18812.
      d2493a20
    • Jacek Laskowski's avatar
      [MINOR][CORE][SQL][DOCS] Typo fixes · b162cc0c
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Typo fixes
      
      ## How was this patch tested?
      
      Local build. Awaiting the official build.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #16144 from jaceklaskowski/typo-fixes.
      b162cc0c
  28. Dec 08, 2016
    • Yanbo Liang's avatar
      [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide · 9bf8f3cd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Add all R examples for ML wrappers which were added during 2.1 release cycle.
      * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
      * Add corresponding examples to ML user guide.
      * Update ML section of SparkR user guide.
      
      Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.
      
      ## How was this patch tested?
      Run all examples manually.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16148 from yanboliang/spark-18325.
      9bf8f3cd
Loading