Skip to content
Snippets Groups Projects
  1. Jul 26, 2017
    • jinxing's avatar
      [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred. · cfb25b27
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Update the description of `spark.shuffle.maxChunksBeingTransferred` to include that the new coming connections will be closed when the max is hit and client should have retry mechanism.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18735 from jinxing64/SPARK-21530.
      cfb25b27
    • hyukjinkwon's avatar
      [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions · 60472dbf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This generates a documentation for Spark SQL built-in functions.
      
      One drawback is, this requires a proper build to generate built-in function list.
      Once it is built, it only takes few seconds by `sql/create-docs.sh`.
      
      Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.
      
      There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up.
      
      This requires `pip install mkdocs` to generate HTMLs from markdown files.
      
      ## How was this patch tested?
      
      Manually tested:
      
      ```
      cd docs
      jekyll build
      ```
      ,
      
      ```
      cd docs
      jekyll serve
      ```
      
      and
      
      ```
      cd sql
      create-docs.sh
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18702 from HyukjinKwon/SPARK-21485.
      60472dbf
  2. Jul 25, 2017
    • jinxing's avatar
      [SPARK-21175] Reject OpenBlocks when memory shortage on shuffle service. · 799e1316
      jinxing authored
      ## What changes were proposed in this pull request?
      
      A shuffle service can serves blocks from multiple apps/tasks. Thus the shuffle service can suffers high memory usage when lots of shuffle-reads happen at the same time. In my cluster, OOM always happens on shuffle service. Analyzing heap dump, memory cost by Netty(ChannelOutboundBufferEntry) can be up to 2~3G. It might make sense to reject "open blocks" request when memory usage is high on shuffle service.
      
      https://github.com/apache/spark/commit/93dd0c518d040155b04e5ab258c5835aec7776fc and https://github.com/apache/spark/commit/85c6ce61930490e2247fb4b0e22dfebbb8b6a1ee tried to alleviate the memory pressure on shuffle service but cannot solve the root cause. This pr proposes to control currency of shuffle read.
      
      ## How was this patch tested?
      Added unit test.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18388 from jinxing64/SPARK-21175.
      799e1316
    • Trueman's avatar
      [SPARK-21498][EXAMPLES] quick start -> one py demo have some bug in code · 996a809c
      Trueman authored
      I find a bug about 'quick start',and created a new issues,Sean Owen  let
      me to make a pull request, and I do
      
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Trueman <lizhaoch@users.noreply.github.com>
      Author: lizhaoch <lizhaoc@163.com>
      
      Closes #18722 from lizhaoch/master.
      996a809c
    • Yash Sharma's avatar
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the... · 4f77c062
      Yash Sharma authored
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils
      
      ## What changes were proposed in this pull request?
      
      The examples and docs for Spark-Kinesis integrations use the deprecated KinesisUtils. We should update the docs to use the KinesisInputDStream builder to create DStreams.
      
      ## How was this patch tested?
      
      The patch primarily updates the documents. The patch will also need to make changes to the Spark-Kinesis examples. The examples need to be tested.
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #18071 from yssharma/ysharma/kinesis_docs.
      4f77c062
  3. Jul 21, 2017
    • Holden Karau's avatar
      [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation. · cc00e99d
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Update the Quickstart and RDD programming guides to mention pip.
      
      ## How was this patch tested?
      
      Built docs locally.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
      cc00e99d
    • Liang-Chi Hsieh's avatar
      [MINOR][SS][DOCS] Minor doc change for kafka integration · c57dfaef
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Minor change to kafka integration document for structured streaming.
      
      ## How was this patch tested?
      
      N/A, doc change only.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18550 from viirya/minor-ss-kafka-doc.
      c57dfaef
  4. Jul 20, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Fix some missing notes for Python 2.6 support drop · 5b61cc6d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      After SPARK-12661, I guess we officially dropped Python 2.6 support. It looks there are few places missing this notes.
      
      I grepped "Python 2.6" and "python 2.6" and the results were below:
      
      ```
      ./core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala:  // Unpickle array.array generated by Python 2.6
      ./docs/index.md:Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
      ./docs/rdd-programming-guide.md:Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter,
      ./docs/rdd-programming-guide.md:Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be removed in Spark 2.2.0.
      ./python/pyspark/context.py:            warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
      ./python/pyspark/ml/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/mllib/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/serializers.py:        # On Python 2.6, we can't write bytearrays to streams, so we need to convert them
      ./python/pyspark/sql/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/streaming/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/tests.py:        # NOTE: dict is used instead of collections.Counter for Python 2.6
      ./python/pyspark/tests.py:        # NOTE: dict is used instead of collections.Counter for Python 2.6
      ```
      
      This PR only proposes to change visible changes as below:
      
      ```
      ./docs/rdd-programming-guide.md:Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter,
      ./docs/rdd-programming-guide.md:Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be removed in Spark 2.2.0.
      ./python/pyspark/context.py:            warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
      ```
      
      This one is already correct:
      
      ```
      ./docs/index.md:Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
      ```
      
      ## How was this patch tested?
      
      ```bash
       grep -r "Python 2.6" .
       grep -r "python 2.6" .
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18682 from HyukjinKwon/minor-python.26.
      5b61cc6d
  5. Jul 19, 2017
    • Susan X. Huynh's avatar
      [SPARK-21456][MESOS] Make the driver failover_timeout configurable · c42ef953
      Susan X. Huynh authored
      ## What changes were proposed in this pull request?
      
      Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.
      
      Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.
      
      Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458
      
      ## How was this patch tested?
      
      Added a unit test to make sure the config option is set while creating the scheduler driver.
      
      Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.
      
      Author: Susan X. Huynh <xhuynh@mesosphere.com>
      
      Closes #18674 from susanxhuynh/sh-mesos-failover-timeout.
      c42ef953
    • Dhruve Ashar's avatar
      [SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch · ef617755
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
      
      ## How was this patch tested?
      Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #18487 from dhruve/impr/SPARK-21243.
      ef617755
  6. Jul 15, 2017
  7. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  8. Jul 12, 2017
  9. Jul 10, 2017
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Remove obsolete `ec2-scripts.md` · c444d108
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since this document became obsolete, we had better remove this for Apache Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, and currently it's just redirection page. The only reference in Apache Spark website will go directly to the destination in https://github.com/apache/spark-website/pull/54.
      
      ## How was this patch tested?
      
      N/A. This is a removal of documentation.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.
      c444d108
  10. Jul 09, 2017
    • jerryshao's avatar
      [MINOR][DOC] Improve the docs about how to correctly set configurations · 457dc9cc
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Spark provides several ways to set configurations, either from configuration file, or from `spark-submit` command line options, or programmatically through `SparkConf` class. It may confuses beginners why some configurations set through `SparkConf` cannot take affect. So here add some docs to address this problems and let beginners know how to correctly set configurations.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18552 from jerryshao/improve-doc.
      457dc9cc
  11. Jul 08, 2017
    • jinxing's avatar
      [SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffleToMem. · 062c336d
      jinxing authored
      ## What changes were proposed in this pull request?
      
      In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18566 from jinxing64/SPARK-21343.
      062c336d
    • Joachim Hereth's avatar
      Mesos doc fixes · 01f183e8
      Joachim Hereth authored
      ## What changes were proposed in this pull request?
      
      Some link fixes for the documentation [Running Spark on Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html):
      
      * Updated Link to Mesos Frameworks (Projects built on top of Mesos)
      * Update Link to Mesos binaries from Mesosphere (former link was redirected to dcos install page)
      
      ## How was this patch tested?
      
      Documentation was built and changed page manually/visually inspected.
      
      No code was changed, hence no dev tests.
      
      Since these changes are rather trivial I did not open a new JIRA ticket.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Joachim Hereth <joachim.hereth@numberfour.eu>
      
      Closes #18564 from daten-kieker/mesos_doc_fixes.
      01f183e8
    • Prashant Sharma's avatar
      [SPARK-21069][SS][DOCS] Add rate source to programming guide. · d0bfc673
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      
      SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.
      
      ## How was this patch tested?
      
      Tested by running jekyll locally.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.
      d0bfc673
  12. Jul 06, 2017
    • Tathagata Das's avatar
      [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation · 0217dfd2
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Few changes to the Structured Streaming documentation
      - Clarify that the entire stream input table is not materialized
      - Add information for Ganglia
      - Add Kafka Sink to the main docs
      - Removed a couple of leftover experimental tags
      - Added more associated reading material and talk videos.
      
      In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
      - Added a redirection to avoid breaking internal and possible external links.
      - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18485 from tdas/SPARK-21267.
      0217dfd2
    • jerryshao's avatar
      [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark · 5800144a
      jerryshao authored
      Current "--jars (spark.jars)", "--files (spark.files)", "--py-files (spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only support non-glob path. This is OK for most of the cases, but when user requires to add more jars, files into Spark, it is too verbose to list one by one. So here propose to add glob path support for resources.
      
      Also improving the code of downloading resources.
      
      ## How was this patch tested?
      
      UT added, also verified manually in local cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18235 from jerryshao/SPARK-21012.
      5800144a
  13. Jul 05, 2017
    • sadikovi's avatar
      [SPARK-20858][DOC][MINOR] Document ListenerBus event queue size · 960298ee
      sadikovi authored
      ## What changes were proposed in this pull request?
      
      This change adds a new configuration option `spark.scheduler.listenerbus.eventqueue.size` to the configuration docs to specify the capacity of the spark listener bus event queue. Default value is 10000.
      
      This is doc PR for [SPARK-15703](https://issues.apache.org/jira/browse/SPARK-15703).
      
      I added option to the `Scheduling` section, however it might be more related to `Spark UI` section.
      
      ## How was this patch tested?
      
      Manually verified correct rendering of configuration option.
      
      Author: sadikovi <ivan.sadikov@lincolnuni.ac.nz>
      Author: Ivan Sadikov <ivan.sadikov@team.telstra.com>
      
      Closes #18476 from sadikovi/SPARK-20858.
      960298ee
  14. Jun 29, 2017
    • Shixiong Zhu's avatar
      [SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem · 80f7ac3a
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service.
      
      Credits to wangyum
      
      Closes #18466
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18467 from zsxwing/SPARK-21253.
      80f7ac3a
  15. Jun 26, 2017
    • jerryshao's avatar
      [SPARK-13669][SPARK-20898][CORE] Improve the blacklist mechanism to handle... · 9e50a1d3
      jerryshao authored
      [SPARK-13669][SPARK-20898][CORE] Improve the blacklist mechanism to handle external shuffle service unavailable situation
      
      ## What changes were proposed in this pull request?
      
      Currently we are running into an issue with Yarn work preserving enabled + external shuffle service.
      In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed. This could also apply to other cluster manager with external shuffle service.
      
      So here the main problem is that we should avoid assigning tasks to those bad executors (where shuffle service is unavailable). Current Spark's blacklist mechanism could blacklist executors/nodes by failure tasks, but it doesn't handle this specific fetch failure scenario. So here propose to improve the current application blacklist mechanism to handle fetch failure issue (especially with external shuffle service unavailable issue), to blacklist the executors/nodes where shuffle fetch is unavailable.
      
      ## How was this patch tested?
      
      Unit test and small cluster verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17113 from jerryshao/SPARK-13669.
      9e50a1d3
  16. Jun 21, 2017
    • Yuming Wang's avatar
      [MINOR][DOCS] Add lost <tr> tag for configuration.md · 987eb8fa
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Add lost `<tr>` tag for `configuration.md`.
      
      ## How was this patch tested?
      N/A
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18372 from wangyum/docs-missing-tr.
      987eb8fa
    • Li Yichao's avatar
      [SPARK-20640][CORE] Make rpc timeout and retry for shuffle registration configurable. · d107b3b9
      Li Yichao authored
      ## What changes were proposed in this pull request?
      
      Currently the shuffle service registration timeout and retry has been hardcoded. This works well for small workloads but under heavy workload when the shuffle service is busy transferring large amount of data we see significant delay in responding to the registration request, as a result we often see the executors fail to register with the shuffle service, eventually failing the job. We need to make these two parameters configurable.
      
      ## How was this patch tested?
      
      * Updated `BlockManagerSuite` to test registration timeout and max attempts configuration actually works.
      
      cc sitalkedia
      
      Author: Li Yichao <lyc@zhihu.com>
      
      Closes #18092 from liyichao/SPARK-20640.
      d107b3b9
  17. Jun 19, 2017
    • assafmendelson's avatar
      [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table · 66a792cd
      assafmendelson authored
      ## What changes were proposed in this pull request?
      
      The description for several options of File Source for structured streaming appeared in the File Sink description instead.
      
      This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2
      
      ## How was this patch tested?
      
      Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
      
      The original documentation was written by tdas and lw-lin
      
      Author: assafmendelson <assaf.mendelson@gmail.com>
      
      Closes #18342 from assafmendelson/spark-21123.
      66a792cd
  18. Jun 18, 2017
  19. Jun 16, 2017
    • Yuming Wang's avatar
      [MINOR][DOCS] Improve Running R Tests docs · 45824fb6
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Update Running R Tests dependence packages to:
      ```bash
      R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
      ```
      
      ## How was this patch tested?
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18271 from wangyum/building-spark.
      45824fb6
  20. Jun 15, 2017
    • Michael Gummelt's avatar
      [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core · a18d6371
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
      
      Summary:
      - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
      
      - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
      
      Old Hierarchy:
      
      ```
      yarn.security.ServiceCredentialProvider (service loaded)
        HadoopFSCredentialProvider
        HiveCredentialProvider
        HBaseCredentialProvider
      yarn.security.ConfigurableCredentialManager
      ```
      
      New Hierarchy:
      
      ```
      HadoopDelegationTokenManager
      HadoopDelegationTokenProvider (not service loaded)
        HadoopFSDelegationTokenProvider
        HiveDelegationTokenProvider
        HBaseDelegationTokenProvider
      
      yarn.security.ServiceCredentialProvider (service loaded)
      yarn.security.YARNHadoopDelegationTokenManager
      ```
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
      
      Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
      a18d6371
    • Felix Cheung's avatar
      [SPARK-20980][DOCS] update doc to reflect multiLine change · 1bf55e39
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only change
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #18312 from felixcheung/sqljsonwholefiledoc.
      1bf55e39
  21. Jun 12, 2017
    • Ziyue Huang's avatar
      [DOCS] Fix error: ambiguous reference to overloaded definition · e6eb02df
      Ziyue Huang authored
      ## What changes were proposed in this pull request?
      
      `df.groupBy.count()` should be `df.groupBy().count()` , otherwise there is an error :
      
      ambiguous reference to overloaded definition, both method groupBy in class Dataset of type (col1: String, cols: String*) and method groupBy in class Dataset of type (cols: org.apache.spark.sql.Column*)
      
      ## How was this patch tested?
      
      ```scala
      val df = spark.readStream.schema(...).json(...)
      val dfCounts = df.groupBy().count()
      ```
      
      Author: Ziyue Huang <zyhuang94@gmail.com>
      
      Closes #18272 from ZiyueHuang/master.
      e6eb02df
  22. Jun 11, 2017
  23. Jun 09, 2017
  24. Jun 08, 2017
    • Mark Grover's avatar
      [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable · 55b8cfe6
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior.
      
      ## How was this patch tested?
      Running unit tests
      
      Author: Mark Grover <mark@apache.org>
      Author: Mark Grover <grover.markgrover@gmail.com>
      
      Closes #18234 from markgrover/spark-19185.
      55b8cfe6
  25. Jun 07, 2017
  26. Jun 05, 2017
    • jerryshao's avatar
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as... · 06c05441
      jerryshao authored
      [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as equivalence of --repositories
      
      ## What changes were proposed in this pull request?
      
      In our use case of launching Spark applications via REST APIs (Livy), there's no way for user to specify command line arguments, all Spark configurations are set through configurations map. For "--repositories" because there's no equivalent Spark configuration, so we cannot specify the custom repository through configuration.
      
      So here propose to add "--repositories" equivalent configuration in Spark.
      
      ## How was this patch tested?
      
      New UT added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18201 from jerryshao/SPARK-20981.
      06c05441
  27. May 26, 2017
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      
      ## How was this patch tested?
      
      Manual tests, docs build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
      ae33abf7
    • Michael Armbrust's avatar
      [SPARK-20844] Remove experimental from Structured Streaming APIs · d935e0a9
      Michael Armbrust authored
      Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #18065 from marmbrus/streamingGA.
      d935e0a9
    • Zheng RuiFeng's avatar
      [SPARK-20849][DOC][SPARKR] Document R DecisionTree · a97c4970
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add an example for sparkr `decisionTree`
      2, document it in user guide
      
      ## How was this patch tested?
      local submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18067 from zhengruifeng/dt_example.
      a97c4970
Loading