Skip to content
Snippets Groups Projects
  1. Sep 07, 2017
  2. Sep 06, 2017
    • Bryan Cutler's avatar
      [SPARK-19357][ML] Adding parallel model evaluation in ML tuning · 16c4c03c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid.  The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently.  This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately.  The default value is `1` which will train/evaluate in serial.
      
      ## How was this patch tested?
      Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel.  Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.
      16c4c03c
    • Riccardo Corbella's avatar
      [SPARK-21924][DOCS] Update structured streaming programming guide doc · 4ee7dfe4
      Riccardo Corbella authored
      ## What changes were proposed in this pull request?
      
      Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide.
      
      Author: Riccardo Corbella <r.corbella@reply.it>
      
      Closes #19137 from riccardocorbella/bugfix.
      4ee7dfe4
  3. Sep 05, 2017
  4. Aug 31, 2017
    • ArtRand's avatar
      [SPARK-20812][MESOS] Add secrets support to the dispatcher · fc45c2c8
      ArtRand authored
      Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
      Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      
      Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
      fc45c2c8
  5. Aug 30, 2017
    • Xiaofeng Lin's avatar
      [SPARK-11574][CORE] Add metrics StatsD sink · cd5d0f33
      Xiaofeng Lin authored
      This patch adds statsd sink to the current metrics system in spark core.
      
      Author: Xiaofeng Lin <xlin@twilio.com>
      
      Closes #9518 from xflin/statsd.
      
      Change-Id: Ib8720e86223d4a650df53f51ceb963cd95b49a44
      cd5d0f33
    • Bryan Cutler's avatar
      [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
      
      ## How was this patch tested?
      
      Manually ran examples and verified that output is consistent for different APIs
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
      4133c1b0
  6. Aug 28, 2017
    • erenavsarogullari's avatar
      [SPARK-19662][SCHEDULER][TEST] Add Fair Scheduler Unit Test coverage for different build cases · 73e64f7d
      erenavsarogullari authored
      ## What changes were proposed in this pull request?
      Fair Scheduler can be built via one of the following options:
      - By setting a `spark.scheduler.allocation.file` property,
      - By setting `fairscheduler.xml` into classpath.
      
      These options are checked **in order** and fair-scheduler is built via first found option. If invalid path is found, `FileNotFoundException` will be expected.
      
      This PR aims unit test coverage of these use cases and a minor documentation change has been added for second option(`fairscheduler.xml` into classpath) to inform the users.
      
      Also, this PR was related with #16813 and has been created separately to keep patch content as isolated and to help the reviewers.
      
      ## How was this patch tested?
      Added new Unit Tests.
      
      Author: erenavsarogullari <erenavsarogullari@gmail.com>
      
      Closes #16992 from erenavsarogullari/SPARK-19662.
      73e64f7d
    • pgandhi's avatar
      [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for... · 24e6c187
      pgandhi authored
      [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server
      
      History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server.
      
      Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode.
      
      ## How was this patch tested?
      Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running.
      
      Author: pgandhi <pgandhi@yahoo-inc.com>
      Author: pgandhi999 <parthkgandhi9@gmail.com>
      
      Closes #19047 from pgandhi999/master.
      24e6c187
  7. Aug 25, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Minor doc fixes related with doc build and uses script dir in SQL doc gen script · 3b66b1c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes both:
      
      - Add information about Javadoc, SQL docs and few more information in `docs/README.md` and a comment in `docs/_plugins/copy_api_dirs.rb` related with Javadoc.
      
      - Adds some commands so that the script always runs the SQL docs build under `./sql` directory (for directly running `./sql/create-docs.sh` in the root directory).
      
      ## How was this patch tested?
      
      Manual tests with `jekyll build` and `./sql/create-docs.sh` in the root directory.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19019 from HyukjinKwon/minor-doc-build.
      3b66b1c4
  8. Aug 24, 2017
    • Susan X. Huynh's avatar
      [SPARK-21694][MESOS] Support Mesos CNI network labels · ce0d3bb3
      Susan X. Huynh authored
      JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21694
      
      ## What changes were proposed in this pull request?
      
      Spark already supports launching containers attached to a given CNI network by specifying it via the config `spark.mesos.network.name`.
      
      This PR adds support to pass in network labels to CNI plugins via a new config option `spark.mesos.network.labels`. These network labels are key-value pairs that are set in the `NetworkInfo` of both the driver and executor tasks. More details in the related Mesos documentation:  http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins
      
      ## How was this patch tested?
      
      Unit tests, for both driver and executor tasks.
      Manual integration test to submit a job with the `spark.mesos.network.labels` option, hit the mesos/state.json endpoint, and check that the labels are set in the driver and executor tasks.
      
      ArtRand skonto
      
      Author: Susan X. Huynh <xhuynh@mesosphere.com>
      
      Closes #18910 from susanxhuynh/sh-mesos-cni-labels.
      ce0d3bb3
  9. Aug 23, 2017
    • Sanket Chintapalli's avatar
      [SPARK-21501] Change CacheLoader to limit entries based on memory footprint · 1662e931
      Sanket Chintapalli authored
      Right now the spark shuffle service has a cache for index files. It is based on a # of files cached (spark.shuffle.service.index.cache.entries). This can cause issues if people have a lot of reducers because the size of each entry can fluctuate based on the # of reducers.
      We saw an issues with a job that had 170000 reducers and it caused NM with spark shuffle service to use 700-800MB or memory in NM by itself.
      We should change this cache to be memory based and only allow a certain memory size used. When I say memory based I mean the cache should have a limit of say 100MB.
      
      https://issues.apache.org/jira/browse/SPARK-21501
      
      Manual Testing with 170000 reducers has been performed with cache loaded up to max 100MB default limit, with each shuffle index file of size 1.3MB. Eviction takes place as soon as the total cache size reaches the 100MB limit and the objects will be ready for garbage collection there by avoiding NM to crash. No notable difference in runtime has been observed.
      
      Author: Sanket Chintapalli <schintap@yahoo-inc.com>
      
      Closes #18940 from redsanket/SPARK-21501.
      1662e931
  10. Aug 20, 2017
    • hyukjinkwon's avatar
      [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build · 41e0eb71
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to install `mkdocs` by `pip install` if missing in the path. Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
      
      It also adds `mkdocs` as requirements in `docs/README.md`.
      
      ## How was this patch tested?
      
      I manually ran `jekyll build` under `docs` directory after manually removing `mkdocs` via `pip uninstall mkdocs`.
      
      Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) where I built Spark few times but never built documentation before and `mkdocs` is not installed.
      
      ```
      ...
      Moving back into docs dir.
      Moving to SQL directory and building docs.
      Missing mkdocs in your path, trying to install mkdocs for SQL documentation generation.
      Collecting mkdocs
        Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
          100% |████████████████████████████████| 1.2MB 574kB/s
      Requirement already satisfied: PyYAML>=3.10 in /usr/lib64/python2.7/site-packages (from mkdocs)
      Collecting livereload>=2.5.1 (from mkdocs)
        Downloading livereload-2.5.1-py2-none-any.whl
      Collecting tornado>=4.1 (from mkdocs)
        Downloading tornado-4.5.1.tar.gz (483kB)
          100% |████████████████████████████████| 491kB 1.4MB/s
      Collecting Markdown>=2.3.1 (from mkdocs)
        Downloading Markdown-2.6.9.tar.gz (271kB)
          100% |████████████████████████████████| 276kB 2.4MB/s
      Collecting click>=3.3 (from mkdocs)
        Downloading click-6.7-py2.py3-none-any.whl (71kB)
          100% |████████████████████████████████| 71kB 2.8MB/s
      Requirement already satisfied: Jinja2>=2.7.1 in /usr/lib/python2.7/site-packages (from mkdocs)
      Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs)
      Requirement already satisfied: backports.ssl_match_hostname in /usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
      Collecting singledispatch (from tornado>=4.1->mkdocs)
        Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
      Collecting certifi (from tornado>=4.1->mkdocs)
        Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
          100% |████████████████████████████████| 358kB 2.1MB/s
      Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
        Downloading backports_abc-0.5-py2.py3-none-any.whl
      Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
      Building wheels for collected packages: tornado, Markdown
        Running setup.py bdist_wheel for tornado ... done
        Stored in directory: /root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
        Running setup.py bdist_wheel for Markdown ... done
        Stored in directory: /root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
      Successfully built tornado Markdown
      Installing collected packages: singledispatch, certifi, backports-abc, tornado, livereload, Markdown, click, mkdocs
      Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
      Generating markdown files for SQL documentation.
      Generating HTML files for SQL documentation.
      INFO    -  Cleaning site directory
      INFO    -  Building documentation to directory: .../spark/sql/site
      Moving back into docs dir.
      Making directory api/sql
      cp -r ../sql/site/. api/sql
                  Source: .../spark/docs
             Destination: .../spark/docs/_site
            Generating...
                          done.
       Auto-regeneration: disabled. Use --watch to enable.
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18984 from HyukjinKwon/sql-doc-mkdocs.
      41e0eb71
  11. Aug 11, 2017
  12. Aug 08, 2017
  13. Aug 07, 2017
  14. Aug 05, 2017
    • hzyaoqin's avatar
      [SPARK-21637][SPARK-21451][SQL] get `spark.hadoop.*` properties from sysProps to hiveconf · 41568e9a
      hzyaoqin authored
      ## What changes were proposed in this pull request?
      When we use `bin/spark-sql` command configuring `--conf spark.hadoop.foo=bar`, the `SparkSQLCliDriver` initializes an instance of  hiveconf, it does not add `foo->bar` to it.
      this pr gets `spark.hadoop.*` properties from sysProps to this hiveconf
      
      ## How was this patch tested?
      UT
      
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #18668 from yaooqinn/SPARK-21451.
      41568e9a
  15. Aug 03, 2017
  16. Aug 01, 2017
    • Sean Owen's avatar
      [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page · b1d59e60
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355.
      
      ## How was this patch tested?
      
      Manually built and viewed docs with jekyll
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18793 from srowen/SPARK-21593.
      b1d59e60
    • Takeshi Yamamuro's avatar
      [SPARK-21589][SQL][DOC] Add documents about Hive UDF/UDTF/UDAF · 110695db
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added documents about unsupported functions in Hive UDF/UDTF/UDAF.
      This pr relates to #18768 and #18527.
      
      ## How was this patch tested?
      N/A
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18792 from maropu/HOTFIX-20170731.
      110695db
  17. Jul 30, 2017
    • Cheng Wang's avatar
      [MINOR][DOC] Replace numTasks with numPartitions in programming guide · 6830e90d
      Cheng Wang authored
      In programming guide, `numTasks` is used in several places as arguments of Transformations. However, in code, `numPartitions` is used. In this fix, I replace `numTasks` with `numPartitions` in programming guide for consistency.
      
      Author: Cheng Wang <chengwang0511@gmail.com>
      
      Closes #18774 from polarke/replace-numtasks-with-numpartitions-in-doc.
      6830e90d
  18. Jul 29, 2017
  19. Jul 26, 2017
    • jinxing's avatar
      [SPARK-21530] Update description of spark.shuffle.maxChunksBeingTransferred. · cfb25b27
      jinxing authored
      ## What changes were proposed in this pull request?
      
      Update the description of `spark.shuffle.maxChunksBeingTransferred` to include that the new coming connections will be closed when the max is hit and client should have retry mechanism.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18735 from jinxing64/SPARK-21530.
      cfb25b27
    • hyukjinkwon's avatar
      [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions · 60472dbf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This generates a documentation for Spark SQL built-in functions.
      
      One drawback is, this requires a proper build to generate built-in function list.
      Once it is built, it only takes few seconds by `sql/create-docs.sh`.
      
      Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.
      
      There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up.
      
      This requires `pip install mkdocs` to generate HTMLs from markdown files.
      
      ## How was this patch tested?
      
      Manually tested:
      
      ```
      cd docs
      jekyll build
      ```
      ,
      
      ```
      cd docs
      jekyll serve
      ```
      
      and
      
      ```
      cd sql
      create-docs.sh
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18702 from HyukjinKwon/SPARK-21485.
      60472dbf
  20. Jul 25, 2017
    • jinxing's avatar
      [SPARK-21175] Reject OpenBlocks when memory shortage on shuffle service. · 799e1316
      jinxing authored
      ## What changes were proposed in this pull request?
      
      A shuffle service can serves blocks from multiple apps/tasks. Thus the shuffle service can suffers high memory usage when lots of shuffle-reads happen at the same time. In my cluster, OOM always happens on shuffle service. Analyzing heap dump, memory cost by Netty(ChannelOutboundBufferEntry) can be up to 2~3G. It might make sense to reject "open blocks" request when memory usage is high on shuffle service.
      
      https://github.com/apache/spark/commit/93dd0c518d040155b04e5ab258c5835aec7776fc and https://github.com/apache/spark/commit/85c6ce61930490e2247fb4b0e22dfebbb8b6a1ee tried to alleviate the memory pressure on shuffle service but cannot solve the root cause. This pr proposes to control currency of shuffle read.
      
      ## How was this patch tested?
      Added unit test.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18388 from jinxing64/SPARK-21175.
      799e1316
    • Trueman's avatar
      [SPARK-21498][EXAMPLES] quick start -> one py demo have some bug in code · 996a809c
      Trueman authored
      I find a bug about 'quick start',and created a new issues,Sean Owen  let
      me to make a pull request, and I do
      
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Trueman <lizhaoch@users.noreply.github.com>
      Author: lizhaoch <lizhaoc@163.com>
      
      Closes #18722 from lizhaoch/master.
      996a809c
    • Yash Sharma's avatar
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the... · 4f77c062
      Yash Sharma authored
      [SPARK-20855][Docs][DStream] Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils
      
      ## What changes were proposed in this pull request?
      
      The examples and docs for Spark-Kinesis integrations use the deprecated KinesisUtils. We should update the docs to use the KinesisInputDStream builder to create DStreams.
      
      ## How was this patch tested?
      
      The patch primarily updates the documents. The patch will also need to make changes to the Spark-Kinesis examples. The examples need to be tested.
      
      Author: Yash Sharma <ysharma@atlassian.com>
      
      Closes #18071 from yssharma/ysharma/kinesis_docs.
      4f77c062
  21. Jul 21, 2017
    • Holden Karau's avatar
      [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation. · cc00e99d
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Update the Quickstart and RDD programming guides to mention pip.
      
      ## How was this patch tested?
      
      Built docs locally.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
      cc00e99d
    • Liang-Chi Hsieh's avatar
      [MINOR][SS][DOCS] Minor doc change for kafka integration · c57dfaef
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Minor change to kafka integration document for structured streaming.
      
      ## How was this patch tested?
      
      N/A, doc change only.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18550 from viirya/minor-ss-kafka-doc.
      c57dfaef
  22. Jul 20, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Fix some missing notes for Python 2.6 support drop · 5b61cc6d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      After SPARK-12661, I guess we officially dropped Python 2.6 support. It looks there are few places missing this notes.
      
      I grepped "Python 2.6" and "python 2.6" and the results were below:
      
      ```
      ./core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala:  // Unpickle array.array generated by Python 2.6
      ./docs/index.md:Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
      ./docs/rdd-programming-guide.md:Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter,
      ./docs/rdd-programming-guide.md:Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be removed in Spark 2.2.0.
      ./python/pyspark/context.py:            warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
      ./python/pyspark/ml/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/mllib/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/serializers.py:        # On Python 2.6, we can't write bytearrays to streams, so we need to convert them
      ./python/pyspark/sql/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/streaming/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/tests.py:        sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
      ./python/pyspark/tests.py:        # NOTE: dict is used instead of collections.Counter for Python 2.6
      ./python/pyspark/tests.py:        # NOTE: dict is used instead of collections.Counter for Python 2.6
      ```
      
      This PR only proposes to change visible changes as below:
      
      ```
      ./docs/rdd-programming-guide.md:Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter,
      ./docs/rdd-programming-guide.md:Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and may be removed in Spark 2.2.0.
      ./python/pyspark/context.py:            warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
      ```
      
      This one is already correct:
      
      ```
      ./docs/index.md:Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
      ```
      
      ## How was this patch tested?
      
      ```bash
       grep -r "Python 2.6" .
       grep -r "python 2.6" .
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18682 from HyukjinKwon/minor-python.26.
      5b61cc6d
  23. Jul 19, 2017
    • Susan X. Huynh's avatar
      [SPARK-21456][MESOS] Make the driver failover_timeout configurable · c42ef953
      Susan X. Huynh authored
      ## What changes were proposed in this pull request?
      
      Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.
      
      Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.
      
      Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458
      
      ## How was this patch tested?
      
      Added a unit test to make sure the config option is set while creating the scheduler driver.
      
      Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.
      
      Author: Susan X. Huynh <xhuynh@mesosphere.com>
      
      Closes #18674 from susanxhuynh/sh-mesos-failover-timeout.
      c42ef953
    • Dhruve Ashar's avatar
      [SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch · ef617755
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
      
      ## How was this patch tested?
      Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #18487 from dhruve/impr/SPARK-21243.
      ef617755
  24. Jul 15, 2017
  25. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  26. Jul 12, 2017
  27. Jul 10, 2017
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Remove obsolete `ec2-scripts.md` · c444d108
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since this document became obsolete, we had better remove this for Apache Spark 2.3.0. The original document is removed via SPARK-12735 on January 2016, and currently it's just redirection page. The only reference in Apache Spark website will go directly to the destination in https://github.com/apache/spark-website/pull/54.
      
      ## How was this patch tested?
      
      N/A. This is a removal of documentation.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18578 from dongjoon-hyun/SPARK-REMOVE-EC2.
      c444d108
  28. Jul 09, 2017
    • jerryshao's avatar
      [MINOR][DOC] Improve the docs about how to correctly set configurations · 457dc9cc
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Spark provides several ways to set configurations, either from configuration file, or from `spark-submit` command line options, or programmatically through `SparkConf` class. It may confuses beginners why some configurations set through `SparkConf` cannot take affect. So here add some docs to address this problems and let beginners know how to correctly set configurations.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18552 from jerryshao/improve-doc.
      457dc9cc
Loading