Skip to content
Snippets Groups Projects
  1. Apr 25, 2017
    • Armin Braun's avatar
      [SPARK-20455][DOCS] Fix Broken Docker IT Docs · c8f12195
      Armin Braun authored
      ## What changes were proposed in this pull request?
      Just added the Maven `test`goal.
      ## How was this patch tested?
      No test needed, just a trivial documentation fix.
      Author: Armin Braun <>
      Closes #17756 from original-brownbear/SPARK-20455.
  2. Apr 24, 2017
  3. Apr 21, 2017
    • 郭小龙 10207633's avatar
      [SPARK-20401][DOC] In the spark official configuration document, the... · ad290402
      郭小龙 10207633 authored
      [SPARK-20401][DOC] In the spark official configuration document, the 'spark.driver.supervise' configuration parameter specification and default values are necessary.
      ## What changes were proposed in this pull request?
      Use the REST interface submits the spark job.
      curl -X  POST --header "Content-Type:application/json;charset=UTF-8" --data'{
          "action": "CreateSubmissionRequest",
          "appArgs": [
          "appResource": "/home/mr/gxl/test.jar",
          "clientSparkVersion": "2.2.0",
          "environmentVariables": {
              "SPARK_ENV_LOADED": "1"
          "mainClass": "cn.zte.HdfsTest",
          "sparkProperties": {
              "spark.jars": "/home/mr/gxl/test.jar",
              **"spark.driver.supervise": "true",**
              "": "HdfsTest",
              "spark.eventLog.enabled": "false",
              "spark.submit.deployMode": "cluster",
              "spark.master": "spark://"
      **I hope that make sure that the driver is automatically restarted if it fails with non-zero exit code.
      But I can not find the 'spark.driver.supervise' configuration parameter specification and default values from the spark official document.**
      ## How was this patch tested?
      manual tests
      Please review before opening a pull request.
      Author: 郭小龙 10207633 <>
      Author: guoxiaolong <>
      Author: guoxiaolongzte <>
      Closes #17696 from guoxiaolongzte/SPARK-20401.
    • Hervé's avatar
      Small rewording about history server use case · 34767997
      Hervé authored
      PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.
      Author: Hervé <>
      Closes #17709 from dud225/patch-1.
  4. Apr 19, 2017
    • ymahajan's avatar
      Fixed typos in docs · bdc60569
      ymahajan authored
      ## What changes were proposed in this pull request?
      Typos at a couple of place in the docs.
      ## How was this patch tested?
      build including docs
      Please review before opening a pull request.
      Author: ymahajan <>
      Closes #17690 from ymahajan/master.
    • cody koeninger's avatar
      [SPARK-20036][DOC] Note incompatible dependencies on org.apache.kafka artifacts · 71a8e9df
      cody koeninger authored
      ## What changes were proposed in this pull request?
      Note that you shouldn't manually add dependencies on org.apache.kafka artifacts
      ## How was this patch tested?
      Doc only change, did jekyll build and looked at the page.
      Author: cody koeninger <>
      Closes #17675 from koeninger/SPARK-20036.
  5. Apr 16, 2017
    • Ji Yan's avatar
      [SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into... · a888fed3
      Ji Yan authored
      [SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into docker when running on mesos with docker containerizer
      ## What changes were proposed in this pull request?
      Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen
      ## How was this patch tested?
      Manually built and tested with passed in parameter
      Author: Ji Yan <jiyan@Jis-MacBook-Air.local>
      Closes #17109 from yanji84/ji/allow_set_docker_user.
  6. Apr 12, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] JSON APIs related documentation fixes · bca4259f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      This PR proposes corrections related to JSON APIs as below:
      - Rendering links in Python documentation
      - Replacing `RDD` to `Dataset` in programing guide
      - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
      - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
      ## How was this patch tested?
      Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
      Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in So, this PR does not fix those.
      Author: hyukjinkwon <>
      Closes #17602 from HyukjinKwon/minor-json-documentation.
    • Lee Dongjin's avatar
      [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide · b9384382
      Lee Dongjin authored
      ## What changes were proposed in this pull request?
      1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
      2. Omitted colon in Output Model section.
      ## How was this patch tested?
      Author: Lee Dongjin <>
      Closes #17564 from dongjinleekr/feature/fix-programming-guide.
  7. Apr 11, 2017
  8. Apr 07, 2017
    • 郭小龙 10207633's avatar
      [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description. · 9e0893b5
      郭小龙 10207633 authored
      ## What changes were proposed in this pull request?
      1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
      Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
      2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
      Because only one stage is determined based on stage-id.
        def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
          val listener = ui.jobProgressListener
          val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
          val adjStatuses = {
            if (statuses.isEmpty()) {
              Arrays.asList(StageStatus.values(): _*)
            } else {
      ## How was this patch tested?
      manual tests
      Please review before opening a pull request.
      Author: 郭小龙 10207633 <>
      Closes #17534 from guoxiaolongzte/SPARK-20218.
  9. Apr 06, 2017
    • Kalvin Chau's avatar
      [SPARK-20085][MESOS] Configurable mesos labels for executors · c8fc1f3b
      Kalvin Chau authored
      ## What changes were proposed in this pull request?
      Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor.
       "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one.
      Discussion of labels with mgummelt at #17404
      ## How was this patch tested?
      Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.
      Tested with: `./build/sbt -Pmesos mesos/test`
      Please review before opening a pull request.
      Author: Kalvin Chau <>
      Closes #17413 from kalvinnchau/mesos-labels.
  10. Apr 05, 2017
  11. Apr 04, 2017
  12. Apr 03, 2017
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      Add docs and examples for Currently scala and Java examples are included. Python example will be added after
      ## How was this patch tested?
      local doc generation and example execution
      Author: Yuhao Yang <>
      Closes #17324 from hhbyyh/imputerdoc.
    • hyukjinkwon's avatar
      [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown · 364b0db7
      hyukjinkwon authored
      # What changes were proposed in this pull request?
      It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
      These are different. For example, this can be checked via `python` as below:
      >>> " "
      >>> " "
      ' '
      _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
      I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
      ![2017-04-03 12 37 17](
      ![2017-04-03 12 36 57](
      ![2017-04-03 12 36 46](
      ![2017-04-03 12 36 31](
      ## How was this patch tested?
      Manually checking.
      These instances were found via
      grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
      in Mac OS.
      It seems there are several instances more as below:
      ./docs/        │   ├── ...
      ./docs/        │   │
      ./docs/        │   ├── country=US
      ./docs/        │   │   └── data.parquet
      ./docs/        │   ├── country=CN
      ./docs/        │   │   └── data.parquet
      ./docs/        │   └── ...
      ./docs/            ├── ...
      ./docs/            │
      ./docs/            ├── country=US
      ./docs/            │   └── data.parquet
      ./docs/            ├── country=CN
      ./docs/            │   └── data.parquet
      ./docs/            └── ...
      ./sql/core/src/test/│   ├── *.avdl                  # Testing Avro IDL(s)
      ./sql/core/src/test/│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
      ./sql/core/src/test/│   ├──             # Script used to generate Java code for Avro
      ./sql/core/src/test/│   └──           # Script used to generate Java code for Thrift
      These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
      Author: hyukjinkwon <>
      Closes #17517 from HyukjinKwon/non-breaking-space.
  13. Apr 01, 2017
    • 郭小龙 10207633's avatar
      [SPARK-20177] Document about compression way has some little detail ch… · cf5963c9
      郭小龙 10207633 authored
      ## What changes were proposed in this pull request?
      Document compression way little detail changes.
      1.spark.eventLog.compress add 'Compression will use'
      2.spark.broadcast.compress add 'Compression will use'
      3,spark.rdd.compress add 'Compression will use' add 'event log describe'.
      Through the documents, I don't know  what is compression mode about 'event log'.
      ## How was this patch tested?
      manual tests
      Please review before opening a pull request.
      Author: 郭小龙 10207633 <>
      Closes #17498 from guoxiaolongzte/SPARK-20177.
  14. Mar 30, 2017
  15. Mar 27, 2017
  16. Mar 23, 2017
    • sureshthalamati's avatar
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to... · c7911807
      sureshthalamati authored
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
      ## What changes were proposed in this pull request?
      Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
      The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
      All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc(url, "TEST.DBCOLTYPETEST", properties)
      ## How was this patch tested?
      Added new test cases to the JDBCWriteSuite
      Author: sureshthalamati <>
      Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
  17. Mar 22, 2017
    • uncleGen's avatar
      [SPARK-20021][PYSPARK] Miss backslash in python code · facfd608
      uncleGen authored
      ## What changes were proposed in this pull request?
      Add backslash for line continuation in python code.
      ## How was this patch tested?
      Author: uncleGen <>
      Author: dylon <>
      Closes #17352 from uncleGen/python-example-doc.
  18. Mar 21, 2017
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      This change describes rank in both places consistently as:
       - "Number of features to use (also referred to as the number of latent factors)"
      Author: Chris Snow <>
      Author: christopher snow <>
      Closes #17345 from snowch/SPARK-20011.
  19. Mar 20, 2017
  20. Mar 17, 2017
    • Sital Kedia's avatar
      [SPARK-13369] Add config for number of consecutive fetch failures · 7b5d873a
      Sital Kedia authored
      The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.
      Unit tests
      Author: Sital Kedia <>
      Closes #17307 from sitalkedia/SPARK-13369.
  21. Mar 12, 2017
    • uncleGen's avatar
      [DOCS][SS] fix structured streaming python example · e29a74d5
      uncleGen authored
      ## What changes were proposed in this pull request?
      - SS python example: `TypeError: 'xxx' object is not callable`
      - some other doc issue.
      ## How was this patch tested?
      Author: uncleGen <>
      Closes #17257 from uncleGen/docs-ss-python.
  22. Mar 10, 2017
  23. Mar 09, 2017
    • Liwei Lin's avatar
      [SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource · 40da4d18
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`).
      This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log).
      ## Usage
        .option("fileNameOnly", true)
      ## How was this patch tested?
      Added a test case
      Author: Liwei Lin <>
      Closes #17120 from lw-lin/filename-only.
  24. Mar 07, 2017
    • Wenchen Fan's avatar
      [SPARK-19516][DOC] update public doc to use SparkSession instead of SparkContext · d69aeeaf
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      After Spark 2.0, `SparkSession` becomes the new entry point for Spark applications. We should update the public documents to reflect this.
      ## How was this patch tested?
      Author: Wenchen Fan <>
      Closes #16856 from cloud-fan/doc.
    • VinceShieh's avatar
      [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is an enhancement to ML StringIndexer.
      Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
      But those unseen records might still be useful and user would like to keep the unseen labels in
      certain use cases, This PR enables StringIndexer to support keeping unseen labels as
      indices [numLabels].
      support the third option "keep"
      ## How was this patch tested?
      Test added in StringIndexerSuite
      Signed-off-by: VinceShieh <>
      (Please fill in changes proposed in this fix)
      Author: VinceShieh <>
      Closes #16883 from VinceShieh/spark-17498.
  25. Mar 03, 2017
    • jerryshao's avatar
      [MINOR][DOC] Fix doc for web UI https configuration · ba186a84
      jerryshao authored
      ## What changes were proposed in this pull request?
      Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https.
      ## How was this patch tested?
      Author: jerryshao <>
      Closes #17147 from jerryshao/fix-doc-ssl.
    • Zhe Sun's avatar
      [SPARK-19797][DOC] ML pipeline document correction · 0bac3e4c
      Zhe Sun authored
      ## What changes were proposed in this pull request?
      Description about pipeline in this paragraph is incorrect
      > If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
      Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.
      ## How was this patch tested?
      This is a tiny modification of **docs/**. I jekyll build the modification and check the compiled document.
      Author: Zhe Sun <>
      Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
  26. Mar 02, 2017
  27. Feb 28, 2017
  28. Feb 25, 2017
    • Boaz Mohar's avatar
      [MINOR][DOCS] Fixes two problems in the SQL programing guide page · 061bcfb8
      Boaz Mohar authored
      ## What changes were proposed in this pull request?
      Removed duplicated lines in sql python example and found a typo.
      ## How was this patch tested?
      Searched for other typo's in the page to minimize PR's.
      Author: Boaz Mohar <>
      Closes #17066 from boazmohar/doc-fix.