Skip to content
Snippets Groups Projects
  1. Mar 03, 2017
    • jerryshao's avatar
      [MINOR][DOC] Fix doc for web UI https configuration · ba186a84
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17147 from jerryshao/fix-doc-ssl.
      ba186a84
    • Zhe Sun's avatar
      [SPARK-19797][DOC] ML pipeline document correction · 0bac3e4c
      Zhe Sun authored
      ## What changes were proposed in this pull request?
      Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works
      
      > If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
      
      Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.
      
      ## How was this patch tested?
      This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document.
      
      Author: Zhe Sun <ymwdalex@gmail.com>
      
      Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
      0bac3e4c
  2. Mar 02, 2017
  3. Feb 28, 2017
  4. Feb 25, 2017
    • Boaz Mohar's avatar
      [MINOR][DOCS] Fixes two problems in the SQL programing guide page · 061bcfb8
      Boaz Mohar authored
      ## What changes were proposed in this pull request?
      
      Removed duplicated lines in sql python example and found a typo.
      
      ## How was this patch tested?
      
      Searched for other typo's in the page to minimize PR's.
      
      Author: Boaz Mohar <boazmohar@gmail.com>
      
      Closes #17066 from boazmohar/doc-fix.
      061bcfb8
  5. Feb 24, 2017
    • Ramkumar Venkataraman's avatar
      [MINOR][DOCS] Fix few typos in structured streaming doc · 1b9ba258
      Ramkumar Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.
      
      ## How was this patch tested?
      
      N/A - since this is a doc fix. I did a jekyll build locally though.
      
      Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>
      
      Closes #17037 from ramkumarvenkat/doc-fix.
      1b9ba258
    • Shubham Chopra's avatar
      [SPARK-15355][CORE] Proactive block replication · fa7c582e
      Shubham Chopra authored
      ## What changes were proposed in this pull request?
      
      We are proposing addition of pro-active block replication in case of executor failures. BlockManagerMasterEndpoint does all the book-keeping to keep a track of all the executors and the blocks they hold. It also keeps a track of which executors are alive through heartbeats. When an executor is removed, all this book-keeping state is updated to reflect the lost executor. This step can be used to identify executors that are still in possession of a copy of the cached data and a message could be sent to them to use the existing "replicate" function to find and place new replicas on other suitable hosts. Blocks replicated this way will let the master know of their existence.
      
      This can happen when an executor is lost, and would that way be pro-active as opposed be being done at query time.
      ## How was this patch tested?
      
      This patch was tested with existing unit tests along with new unit tests added to test the functionality.
      
      Author: Shubham Chopra <schopra31@bloomberg.net>
      
      Closes #14412 from shubhamchopra/ProactiveBlockReplication.
      fa7c582e
  6. Feb 23, 2017
  7. Feb 22, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM. · 4661d30b
      Marcelo Vanzin authored
      Allow an application to use the History Server URL as the tracking
      URL in the YARN RM, so there's still a link to the web UI somewhere
      in YARN even if the driver's UI is disabled. This is useful, for
      example, if an admin wants to disable the driver UI by default for
      applications, since it's harder to secure it (since it involves non
      trivial ssl certificate and auth management that admins may not want
      to expose to user apps).
      
      This needs to be opt-in, because of the way the YARN proxy works, so
      a new configuration was added to enable the option.
      
      The YARN RM will proxy requests to live AMs instead of redirecting
      the client, so pages in the SHS UI will not render correctly since
      they'll reference invalid paths in the RM UI. The proxy base support
      in the SHS cannot be used since that would prevent direct access to
      the SHS.
      
      So, to solve this problem, for the feature to work end-to-end, a new
      YARN-specific filter was added that detects whether the requests come
      from the proxy and redirects the client appropriatly. The SHS admin has
      to add this filter manually if they want the feature to work.
      
      Tested with new unit test, and by running with the documented configuration
      set in a test cluster. Also verified the driver UI is used when it's
      enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16946 from vanzin/SPARK-19554.
      4661d30b
  8. Feb 21, 2017
  9. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      0e240549
  10. Feb 15, 2017
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
  11. Feb 14, 2017
  12. Feb 13, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 0169360e
      Marcelo Vanzin authored
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      0169360e
  13. Feb 10, 2017
    • Hervé's avatar
      Encryption of shuffle files · c5a66356
      Hervé authored
      Hello
      
      According to my understanding of commits 4b4e329e & 8b325b17, one may now encrypt shuffle files regardless of the cluster manager in use.
      
      However I have limited understanding of the code, I'm not able to find out whether theses changes also comprise all "temporary local storage, such as shuffle files, cached data, and other application files".
      
      Please feel free to amend or reject my PR if I'm wrong.
      
      dud
      
      Author: Hervé <dud225@users.noreply.github.com>
      
      Closes #16885 from dud225/patch-1.
      c5a66356
    • jerryshao's avatar
      [SPARK-19545][YARN] Fix compile issue for Spark on Yarn when building against Hadoop 2.6.0~2.6.3 · 8e8afb3a
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Due to the newly added API in Hadoop 2.6.4+, Spark builds against Hadoop 2.6.0~2.6.3 will meet compile error. So here still reverting back to use reflection to handle this issue.
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16884 from jerryshao/SPARK-19545.
      8e8afb3a
  14. Feb 09, 2017
    • José Hiram Soltren's avatar
      [SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted · 6287c94f
      José Hiram Soltren authored
      ## What changes were proposed in this pull request?
      
      In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time.
      
      In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted".
      
      In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed.
      
      ## How was this patch tested?
      
      ./dev/run-tests
      Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed.
      
      Testing can likely be improved here; suggestions welcome.
      
      Author: José Hiram Soltren <jose@cloudera.com>
      
      Closes #16650 from jsoltren/SPARK-16554-submit.
      6287c94f
    • Marcelo Vanzin's avatar
      [SPARK-17874][CORE] Add SSL port configuration. · 3fc8e8ca
      Marcelo Vanzin authored
      Make the SSL port configuration explicit, instead of deriving it
      from the non-SSL port, but retain the existing functionality in
      case anyone depends on it.
      
      The change starts the HTTPS and HTTP connectors separately, so
      that it's possible to use independent ports for each. For that to
      work, the initialization of the server needs to be shuffled around
      a bit. The change also makes it so the initialization of both
      connectors is similar, and end up using the same Scheduler - previously
      only the HTTP connector would use the correct one.
      
      Also fixed some outdated documentation about a couple of services
      that were removed long ago.
      
      Tested with unit tests and by running spark-shell with SSL configs.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16625 from vanzin/SPARK-17874.
      3fc8e8ca
  15. Feb 08, 2017
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      e8d3fca4
  16. Feb 07, 2017
  17. Feb 03, 2017
  18. Feb 01, 2017
    • Zheng RuiFeng's avatar
      [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning · 04ee8cf6
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Fix brokens links in ml-pipeline and ml-tuning
      `<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16754 from zhengruifeng/doc_api_fix.
      04ee8cf6
    • hyukjinkwon's avatar
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in... · f1a1f260
      hyukjinkwon authored
      [SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - Support LaTex inline-formula, `\( ... \)` in Scala API documentation
        It seems currently,
      
        ```
        \( ... \)
        ```
      
        are rendered as they are, for example,
      
        <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">
      
        It seems mistakenly more backslashes were added.
      
      - Fix warnings Scaladoc/Javadoc generation
        This PR fixes t two types of warnings as below:
      
        ```
        [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
        [warn]   /**
        [warn]   ^
        ```
      
        ```
        [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
        [warn]  * `${var}`, `${system:var}` and `${env:var}`.
        [warn]      ^
        ```
      
      - Fix Javadoc8 break
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
        [error]    *                          E.g., {link VectorUDT} for vector features.
        [error]                                            ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
        [error]  *                       E.g., {link VectorUDT} for vector features.
        [error]                                       ^
        [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
        [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
        [error]                                                  ^
        ```
      
      ## How was this patch tested?
      
      Manually via `sbt unidoc` and `jeykil build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16741 from HyukjinKwon/warn-and-break.
      f1a1f260
  19. Jan 30, 2017
  20. Jan 25, 2017
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · 3fdce814
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      
      ## How was this patch tested?
      
      The patch was tested locally by building the docs. The examples were run as well.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16329 from aokolnychyi/SPARK-16046.
      3fdce814
  21. Jan 24, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19139][CORE] New auth mechanism for transport library. · 8f3f73ab
      Marcelo Vanzin authored
      This change introduces a new auth mechanism to the transport library,
      to be used when users enable strong encryption. This auth mechanism
      has better security than the currently used DIGEST-MD5.
      
      The new protocol uses symmetric key encryption to mutually authenticate
      the endpoints, and is very loosely based on ISO/IEC 9798.
      
      The new protocol falls back to SASL when it thinks the remote end is old.
      Because SASL does not support asking the server for multiple auth protocols,
      which would mean we could re-use the existing SASL code by just adding a
      new SASL provider, the protocol is implemented outside of the SASL API
      to avoid the boilerplate of adding a new provider.
      
      Details of the auth protocol are discussed in the included README.md
      file.
      
      This change partly undos the changes added in SPARK-13331; AES encryption
      is now decoupled from SASL authentication. The encryption code itself,
      though, has been re-used as part of this change.
      
      ## How was this patch tested?
      
      - Unit tests
      - Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
      - Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16521 from vanzin/SPARK-19139.
      8f3f73ab
    • Parag Chaudhari's avatar
      [SPARK-14049][CORE] Add functionality in spark history sever API to query applications by end time · 0ff67a1c
      Parag Chaudhari authored
      ## What changes were proposed in this pull request?
      
      Currently, spark history server REST API provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it  lacks support to query applications by their end time. In this pull request we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server REST API. This functionality can be used for following queries,
      1. Applications finished in last 'x' minutes
      2. Applications finished before 'y' time
      3. Applications finished between 'x' time to 'y' time
      4. Applications started from 'x' time and finished before 'y' time.
      
      For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range.
      ## How was this patch tested?
      
      Existing unit tests and 4 new unit tests.
      
      Author: Parag Chaudhari <paragpc@amazon.com>
      
      Closes #11867 from paragpc/master-SHS-query-by-endtime_2.
      0ff67a1c
    • uncleGen's avatar
      [DOCS] Fix typo in docs · 7c61c2a1
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Fix typo in docs
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16658 from uncleGen/typo-issue.
      7c61c2a1
  22. Jan 23, 2017
  23. Jan 20, 2017
  24. Jan 17, 2017
    • jerryshao's avatar
      [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs · b79cc7ce
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.
      
      ## How was this patch tested?
      
      Local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16560 from jerryshao/SPARK-19179.
      b79cc7ce
  25. Jan 15, 2017
    • Maurus Cuelenaere's avatar
      [MINOR][DOC] Document local[*,F] master modes · 3df2d931
      Maurus Cuelenaere authored
      ## What changes were proposed in this pull request?
      
      core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so.
      
      ## How was this patch tested?
      
      By using the Github Markdown preview feature.
      
      Author: Maurus Cuelenaere <mcuelenaere@gmail.com>
      
      Closes #16562 from mcuelenaere/patch-1.
      3df2d931
  26. Jan 11, 2017
    • Bryan Cutler's avatar
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings... · 3bc2eff8
      Bryan Cutler authored
      [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts
      
      ## What changes were proposed in this pull request?
      
      Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.
      
      This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
      ## How was this patch tested?
      
      Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Ian Hummel <ian@themodernlife.net>
      
      Closes #15119 from BryanCutler/spark-custom-IvySettings.
      3bc2eff8
    • jerryshao's avatar
      [SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems · 4239a108
      jerryshao authored
      Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.
      
      ## How was this patch tested?
      
      Manually verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16432 from jerryshao/SPARK-19021.
      4239a108
  27. Jan 10, 2017
    • Shixiong Zhu's avatar
      [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · bc6c56e9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16520 from zsxwing/update-without-agg.
      bc6c56e9
Loading