Skip to content
Snippets Groups Projects
  1. Jun 22, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · e4469760
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
      e4469760
  2. Jun 11, 2017
  3. Jun 06, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation. · 0cba4951
      Marcelo Vanzin authored
      This change adds an abstraction and LevelDB implementation for a key-value
      store that will be used to store UI and SHS data.
      
      The interface is described in KVStore.java (see javadoc). Specifics
      of the LevelDB implementation are discussed in the javadocs of both
      LevelDB.java and LevelDBTypeInfo.java.
      
      Included also are a few small benchmarks just to get some idea of
      latency. Because they're too slow for regular unit test runs, they're
      disabled by default.
      
      Tested with the included unit tests, and also as part of the overall feature
      implementation (including running SHS with hundreds of apps).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17902 from vanzin/shs-ng/M1.
      0cba4951
  4. May 07, 2017
    • Steve Loughran's avatar
      [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. · 2cf83c47
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
      
      It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
      
      There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.
      
      (this is the successor to #12004; I can't re-open it)
      
      ## How was this patch tested?
      
      Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
      
      Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
      
      Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
      
      SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
      maven build `mvn install -Phadoop-cloud -Phadoop-2.7`
      
      This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.
      
      Author: Steve Loughran <stevel@apache.org>
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #17834 from steveloughran/cloud/SPARK-7481-current.
      2cf83c47
  5. May 03, 2017
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
  6. Apr 28, 2017
    • Mark Grover's avatar
      [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721 · 5d71f3db
      Mark Grover authored
      Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
      Without this change, because of incompatibily between Jetty versions,
      Spark fails to compile when built against Hadoop 3
      
      ## How was this patch tested?
      Unit tests being run.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17790 from markgrover/spark-20514.
      5d71f3db
  7. Apr 25, 2017
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      Unverified
      67eef47a
  8. Apr 24, 2017
  9. Apr 19, 2017
  10. Apr 18, 2017
  11. Feb 22, 2017
    • Adam Budde's avatar
      [SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS · e4065376
      Adam Budde authored
      - Add dependency on aws-java-sdk-sts
      - Replace SerializableAWSCredentials with new SerializableCredentialsProvider interface
      - Make KinesisReceiver take SerializableCredentialsProvider as argument and
        pass credential provider to KCL
      - Add new implementations of KinesisUtils.createStream() that take STS
        arguments
      - Make JavaKinesisStreamSuite test the entire KinesisUtils Java API
      - Update KCL/AWS SDK dependencies to 1.7.x/1.11.x
      
      ## What changes were proposed in this pull request?
      
      [JIRA link with detailed description.](https://issues.apache.org/jira/browse/SPARK-19405)
      
      * Replace SerializableAWSCredentials with new SerializableKCLAuthProvider class that takes 5 optional config params for configuring AWS auth and returns the appropriate credential provider object
      * Add new public createStream() APIs for specifying these parameters in KinesisUtils
      
      ## How was this patch tested?
      
      * Manually tested using explicit keypair and instance profile to read data from Kinesis stream in separate account (difficult to write a test orchestrating creation and assumption of IAM roles across separate accounts)
      * Expanded JavaKinesisStreamSuite to test the entire Java API in KinesisUtils
      
      ## License acknowledgement
      This contribution is my original work and that I license the work to the project under the project’s open source license.
      
      Author: Budde <budde@amazon.com>
      
      Closes #16744 from budde/master.
      e4065376
  12. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      Unverified
      0e240549
  13. Feb 08, 2017
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      Unverified
      e8d3fca4
  14. Jan 31, 2017
  15. Jan 18, 2017
    • Adam Roberts's avatar
      [SPARK-18782][BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5 · 17ce0b5b
      Adam Roberts authored
      **What changes were proposed in this pull request?**
      
      Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up
      
      **How was this patch tested?**
      
      Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #16616 from a-roberts/Hadoop265Bumper.
      Unverified
      17ce0b5b
  16. Jan 15, 2017
  17. Jan 10, 2017
  18. Dec 21, 2016
    • Ryan Williams's avatar
      [SPARK-17807][CORE] split test-tags into test-JAR · afd9bc1d
      Ryan Williams authored
      Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.
      
      Alternative to #16303.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #16311 from ryan-williams/tt.
      afd9bc1d
    • Yin Huai's avatar
      [SPARK-18951] Upgrade com.thoughtworks.paranamer/paranamer to 2.6 · 1a643889
      Yin Huai authored
      ## What changes were proposed in this pull request?
      I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16359 from yhuai/SPARK-18951.
      1a643889
  19. Dec 11, 2016
  20. Dec 06, 2016
  21. Dec 03, 2016
  22. Dec 02, 2016
  23. Nov 28, 2016
    • Yin Huai's avatar
      [SPARK-18602] Set the version of org.codehaus.janino:commons-compiler to 3.0.0... · eba72775
      Yin Huai authored
      [SPARK-18602] Set the version of org.codehaus.janino:commons-compiler to 3.0.0 to match the version of org.codehaus.janino:janino
      
      ## What changes were proposed in this pull request?
      org.codehaus.janino:janino depends on org.codehaus.janino:commons-compiler and we have been upgraded to org.codehaus.janino:janino 3.0.0.
      
      However, seems we are still pulling in org.codehaus.janino:commons-compiler 2.7.6 because of calcite. It looks like an accident because we exclude janino from calcite (see here https://github.com/apache/spark/blob/branch-2.1/pom.xml#L1759). So, this PR upgrades org.codehaus.janino:commons-compiler to 3.0.0.
      
      ## How was this patch tested?
      jenkins
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16025 from yhuai/janino-commons-compile.
      eba72775
  24. Nov 25, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 51b1c155
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      Unverified
      51b1c155
  25. Nov 20, 2016
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc · c528812c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes/fixes two things.
      
      - Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
      
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
        [error]   /** group setParam */
        [error]       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
        [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
        [error]    ^
        ...
        ```
      
        It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
      
      - Print `example` in javadoc.
        Currently, there are few `example` tag in several places.
      
        ```
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
        ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
        ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
        ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
        ```
      
      **Before**
      
        <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
      
      **After**
        <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png">
      
      ## How was this patch tested?
      
      Maunally tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
      Unverified
      c528812c
  26. Nov 19, 2016
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      Unverified
      d5b1d5fc
  27. Nov 12, 2016
    • Guoqiang Li's avatar
      [SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final · bc41d997
      Guoqiang Li authored
      ## What changes were proposed in this pull request?
      
      One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
      In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #15830 from witgo/SPARK-18375_netty-4.0.42.
      Unverified
      bc41d997
  28. Nov 10, 2016
    • Sean Owen's avatar
      [SPARK-18262][BUILD][SQL] JSON.org license is now CatX · 16eaad9d
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Try excluding org.json:json from hive-exec dep as it's Cat X now. It may be the case that it's not used by the part of Hive Spark uses anyway.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15798 from srowen/SPARK-18262.
      16eaad9d
  29. Nov 02, 2016
    • Steve Loughran's avatar
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test... · 37d95227
      Steve Loughran authored
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test against staging artifacts
      
      ## What changes were proposed in this pull request?
      
      Adds a `snapshots-and-staging profile` so that  RCs of projects like Hadoop and HBase can be used in developer-only build and test runs. There's a comment above the profile telling people not to use this in production.
      
      There's no attempt to do the same for SBT, as Ivy is different.
      ## How was this patch tested?
      
      Tested by building against the Hadoop 2.7.3 RC 1 JARs
      
      without the profile (and without any local copy of the 2.7.3 artifacts), the build failed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.7.3
      
      ...
      
      [INFO] ------------------------------------------------------------------------
      [INFO] Building Spark Project Launcher 2.1.0-SNAPSHOT
      [INFO] ------------------------------------------------------------------------
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.pom
      [WARNING] The POM for org.apache.hadoop:hadoop-client:jar:2.7.3 is missing, no dependency information available
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.jar
      [INFO] ------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO]
      [INFO] Spark Project Parent POM ........................... SUCCESS [  4.482 s]
      [INFO] Spark Project Tags ................................. SUCCESS [ 17.402 s]
      [INFO] Spark Project Sketch ............................... SUCCESS [ 11.252 s]
      [INFO] Spark Project Networking ........................... SUCCESS [ 13.458 s]
      [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  9.043 s]
      [INFO] Spark Project Unsafe ............................... SUCCESS [ 16.027 s]
      [INFO] Spark Project Launcher ............................. FAILURE [  1.653 s]
      [INFO] Spark Project Core ................................. SKIPPED
      ...
      ```
      
      With the profile, the build completed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive,snapshots-and-staging -Dhadoop.version=2.7.3
      ```
      
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #14646 from steveloughran/stevel/SPARK-17058-support-asf-snapshots.
      37d95227
  30. Oct 19, 2016
  31. Oct 18, 2016
    • Reynold Xin's avatar
      Revert "[SPARK-17985][CORE] Bump commons-lang3 version to 3.5." · cd662bc7
      Reynold Xin authored
      This reverts commit bfe7885a.
      
      The commit caused build failures on Hadoop 2.2 profile:
      
      ```
      [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
      [error]       var numBytes = IOUtils.read(gzInputStream, buf)
      [error]                              ^
      [error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
      [error]         numBytes = IOUtils.read(gzInputStream, buf)
      [error]                            ^
      ```
      cd662bc7
    • Takuya UESHIN's avatar
      [SPARK-17985][CORE] Bump commons-lang3 version to 3.5. · bfe7885a
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      `SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
      See https://issues.apache.org/jira/browse/LANG-1251.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15525 from ueshin/issues/SPARK-17985.
      bfe7885a
  32. Oct 06, 2016
  33. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
  34. Sep 22, 2016
  35. Sep 19, 2016
    • sureshthalamati's avatar
      [SPARK-17473][SQL] fixing docker integration tests error due to different versions of jars. · cdea1d13
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      Docker tests are using older version  of jersey jars (1.19),  which was used in older releases of spark.  In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After  upgrade to new versions, docker tests  are  failing with AbstractMethodError.  Now that spark is upgraded  to 2.x jersey version, using of  shaded docker jars  may not be required any more.  Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one.
      
      ## How was this patch tested?
      
      Tested  using existing  docker-integration-tests
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
      cdea1d13
  36. Sep 16, 2016
    • Reynold Xin's avatar
      [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 · dca771be
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes.
      
      ## How was this patch tested?
      The change should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15115 from rxin/SPARK-17558.
      dca771be
  37. Sep 15, 2016
    • Adam Roberts's avatar
      [SPARK-17379][BUILD] Upgrade netty-all to 4.0.41 final for bug fixes · 0ad8eeb4
      Adam Roberts authored
      ## What changes were proposed in this pull request?
      Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible.
      
      ## How was this patch tested?
      Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #14961 from a-roberts/netty.
      0ad8eeb4
Loading