Skip to content
Snippets Groups Projects
  1. Sep 13, 2017
    • Sean Owen's avatar
      [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile · 4fbf748b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Put Kafka 0.8 support behind a kafka-0-8 profile.
      
      ## How was this patch tested?
      
      Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19134 from srowen/SPARK-21893.
      4fbf748b
  2. Sep 05, 2017
    • jerryshao's avatar
      [SPARK-9104][CORE] Expose Netty memory metrics in Spark · 445f1790
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors.
      
      This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display.
      
      ## How was this patch tested?
      
      Add Unit test to verify it, also manually verified in real cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18935 from jerryshao/SPARK-9104.
      445f1790
    • hyukjinkwon's avatar
      [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 · 02a4386a
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:
      
      ```scala
      val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
      df.show()
      ```
      
      **Before**
      
      ```
      java.lang.NullPointerException
      	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
      	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
      ...
      ```
      
      **After**
      
      ```
      +---+----+--------+
      |  a|   b|unparsed|
      +---+----+--------+
      |  a|null|       a|
      +---+----+--------+
      ```
      
      It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.
      
      ## How was this patch tested?
      
      Unit test added in `CSVSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #19113 from HyukjinKwon/bump-up-univocity.
      02a4386a
  3. Sep 01, 2017
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
  4. Aug 31, 2017
    • ArtRand's avatar
      [SPARK-20812][MESOS] Add secrets support to the dispatcher · fc45c2c8
      ArtRand authored
      Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
      Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      
      Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
      fc45c2c8
  5. Aug 24, 2017
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
  6. Aug 16, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 · 8c54f1eb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
      
      - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
      - Maintainability: Reduce the Hive dependency and can remove old legacy code later.
      
      Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
      - Usability: User can use ORC data sources without hive module, i.e, -Phive.
      - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
      
      ## How was this patch tested?
      
      Pass the jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18640 from dongjoon-hyun/SPARK-21422.
      8c54f1eb
  7. Aug 12, 2017
    • pj.fanning's avatar
      [SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates · c0e333db
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0.
      
      ## How was this patch tested?
      
      Existing Build.
      
      Author: pj.fanning <pj.fanning@workday.com>
      
      Closes #18921 from pjfanning/SPARK-21709.
      c0e333db
    • Sean Owen's avatar
      [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12 · b0bdfce9
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This is trivial, but bugged me. We should download software over HTTPS.
      And we can use RAT 0.12 while at it to pick up bug fixes.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18927 from srowen/Rat012.
      b0bdfce9
  8. Aug 09, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) · b78cf13b
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.
      
      Major diffs between the latest release and v1.3.0 in the master are as follows (https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6d4693f56253fcddfad7b441bb8d917b182efa2d);
      - fixed NPE in XXHashFactory similarly
      - Don't place resources in default package to support shading
      - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
      - Try to load lz4-java from java.library.path, then fallback to bundled
      - Add ppc64le binary
      - Add s390x JNI binding
      - Add basic LZ4 Frame v1.5.0 support
      - enable aarch64 support for lz4-java
      - Allow unsafeInstance() for ppc64le archiecture
      - Add unsafeInstance support for AArch64
      - Support 64-bit JNI build on Solaris
      - Avoid over-allocating a buffer
      - Allow EndMark to be incompressible for LZ4FrameInputStream.
      - Concat byte stream
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18883 from maropu/SPARK-21276.
      b78cf13b
    • WeichenXu's avatar
      [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search · b35660dd
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
      https://github.com/scalanlp/breeze/pull/651
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18797 from WeichenXu123/update-breeze.
      b35660dd
  9. Aug 08, 2017
  10. Aug 06, 2017
  11. Jul 30, 2017
    • hyukjinkwon's avatar
      [MINOR] Minor comment fixes in merge_spark_pr.py script · f1a798b5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix few rather typos in `merge_spark_pr.py`.
      
      - `#   usage: ./apache-pr-merge.py    (see config env vars below)`
        -> `#   usage: ./merge_spark_pr.py    (see config env vars below)`
      
      - `... have local a Spark ...` -> `... have a local Spark ...`
      
      - `... to Apache.` -> `... to Apache Spark.`
      
      I skimmed this file and these look all I could find.
      
      ## How was this patch tested?
      
      pep8 check (`./dev/lint-python`).
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18776 from HyukjinKwon/minor-merge-script.
      f1a798b5
  12. Jul 18, 2017
  13. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  14. Jul 10, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · d03aebbe
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
      d03aebbe
  15. Jul 05, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 · c8d0aba1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to bump Py4J in order to fix the following float/double bug.
      Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.
      
      **BEFORE**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +--------------------+
      |(id + 17.1335742042)|
      +--------------------+
      |       17.1335742042|
      +--------------------+
      ```
      
      **AFTER**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +-------------------------+
      |(id + 17.133574204226083)|
      +-------------------------+
      |       17.133574204226083|
      +-------------------------+
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18546 from dongjoon-hyun/SPARK-21278.
      c8d0aba1
  16. Jun 28, 2017
  17. Jun 24, 2017
    • hyukjinkwon's avatar
      [SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving... · 7c7bc8fc
      hyukjinkwon authored
      [SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs
      
      ## What changes were proposed in this pull request?
      
      Recently, Jenkins tests were unstable due to unknown reasons as below:
      
      ```
       /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was terminated by signal 9
          test_result_code, test_result_note = run_tests(tests_timeout)
        File "./dev/run-tests-jenkins.py", line 140, in run_tests
          test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
      KeyError: -9
      ```
      
      ```
      Traceback (most recent call last):
        File "./dev/run-tests-jenkins.py", line 226, in <module>
          main()
        File "./dev/run-tests-jenkins.py", line 213, in main
          test_result_code, test_result_note = run_tests(tests_timeout)
        File "./dev/run-tests-jenkins.py", line 140, in run_tests
          test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
      KeyError: -10
      ```
      
      This exception looks causing failing to update the comments in the PR. For example:
      
      ![2017-06-23 4 19 41](https://user-images.githubusercontent.com/6477701/27470626-d035ecd8-582f-11e7-883e-0ae6941659b7.png)
      
      ![2017-06-23 4 19 50](https://user-images.githubusercontent.com/6477701/27470629-d11ba782-582f-11e7-97e0-64d28cbc19aa.png)
      
      these comment just remain.
      
      This always requires, for both reviewers and the author, a overhead to click and check the logs, which I believe are not really useful.
      
      This PR proposes to leave the code in the PR comment messages and let update the comments.
      
      ## How was this patch tested?
      
      Jenkins tests below, I manually gave the error code to test this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18399 from HyukjinKwon/jenkins-print-errors.
      7c7bc8fc
  18. Jun 22, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · e4469760
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
      e4469760
  19. Jun 19, 2017
  20. Jun 15, 2017
    • Michael Gummelt's avatar
      [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core · a18d6371
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
      
      Summary:
      - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
      
      - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
      
      Old Hierarchy:
      
      ```
      yarn.security.ServiceCredentialProvider (service loaded)
        HadoopFSCredentialProvider
        HiveCredentialProvider
        HBaseCredentialProvider
      yarn.security.ConfigurableCredentialManager
      ```
      
      New Hierarchy:
      
      ```
      HadoopDelegationTokenManager
      HadoopDelegationTokenProvider (not service loaded)
        HadoopFSDelegationTokenProvider
        HiveDelegationTokenProvider
        HBaseDelegationTokenProvider
      
      yarn.security.ServiceCredentialProvider (service loaded)
      yarn.security.YARNHadoopDelegationTokenManager
      ```
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
      
      Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
      a18d6371
  21. Jun 11, 2017
  22. Jun 02, 2017
    • Wenchen Fan's avatar
      [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes · 864d94fe
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18191 from cloud-fan/test.
      864d94fe
    • hyukjinkwon's avatar
      [MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory · 0e31e28d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, if we run `./python/run-tests.py` and they are aborted without cleaning up this directory, it fails pep8 check due to some Python scripts generated. For example, https://github.com/apache/spark/blob/7387126f83dc0489eb1df734bfeba705709b7861/python/pyspark/tests.py#L1955-L1968
      
      ```
      PEP8 checks failed.
      ./work/app-20170531190857-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531190909-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531190924-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
      ./work/app-20170531190924-0000/0/test.py:7:52: W292 no newline at end of file
      ./work/app-20170531191016-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531191030-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531191045-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
      ./work/app-20170531191045-0000/0/test.py:7:52: W292 no newline at end of file
      ```
      
      For me, it is sometimes a bit annoying. This PR proposes to exclude these (assuming we want to skip per https://github.com/apache/spark/blob/master/.gitignore#L73).
      
      Also, it moves other pep8 configurations in the script into ini configuration file in pep8.
      
      ## How was this patch tested?
      
      Manually tested via `./dev/lint-python`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18161 from HyukjinKwon/work-exclude-pep8.
      0e31e28d
  23. May 10, 2017
    • Xianyang Liu's avatar
      [MINOR][BUILD] Fix lint-java breaks. · fcb88f92
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      ```
      [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
      ```
      
      after:
      ```
      dev/lint-java
      Checkstyle checks passed.
      ```
      [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
      
      ## How was this patch tested?
      
      Travis CI
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17890 from ConeyLiu/codestyle.
      fcb88f92
  24. May 09, 2017
    • Holden Karau's avatar
      [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version · 1b85bcd9
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
      
      ## How was this patch tested?
      
      Ran `make-distribution` locally
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
      1b85bcd9
  25. May 03, 2017
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
  26. Apr 25, 2017
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      Unverified
      67eef47a
  27. Apr 19, 2017
  28. Apr 12, 2017
    • hyukjinkwon's avatar
      [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins · ceaf77ae
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
      
      There are several problems with it:
      
      - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
      
      - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
      
        (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
      
      To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
      
      There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
      
      Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
      
      ## How was this patch tested?
      
      Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
      
      This was tested via manually adding `time.time()` as below:
      
      ```diff
           profiles_and_goals = build_profiles + sbt_goals
      
           print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
                 " ".join(profiles_and_goals))
      
      +    import time
      +    st = time.time()
           exec_sbt(profiles_and_goals)
      +    print("Elapsed :[%s]" % str(time.time() - st))
      ```
      
      produces
      
      ```
      ...
      ========================================================================
      Building Unidoc API Documentation
      ========================================================================
      ...
      [info] Main Java API documentation successful.
      ...
      Elapsed :[94.8746569157]
      ...
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17477 from HyukjinKwon/SPARK-18692.
      ceaf77ae
  29. Apr 11, 2017
    • David Gingrich's avatar
      [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 · 6297697f
      David Gingrich authored
      ## What changes were proposed in this pull request?
      
      Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
      
      ## How was this patch tested?
      
      - Doctests for helper function
      
      ## Legal
      
      This is my original work and I license the work to the project under the project’s open source license.
      
      Author: David Gingrich <david@textio.com>
      
      Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
      6297697f
  30. Apr 02, 2017
  31. Mar 29, 2017
    • Holden Karau's avatar
      [SPARK-19955][PYSPARK] Jenkins Python Conda based test. · d6ddfdf6
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
      
      ## How was this patch tested?
      
      Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
      d6ddfdf6
  32. Mar 28, 2017
  33. Mar 27, 2017
    • Josh Rosen's avatar
      [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 314cf51d
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The master snapshot publisher builds are currently broken due to two minor build issues:
      
      1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
      2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
      
      ## How was this patch tested?
      
      The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17437 from JoshRosen/spark-20102.
      314cf51d
  34. Mar 26, 2017
    • zero323's avatar
      [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth · 0bc8847a
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add `HasSupport` and `HasConfidence` `Params`.
      - Add new module `pyspark.ml.fpm`.
      - Add `FPGrowth` / `FPGrowthModel` wrappers.
      - Provide tests for new features.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17218 from zero323/SPARK-19281.
      0bc8847a
  35. Feb 18, 2017
Loading