Skip to content
Snippets Groups Projects
  1. Jun 22, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · e4469760
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
      e4469760
  2. Jun 19, 2017
  3. Jun 15, 2017
    • Michael Gummelt's avatar
      [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core · a18d6371
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
      
      Summary:
      - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
      
      - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
      
      Old Hierarchy:
      
      ```
      yarn.security.ServiceCredentialProvider (service loaded)
        HadoopFSCredentialProvider
        HiveCredentialProvider
        HBaseCredentialProvider
      yarn.security.ConfigurableCredentialManager
      ```
      
      New Hierarchy:
      
      ```
      HadoopDelegationTokenManager
      HadoopDelegationTokenProvider (not service loaded)
        HadoopFSDelegationTokenProvider
        HiveDelegationTokenProvider
        HBaseDelegationTokenProvider
      
      yarn.security.ServiceCredentialProvider (service loaded)
      yarn.security.YARNHadoopDelegationTokenManager
      ```
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
      
      Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
      a18d6371
  4. Jun 11, 2017
  5. Jun 02, 2017
    • Wenchen Fan's avatar
      [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes · 864d94fe
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18191 from cloud-fan/test.
      864d94fe
    • hyukjinkwon's avatar
      [MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory · 0e31e28d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, if we run `./python/run-tests.py` and they are aborted without cleaning up this directory, it fails pep8 check due to some Python scripts generated. For example, https://github.com/apache/spark/blob/7387126f83dc0489eb1df734bfeba705709b7861/python/pyspark/tests.py#L1955-L1968
      
      ```
      PEP8 checks failed.
      ./work/app-20170531190857-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531190909-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531190924-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
      ./work/app-20170531190924-0000/0/test.py:7:52: W292 no newline at end of file
      ./work/app-20170531191016-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531191030-0000/0/test.py:5:55: W292 no newline at end of file
      ./work/app-20170531191045-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1
      ./work/app-20170531191045-0000/0/test.py:7:52: W292 no newline at end of file
      ```
      
      For me, it is sometimes a bit annoying. This PR proposes to exclude these (assuming we want to skip per https://github.com/apache/spark/blob/master/.gitignore#L73).
      
      Also, it moves other pep8 configurations in the script into ini configuration file in pep8.
      
      ## How was this patch tested?
      
      Manually tested via `./dev/lint-python`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18161 from HyukjinKwon/work-exclude-pep8.
      0e31e28d
  6. May 10, 2017
    • Xianyang Liu's avatar
      [MINOR][BUILD] Fix lint-java breaks. · fcb88f92
      Xianyang Liu authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      ```
      [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream.
      ```
      
      after:
      ```
      dev/lint-java
      Checkstyle checks passed.
      ```
      [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
      
      ## How was this patch tested?
      
      Travis CI
      
      Author: Xianyang Liu <xianyang.liu@intel.com>
      
      Closes #17890 from ConeyLiu/codestyle.
      fcb88f92
  7. May 09, 2017
    • Holden Karau's avatar
      [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python version · 1b85bcd9
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar.
      
      ## How was this patch tested?
      
      Ran `make-distribution` locally
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
      1b85bcd9
  8. May 03, 2017
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
  9. Apr 25, 2017
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      67eef47a
  10. Apr 19, 2017
  11. Apr 12, 2017
    • hyukjinkwon's avatar
      [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins · ceaf77ae
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
      
      There are several problems with it:
      
      - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
      
      - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
      
        (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
      
      To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
      
      There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
      
      Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
      
      ## How was this patch tested?
      
      Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
      
      This was tested via manually adding `time.time()` as below:
      
      ```diff
           profiles_and_goals = build_profiles + sbt_goals
      
           print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
                 " ".join(profiles_and_goals))
      
      +    import time
      +    st = time.time()
           exec_sbt(profiles_and_goals)
      +    print("Elapsed :[%s]" % str(time.time() - st))
      ```
      
      produces
      
      ```
      ...
      ========================================================================
      Building Unidoc API Documentation
      ========================================================================
      ...
      [info] Main Java API documentation successful.
      ...
      Elapsed :[94.8746569157]
      ...
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17477 from HyukjinKwon/SPARK-18692.
      ceaf77ae
  12. Apr 11, 2017
    • David Gingrich's avatar
      [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 · 6297697f
      David Gingrich authored
      ## What changes were proposed in this pull request?
      
      Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
      
      ## How was this patch tested?
      
      - Doctests for helper function
      
      ## Legal
      
      This is my original work and I license the work to the project under the project’s open source license.
      
      Author: David Gingrich <david@textio.com>
      
      Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
      6297697f
  13. Apr 02, 2017
  14. Mar 29, 2017
    • Holden Karau's avatar
      [SPARK-19955][PYSPARK] Jenkins Python Conda based test. · d6ddfdf6
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
      
      ## How was this patch tested?
      
      Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
      d6ddfdf6
  15. Mar 28, 2017
  16. Mar 27, 2017
    • Josh Rosen's avatar
      [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 314cf51d
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The master snapshot publisher builds are currently broken due to two minor build issues:
      
      1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
      2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
      
      ## How was this patch tested?
      
      The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17437 from JoshRosen/spark-20102.
      314cf51d
  17. Mar 26, 2017
    • zero323's avatar
      [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth · 0bc8847a
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add `HasSupport` and `HasConfidence` `Params`.
      - Add new module `pyspark.ml.fpm`.
      - Add `FPGrowth` / `FPGrowthModel` wrappers.
      - Provide tests for new features.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17218 from zero323/SPARK-19281.
      0bc8847a
  18. Feb 18, 2017
  19. Feb 17, 2017
    • Roberto Agostino Vitillo's avatar
      [SPARK-19517][SS] KafkaSource fails to initialize partition offsets · 1a3f5f8c
      Roberto Agostino Vitillo authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets.
      
      ## How was this patch tested?
      
      I ran the test suite for spark-sql-kafka-0-10.
      
      Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com>
      
      Closes #16857 from vitillo/kafka_source_fix.
      1a3f5f8c
  20. Feb 16, 2017
    • Sean Owen's avatar
      [SPARK-19550][HOTFIX][BUILD] Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima · dcc2d540
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima script to run MiMa
      This follows on https://github.com/apache/spark/pull/16871 -- it's a slightly separate issue, but, is currently causing a build failure.
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16957 from srowen/SPARK-19550.2.
      dcc2d540
    • Sean Owen's avatar
      [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support · 0e240549
      Sean Owen authored
      - Move external/java8-tests tests into core, streaming, sql and remove
      - Remove MaxPermGen and related options
      - Fix some reflection / TODOs around Java 8+ methods
      - Update doc references to 1.7/1.8 differences
      - Remove Java 7/8 related build profiles
      - Update some plugins for better Java 8 compatibility
      - Fix a few Java-related warnings
      
      For the future:
      
      - Update Java 8 examples to fully use Java 8
      - Update Java tests to use lambdas for simplicity
      - Update Java internal implementations to use lambdas
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16871 from srowen/SPARK-19493.
      0e240549
  21. Feb 14, 2017
  22. Feb 08, 2017
    • Dongjoon Hyun's avatar
      [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6 · c618ccdb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3.
      
      **BEFORE**
      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
      ```
      ========================================================================
      Building Spark
      ========================================================================
      [error] Could not find hadoop2.3 in the list. Valid options  are ['hadoop2.6', 'hadoop2.7']
      Attempting to post to Github...
       > Post successful.
      ```
      
      **AFTER**
      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console
      ```
      ========================================================================
      Building Spark
      ========================================================================
      [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
      Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
      Note, this will be overridden by -java-home if it is set.
      ```
      
      ## How was this patch tested?
      
      Pass the existing test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16858 from dongjoon-hyun/hotfix_run-tests.
      c618ccdb
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      e8d3fca4
  23. Jan 31, 2017
  24. Jan 25, 2017
    • Holden Karau's avatar
      [SPARK-19064][PYSPARK] Fix pip installing of sub components · 965c82d8
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.
      
      ## How was this patch tested?
      
      Updated sanity test script to import mllib and ml sub-components.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
      965c82d8
  25. Jan 19, 2017
    • José Hiram Soltren's avatar
      [SPARK-16654][CORE] Add UI coverage for Application Level Blacklisting · 640f9423
      José Hiram Soltren authored
      Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler.
      
      ## What changes were proposed in this pull request?
      
      Adds a UI to these patches by:
      - defining new listener events for blacklisting and unblacklisting, nodes and executors;
      - sending said events at the relevant points in BlacklistTracker;
      - adding JSON (de)serialization code for these events;
      - augmenting the Executors UI page to show which, and how many, executors are blacklisted;
      - adding a unit test to make sure events are being fired;
      - adding HistoryServerSuite coverage to verify that the SHS reads these events correctly.
      - updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status
      
      Updates .rat-excludes to pass tests.
      
      username squito
      
      ## How was this patch tested?
      
      ./dev/run-tests
      testOnly org.apache.spark.util.JsonProtocolSuite
      testOnly org.apache.spark.scheduler.BlacklistTrackerSuite
      testOnly org.apache.spark.deploy.history.HistoryServerSuite
      https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh
      ![blacklist-20161219](https://cloud.githubusercontent.com/assets/1208477/21335321/9eda320a-c623-11e6-8b8c-9c912a73c276.jpg)
      
      Author: José Hiram Soltren <jose@cloudera.com>
      
      Closes #16346 from jsoltren/SPARK-16654-submit.
      640f9423
  26. Jan 18, 2017
  27. Jan 16, 2017
    • Felix Cheung's avatar
      [SPARK-18828][SPARKR] Refactor scripts for R · c84f7d3e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Refactored script to remove duplications and clearer purpose for each script
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16249 from felixcheung/rscripts.
      c84f7d3e
  28. Jan 15, 2017
  29. Jan 14, 2017
    • hyukjinkwon's avatar
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor... · b6a7aa4f
      hyukjinkwon authored
      [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly
      
      ## What changes were proposed in this pull request?
      
      It seems Hadoop libraries need winutils binaries for native libraries in the path.
      
      It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below:
      
      ```
       - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds)
         org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
         at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
         ...
      ```
      
      This PR proposes to add it to the `Path` for AppVeyor tests.
      
      ## How was this patch tested?
      
      Manually via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16584 from HyukjinKwon/set-path-appveyor.
      b6a7aa4f
  30. Jan 10, 2017
  31. Jan 02, 2017
    • hyukjinkwon's avatar
      [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts · 46b21260
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to check pep8 against all other Python scripts and fix the errors as below:
      
      ```bash
      ./dev/create-release/generate-contributors.py
      ./dev/create-release/releaseutils.py
      ./dev/create-release/translate-contributors.py
      ./dev/lint-python
      ./python/docs/epytext.py
      ./examples/src/main/python/mllib/decision_tree_classification_example.py
      ./examples/src/main/python/mllib/decision_tree_regression_example.py
      ./examples/src/main/python/mllib/gradient_boosting_classification_example.py
      ./examples/src/main/python/mllib/gradient_boosting_regression_example.py
      ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
      ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
      ./examples/src/main/python/mllib/naive_bayes_example.py
      ./examples/src/main/python/mllib/random_forest_classification_example.py
      ./examples/src/main/python/mllib/random_forest_regression_example.py
      ./examples/src/main/python/mllib/svm_with_sgd_example.py
      ./examples/src/main/python/streaming/network_wordjoinsentiments.py
      ./sql/hive/src/test/resources/data/scripts/cat.py
      ./sql/hive/src/test/resources/data/scripts/cat_error.py
      ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
      ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
      ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
      ./sql/hive/src/test/resources/data/scripts/escapednewline.py
      ./sql/hive/src/test/resources/data/scripts/escapedtab.py
      ./sql/hive/src/test/resources/data/scripts/input20_script.py
      ./sql/hive/src/test/resources/data/scripts/newline.py
      ```
      
      ## How was this patch tested?
      
      - `./python/docs/epytext.py`
      
        ```bash
        cd ./python/docs $$ make html
        ```
      
      - pep8 check (Python 2.7 / Python 3.3.6)
      
        ```
        ./dev/lint-python
        ```
      
      - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python -m doctest -v ./dev/merge_spark_pr.py
        ```
      
      - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python generate-contributors.py
        python translate-contributors.py
        ```
      
      - Examples (Python 2.7 / Python 3.3.6)
      
        ```bash
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
        ```
      
      - Examples (Python 2.7 only / Python 3.3.6 not working)
        ```
        ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
        ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
        ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
        ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
        ```
      
      - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)
      
        Manually tested only changed ones.
      
      - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)
      
        Manually tested this after disabling actually adding comments and links.
      
      And also via Jenkins tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16405 from HyukjinKwon/minor-pep8.
      46b21260
  32. Dec 29, 2016
    • Yin Huai's avatar
      Update known_translations for contributor names and also fix a small issue in... · 63036aee
      Yin Huai authored
      Update known_translations for contributor names and also fix a small issue in translate-contributors.py
      
      ## What changes were proposed in this pull request?
      This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py
      
      ## How was this patch tested?
      manually tested
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16423 from yhuai/contributors.
      63036aee
  33. Dec 21, 2016
    • Felix Cheung's avatar
      [BUILD] make-distribution should find JAVA_HOME for non-RHEL systems · e1b43dc4
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      make-distribution.sh should find JAVA_HOME for Ubuntu, Mac and other non-RHEL systems
      
      ## How was this patch tested?
      
      Manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16363 from felixcheung/buildjava.
      e1b43dc4
    • Shixiong Zhu's avatar
      [SPARK-18588][SS][KAFKA] Create a new KafkaConsumer when error happens to fix the flaky test · 95efc895
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.
      
      This PR also assigns a new group id to the new created consumer for a possible race condition:  the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)
      
      ## How was this patch tested?
      
      In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console , it ran this flaky test 120 times and all passed.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16282 from zsxwing/kafka-fix.
      95efc895
    • Yin Huai's avatar
      [SPARK-18951] Upgrade com.thoughtworks.paranamer/paranamer to 2.6 · 1a643889
      Yin Huai authored
      ## What changes were proposed in this pull request?
      I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16359 from yhuai/SPARK-18951.
      1a643889
  34. Dec 15, 2016
Loading