- Jul 18, 2017
-
-
Sean Owen authored
[SPARK-15526][ML][FOLLOWUP] Make JPMML provided scope to avoid including unshaded JARs, and repromote to compile in MLlib Following the comment at https://issues.apache.org/jira/browse/SPARK-15526?focusedCommentId=16086106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086106 -- this change actually needed a little more work to be complete. This also marks JPMML as `provided` to make sure its JARs aren't included in the `jars` output, but then scopes to `compile` in `mllib`. This is how Guava is handled. Checked result in `assembly/target/scala-2.11/jars` to verify there are no JPMML jars. Maven and SBT builds still work. Author: Sean Owen <sowen@cloudera.com> Closes #18637 from srowen/SPARK-15526.2.
-
- Jul 13, 2017
-
-
Sean Owen authored
## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.
-
- Jul 10, 2017
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
-
- Jul 05, 2017
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ |(id + 17.1335742042)| +--------------------+ | 17.1335742042| +--------------------+ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ |(id + 17.133574204226083)| +-------------------------+ | 17.133574204226083| +-------------------------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278.
-
- Jun 28, 2017
-
-
Wenchen Fan authored
This reverts commit e4469760.
-
- Jun 24, 2017
-
-
hyukjinkwon authored
[SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs ## What changes were proposed in this pull request? Recently, Jenkins tests were unstable due to unknown reasons as below: ``` /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was terminated by signal 9 test_result_code, test_result_note = run_tests(tests_timeout) File "./dev/run-tests-jenkins.py", line 140, in run_tests test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code] KeyError: -9 ``` ``` Traceback (most recent call last): File "./dev/run-tests-jenkins.py", line 226, in <module> main() File "./dev/run-tests-jenkins.py", line 213, in main test_result_code, test_result_note = run_tests(tests_timeout) File "./dev/run-tests-jenkins.py", line 140, in run_tests test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code] KeyError: -10 ``` This exception looks causing failing to update the comments in the PR. For example:   these comment just remain. This always requires, for both reviewers and the author, a overhead to click and check the logs, which I believe are not really useful. This PR proposes to leave the code in the PR comment messages and let update the comments. ## How was this patch tested? Jenkins tests below, I manually gave the error code to test this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18399 from HyukjinKwon/jenkins-print-errors.
-
- Jun 22, 2017
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
-
- Jun 19, 2017
-
-
Xianyang Liu authored
## What changes were proposed in this pull request? Fix some typo of the document. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18350 from ConeyLiu/fixtypo.
-
- Jun 15, 2017
-
-
Michael Gummelt authored
## What changes were proposed in this pull request? Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it. In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private. In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained. Summary: - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`. Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`. Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module. - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations. Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`. Old Hierarchy: ``` yarn.security.ServiceCredentialProvider (service loaded) HadoopFSCredentialProvider HiveCredentialProvider HBaseCredentialProvider yarn.security.ConfigurableCredentialManager ``` New Hierarchy: ``` HadoopDelegationTokenManager HadoopDelegationTokenProvider (not service loaded) HadoopFSDelegationTokenProvider HiveDelegationTokenProvider HBaseDelegationTokenProvider yarn.security.ServiceCredentialProvider (service loaded) yarn.security.YARNHadoopDelegationTokenManager ``` ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Author: Dr. Stefan Schimanski <sttts@mesosphere.io> Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
-
- Jun 11, 2017
-
-
Yuming Wang authored
## What changes were proposed in this pull request? Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933). ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18247 from wangyum/SPARK-13933.
-
- Jun 02, 2017
-
-
Wenchen Fan authored
## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test.
-
hyukjinkwon authored
## What changes were proposed in this pull request? Currently, if we run `./python/run-tests.py` and they are aborted without cleaning up this directory, it fails pep8 check due to some Python scripts generated. For example, https://github.com/apache/spark/blob/7387126f83dc0489eb1df734bfeba705709b7861/python/pyspark/tests.py#L1955-L1968 ``` PEP8 checks failed. ./work/app-20170531190857-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531190909-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531190924-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1 ./work/app-20170531190924-0000/0/test.py:7:52: W292 no newline at end of file ./work/app-20170531191016-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531191030-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531191045-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1 ./work/app-20170531191045-0000/0/test.py:7:52: W292 no newline at end of file ``` For me, it is sometimes a bit annoying. This PR proposes to exclude these (assuming we want to skip per https://github.com/apache/spark/blob/master/.gitignore#L73). Also, it moves other pep8 configurations in the script into ini configuration file in pep8. ## How was this patch tested? Manually tested via `./dev/lint-python`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18161 from HyukjinKwon/work-exclude-pep8.
-
- May 10, 2017
-
-
Xianyang Liu authored
## What changes were proposed in this pull request? This PR proposes to fix the lint-breaks as below: ``` [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] (naming) MethodName: Method name 'ProcessingTime' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] (naming) MethodName: Method name 'Once' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.api.java.JavaDStream. ``` after: ``` dev/lint-java Checkstyle checks passed. ``` [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169) ## How was this patch tested? Travis CI Author: Xianyang Liu <xianyang.liu@intel.com> Closes #17890 from ConeyLiu/codestyle.
-
- May 09, 2017
-
-
Holden Karau authored
## What changes were proposed in this pull request? Drop the hadoop distirbution name from the Python version (PEP440 - https://www.python.org/dev/peps/pep-0440/). We've been using the local version string to disambiguate between different hadoop versions packaged with PySpark, but PEP0440 states that local versions should not be used when publishing up-stream. Since we no longer make PySpark pip packages for different hadoop versions, we can simply drop the hadoop information. If at a later point we need to start publishing different hadoop versions we can look at make different packages or similar. ## How was this patch tested? Ran `make-distribution` locally Author: Holden Karau <holden@us.ibm.com> Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
-
- May 03, 2017
-
-
Sean Owen authored
## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17803 from srowen/SPARK-20523.
-
- Apr 25, 2017
-
-
Yanbo Liang authored
## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449.
-
- Apr 19, 2017
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes two things as below: - Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce. So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. **Note that PR builder will still build it with Hadoop 2.6 & SBT.** ``` ======================================================================== Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. ... ``` I checked the environment variables from the logs (first bit) as below: - **spark-master-test-sbt-hadoop-2.6** (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=master AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 <- I use this variable AMPLAB_JENKINS="true" ``` - spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=master AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7 AMPLAB_JENKINS="true" ``` - spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.6 HADOOP_VERSION= SPARK_BRANCH=master AMPLAB_JENKINS="true" ``` - spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.7 HADOOP_VERSION= SPARK_BRANCH=master AMPLAB_JENKINS="true" ``` - PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull ``` JENKINS_MASTER_HOSTNAME=amp-jenkins-master JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 ``` Assuming from other logs in branch-2.1 - SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=branch-2.1 AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 AMPLAB_JENKINS="true" ``` - Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.6 HADOOP_VERSION= SPARK_BRANCH=branch-2.1 AMPLAB_JENKINS="true" ``` We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520 - Revert the previous try After https://github.com/apache/spark/pull/17651, it seems the build still fails on SBT Hadoop 2.6 master. I am unable to reproduce this - https://github.com/apache/spark/pull/17477#issuecomment-294094092 and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this). ## How was this patch tested? I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below: ```python >>> import collections >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") False >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") True >>> os = collections.namedtuple('os', 'environ')(environ={}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") True ``` I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this. Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092 Author: hyukjinkwon <gurwls223@gmail.com> Closes #17669 from HyukjinKwon/revert-SPARK-20343.
-
- Apr 12, 2017
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable. There are several problems with it: - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?". - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up. (see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627)) To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above. There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013 Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings. ## How was this patch tested? Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`. This was tested via manually adding `time.time()` as below: ```diff profiles_and_goals = build_profiles + sbt_goals print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ", " ".join(profiles_and_goals)) + import time + st = time.time() exec_sbt(profiles_and_goals) + print("Elapsed :[%s]" % str(time.time() - st)) ``` produces ``` ... ======================================================================== Building Unidoc API Documentation ======================================================================== ... [info] Main Java API documentation successful. ... Elapsed :[94.8746569157] ... Author: hyukjinkwon <gurwls223@gmail.com> Closes #17477 from HyukjinKwon/SPARK-18692.
-
- Apr 11, 2017
-
-
David Gingrich authored
## What changes were proposed in this pull request? Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3). Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences. ## How was this patch tested? - Doctests for helper function ## Legal This is my original work and I license the work to the project under the project’s open source license. Author: David Gingrich <david@textio.com> Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
-
- Apr 02, 2017
-
-
zuotingbing authored
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123 ## What changes were proposed in this pull request? If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed. ## How was this patch tested? manual tests Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #17452 from zuotingbing/spark-bulid.
-
- Mar 29, 2017
-
-
Holden Karau authored
## What changes were proposed in this pull request? Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability. ## How was this patch tested? Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins. Author: Holden Karau <holden@us.ibm.com> Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
-
- Mar 28, 2017
-
-
Bago Amirbekian authored
## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago Amirbekian <bago@databricks.com> Closes #17421 from MrBago/chiSquareTestWrapper.
-
- Mar 27, 2017
-
-
Josh Rosen authored
## What changes were proposed in this pull request? The master snapshot publisher builds are currently broken due to two minor build issues: 1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands. 2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script. ## How was this patch tested? The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally. Author: Josh Rosen <joshrosen@databricks.com> Closes #17437 from JoshRosen/spark-20102.
-
- Mar 26, 2017
-
-
zero323 authored
## What changes were proposed in this pull request? - Add `HasSupport` and `HasConfidence` `Params`. - Add new module `pyspark.ml.fpm`. - Add `FPGrowth` / `FPGrowthModel` wrappers. - Provide tests for new features. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17218 from zero323/SPARK-19281.
-
- Feb 18, 2017
-
-
Shuai Lin authored
## What changes were proposed in this pull request? Fixed a typo in `dev/make-distribution.sh` script that sets the MAVEN_OPTS variable, introduced [here](https://github.com/apache/spark/commit/0e24054#diff-ba2c046d92a1d2b5b417788bfb5cb5f8R149). ## How was this patch tested? Run `dev/make-distribution.sh` manually. Author: Shuai Lin <linshuai2012@gmail.com> Closes #16984 from lins05/fix-spark-make-distribution-after-removing-java7.
-
- Feb 17, 2017
-
-
Roberto Agostino Vitillo authored
## What changes were proposed in this pull request? This patch fixes a bug in `KafkaSource` with the (de)serialization of the length of the JSON string that contains the initial partition offsets. ## How was this patch tested? I ran the test suite for spark-sql-kafka-0-10. Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com> Closes #16857 from vitillo/kafka_source_fix.
-
- Feb 16, 2017
-
-
Sean Owen authored
## What changes were proposed in this pull request? Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima script to run MiMa This follows on https://github.com/apache/spark/pull/16871 -- it's a slightly separate issue, but, is currently causing a build failure. ## How was this patch tested? Manually tested. Author: Sean Owen <sowen@cloudera.com> Closes #16957 from srowen/SPARK-19550.2.
-
Sean Owen authored
- Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16871 from srowen/SPARK-19493.
-
- Feb 14, 2017
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? It seems wintuils for Hadoop 2.6.5 not exiting for now in https://github.com/steveloughran/winutils This breaks the tests in SparkR on Windows so this PR proposes to use winutils built by Hadoop 2.6.4 for now. ## How was this patch tested? Manually via AppVeyor **Before** https://ci.appveyor.com/project/spark-test/spark/build/627-r-test-break **After** https://ci.appveyor.com/project/spark-test/spark/build/629-r-test-break Author: hyukjinkwon <gurwls223@gmail.com> Closes #16927 from HyukjinKwon/spark-r-windows-break.
-
- Feb 08, 2017
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3. **BEFORE** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7'] Attempting to post to Github... > Post successful. ``` **AFTER** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ``` ## How was this patch tested? Pass the existing test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16858 from dongjoon-hyun/hotfix_run-tests.
-
Sean Owen authored
## What changes were proposed in this pull request? - Remove support for Hadoop 2.5 and earlier - Remove reflection and code constructs only needed to support multiple versions at once - Update docs to reflect newer versions - Remove older versions' builds and profiles. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16810 from srowen/SPARK-19464.
-
- Jan 31, 2017
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16751 from dongjoon-hyun/SPARK-19409.
-
- Jan 25, 2017
-
-
Holden Karau authored
## What changes were proposed in this pull request? Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution. ## How was this patch tested? Updated sanity test script to import mllib and ml sub-components. Author: Holden Karau <holden@us.ibm.com> Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
-
- Jan 19, 2017
-
-
José Hiram Soltren authored
Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler. ## What changes were proposed in this pull request? Adds a UI to these patches by: - defining new listener events for blacklisting and unblacklisting, nodes and executors; - sending said events at the relevant points in BlacklistTracker; - adding JSON (de)serialization code for these events; - augmenting the Executors UI page to show which, and how many, executors are blacklisted; - adding a unit test to make sure events are being fired; - adding HistoryServerSuite coverage to verify that the SHS reads these events correctly. - updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status Updates .rat-excludes to pass tests. username squito ## How was this patch tested? ./dev/run-tests testOnly org.apache.spark.util.JsonProtocolSuite testOnly org.apache.spark.scheduler.BlacklistTrackerSuite testOnly org.apache.spark.deploy.history.HistoryServerSuite https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh  Author: José Hiram Soltren <jose@cloudera.com> Closes #16346 from jsoltren/SPARK-16654-submit.
-
- Jan 18, 2017
-
-
Yin Huai authored
## What changes were proposed in this pull request? Update known_translations per https://github.com/apache/spark/pull/16423#issuecomment-269739634 Author: Yin Huai <yhuai@databricks.com> Closes #16628 from yhuai/known_translations.
-
Adam Roberts authored
**What changes were proposed in this pull request?** Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up **How was this patch tested?** Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #16616 from a-roberts/Hadoop265Bumper.
-
- Jan 16, 2017
-
-
Felix Cheung authored
## What changes were proposed in this pull request? Refactored script to remove duplications and clearer purpose for each script ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16249 from felixcheung/rscripts.
-
- Jan 15, 2017
-
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971.
-
- Jan 14, 2017
-
-
hyukjinkwon authored
[SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly ## What changes were proposed in this pull request? It seems Hadoop libraries need winutils binaries for native libraries in the path. It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below: ``` - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) ... ``` This PR proposes to add it to the `Path` for AppVeyor tests. ## How was this patch tested? Manually via AppVeyor. **Before** https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m **After** https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7 Author: hyukjinkwon <gurwls223@gmail.com> Closes #16584 from HyukjinKwon/set-path-appveyor.
-
- Jan 10, 2017
-
-
Sean Owen authored
## What changes were proposed in this pull request? Updates to libthrift 0.9.3 to address a CVE. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #16530 from srowen/SPARK-18997.
-