Skip to content
Snippets Groups Projects
  • hyukjinkwon's avatar
    46b21260
    [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts · 46b21260
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to check pep8 against all other Python scripts and fix the errors as below:
    
    ```bash
    ./dev/create-release/generate-contributors.py
    ./dev/create-release/releaseutils.py
    ./dev/create-release/translate-contributors.py
    ./dev/lint-python
    ./python/docs/epytext.py
    ./examples/src/main/python/mllib/decision_tree_classification_example.py
    ./examples/src/main/python/mllib/decision_tree_regression_example.py
    ./examples/src/main/python/mllib/gradient_boosting_classification_example.py
    ./examples/src/main/python/mllib/gradient_boosting_regression_example.py
    ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
    ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
    ./examples/src/main/python/mllib/naive_bayes_example.py
    ./examples/src/main/python/mllib/random_forest_classification_example.py
    ./examples/src/main/python/mllib/random_forest_regression_example.py
    ./examples/src/main/python/mllib/svm_with_sgd_example.py
    ./examples/src/main/python/streaming/network_wordjoinsentiments.py
    ./sql/hive/src/test/resources/data/scripts/cat.py
    ./sql/hive/src/test/resources/data/scripts/cat_error.py
    ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
    ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
    ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
    ./sql/hive/src/test/resources/data/scripts/escapednewline.py
    ./sql/hive/src/test/resources/data/scripts/escapedtab.py
    ./sql/hive/src/test/resources/data/scripts/input20_script.py
    ./sql/hive/src/test/resources/data/scripts/newline.py
    ```
    
    ## How was this patch tested?
    
    - `./python/docs/epytext.py`
    
      ```bash
      cd ./python/docs $$ make html
      ```
    
    - pep8 check (Python 2.7 / Python 3.3.6)
    
      ```
      ./dev/lint-python
      ```
    
    - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)
    
      ```bash
      python -m doctest -v ./dev/merge_spark_pr.py
      ```
    
    - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)
    
      ```bash
      python generate-contributors.py
      python translate-contributors.py
      ```
    
    - Examples (Python 2.7 / Python 3.3.6)
    
      ```bash
      ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
      ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
      ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
      ```
    
    - Examples (Python 2.7 only / Python 3.3.6 not working)
      ```
      ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
      ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
      ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
      ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
      ```
    
    - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)
    
      Manually tested only changed ones.
    
    - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)
    
      Manually tested this after disabling actually adding comments and links.
    
    And also via Jenkins tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #16405 from HyukjinKwon/minor-pep8.
    [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts
    hyukjinkwon authored
    ## What changes were proposed in this pull request?
    
    This PR proposes to check pep8 against all other Python scripts and fix the errors as below:
    
    ```bash
    ./dev/create-release/generate-contributors.py
    ./dev/create-release/releaseutils.py
    ./dev/create-release/translate-contributors.py
    ./dev/lint-python
    ./python/docs/epytext.py
    ./examples/src/main/python/mllib/decision_tree_classification_example.py
    ./examples/src/main/python/mllib/decision_tree_regression_example.py
    ./examples/src/main/python/mllib/gradient_boosting_classification_example.py
    ./examples/src/main/python/mllib/gradient_boosting_regression_example.py
    ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
    ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
    ./examples/src/main/python/mllib/naive_bayes_example.py
    ./examples/src/main/python/mllib/random_forest_classification_example.py
    ./examples/src/main/python/mllib/random_forest_regression_example.py
    ./examples/src/main/python/mllib/svm_with_sgd_example.py
    ./examples/src/main/python/streaming/network_wordjoinsentiments.py
    ./sql/hive/src/test/resources/data/scripts/cat.py
    ./sql/hive/src/test/resources/data/scripts/cat_error.py
    ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
    ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
    ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
    ./sql/hive/src/test/resources/data/scripts/escapednewline.py
    ./sql/hive/src/test/resources/data/scripts/escapedtab.py
    ./sql/hive/src/test/resources/data/scripts/input20_script.py
    ./sql/hive/src/test/resources/data/scripts/newline.py
    ```
    
    ## How was this patch tested?
    
    - `./python/docs/epytext.py`
    
      ```bash
      cd ./python/docs $$ make html
      ```
    
    - pep8 check (Python 2.7 / Python 3.3.6)
    
      ```
      ./dev/lint-python
      ```
    
    - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)
    
      ```bash
      python -m doctest -v ./dev/merge_spark_pr.py
      ```
    
    - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)
    
      ```bash
      python generate-contributors.py
      python translate-contributors.py
      ```
    
    - Examples (Python 2.7 / Python 3.3.6)
    
      ```bash
      ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
      ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
      ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
      ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
      ```
    
    - Examples (Python 2.7 only / Python 3.3.6 not working)
      ```
      ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
      ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
      ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
      ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
      ```
    
    - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)
    
      Manually tested only changed ones.
    
    - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)
    
      Manually tested this after disabling actually adding comments and links.
    
    And also via Jenkins tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #16405 from HyukjinKwon/minor-pep8.
github_jira_sync.py 5.17 KiB
#!/usr/bin/env python

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Utility for updating JIRA's with information about Github pull requests

import json
import os
import re
import sys
import urllib2

try:
    import jira.client
except ImportError:
    print("This tool requires the jira-python library")
    print("Install using 'sudo pip install jira'")
    sys.exit(-1)

# User facing configs
GITHUB_API_BASE = os.environ.get("GITHUB_API_BASE", "https://api.github.com/repos/apache/spark")
JIRA_PROJECT_NAME = os.environ.get("JIRA_PROJECT_NAME", "SPARK")
JIRA_API_BASE = os.environ.get("JIRA_API_BASE", "https://issues.apache.org/jira")
JIRA_USERNAME = os.environ.get("JIRA_USERNAME", "apachespark")
JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD", "XXX")
# Maximum number of updates to perform in one run
MAX_UPDATES = int(os.environ.get("MAX_UPDATES", "100000"))
# Cut-off for oldest PR on which to comment. Useful for avoiding
# "notification overload" when running for the first time.
MIN_COMMENT_PR = int(os.environ.get("MIN_COMMENT_PR", "1496"))

# File used as an opitimization to store maximum previously seen PR
# Used mostly because accessing ASF JIRA is slow, so we want to avoid checking
# the state of JIRA's that are tied to PR's we've already looked at.
MAX_FILE = ".github-jira-max"


def get_url(url):
    try:
        return urllib2.urlopen(url)
    except urllib2.HTTPError:
        print("Unable to fetch URL, exiting: %s" % url)
        sys.exit(-1)


def get_json(urllib_response):
    return json.load(urllib_response)


# Return a list of (JIRA id, JSON dict) tuples:
# e.g. [('SPARK-1234', {.. json ..}), ('SPARK-5687', {.. json ..})}
def get_jira_prs():
    result = []
    has_next_page = True
    page_num = 0
    while has_next_page:
        page = get_url(GITHUB_API_BASE + "/pulls?page=%s&per_page=100" % page_num)
        page_json = get_json(page)

        for pull in page_json:
            jiras = re.findall(JIRA_PROJECT_NAME + "-[0-9]{4,5}", pull['title'])
            for jira in jiras:
                result = result + [(jira, pull)]

        # Check if there is another page
        link_header = filter(lambda k: k.startswith("Link"), page.info().headers)[0]
        if "next" not in link_header:
            has_next_page = False
        else:
            page_num += 1
    return result


def set_max_pr(max_val):
    f = open(MAX_FILE, 'w')
    f.write("%s" % max_val)
    f.close()
    print("Writing largest PR number seen: %s" % max_val)


def get_max_pr():
    if os.path.exists(MAX_FILE):
        result = int(open(MAX_FILE, 'r').read())
        print("Read largest PR number previously seen: %s" % result)
        return result
    else:
        return 0


jira_client = jira.client.JIRA({'server': JIRA_API_BASE},
                               basic_auth=(JIRA_USERNAME, JIRA_PASSWORD))

jira_prs = get_jira_prs()

previous_max = get_max_pr()
print("Retrieved %s JIRA PR's from Github" % len(jira_prs))
jira_prs = [(k, v) for k, v in jira_prs if int(v['number']) > previous_max]
print("%s PR's remain after excluding visted ones" % len(jira_prs))

num_updates = 0
considered = []
for issue, pr in sorted(jira_prs, key=lambda kv: int(kv[1]['number'])):
    if num_updates >= MAX_UPDATES:
        break
    pr_num = int(pr['number'])

    print("Checking issue %s" % issue)
    considered = considered + [pr_num]

    url = pr['html_url']
    title = "[Github] Pull Request #%s (%s)" % (pr['number'], pr['user']['login'])
    try:
        existing_links = map(lambda l: l.raw['object']['url'], jira_client.remote_links(issue))
    except:
        print("Failure reading JIRA %s (does it exist?)" % issue)
        print(sys.exc_info()[0])
        continue

    if url in existing_links:
        continue

    icon = {"title": "Pull request #%s" % pr['number'],
            "url16x16": "https://assets-cdn.github.com/favicon.ico"}
    destination = {"title": title, "url": url, "icon": icon}
    # For all possible fields see:
    # https://developer.atlassian.com/display/JIRADEV/Fields+in+Remote+Issue+Links
    # application = {"name": "Github pull requests", "type": "org.apache.spark.jira.github"}
    jira_client.add_remote_link(issue, destination)

    comment = "User '%s' has created a pull request for this issue:" % pr['user']['login']
    comment += "\n%s" % pr['html_url']
    if pr_num >= MIN_COMMENT_PR:
        jira_client.add_comment(issue, comment)

    print("Added link %s <-> PR #%s" % (issue, pr['number']))
    num_updates += 1

if len(considered) > 0:
    set_max_pr(max(considered))