Skip to content
Snippets Groups Projects
Commit 6a3cbbc0 authored by Holden Karau's avatar Holden Karau Committed by Josh Rosen
Browse files

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

Author: Holden Karau <holden@us.ibm.com>
Author: Juliet Hougland <juliet@cloudera.com>
Author: Juliet Hougland <not@myemail.com>

Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
parent 95157938
No related branches found
No related tags found
No related merge requests found
Showing with 267 additions and 21 deletions
......@@ -57,6 +57,8 @@ project/plugins/project/build.properties
project/plugins/src_managed/
project/plugins/target/
python/lib/pyspark.zip
python/deps
python/pyspark/python
reports/
scalastyle-on-compile.generated.xml
scalastyle-output.xml
......
......@@ -25,7 +25,7 @@ set -o posix
# Figure out if SPARK_HOME is set
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
CLASS="org.apache.hive.beeline.BeeLine"
......
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive.
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
# Short cirtuit if the user already has this set.
if [ ! -z "${SPARK_HOME}" ]; then
exit 0
elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
# need to search the different Python directories for a Spark installation.
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
# spark-submit in another directory we want to use that version of PySpark rather than the
# pip installed version of PySpark.
export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
else
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
# Default to standard python interpreter unless told otherwise
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
fi
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
fi
......@@ -23,7 +23,7 @@
# Figure out where Spark is installed
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
if [ -z "$SPARK_ENV_LOADED" ]; then
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
source "${SPARK_HOME}"/bin/load-spark-env.sh
......@@ -46,7 +46,7 @@ WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 0))
# Determine the Python executable to use for the executors:
if [[ -z "$PYSPARK_PYTHON" ]]; then
if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! WORKS_WITH_IPYTHON ]]; then
if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! $WORKS_WITH_IPYTHON ]]; then
echo "IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON" 1>&2
exit 1
else
......@@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then
unset YARN_CONF_DIR
unset HADOOP_CONF_DIR
export PYTHONHASHSEED=0
exec "$PYSPARK_DRIVER_PYTHON" -m $1
exec "$PYSPARK_DRIVER_PYTHON" -m "$1"
exit
fi
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
export _SPARK_CMD_USAGE="Usage: ./bin/run-example [options] example-class [example args]"
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
. "${SPARK_HOME}"/bin/load-spark-env.sh
......@@ -27,7 +27,7 @@ fi
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ `command -v java` ]; then
if [ "$(command -v java)" ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
......@@ -36,7 +36,7 @@ else
fi
# Find Spark jars.
if [ -f "${SPARK_HOME}/RELEASE" ]; then
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
......
......@@ -21,7 +21,7 @@
# Shell script for starting the Spark Shell REPL
cygwin=false
case "`uname`" in
case "$(uname)" in
CYGWIN*) cygwin=true;;
esac
......@@ -29,7 +29,7 @@ esac
set -o posix
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]"
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
# disable randomized hash for string in Python 3.3+
......
......@@ -18,7 +18,7 @@
#
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source "$(dirname "$0")"/find-spark-home
fi
source "${SPARK_HOME}"/bin/load-spark-env.sh
......
......@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
export ZINC_PORT=$ZINC_PORT
echo "Creating distribution: $NAME ($FLAGS)"
# Write out the NAME and VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT
# to dev0 to be closer to PEP440. We use the NAME as a "local version".
PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"`
echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
# Get maven home set by MVN
MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz $FLAGS \
echo "Creating distribution"
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz --pip $FLAGS \
-DzincPort=$ZINC_PORT 2>&1 > ../binary-release-$NAME.log
cd ..
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
echo "Copying and signing python distribution"
PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
cp spark-$SPARK_VERSION-bin-$NAME/python/dist/$PYTHON_DIST_NAME .
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
--output $PYTHON_DIST_NAME.asc \
--detach-sig $PYTHON_DIST_NAME
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
MD5 $PYTHON_DIST_NAME > \
$PYTHON_DIST_NAME.md5
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
SHA512 $PYTHON_DIST_NAME > \
$PYTHON_DIST_NAME.sha
echo "Copying and signing regular binary distribution"
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
--output spark-$SPARK_VERSION-bin-$NAME.tgz.asc \
--detach-sig spark-$SPARK_VERSION-bin-$NAME.tgz
......@@ -208,6 +229,7 @@ if [[ "$1" == "package" ]]; then
# Re-upload a second time and leave the files in the timestamped upload directory:
LFTP mkdir -p $dest_dir
LFTP mput -O $dest_dir 'spark-*'
LFTP mput -O $dest_dir 'pyspark-*'
exit 0
fi
......
......@@ -65,6 +65,7 @@ sed -i".tmp1" 's/Version.*$/Version: '"$RELEASE_VERSION"'/g' R/pkg/DESCRIPTION
# Set the release version in docs
sed -i".tmp1" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$RELEASE_VERSION"'/g' docs/_config.yml
sed -i".tmp2" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$RELEASE_VERSION"'/g' docs/_config.yml
sed -i".tmp3" 's/__version__ = .*$/__version__ = "'"$RELEASE_VERSION"'"/' python/pyspark/version.py
git commit -a -m "Preparing Spark release $RELEASE_TAG"
echo "Creating tag $RELEASE_TAG at the head of $GIT_BRANCH"
......@@ -74,12 +75,16 @@ git tag $RELEASE_TAG
$MVN versions:set -DnewVersion=$NEXT_VERSION | grep -v "no value" # silence logs
# Remove -SNAPSHOT before setting the R version as R expects version strings to only have numbers
R_NEXT_VERSION=`echo $NEXT_VERSION | sed 's/-SNAPSHOT//g'`
sed -i".tmp2" 's/Version.*$/Version: '"$R_NEXT_VERSION"'/g' R/pkg/DESCRIPTION
sed -i".tmp4" 's/Version.*$/Version: '"$R_NEXT_VERSION"'/g' R/pkg/DESCRIPTION
# Write out the R_NEXT_VERSION to PySpark version info we use dev0 instead of SNAPSHOT to be closer
# to PEP440.
sed -i".tmp5" 's/__version__ = .*$/__version__ = "'"$R_NEXT_VERSION.dev0"'"/' python/pyspark/version.py
# Update docs with next version
sed -i".tmp3" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$NEXT_VERSION"'/g' docs/_config.yml
sed -i".tmp6" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$NEXT_VERSION"'/g' docs/_config.yml
# Use R version for short version
sed -i".tmp4" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$R_NEXT_VERSION"'/g' docs/_config.yml
sed -i".tmp7" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$R_NEXT_VERSION"'/g' docs/_config.yml
git commit -a -m "Preparing development version $NEXT_VERSION"
......
......@@ -20,7 +20,9 @@
SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
SPARK_ROOT_DIR="$(dirname "$SCRIPT_DIR")"
PATHS_TO_CHECK="./python/pyspark/ ./examples/src/main/python/ ./dev/sparktestsupport"
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/run-tests.py ./dev/run-tests-jenkins.py"
# TODO: fix pep8 errors with the rest of the Python scripts under dev
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/*.py ./dev/run-tests-jenkins.py"
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/pip-sanity-check.py"
PEP8_REPORT_PATH="$SPARK_ROOT_DIR/dev/pep8-report.txt"
PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt"
PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt"
......
......@@ -33,6 +33,7 @@ SPARK_HOME="$(cd "`dirname "$0"`/.."; pwd)"
DISTDIR="$SPARK_HOME/dist"
MAKE_TGZ=false
MAKE_PIP=false
NAME=none
MVN="$SPARK_HOME/build/mvn"
......@@ -40,7 +41,7 @@ function exit_with_usage {
echo "make-distribution.sh - tool for making binary distributions of Spark"
echo ""
echo "usage:"
cl_options="[--name] [--tgz] [--mvn <mvn-command>]"
cl_options="[--name] [--tgz] [--pip] [--mvn <mvn-command>]"
echo "make-distribution.sh $cl_options <maven build options>"
echo "See Spark's \"Building Spark\" doc for correct Maven options."
echo ""
......@@ -67,6 +68,9 @@ while (( "$#" )); do
--tgz)
MAKE_TGZ=true
;;
--pip)
MAKE_PIP=true
;;
--mvn)
MVN="$2"
shift
......@@ -201,6 +205,16 @@ fi
# Copy data files
cp -r "$SPARK_HOME/data" "$DISTDIR"
# Make pip package
if [ "$MAKE_PIP" == "true" ]; then
echo "Building python distribution package"
cd $SPARK_HOME/python
python setup.py sdist
cd ..
else
echo "Skipping creating pip installable PySpark"
fi
# Copy other things
mkdir "$DISTDIR"/conf
cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf
......
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import print_function
from pyspark.sql import SparkSession
import sys
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PipSanityCheck")\
.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100), 10)
value = rdd.reduce(lambda x, y: x + y)
if (value != 4950):
print("Value {0} did not match expected value.".format(value), file=sys.stderr)
sys.exit(-1)
print("Successfully ran pip sanity check")
spark.stop()
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Stop on error
set -e
# Set nullglob for when we are checking existence based on globs
shopt -s nullglob
FWDIR="$(cd "$(dirname "$0")"/..; pwd)"
cd "$FWDIR"
echo "Constucting virtual env for testing"
VIRTUALENV_BASE=$(mktemp -d)
# Clean up the virtual env enviroment used if we created one.
function delete_virtualenv() {
echo "Cleaning up temporary directory - $VIRTUALENV_BASE"
rm -rf "$VIRTUALENV_BASE"
}
trap delete_virtualenv EXIT
# Some systems don't have pip or virtualenv - in those cases our tests won't work.
if ! hash virtualenv 2>/dev/null; then
echo "Missing virtualenv skipping pip installability tests."
exit 0
fi
if ! hash pip 2>/dev/null; then
echo "Missing pip, skipping pip installability tests."
exit 0
fi
# Figure out which Python execs we should test pip installation with
PYTHON_EXECS=()
if hash python2 2>/dev/null; then
# We do this since we are testing with virtualenv and the default virtual env python
# is in /usr/bin/python
PYTHON_EXECS+=('python2')
elif hash python 2>/dev/null; then
# If python2 isn't installed fallback to python if available
PYTHON_EXECS+=('python')
fi
if hash python3 2>/dev/null; then
PYTHON_EXECS+=('python3')
fi
# Determine which version of PySpark we are building for archive name
PYSPARK_VERSION=$(python -c "exec(open('python/pyspark/version.py').read());print __version__")
PYSPARK_DIST="$FWDIR/python/dist/pyspark-$PYSPARK_VERSION.tar.gz"
# The pip install options we use for all the pip commands
PIP_OPTIONS="--upgrade --no-cache-dir --force-reinstall "
# Test both regular user and edit/dev install modes.
PIP_COMMANDS=("pip install $PIP_OPTIONS $PYSPARK_DIST"
"pip install $PIP_OPTIONS -e python/")
for python in "${PYTHON_EXECS[@]}"; do
for install_command in "${PIP_COMMANDS[@]}"; do
echo "Testing pip installation with python $python"
# Create a temp directory for us to work in and save its name to a file for cleanup
echo "Using $VIRTUALENV_BASE for virtualenv"
VIRTUALENV_PATH="$VIRTUALENV_BASE"/$python
rm -rf "$VIRTUALENV_PATH"
mkdir -p "$VIRTUALENV_PATH"
virtualenv --python=$python "$VIRTUALENV_PATH"
source "$VIRTUALENV_PATH"/bin/activate
# Upgrade pip
pip install --upgrade pip
echo "Creating pip installable source dist"
cd "$FWDIR"/python
$python setup.py sdist
echo "Installing dist into virtual env"
cd dist
# Verify that the dist directory only contains one thing to install
sdists=(*.tar.gz)
if [ ${#sdists[@]} -ne 1 ]; then
echo "Unexpected number of targets found in dist directory - please cleanup existing sdists first."
exit -1
fi
# Do the actual installation
cd "$FWDIR"
$install_command
cd /
echo "Run basic sanity check on pip installed version with spark-submit"
spark-submit "$FWDIR"/dev/pip-sanity-check.py
echo "Run basic sanity check with import based"
python "$FWDIR"/dev/pip-sanity-check.py
echo "Run the tests for context.py"
python "$FWDIR"/python/pyspark/context.py
cd "$FWDIR"
done
done
exit 0
......@@ -128,6 +128,7 @@ def run_tests(tests_timeout):
ERROR_CODES["BLOCK_MIMA"]: 'MiMa tests',
ERROR_CODES["BLOCK_SPARK_UNIT_TESTS"]: 'Spark unit tests',
ERROR_CODES["BLOCK_PYSPARK_UNIT_TESTS"]: 'PySpark unit tests',
ERROR_CODES["BLOCK_PYSPARK_PIP_TESTS"]: 'PySpark pip packaging tests',
ERROR_CODES["BLOCK_SPARKR_UNIT_TESTS"]: 'SparkR unit tests',
ERROR_CODES["BLOCK_TIMEOUT"]: 'from timeout after a configured wait of \`%s\`' % (
tests_timeout)
......
......@@ -432,6 +432,12 @@ def run_python_tests(test_modules, parallelism):
run_cmd(command)
def run_python_packaging_tests():
set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS")
command = [os.path.join(SPARK_HOME, "dev", "run-pip-tests")]
run_cmd(command)
def run_build_tests():
set_title_and_block("Running build tests", "BLOCK_BUILD_TESTS")
run_cmd([os.path.join(SPARK_HOME, "dev", "test-dependencies.sh")])
......@@ -583,6 +589,7 @@ def main():
modules_with_python_tests = [m for m in test_modules if m.python_test_goals]
if modules_with_python_tests:
run_python_tests(modules_with_python_tests, opts.parallelism)
run_python_packaging_tests()
if any(m.should_run_r_tests for m in test_modules):
run_sparkr_tests()
......
......@@ -33,5 +33,6 @@ ERROR_CODES = {
"BLOCK_SPARKR_UNIT_TESTS": 20,
"BLOCK_JAVA_STYLE": 21,
"BLOCK_BUILD_TESTS": 22,
"BLOCK_PYSPARK_PIP_TESTS": 23,
"BLOCK_TIMEOUT": 124
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment