Skip to content
Snippets Groups Projects
  • Andrew Or's avatar
    5081a0a9
    [SPARK-1900 / 1918] PySpark on YARN is broken · 5081a0a9
    Andrew Or authored
    If I run the following on a YARN cluster
    ```
    bin/spark-submit sheep.py --master yarn-client
    ```
    it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
    ```
    bin/spark-submit file:/path/to/sheep.py --master yarn-client
    ```
    However, this also fails. This time it is because python does not understand URI schemes.
    
    This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
    
    Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #853 from andrewor14/submit-paths and squashes the following commits:
    
    0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
    323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
    3c36587 [Andrew Or] Improve error messages (minor)
    854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
    6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
    3bb0359 [Andrew Or] Update more comments (minor)
    2a1f8a0 [Andrew Or] Update comments (minor)
    6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
    a68c4d1 [Andrew Or] Handle Windows python file path correctly
    427a250 [Andrew Or] Resolve paths properly for Windows
    a591a4a [Andrew Or] Update tests for resolving URIs
    6c8621c [Andrew Or] Move resolveURIs to Utils
    db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
    f542dce [Andrew Or] Fix outdated tests
    691c4ce [Andrew Or] Ignore special primary resource names
    5342ac7 [Andrew Or] Add missing space in error message
    02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
    5081a0a9
    History
    [SPARK-1900 / 1918] PySpark on YARN is broken
    Andrew Or authored
    If I run the following on a YARN cluster
    ```
    bin/spark-submit sheep.py --master yarn-client
    ```
    it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
    ```
    bin/spark-submit file:/path/to/sheep.py --master yarn-client
    ```
    However, this also fails. This time it is because python does not understand URI schemes.
    
    This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
    
    Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes #853 from andrewor14/submit-paths and squashes the following commits:
    
    0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
    323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
    3c36587 [Andrew Or] Improve error messages (minor)
    854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
    6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
    3bb0359 [Andrew Or] Update more comments (minor)
    2a1f8a0 [Andrew Or] Update comments (minor)
    6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
    a68c4d1 [Andrew Or] Handle Windows python file path correctly
    427a250 [Andrew Or] Resolve paths properly for Windows
    a591a4a [Andrew Or] Update tests for resolving URIs
    6c8621c [Andrew Or] Move resolveURIs to Utils
    db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
    f542dce [Andrew Or] Fix outdated tests
    691c4ce [Andrew Or] Ignore special primary resource names
    5342ac7 [Andrew Or] Add missing space in error message
    02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly