Skip to content
  • Nicholas Chammas's avatar
    52ed7da1
    [SPARK-6193] [EC2] Push group filter up to EC2 · 52ed7da1
    Nicholas Chammas authored
    When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.
    
    This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.
    
    Basically, the problem (and solution) look like this:
    
    ```python
    >>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
    116.96390509605408
    >>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
    4.629754066467285
    ```
    
    Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):
    
    ```shell
    # master
    $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
    ...
    3 loops, best of 3: 9.83 sec per loop
    
    # this PR
    $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
    ...
    3 loops, best of 3: 1.47 sec per loop
    ```
    
    This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.
    
    Finally, this PR fixes some minor grammar issues related to printing status to the user. 🎩 👏
    
    Author: Nicholas Chammas <nicholas.chammas@gmail.com>
    
    Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:
    
    18802f1 [Nicholas Chammas] ignore shutting-down
    f2a5b9f [Nicholas Chammas] fix grammar
    d96a489 [Nicholas Chammas] push group filter up to EC2
    52ed7da1
    [SPARK-6193] [EC2] Push group filter up to EC2
    Nicholas Chammas authored
    When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.
    
    This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.
    
    Basically, the problem (and solution) look like this:
    
    ```python
    >>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
    116.96390509605408
    >>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
    4.629754066467285
    ```
    
    Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):
    
    ```shell
    # master
    $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
    ...
    3 loops, best of 3: 9.83 sec per loop
    
    # this PR
    $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
    ...
    3 loops, best of 3: 1.47 sec per loop
    ```
    
    This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.
    
    Finally, this PR fixes some minor grammar issues related to printing status to the user. 🎩 👏
    
    Author: Nicholas Chammas <nicholas.chammas@gmail.com>
    
    Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:
    
    18802f1 [Nicholas Chammas] ignore shutting-down
    f2a5b9f [Nicholas Chammas] fix grammar
    d96a489 [Nicholas Chammas] push group filter up to EC2
Loading