Skip to content
Snippets Groups Projects
user avatar
Haoran Qiu authored
cb25bebe
History

RL-based Multi-dimensional Autoscaling

Create MPA Objects for Application Deployment

Step 1. Create MPAs for the deployed application

See examples in multidimensional-pod-autoscaler/examples:

# change the MPA name and the namespace in the YAML file if needed
kubectl create -f multidimensional-pod-autoscaler/examples/hamster.yaml
kubectl create -f multidimensional-pod-autoscaler/examples/hamster-mpa.yaml

Step 2. Make Sure the Prometheus environment are set

export PROM_SECRET=`oc get secrets -n <monitoring-ns> |grep prometheus-k8s-token |head -n 1|awk '{print $1}'`
export PROM_TOKEN=`echo $(oc get secret $PROM_SECRET -n <monitoring-ns> -o json | jq -r '.data.token') | base64 -d`
export PROM_HOST=`oc get route prometheus-k8s -n <monitoring-ns> -o json | jq -r '.spec.host'`

Note that the application metrics (e.g., latency) are exported to Promethues for supporting SLO-driven autoscaling.

Run RL Controller

Step 1. Install packages

cd rl-controller
pip3 install -r requirements.txt

Step 2. Start RL policy-training stage

# training from scratch
python main.py --app_name <app_name> --app_namespace <app_ns> --mpa_name '<mpa_name> --mpa_namespace <mpa_ns>

# using a checkpoint (bootstrapping)
# --use_checkpoint --checkpoint './checkpoints/ppo-ep-xxx.pth.tar'

The logs (for metrics and trajectories) will be stored in ./rl-controller/logs and model checkpoints will be saved at ./rl-controller/checkpoints.

Step 3. Start RL policy-serving (inference) stage

python main.py --inference --use_checkpoint --checkpoint './checkpoints/ppo-ep-xxx.pth.tar'

The agent will detect if retraining is needed whenever necessary based on specified reward thresholds.