Adding new machines to Enterprise quickly in AWS using warm pools

NCCGIA · ‎06-27-2022

Our current Enterprise deployment in AWS takes about 20-25 minutes to spin up an additional machine in our auto scaling group and have it fully added to the site. We are tinkering with using a Warm Pool to add new machines and put them in a stopped state so that we can respond more dynamically to rapid increases in system load. Has anyone considered this or already successfully implemented?

MikeSchonlau · ‎06-27-2022

We have not tested a warm pool, but would be very interested in the results. We have had the same experience with our auto-scaling group - about 20-30 minutes before a new instance has fully joined the site.

NCCGIA · ‎06-27-2022

We have gotten so far as to preliminarily test. We created the warm pool of instances, they did join the site, and then were put into a "stopped" state (i.e., no incursion of AWS costs). We decreased the instance launching wait period to 60s, manually increased the desired number of instances and the new instances successfully joined the site in less than 10min. While a significant improvement, we'd still like to get that down further. We did experience an issue when we decreased the instances to force the group to scale in (we wanted to see if another instance would be added to the pool after termination). We ran into an issue with Enterprise but haven't tracked down if the warm pool had something to do with it. Also, not sure if we can improve how quickly AGS recognizes the new machine. If anyone has suggestions, we'll be happy to try them out. I'll be sure to update this thread with any additional results/info we gather.

NCCGIA · ‎06-27-2022

I've run a few tests today and haven't had much success in getting the machines (that are put into service from the warm pool) to join the site. When it seems that the machines should be joining, the site goes down. Manually deleting the instances that were brought into service, and the warm pool itself, allows for the site to become accessible.

NCOneMap · ‎08-25-2022

We've deployed a solution that seems to show promise in testing. Our time to add a new machine to a site has gone from ~20min to ~4min. Our process:

Set the max number of instances for the auto-scaling group (i.e., 6)
Create an AWS warm pool and set the max count equal to the value entered in step #1
Use the AWS CLI to set the "instance reuse" policy for the warm pool so that instances are not potentially terminated. Setting this policy is not supported in the console. (If using Windows, must use AWS Cloudshell. Command does not work if using the CLI in the CMD window).
1. Example:
  1. aws autoscaling put-warm-pool --auto-scaling-group-name agsSuper-ServerStack-AutoScalingGroup-1K7QS9RIGCV3K --pool-state Stopped --instance-reuse-policy '{\"ReuseOnScaleIn\": true}'
Use AGS REST Admin to change the "remove from site" setting so that machines are not removed
Confirm that all instances have joined the site
Decrease the desired number of instances in the auto-scaling group to the number that we want to be used during normal traffic (i.e., 2)

The unused instances will be returned to the warm pool but will still appear in ArcGIS Manager. When it needs to scale out again, the machines simply change state (e.g. stopped to start) and are part of the AGS site as soon as AGS gets a heartbeat. Doing this avoids the 20min spin up time to create the new instance and allow the Chef and any "user data" scripts to run. Because we've "pre-configured" everything they are ready to go and just need to change state. Doing this also avoids potential IP address issues in Enterprise when attempting to use auto-scaling and warm pools (more info). The only potential issue we've identified with our approach is that you could get into IP address conflicts if the combined number of instances you have active and have configured for the warm pool are less than the maximum number of instances you've set for the auto-scaling group. As scaling events occur, there could be IP address issues between what AWS assigns to new instances (i.e., not coming from the warm pool but actual new instances) and what Enterprise thinks is in use. It's a bit deep and convoluted to explain, but the takeaway is that by: 1) configuring ArcGIS to not release the IP addresses in the site and 2) we've configured the warm pool max count to be equal to the max count of the auto-scaling group, we mitigate that concern.

georisk · ‎09-27-2023

Hi,

We have a similar problem and interested on your approach. Can you provide details on how you did the step # 4?

Thanks!

NitinChauhan · ‎09-27-2022

Hi Everyone.

Can somebody share how you have automated the join site for the new EC2 being launched as a part of ASG. I believe you have created a custom AMI to the stage of join site. How is it being taken forward after that?