I've been chasing something weird I was seeing in devstack when creating
hundreds of instances in a single request where at some limit, things
blow up in an unexpected way during scheduling and all instances were
put into ERROR state. Given the environment I was running in, this
shouldn't have been happening, and today we figured out what was
actually happening. To summarize, we retry scheduling requests on RPC
timeout so you can have schedulermaxattempts greenthreads running
concurrently trying to schedule 1000 instances and melt your scheduler.
I've started a spec which goes into the details of the actual issue:
It also proposes a solution, but I don't feel it's the greatest
solution, so there are also some alternatives in there.
I'm really interested in operator feedback on this because I assume that
people are dealing with stuff like this in production already, and have
had to come up with ways to solve it.
OpenStack Development Mailing List (not for usage questions)