We're trying to figure out why, sometimes, rpc_loop takes over 10 seconds
to process an iteration when booting instances. So we deployed devstack on
a 8GB, 4vCPU VM and did some profiling on the following command:
nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec --nic
net-name=private --min-count 8 instance
(network private has portsecurityenabled set to False to avoid the
overhead of setting up sgs)
Logs showed that sometimes, the network-vif-plugged event was sent by the
server ~12 seconds after the vif was detected by ovsdb monitor. Usually,
first and second events come faster while the rest take longer. Further
analysis showed that rpc_loop iterations take several seconds to complete
so, if the vif is detected while iteration X is running, it won't be
processed until iteration X+1.
As an example, I've attached a simplified sequence diagram  to show what
happened in a particular iteration of my debug (I have full logs and pstat
files for this session for those interested). In this example, iteration 76
is going to process two ports while some of the previous spawned machines
are being managed by libvirt and so on... At the beginning of iteration 76,
a new vif is detected by ovsdb monitor but it won't be processed until 12
seconds later in iteration 77.
Profiling files show that aggregated CPU time for neutron workers is 97
seconds, while CPU time for ovs agent is 2.1 seconds. Most of its time is
spent waiting for RPC so it looks like there's apparently some room for
optimization and multiprocessing here.
According to dstat.log, CPU is at ~90% and there's ~1GB of free RAM. I
can't tell whether the hypervisor was swapping or not since I didn't have
access to it.
----system---- ----total-cpu-usage---- ------memory-usage----- -net/total->
time |usr sys idl wai hiq siq| used buff cach free| recv send>
05-02 14:22:50| 89 11 0 0 0 0|5553M 0 1151M 1119M|1808B 1462B>
05-02 14:22:51| 90 10 0 0 0 0|5567M 0 1151M 1106M|1588B 836B>
05-02 14:22:52| 89 11 1 0 0 0|5581M 0 1151M 1092M|3233B 2346B>
05-02 14:22:53| 89 10 0 0 0 0|5598M 0 1151M 1075M|2676B 2038B>
05-02 14:22:54| 90 10 0 0 0 0|5600M 0 1151M 1073M| 20k 14k>
05-02 14:22:55| 90 9 0 0 0 0|5601M 0 1151M 1072M| 22k 16k>
Also, while having a look at server profiling, around the 33% of the time
was spent building SQL queries . Mike Bayer went through this and
suggested having a look at baked queries and also submitted a sketch of his
I wanted to share these findings with you (probably most of you knew but
I'm quite new to OpenStack so It's been a really nice exercise for me to
better understand how things work) and gather your feedback about how
things can be improved. Also, I'll be happy to share the results and
discuss further if you think it's worth during the PTG next week.
Thanks a lot for reading and apologies for such a long email!
OpenStack Development Mailing List (not for usage questions)