settingsLogin | Registersettings

[openstack-dev] realtime kvm cpu affinities

0 votes

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for realtime,
and there are a few proposals on how to improve that further.

But there is still no full answer on how to distribute threads across
host-cores. The vcpus are easy but for the emulation and io-threads
there are multiple options. I would like to collect the constraints
from a qemu/kvm perspective first, and than possibly influence the
OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
cyclictest results in the guests? In [3] Rik talked about problems
like hold holder preemption, starvation etc. but not where/how to
schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as long as
the guest knows about it? Any funny behaving guest, not just Linux.
- Is it ok to make the emulators potentially slow by running them on
busy best-effort cores, or will they quickly be on the critical path
if you do more than just cyclictest? - our experience says we don't
need them reactive even with rt-networking involved

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared set
of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is load
outside the assigned resources, which leads to quota and accounting
problems.

So the current OpenStack model is to run those threads next to one
or more vcpu-threads. [1] You will need to remember that the vcpus in
question should not be your rt-cpus in the guest. I.e. if vcpu0 shares
its pcpu with the hypervisor noise your preemptrt-guest would use
isolcpus=1.

Is that kind of sharing a pcpu really a good idea? I could imagine
things like smp housekeeping (cache invalidation etc.) to eventually
cause vcpu1 having to wait for the emulator stuck in IO.
Or maybe a busy polling vcpu0 starving its own emulator causing high
latency or even deadlocks.
Even if it happens to work for Linux guests it seems like a strong
assumption that an rt-guest that has noise cores can deal with even more
noise one scheduling level below.

More recent proposals [2] suggest a scheme where the emulator and io
threads are on a separate core. That sounds more reasonable /
conservative but dramatically increases the per VM cost. And the pcpus
hosting the hypervisor threads will probably be idle most of the time.
I guess in this context the most important question is whether qemu is
ever involved in "regular operation" if you avoid the obvious IO
problems on your critical path.

My guess is that just [1] has serious hidden latency problems and [2]
is taking it a step too far by wasting whole cores for idle emulators.
We would like to suggest some other way inbetween, that is a little
easier on the core count. Our current solution seems to work fine but
has the mentioned quota problems.
With this mail i am hoping to collect some constraints to derive a
suggestion from. Or maybe collect some information that could be added
to the current blueprints as reasoning/documentation.

Sorry if you receive this mail a second time, i was not subscribed to
openstack-dev the first time.

best regards,
Henning

[1]
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
[2]
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
[3]
http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jul 6, 2017 in openstack-dev by Henning_Schild (600 points)   1

31 Responses

0 votes

On 06/20/2017 01:48 AM, Henning Schild wrote:
Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for realtime,
and there are a few proposals on how to improve that further.

But there is still no full answer on how to distribute threads across
host-cores. The vcpus are easy but for the emulation and io-threads
there are multiple options. I would like to collect the constraints
from a qemu/kvm perspective first, and than possibly influence the
OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
cyclictest results in the guests? In [3] Rik talked about problems
like hold holder preemption, starvation etc. but not where/how to
schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as long as
the guest knows about it? Any funny behaving guest, not just Linux.
- Is it ok to make the emulators potentially slow by running them on
busy best-effort cores, or will they quickly be on the critical path
if you do more than just cyclictest? - our experience says we don't
need them reactive even with rt-networking involved

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared set
of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is load
outside the assigned resources, which leads to quota and accounting
problems.

If you wanted to go this route, you could just edit the "vcpupinset" entry in
nova.conf on the compute nodes so that nova doesn't actually know about all of
the host vCPUs. Then you could run host load and emulator threads on the pCPUs
that nova doesn't know about, and there will be no quota/accounting issues in nova.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 20, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

On 06/20/2017 01:48 AM, Henning Schild wrote:

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for
realtime, and there are a few proposals on how to improve that
further.

But there is still no full answer on how to distribute threads
across host-cores. The vcpus are easy but for the emulation and
io-threads there are multiple options. I would like to collect the
constraints from a qemu/kvm perspective first, and than possibly
influence the OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
cyclictest results in the guests? In [3] Rik talked about
problems like hold holder preemption, starvation etc. but not
where/how to schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as
long as the guest knows about it? Any funny behaving guest, not
just Linux.
- Is it ok to make the emulators potentially slow by running them on
busy best-effort cores, or will they quickly be on the critical
path if you do more than just cyclictest? - our experience says we
don't need them reactive even with rt-networking involved

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.

If you wanted to go this route, you could just edit the
"vcpupinset" entry in nova.conf on the compute nodes so that nova
doesn't actually know about all of the host vCPUs. Then you could
run host load and emulator threads on the pCPUs that nova doesn't
know about, and there will be no quota/accounting issues in nova.

Exactly that is the idea but OpenStack currently does not allow that.
No thread will ever end up on a core outside the vcpupinset and
emulator/io-threads are controlled by OpenStack/libvirt. And you need a
way to specify exactly which cores outside vcpupinset are allowed for
breaking out of that set.
On our compute nodes we also have cores for host-realtime tasks i.e.
dpdk-based rt-networking.

Henning

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Henning_Schild (600 points)   1
0 votes

Am Tue, 20 Jun 2017 10:04:30 -0400
schrieb Luiz Capitulino lcapitulino@redhat.com:

On Tue, 20 Jun 2017 09:48:23 +0200
Henning Schild henning.schild@siemens.com wrote:

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for
realtime, and there are a few proposals on how to improve that
further.

But there is still no full answer on how to distribute threads
across host-cores. The vcpus are easy but for the emulation and
io-threads there are multiple options. I would like to collect the
constraints from a qemu/kvm perspective first, and than possibly
influence the OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
cyclictest results in the guests? In [3] Rik talked about problems
like hold holder preemption, starvation etc. but not where/how to
schedule emulators and io

We put emulator threads and io-threads in housekeeping cores in
the host. I think housekeeping cores is what you're calling
best-effort cores, those are non-isolated cores that will run host
load.

As expected, any best-effort/housekeeping core will do but overlap with
the vcpu-cores is a bad idea.

  • Is it ok to put a vcpu and emulator thread on the same core as
    long as the guest knows about it? Any funny behaving guest, not
    just Linux.

We can't do this for KVM-RT because we run all vcpu threads with
FIFO priority.

Same point as above, meaning the "hw:cpurealtimemask" approach is
wrong for realtime.

However, we have another project with DPDK whose goal is to achieve
zero-loss networking. The configuration required by this project is
very similar to the one required by KVM-RT. One difference though is
that we don't use RT and hence don't use FIFO priority.

In this project we've been running with the emulator thread and a
vcpu sharing the same core. As long as the guest housekeeping CPUs
are idle, we don't get any packet drops (most of the time, what
causes packet drops in this test-case would cause spikes in
cyclictest). However, we're seeing some packet drops for certain
guest workloads which we are still debugging.

Ok but that seems to be a different scenario where hw:cpupolicy
dedicated should be sufficient. However if the placement of the io and
emulators has to be on a subset of the dedicated cpus something like
hw:cpu
realtime_mask would be required.

  • Is it ok to make the emulators potentially slow by running them on
    busy best-effort cores, or will they quickly be on the critical
    path if you do more than just cyclictest? - our experience says we
    don't need them reactive even with rt-networking involved

I believe it is ok.

Ok.

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.

So the current OpenStack model is to run those threads next to one
or more vcpu-threads. [1] You will need to remember that the vcpus
in question should not be your rt-cpus in the guest. I.e. if vcpu0
shares its pcpu with the hypervisor noise your preemptrt-guest
would use isolcpus=1.

Is that kind of sharing a pcpu really a good idea? I could imagine
things like smp housekeeping (cache invalidation etc.) to eventually
cause vcpu1 having to wait for the emulator stuck in IO.

Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
running vcpu0 on an non-isolated core and without FIFO priority
caused spikes in vcpu1. I guess we debugged this down to vcpu1
waiting a few dozen microseconds for vcpu0 for some reason. Running
vcpu0 on a isolated core with FIFO priority fixed this (again, this
was years ago, I won't remember all the details).

Or maybe a busy polling vcpu0 starving its own emulator causing high
latency or even deadlocks.

This will probably happen if you run vcpu0 with FIFO priority.

Two more points that indicate that hw:cpurealtimemask (putting
emulators/io next to any vcpu) does not work for general rt.

Even if it happens to work for Linux guests it seems like a strong
assumption that an rt-guest that has noise cores can deal with even
more noise one scheduling level below.

More recent proposals [2] suggest a scheme where the emulator and io
threads are on a separate core. That sounds more reasonable /
conservative but dramatically increases the per VM cost. And the
pcpus hosting the hypervisor threads will probably be idle most of
the time.

I don't know how to solve this problem. Maybe if we dedicate only one
core for all emulator threads and io-threads of a VM would mitigate
this? Of course we'd have to test it to see if this doesn't give
spikes.

[2] suggests exactly that but it is a waste of pcpus. Say a vcpu needs
1.0 cores and all other threads need 0.05 cores. The real need of a 1
core rt-vm would be 1.05 for two it would be 2.05.
With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we
need 3 and waste 0.95.

I guess in this context the most important question is whether qemu
is ever involved in "regular operation" if you avoid the obvious IO
problems on your critical path.

My guess is that just [1] has serious hidden latency problems and
[2] is taking it a step too far by wasting whole cores for idle
emulators. We would like to suggest some other way inbetween, that
is a little easier on the core count. Our current solution seems to
work fine but has the mentioned quota problems.

What is your solution?

We have a kilo-based prototype that introduced emulatorpinset in
nova.conf. All vcpu threads will be scheduled on vcpupinset and
emulators and IO of all VMs will share emulatorpinset.
vcpupinset contains isolcpus from the host and emulatorpinset
contains best-effort cores from the host.
That basically means you put all emulators and io of all VMs onto a set
of cores that the host potentially also uses for other stuff. Sticking
with the made up numbers from above, all the 0.05s can share pcpus.

With the current implementation in mitaka (hw:cpurealtimemask) you
can not have a single-core rt-vm because you can not put 1.05 into 1
without overcommitting. You can put 2.05 into 2 but as you confirmed
the overcommitted core could still slow down the truly exclusive one.
On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores).

With [2], which is not implemented yet, the overcommitting is avoided.
But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3
On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores).

With our approach it might be hard to account for emulator and
io-threads because they share pcpus. But you do not run into
overcommitting and don't waste pcpus at the same time.
On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs (2-3
cores)

Henning

With this mail i am hoping to collect some constraints to derive a
suggestion from. Or maybe collect some information that could be
added to the current blueprints as reasoning/documentation.

Sorry if you receive this mail a second time, i was not subscribed
to openstack-dev the first time.

best regards,
Henning

[1]
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
[2]
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
[3]
http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Henning_Schild (600 points)   1
0 votes

Am Wed, 21 Jun 2017 09:32:42 -0400
schrieb Luiz Capitulino lcapitulino@redhat.com:

On Wed, 21 Jun 2017 12:47:27 +0200
Henning Schild henning.schild@siemens.com wrote:

What is your solution?

We have a kilo-based prototype that introduced emulatorpinset in
nova.conf. All vcpu threads will be scheduled on vcpupinset and
emulators and IO of all VMs will share emulatorpinset.
vcpupinset contains isolcpus from the host and emulatorpinset
contains best-effort cores from the host.

You lost me here a bit as I'm not familiar with OpenStack
configuration.

Does not matter, i guess you got the point and some other people might
find that useful.

That basically means you put all emulators and io of all VMs onto a
set of cores that the host potentially also uses for other stuff.
Sticking with the made up numbers from above, all the 0.05s can
share pcpus.

So, this seems to be way we use KVM-RT without OpenStack: emulator
threads and io threads run on the host housekeeping cores, where all
other host processes will run. IOW, you only reserve pcpus for vcpus
threads.

Thanks for the input. I think you confirmend that the current
implementation in openstack can not work and that the new proposal and
our approach should work.
Now we will have to see how to proceed with that information in the
openstack community.

I can't comment on OpenStack accounting trade-off/implications of
doing this, but from KVM-RT perspective this is probably the best
solution. I say "probably" because so far we have only tested with
cyclictest and simple applications. I don't know if more complex
applications would have different needs wrt I/O threads for example.

We have a networking ping/pong cyclictest kind of thing and much more
complex setups. Emulators and IO are not on the critical path in our
examples.

PS: OpenStack devel list refuses emails from non-subscribers. I won't
subscribe for a one-time discussion, so my emails are not
reaching the list...

Yeah had the same problem, also with their gerrit. Lets just call it
Stack ... I kept all your text in my replies, and they end up on the
list.

Henning


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Henning_Schild (600 points)   1
0 votes

On 06/21/2017 02:42 AM, Henning Schild wrote:
Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.

If you wanted to go this route, you could just edit the
"vcpupinset" entry in nova.conf on the compute nodes so that nova
doesn't actually know about all of the host vCPUs. Then you could
run host load and emulator threads on the pCPUs that nova doesn't
know about, and there will be no quota/accounting issues in nova.

Exactly that is the idea but OpenStack currently does not allow that.
No thread will ever end up on a core outside the vcpupinset and
emulator/io-threads are controlled by OpenStack/libvirt.

Ah, right. This will isolate the host load from the guest load, but it will
leave the guest emulator work running on the same pCPUs as one or more vCPU threads.

Your emulatorpinset idea is interesting...it might be worth proposing in nova.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

On 06/21/2017 09:45 AM, Chris Friesen wrote:

On 06/21/2017 02:42 AM, Henning Schild wrote:

Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a
shared set of pcpus where we also run best-effort VMs and host
load. Now the OpenStack guys are not too happy with that because
that is load outside the assigned resources, which leads to
quota and accounting problems.

If you wanted to go this route, you could just edit the
"vcpupinset" entry in nova.conf on the compute nodes so that
nova doesn't actually know about all of the host vCPUs. Then you
could run host load and emulator threads on the pCPUs that nova
doesn't know about, and there will be no quota/accounting issues
in nova.

Exactly that is the idea but OpenStack currently does not allow
that. No thread will ever end up on a core outside the
vcpupinset and emulator/io-threads are controlled by
OpenStack/libvirt.

Ah, right. This will isolate the host load from the guest load,
but it will leave the guest emulator work running on the same pCPUs
as one or more vCPU threads.

Your emulatorpinset idea is interesting...it might be worth
proposing in nova.

Actually, based on [1] it appears they considered it and decided that
it didn't provide enough isolation between realtime VMs.

Hey Chris,

i guess you are talking about that section from [1]:

We could use a host level tunable to just reserve a set of host
pCPUs for running emulator threads globally, instead of trying to
account for it per instance. This would work in the simple case,
but when NUMA is used, it is highly desirable to have more fine
grained config to control emulator thread placement. When real-time
or dedicated CPUs are used, it will be critical to separate
emulator threads for different KVM instances.

I know it has been considered, but i would like to bring the topic up
again. Because doing it that way allows for many more rt-VMs on a host
and i am not sure i fully understood why the idea was discarded in the
end.

I do not really see the influence of NUMA here. Say the
emulatorpinset is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not be an
issue. Or you could say the set needs to contain at least one core per
numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is not "critical
to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each other.
At least not on the "cpuset" basis, maybe "blkio" and cgroups like that.

Henning

Chris

[1]
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Henning_Schild (600 points)   1
0 votes

On 06/21/2017 10:46 AM, Henning Schild wrote:
Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

i guess you are talking about that section from [1]:

We could use a host level tunable to just reserve a set of host
pCPUs for running emulator threads globally, instead of trying to
account for it per instance. This would work in the simple case,
but when NUMA is used, it is highly desirable to have more fine
grained config to control emulator thread placement. When real-time
or dedicated CPUs are used, it will be critical to separate
emulator threads for different KVM instances.

Yes, that's the relevant section.

I know it has been considered, but i would like to bring the topic up
again. Because doing it that way allows for many more rt-VMs on a host
and i am not sure i fully understood why the idea was discarded in the
end.

I do not really see the influence of NUMA here. Say the
emulatorpinset is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not be an
issue. Or you could say the set needs to contain at least one core per
numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is not "critical
to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each other.
At least not on the "cpuset" basis, maybe "blkio" and cgroups like that.

I'm reluctant to say conclusively that we don't need to separate emulator
threads since I don't think we've considered all the cases. For example, what
happens if one or more of the instances are being live-migrated? The migration
thread for those instances will be very busy scanning for dirty pages, which
could delay the emulator threads for other instances and also cause significant
cross-NUMA traffic unless we ensure at least one core per NUMA-node.

Also, I don't think we've determined how much CPU time is needed for the
emulator threads. If we have ~60 CPUs available for instances split across two
NUMA nodes, can we safely run the emulator threads of 30 instances all together
on a single CPU? If not, how much "emulator overcommit" is allowable?

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 21, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Am Wed, 21 Jun 2017 11:40:14 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

On 06/21/2017 10:46 AM, Henning Schild wrote:

Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

i guess you are talking about that section from [1]:

We could use a host level tunable to just reserve a set of host
pCPUs for running emulator threads globally, instead of trying to
account for it per instance. This would work in the simple case,
but when NUMA is used, it is highly desirable to have more fine
grained config to control emulator thread placement. When
real-time or dedicated CPUs are used, it will be critical to
separate emulator threads for different KVM instances.

Yes, that's the relevant section.

I know it has been considered, but i would like to bring the topic
up again. Because doing it that way allows for many more rt-VMs on
a host and i am not sure i fully understood why the idea was
discarded in the end.

I do not really see the influence of NUMA here. Say the
emulatorpinset is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not
be an issue. Or you could say the set needs to contain at least one
core per numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is not
"critical to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each
other. At least not on the "cpuset" basis, maybe "blkio" and
cgroups like that.

I'm reluctant to say conclusively that we don't need to separate
emulator threads since I don't think we've considered all the cases.
For example, what happens if one or more of the instances are being
live-migrated? The migration thread for those instances will be very
busy scanning for dirty pages, which could delay the emulator threads
for other instances and also cause significant cross-NUMA traffic
unless we ensure at least one core per NUMA-node.

Realtime instances can not be live-migrated. We are talking about
threads that can not even be moved between two cores on one numa-node
without missing a deadline. But your point is good because it could
mean that such an emulator_set - if defined - should not be used for all
VMs.

Also, I don't think we've determined how much CPU time is needed for
the emulator threads. If we have ~60 CPUs available for instances
split across two NUMA nodes, can we safely run the emulator threads
of 30 instances all together on a single CPU? If not, how much
"emulator overcommit" is allowable?

That depends on how much IO your VMs are issuing and can not be
answered in general. All VMs can cause high load with IO/emulation,
rt-VMs are probably less likely to do so.
Say your 64cpu compute-node would be used for both rt and regular. To
mix you would have two instances of nova running on that machine. One
gets node0 (32 cpus) for regular VMs. The emulator-pin-set would not
be defined here (so it would equal the vcpupinset, full overlap).
The other nova would get node1 and disable hyperthreads for all rt
cores (17 cpus left). It would need at least one core for housekeeping
and io/emulation threads. So you are down to max. 15 VMs putting their
IO on that one core and its hyperthread 7.5 per cpu.

In the same setup with [2] we would get a max of 7 single-cpu VMs,
instead of 15! And 15 vs 31 if you dedicate the whole box to rt.

Henning

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2017 by Henning_Schild (600 points)   1
0 votes

On 06/22/2017 01:47 AM, Henning Schild wrote:
Am Wed, 21 Jun 2017 11:40:14 -0600
schrieb Chris Friesen chris.friesen@windriver.com:

On 06/21/2017 10:46 AM, Henning Schild wrote:

As we know from our setup, and as Luiz confirmed - it is not
"critical to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each
other. At least not on the "cpuset" basis, maybe "blkio" and
cgroups like that.

I'm reluctant to say conclusively that we don't need to separate
emulator threads since I don't think we've considered all the cases.
For example, what happens if one or more of the instances are being
live-migrated? The migration thread for those instances will be very
busy scanning for dirty pages, which could delay the emulator threads
for other instances and also cause significant cross-NUMA traffic
unless we ensure at least one core per NUMA-node.

Realtime instances can not be live-migrated. We are talking about
threads that can not even be moved between two cores on one numa-node
without missing a deadline. But your point is good because it could
mean that such an emulator_set - if defined - should not be used for all
VMs.

I'd suggest that realtime instances cannot be live-migrated while meeting
realtime commitments
. There may be reasons to live-migrate realtime instances
that aren't currently providing service.

Also, I don't think we've determined how much CPU time is needed for
the emulator threads. If we have ~60 CPUs available for instances
split across two NUMA nodes, can we safely run the emulator threads
of 30 instances all together on a single CPU? If not, how much
"emulator overcommit" is allowable?

That depends on how much IO your VMs are issuing and can not be
answered in general. All VMs can cause high load with IO/emulation,
rt-VMs are probably less likely to do so.

I think the result of this is that in addition to "rtemulatorpinset" you'd
probably want a config option for "rt
emulatorovercommitratio" or something
similar.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote:
Am Tue, 20 Jun 2017 10:04:30 -0400
schrieb Luiz Capitulino lcapitulino@redhat.com:

On Tue, 20 Jun 2017 09:48:23 +0200
Henning Schild henning.schild@siemens.com wrote:

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for
realtime, and there are a few proposals on how to improve that
further.

But there is still no full answer on how to distribute threads
across host-cores. The vcpus are easy but for the emulation and
io-threads there are multiple options. I would like to collect the
constraints from a qemu/kvm perspective first, and than possibly
influence the OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
cyclictest results in the guests? In [3] Rik talked about problems
like hold holder preemption, starvation etc. but not where/how to
schedule emulators and io

We put emulator threads and io-threads in housekeeping cores in
the host. I think housekeeping cores is what you're calling
best-effort cores, those are non-isolated cores that will run host
load.

As expected, any best-effort/housekeeping core will do but overlap with
the vcpu-cores is a bad idea.

  • Is it ok to put a vcpu and emulator thread on the same core as
    long as the guest knows about it? Any funny behaving guest, not
    just Linux.

We can't do this for KVM-RT because we run all vcpu threads with
FIFO priority.

Same point as above, meaning the "hw:cpurealtimemask" approach is
wrong for realtime.

However, we have another project with DPDK whose goal is to achieve
zero-loss networking. The configuration required by this project is
very similar to the one required by KVM-RT. One difference though is
that we don't use RT and hence don't use FIFO priority.

In this project we've been running with the emulator thread and a
vcpu sharing the same core. As long as the guest housekeeping CPUs
are idle, we don't get any packet drops (most of the time, what
causes packet drops in this test-case would cause spikes in
cyclictest). However, we're seeing some packet drops for certain
guest workloads which we are still debugging.

Ok but that seems to be a different scenario where hw:cpupolicy
dedicated should be sufficient. However if the placement of the io and
emulators has to be on a subset of the dedicated cpus something like
hw:cpu
realtime_mask would be required.

  • Is it ok to make the emulators potentially slow by running them on
    busy best-effort cores, or will they quickly be on the critical
    path if you do more than just cyclictest? - our experience says we
    don't need them reactive even with rt-networking involved

I believe it is ok.

Ok.

Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.

So the current OpenStack model is to run those threads next to one
or more vcpu-threads. [1] You will need to remember that the vcpus
in question should not be your rt-cpus in the guest. I.e. if vcpu0
shares its pcpu with the hypervisor noise your preemptrt-guest
would use isolcpus=1.

Is that kind of sharing a pcpu really a good idea? I could imagine
things like smp housekeeping (cache invalidation etc.) to eventually
cause vcpu1 having to wait for the emulator stuck in IO.

Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
running vcpu0 on an non-isolated core and without FIFO priority
caused spikes in vcpu1. I guess we debugged this down to vcpu1
waiting a few dozen microseconds for vcpu0 for some reason. Running
vcpu0 on a isolated core with FIFO priority fixed this (again, this
was years ago, I won't remember all the details).

Or maybe a busy polling vcpu0 starving its own emulator causing high
latency or even deadlocks.

This will probably happen if you run vcpu0 with FIFO priority.

Two more points that indicate that hw:cpurealtimemask (putting
emulators/io next to any vcpu) does not work for general rt.

Even if it happens to work for Linux guests it seems like a strong
assumption that an rt-guest that has noise cores can deal with even
more noise one scheduling level below.

More recent proposals [2] suggest a scheme where the emulator and io
threads are on a separate core. That sounds more reasonable /
conservative but dramatically increases the per VM cost. And the
pcpus hosting the hypervisor threads will probably be idle most of
the time.

I don't know how to solve this problem. Maybe if we dedicate only one
core for all emulator threads and io-threads of a VM would mitigate
this? Of course we'd have to test it to see if this doesn't give
spikes.

[2] suggests exactly that but it is a waste of pcpus. Say a vcpu needs
1.0 cores and all other threads need 0.05 cores. The real need of a 1
core rt-vm would be 1.05 for two it would be 2.05.
With [1] we pack 2.05 onto 2 pcpus, that does not work. With [2] we
need 3 and waste 0.95.

I guess in this context the most important question is whether qemu
is ever involved in "regular operation" if you avoid the obvious IO
problems on your critical path.

My guess is that just [1] has serious hidden latency problems and
[2] is taking it a step too far by wasting whole cores for idle
emulators. We would like to suggest some other way inbetween, that
is a little easier on the core count. Our current solution seems to
work fine but has the mentioned quota problems.

What is your solution?

We have a kilo-based prototype that introduced emulatorpinset in
nova.conf. All vcpu threads will be scheduled on vcpupinset and
emulators and IO of all VMs will share emulatorpinset.
vcpupinset contains isolcpus from the host and emulatorpinset
contains best-effort cores from the host.
That basically means you put all emulators and io of all VMs onto a set
of cores that the host potentially also uses for other stuff. Sticking
with the made up numbers from above, all the 0.05s can share pcpus.

With the current implementation in mitaka (hw:cpurealtimemask) you
can not have a single-core rt-vm because you can not put 1.05 into 1
without overcommitting. You can put 2.05 into 2 but as you confirmed
the overcommitted core could still slow down the truly exclusive one.
On a 4-core host you get a maximum of 1 rt-VMs (2-3 cores).

With [2], which is not implemented yet, the overcommitting is avoided.
But now you waste a lot of pcpus. 1.05 = 2, 2.05 = 3
On a 4-core host you get a maximum of 1 rt-VMs (1-2 cores).

With our approach it might be hard to account for emulator and
io-threads because they share pcpus. But you do not run into
overcommitting and don't waste pcpus at the same time.
On a 4-core host you get a maximum of 3 rt-VMs (1 core), 1 rt-VMs (2-3
cores)

I think your solution is good.

In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.

For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

  • hw:cpuemulatorthreads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpupinset) to basically pack the emulator threads of all VMs
running here (useful for RT context).

If on flavor extra-specs It will be applied to the vCPUs dedicated for
the guest (useful for DPDK context).

s.

Henning

With this mail i am hoping to collect some constraints to derive a
suggestion from. Or maybe collect some information that could be
added to the current blueprints as reasoning/documentation.

Sorry if you receive this mail a second time, i was not subscribed
to openstack-dev the first time.

best regards,
Henning

[1]
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
[2]
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
[3]
http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 23, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
...