settingsLogin | Registersettings

[openstack-dev] [nova] Running large instances with CPU pinning and OOM

0 votes

Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Thanks.

Regards,

Jakub Jursa


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Sep 28, 2017 in openstack-dev by Jakub_Jursa (280 points)   1 1

21 Responses

0 votes

On 09/27/2017 03:12 AM, Jakub Jursa wrote:

On 27.09.2017 10:40, Blair Bethwaite wrote:

On 27 September 2017 at 18:14, Stephen Finucane sfinucan@redhat.com wrote:

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

I'm not quite sure if/how OpenStack handles NUMA pinning (why is VM
being killed by OOM rather than having memory allocated on different
NUMA node). Anyway, good point, thank you, I should have a look at exact
parameters passed to QEMU when using CPU pinning.

OpenStack uses strict memory pinning when using CPU pinning and/or memory
hugepages, so all allocations are supposed to be local. When it can't allocate
locally, it triggers OOM.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

On 09/27/2017 08:01 AM, Blair Bethwaite wrote:
On 27 September 2017 at 23:19, Jakub Jursa jakub.jursa@chillisys.com wrote:

'hw:cpupolicy=dedicated' (while NOT setting 'hw:numanodes') results in
libvirt pinning CPU in 'strict' memory mode

(from libvirt xml for given instance)
...




...

So yeah, the instance is not able to allocate memory from another NUMA node.

I can't recall what the docs say on this but I wouldn't be surprised
if that was a bug. Though I do think most users would want CPU & NUMA
pinning together (you haven't shared your use case but perhaps you do
too?).

Not a bug. Once you enable CPU pinning we assume you care about performance,
and for max performance you need NUMA affinity as well. (And hugepages are
beneficial too.)

I'm not quite sure what do you mean by 'memory will be locked for the
guest'. Also, aren't huge pages enabled in kernel by default?

I think that suggestion was probably referring to static hugepages,
which can be reserved (per NUMA node) at boot and then (assuming your
host is configured correctly) QEMU will be able to back guest RAM with
them.

One nice thing about static hugepages is that you pre-allocate them at startup,
so you can decide on a per-NUMA-node basis how much 4K memory you want to leave
for incidental host stuff and qemu overhead. This lets you specify different
amounts of "host-reserved" memory on different NUMA nodes.

In order to use static hugepages for the guest you need to explicitly ask for a
page size of 2MB. (1GB is possible as well but in most cases doesn't buy you
much compared to 2MB.)

Lastly, qemu has overhead that varies depending on what you're doing in the
guest. In particular, there are various IO queues that can consume significant
amounts of memory. The company that I work for put in a good bit of effort
engineering things so that they work more reliably, and part of that was
determining how much memory to reserve for the host.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Lastly, qemu has overhead that varies depending on what you're doing in the
guest. In particular, there are various IO queues that can consume
significant amounts of memory. The company that I work for put in a good
bit of effort engineering things so that they work more reliably, and part
of that was determining how much memory to reserve for the host.

Chris

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not
considered in nova code when calculating if the instance being
provisioned actually fits into host's available resources (only the
memory, configured in instance's flavor is considered). And this is
especially being a problem when CPU pinning is used as the memory
allocation is bounded by limits of specific NUMA node (due to the
strict memory allocation mode). This renders the global reservation
parameter reservedhostmemory_mb useless as it doesn't take NUMA into
account.

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.

Regards,
Prema


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Premysl_Kouril (500 points)   1
0 votes

On 09/27/2017 03:10 PM, Premysl Kouril wrote:

Lastly, qemu has overhead that varies depending on what you're doing in the
guest. In particular, there are various IO queues that can consume
significant amounts of memory. The company that I work for put in a good
bit of effort engineering things so that they work more reliably, and part
of that was determining how much memory to reserve for the host.

Chris

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not
considered in nova code when calculating if the instance being
provisioned actually fits into host's available resources (only the
memory, configured in instance's flavor is considered). And this is
especially being a problem when CPU pinning is used as the memory
allocation is bounded by limits of specific NUMA node (due to the
strict memory allocation mode). This renders the global reservation
parameter reservedhostmemory_mb useless as it doesn't take NUMA into
account.

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.

Feel free to report a bug against nova...maybe reservedhostmemory_mb should be
a list of per-numa-node values.

It's a bit of a hack, but if you use hugepages for all the guests you can
control the amount of per-numa-node memory reserved for host overhead.

Since the kvm overhead memory is allocated from 4K pages (in my experience) you
can just choose to leave some memory on each host NUMA node as 4K pages instead
of allocating them as hugepages.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Hi Prema

On 28 September 2017 at 07:10, Premysl Kouril premysl.kouril@gmail.com wrote:
Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not

I'm curious what sort of VM configuration causes such high overheads,
is this when using highly tuned virt devices with very large buffers?

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.

If you are pinning multiple guests per NUMA node in a multi-NUMA node
system then you might also have issues with uneven distribution of
system overheads across nodes, depending on how close to the sun you
are flying.

--
Cheers,
~Blairo


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Blair_Bethwaite (4,080 points)   1 3 5
0 votes

On 09/27/2017 04:55 PM, Blair Bethwaite wrote:
Hi Prema

On 28 September 2017 at 07:10, Premysl Kouril premysl.kouril@gmail.com wrote:

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not

I'm curious what sort of VM configuration causes such high overheads,
is this when using highly tuned virt devices with very large buffers?

For what it's worth we ran into issues a couple years back with I/O to
RDB-backed disks in writethrough/writeback. There was a bug that allowed a very
large number of in-flight operations if the ceph server couldn't keep up with
the aggregate load. We hacked a local solution, I'm not sure if it's been dealt
with upstream.

I think virtio networking has also caused issues, though not as bad. (But
noticeable when running close to the line.)

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

On Wed, Sep 27, 2017 at 11:10:40PM +0200, Premysl Kouril wrote:

Lastly, qemu has overhead that varies depending on what you're doing in the
guest. In particular, there are various IO queues that can consume
significant amounts of memory. The company that I work for put in a good
bit of effort engineering things so that they work more reliably, and part
of that was determining how much memory to reserve for the host.

Chris

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not
considered in nova code when calculating if the instance being
provisioned actually fits into host's available resources (only the
memory, configured in instance's flavor is considered). And this is
especially being a problem when CPU pinning is used as the memory
allocation is bounded by limits of specific NUMA node (due to the
strict memory allocation mode). This renders the global reservation
parameter reservedhostmemory_mb useless as it doesn't take NUMA into
account.

Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reservedhostmemory_mb" is enough.

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.

Regards,
Prema


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 28, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
0 votes

On 09/28/2017 05:29 AM, Sahid Orentino Ferdjaoui wrote:

Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reservedhostmemory_mb" is enough.

What I see in the code/docs doesn't match that, but it's entirely possible I'm
missing something.

nova uses LibvirtConfigGuestNUMATuneMemory with a mode of "strict" and a nodeset
of "the host NUMA nodes used by a guest".

For a guest with single NUMA node, I think this would map to libvirt XML of
something like

The docs at https://libvirt.org/formatdomain.html#elementsNUMATuning say, "The
optional memory element specifies how to allocate memory for the domain process
on a NUMA host."

That seems to me that the qemu overhead would be NUMA-affined, no? (If you had
a multi-NUMA-node guest, then the qemu overhead would float across all the NUMA
nodes used by the guest.)

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 28, 2017 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reservedhostmemory_mb" is enough.

Even if that would be true and overhead memory could float in NUMA
nodes it generally doesn't prevent us from running into OOM troubles.
No matter where (in which NUMA node) the overhead memory gets
allocated, it is not included in available memory calculation for that
NUMA node when provisioning new instance and thus can cause OOM (once
the guest operating system of the newly provisioned instance actually
starts allocating memory which can only be allocated from its assigned
NUMA node).

Prema


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 28, 2017 by Premysl_Kouril (500 points)   1
0 votes

On Thu, 28 Sep 2017, Premysl Kouril wrote:

Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reservedhostmemory_mb" is enough.

Even if that would be true and overhead memory could float in NUMA
nodes it generally doesn't prevent us from running into OOM troubles.
No matter where (in which NUMA node) the overhead memory gets
allocated, it is not included in available memory calculation for that
NUMA node when provisioning new instance and thus can cause OOM (once
the guest operating system of the newly provisioned instance actually
starts allocating memory which can only be allocated from its assigned
NUMA node).

Some of the discussion on this bug may be relevant:

https://bugs.launchpad.net/nova/+bug/1683858

--
Chris Dent (⊙_⊙') https://anticdent.org/
freenode: cdent tw: @anticdent__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Sep 28, 2017 by cdent_plus_os_at_ant (12,800 points)   2 2 5
...