settingsLogin | Registersettings

[openstack-dev] [nova] Running large instances with CPU pinning and OOM

0 votes

Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Thanks.

Regards,

Jakub Jursa


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Sep 28, 2017 in openstack-dev by Jakub_Jursa (280 points)   1 1

21 Responses

0 votes

On Thu, Sep 28, 2017 at 11:10:38PM +0200, Premysl Kouril wrote:

Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reservedhostmemory_mb" is enough.

Even if that would be true and overhead memory could float in NUMA
nodes it generally doesn't prevent us from running into OOM troubles.
No matter where (in which NUMA node) the overhead memory gets
allocated, it is not included in available memory calculation for that
NUMA node when provisioning new instance and thus can cause OOM (once
the guest operating system of the newly provisioned instance actually
starts allocating memory which can only be allocated from its assigned
NUMA node).

That is why you need to use Huge Pages. The memory will be reserved
and locked for the guest.

Prema


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
...