settingsLogin | Registersettings

[openstack-dev] [nova] Running large instances with CPU pinning and OOM

0 votes

Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Thanks.

Regards,

Jakub Jursa


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Sep 28, 2017 in openstack-dev by Jakub_Jursa (280 points)   1 1

21 Responses

0 votes

On Mon, 2017-09-25 at 17:36 +0200, Jakub Jursa wrote:
Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Both 'ramallocationratio' and 'cpuallocationratio' are ignored when using
pinned CPUs because they don't make much sense: you want a high performance VM
and have assigned dedicated cores to the instance for this purpose, yet you're
telling nova to over-schedule and schedule multiple instances to some of these
same cores.

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

Hope this helps,
Stephen


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Stephen_Finucane (1,620 points)   2
0 votes

On 27 September 2017 at 18:14, Stephen Finucane sfinucan@redhat.com wrote:
What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

Jakub, your numbers sound reasonable to me, i.e., use 60 out of 64GB
when only considering QEMU overhead - however I would expect that
might be a problem on NUMA node0 where there will be extra reserved
memory regions for kernel and devices. In such a configuration where
you are wanting to pin multiple guests into each of multiple NUMA
nodes I think you may end up needing different flavor/instance-type
configs (using less RAM) for node0 versus other NUMA nodes. Suggest
freshly booting one of your hypervisors and then with no guests
running take a look at e.g. /proc/buddyinfo/ and /proc/zoneinfo to see
what memory is used/available and where.

--
Cheers,
~Blairo


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Blair_Bethwaite (4,080 points)   1 3 4
0 votes

Also CC-ing os-ops as someone else may have encountered this before
and have further/better advice...

On 27 September 2017 at 18:40, Blair Bethwaite
blair.bethwaite@gmail.com wrote:
On 27 September 2017 at 18:14, Stephen Finucane sfinucan@redhat.com wrote:

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

Jakub, your numbers sound reasonable to me, i.e., use 60 out of 64GB
when only considering QEMU overhead - however I would expect that
might be a problem on NUMA node0 where there will be extra reserved
memory regions for kernel and devices. In such a configuration where
you are wanting to pin multiple guests into each of multiple NUMA
nodes I think you may end up needing different flavor/instance-type
configs (using less RAM) for node0 versus other NUMA nodes. Suggest
freshly booting one of your hypervisors and then with no guests
running take a look at e.g. /proc/buddyinfo/ and /proc/zoneinfo to see
what memory is used/available and where.

--
Cheers,
~Blairo

--
Cheers,
~Blairo


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Blair_Bethwaite (4,080 points)   1 3 4
0 votes

On 27.09.2017 10:14, Stephen Finucane wrote:
On Mon, 2017-09-25 at 17:36 +0200, Jakub Jursa wrote:

Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Both 'ramallocationratio' and 'cpuallocationratio' are ignored when using
pinned CPUs because they don't make much sense: you want a high performance VM
and have assigned dedicated cores to the instance for this purpose, yet you're
telling nova to over-schedule and schedule multiple instances to some of these
same cores.

I wanted to use 'ramallocationration' with value for example 0.8 to
force 'under-schedule' the host, to create a reserve on the host.

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I'm afraid that this won't help as this option doesn't take into account
NUMA nodes (e.g. there would be 'reservedhostmemory_mb' amount of free
memory on the physical node, but not in all its NUMA nodes

Hope this helps,
Stephen

Regards,

Jakub


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Jakub_Jursa (280 points)   1 1
0 votes

On 27.09.2017 10:40, Blair Bethwaite wrote:
On 27 September 2017 at 18:14, Stephen Finucane sfinucan@redhat.com wrote:

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

I'm not quite sure if/how OpenStack handles NUMA pinning (why is VM
being killed by OOM rather than having memory allocated on different
NUMA node). Anyway, good point, thank you, I should have a look at exact
parameters passed to QEMU when using CPU pinning.

Jakub, your numbers sound reasonable to me, i.e., use 60 out of 64GB

Hm, but the question is, how to prevent having some smaller instance
(e.g. 2GB RAM) scheduled on such NUMA node?

when only considering QEMU overhead - however I would expect that
might be a problem on NUMA node0 where there will be extra reserved
memory regions for kernel and devices. In such a configuration where
you are wanting to pin multiple guests into each of multiple NUMA
nodes I think you may end up needing different flavor/instance-type
configs (using less RAM) for node0 versus other NUMA nodes. Suggest

What do you mean using different flavor? From what I understand (
http://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html
https://docs.openstack.org/nova/pike/admin/cpu-topologies.html ) it can
be specified that flavor 'wants' different amount memory from its
(virtual) NUMA nodes, but mapping vCPU <-> pCPU is more or less
arbitrary (meaning that there is no way how to specify for NUMA node0 on
physical host that it has less memory available for VM allocation)

freshly booting one of your hypervisors and then with no guests
running take a look at e.g. /proc/buddyinfo/ and /proc/zoneinfo to see
what memory is used/available and where.

Thanks, I'll look into it.

Regards,

Jakub


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Jakub_Jursa (280 points)   1 1
0 votes

On 27.09.2017 11:12, Jakub Jursa wrote:

On 27.09.2017 10:40, Blair Bethwaite wrote:

On 27 September 2017 at 18:14, Stephen Finucane sfinucan@redhat.com wrote:

What you're probably looking for is the 'reservedhostmemory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

I'm not quite sure if/how OpenStack handles NUMA pinning (why is VM
being killed by OOM rather than having memory allocated on different
NUMA node). Anyway, good point, thank you, I should have a look at exact
parameters passed to QEMU when using CPU pinning.

Jakub, your numbers sound reasonable to me, i.e., use 60 out of 64GB

Hm, but the question is, how to prevent having some smaller instance
(e.g. 2GB RAM) scheduled on such NUMA node?

when only considering QEMU overhead - however I would expect that
might be a problem on NUMA node0 where there will be extra reserved
memory regions for kernel and devices. In such a configuration where
you are wanting to pin multiple guests into each of multiple NUMA
nodes I think you may end up needing different flavor/instance-type
configs (using less RAM) for node0 versus other NUMA nodes. Suggest

What do you mean using different flavor? From what I understand (
http://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html
https://docs.openstack.org/nova/pike/admin/cpu-topologies.html ) it can
be specified that flavor 'wants' different amount memory from its
(virtual) NUMA nodes, but mapping vCPU <-> pCPU is more or less
arbitrary (meaning that there is no way how to specify for NUMA node0 on
physical host that it has less memory available for VM allocation)

Can't be 'reservedhugepages' option used to reserve memory on certain
NUMA nodes?
https://docs.openstack.org/ocata/config-reference/compute/config-options.html

freshly booting one of your hypervisors and then with no guests
running take a look at e.g. /proc/buddyinfo/ and /proc/zoneinfo to see
what memory is used/available and where.

Thanks, I'll look into it.

Regards,

Jakub


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Jakub_Jursa (280 points)   1 1
0 votes

On Wed, Sep 27, 2017 at 11:58 AM, Jakub Jursa
jakub.jursa@chillisys.com wrote:

On 27.09.2017 11:12, Jakub Jursa wrote:

On 27.09.2017 10:40, Blair Bethwaite wrote:

On 27 September 2017 at 18:14, Stephen Finucane
sfinucan@redhat.com wrote:

What you're probably looking for is the 'reservedhostmemory_mb'
option. This
defaults to 512 (at least in the latest master) so if you up this
to 4192 or
similar you should resolve the issue.

I don't see how this would help given the problem description -
reservedhostmemory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based
on
Nova's simplified notion of memory capacity. It sounds like both
CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.

I'm not quite sure if/how OpenStack handles NUMA pinning (why is VM
being killed by OOM rather than having memory allocated on different
NUMA node). Anyway, good point, thank you, I should have a look at
exact
parameters passed to QEMU when using CPU pinning.

Jakub, your numbers sound reasonable to me, i.e., use 60 out of
64GB

Hm, but the question is, how to prevent having some smaller instance
(e.g. 2GB RAM) scheduled on such NUMA node?

when only considering QEMU overhead - however I would expect that
might be a problem on NUMA node0 where there will be extra
reserved
memory regions for kernel and devices. In such a configuration
where
you are wanting to pin multiple guests into each of multiple NUMA
nodes I think you may end up needing different flavor/instance-type
configs (using less RAM) for node0 versus other NUMA nodes. Suggest

What do you mean using different flavor? From what I understand (

http://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html
https://docs.openstack.org/nova/pike/admin/cpu-topologies.html ) it
can
be specified that flavor 'wants' different amount memory from its
(virtual) NUMA nodes, but mapping vCPU <-> pCPU is more or less
arbitrary (meaning that there is no way how to specify for NUMA
node0 on
physical host that it has less memory available for VM allocation)

Can't be 'reservedhugepages' option used to reserve memory on
certain
NUMA nodes?
https://docs.openstack.org/ocata/config-reference/compute/config-options.html

I think the qemu memory overhead is allocated from the 4k memory pool
so the question is if it is possible to reserve 4k pages with the
reservedhugepages config option. I don't find any restriction in the
code base about 4k pages (even if it is not considered as a large page
by definition) so in theory you can do it. However this also means you
have to enable NumaTopologyFilter.

Cheers,
gibi

freshly booting one of your hypervisors and then with no guests
running take a look at e.g. /proc/buddyinfo/ and /proc/zoneinfo to
see
what memory is used/available and where.

Thanks, I'll look into it.

Regards,

Jakub


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by =?iso-8859-1?Q?Bal=E (5,020 points)   1 2 3
0 votes

On Mon, Sep 25, 2017 at 05:36:44PM +0200, Jakub Jursa wrote:
Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

In Nova when using NUMA we do pin the memory on the host NUMA nodes
selected during scheduling. In your case it seems that you have
specificly requested a guest with 1 NUMA node. It will be not possible
for the process to grab memory on an other host NUMA node but some
other processes could be running in that host NUMA node and consume
memory.

What you need is to use Huge Pages, in such case the memory will be
locked for the guest.

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Thanks.

Regards,

Jakub Jursa


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
0 votes

On 27.09.2017 14:46, Sahid Orentino Ferdjaoui wrote:
On Mon, Sep 25, 2017 at 05:36:44PM +0200, Jakub Jursa wrote:

Hello everyone,

We're experiencing issues with running large instances (~60GB RAM) on
fairly large NUMA nodes (4 CPUs, 256GB RAM) while using cpu pinning. The
problem is that it seems that in some extreme cases qemu/KVM can have
significant memory overhead (10-15%?) which nova-compute service doesn't
take in to the account when launching VMs. Using our configuration as an
example - imagine running two VMs with 30GB RAM on one NUMA node
(because we use cpu pinning) - therefore using 60GB out of 64GB for
given NUMA domain. When both VMs would consume their entire memory
(given 10% KVM overhead) OOM killer takes an action (despite having
plenty of free RAM in other NUMA nodes). (the numbers are just
arbitrary, the point is that nova-scheduler schedules the instance to
run on the node because the memory seems 'free enough', but specific
NUMA node can be lacking the memory reserve).

In Nova when using NUMA we do pin the memory on the host NUMA nodes
selected during scheduling. In your case it seems that you have
specificly requested a guest with 1 NUMA node. It will be not possible
for the process to grab memory on an other host NUMA node but some
other processes could be running in that host NUMA node and consume
memory.

Yes, that is very likely the case - that some other processes consume
the memory on given NUMA node. It seems that setting flavor metadata
'hw:cpupolicy=dedicated' (while NOT setting 'hw:numanodes') results in
libvirt pinning CPU in 'strict' memory mode

(from libvirt xml for given instance)
...

...

So yeah, the instance is not able to allocate memory from another NUMA node.

What you need is to use Huge Pages, in such case the memory will be
locked for the guest.

I'm not quite sure what do you mean by 'memory will be locked for the
guest'. Also, aren't huge pages enabled in kernel by default?

Our initial solution was to use ramallocationratio < 1 to ensure
having some reserved memory - this didn't work. Upon studying source of
nova, it turns out that ramallocationratio is ignored when using cpu
pinning. (see
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L859
and
https://github.com/openstack/nova/blob/mitaka-eol/nova/virt/hardware.py#L821
). We're running Mitaka, but this piece of code is implemented in Ocata
in a same way.
We're considering to create a patch for taking ramallocationratio in
to account.

My question is - is ramallocationratio ignored on purpose when using
cpu pinning? If yes, what is the reasoning behind it? And what would be
the right solution to ensure having reserved RAM on the NUMA nodes?

Thanks.

Regards,

Jakub Jursa


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Jakub_Jursa (280 points)   1 1
0 votes

On 27 September 2017 at 23:19, Jakub Jursa jakub.jursa@chillisys.com wrote:
'hw:cpupolicy=dedicated' (while NOT setting 'hw:numanodes') results in
libvirt pinning CPU in 'strict' memory mode

(from libvirt xml for given instance)
...




...

So yeah, the instance is not able to allocate memory from another NUMA node.

I can't recall what the docs say on this but I wouldn't be surprised
if that was a bug. Though I do think most users would want CPU & NUMA
pinning together (you haven't shared your use case but perhaps you do
too?).

I'm not quite sure what do you mean by 'memory will be locked for the
guest'. Also, aren't huge pages enabled in kernel by default?

I think that suggestion was probably referring to static hugepages,
which can be reserved (per NUMA node) at boot and then (assuming your
host is configured correctly) QEMU will be able to back guest RAM with
them.

You are probably thinking of THP (transparent huge pages) which are
now on by default in Linux but can be somewhat hit & miss if you have
a long running host where memory has become fragmented or the
pagecache is large - in our experience performance can be severely
degraded by just missing hugepage backing of a small fraction of guest
memory, and we have noticed behaviour from memory management where THP
allocations fail when pagecache is highly utilised despite none of it
being dirty (so should be able to be dropped immediately).

--
Cheers,
~Blairo


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 27, 2017 by Blair_Bethwaite (4,080 points)   1 3 4
...