settingsLogin | Registersettings

[Openstack] QEMU/KVM crash when mixing cpu_policy:dedicated and non-dedicated flavors?

0 votes

Hi
I just noticed a strange (?) issue when I tried to create an instance with
a flavor with hw:cpu_policy=dedicated. The instance failed with error:

Unable to read from monitor: Connection reset by peer', u'code': 500,
u'details': u' File
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in
dobuildandruninstance\n filterproperties)
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116,
in buildandruninstance\n instanceuuid=instance.uuid,
reason=six.text
type(e))

And all other instances were shut down, even those living on another
compute host than the new one was scheduled to. A quick googling reveals
that this could be due to the hypervisor crashing (though why would it
crash on unrelated compute hosts??).

The only odd thing here that I can think of was that the existing instances
did -not- use dedicated cpu policy -- can there be problems like this when
attempting to mix dedicated and non-dedicated policies?

This was with Mitaka.

/Tomas


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
asked Sep 18, 2017 in openstack by Tomas_Brännström (200 points)   2

2 Responses

0 votes

----- Original Message -----
From: "Tomas Brännström" tomas.a.brannstrom@tieto.com
To: openstack@lists.openstack.org
Sent: Friday, September 15, 2017 5:56:34 AM
Subject: [Openstack] QEMU/KVM crash when mixing cpu_policy:dedicated and non-dedicated flavors?

Hi
I just noticed a strange (?) issue when I tried to create an instance with
a flavor with hw:cpu_policy=dedicated. The instance failed with error:

Unable to read from monitor: Connection reset by peer', u'code': 500,
u'details': u' File
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in
dobuildandruninstance\n filterproperties)
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116,
in buildandruninstance\n instanceuuid=instance.uuid,
reason=six.text
type(e))

And all other instances were shut down, even those living on another
compute host than the new one was scheduled to. A quick googling reveals
that this could be due to the hypervisor crashing (though why would it
crash on unrelated compute hosts??).

Are there any more specific messages in the system logs or elsewhere? Check /var/log/libvirt/* in particular, though I suspect it will be the original source of the above message it may have some additional useful information earlier.

The only odd thing here that I can think of was that the existing instances
did -not- use dedicated cpu policy -- can there be problems like this when
attempting to mix dedicated and non-dedicated policies?

The main problem if you mix them on the same node is that Nova wont account properly for this when placing guests, the current design assumes that a node will be used either for "normal" instances (with CPU overcommit) or "dedicated" instances (no CPU overcommit, pinning) and the two will be separated via the use of host aggregates and flavors. This in and of itself should not result in a QEMU crash though it may eventually result in issues w.r.t. balancing of scheduling/placement decisions. If instances on other nodes went down at the same time I'd be looking for a broader issue, what is your storage and networking setup like?

-Steve

This was with Mitaka.

/Tomas


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

--
Steve Gordon,
Principal Product Manager,
Red Hat OpenStack Platform

responded Sep 16, 2017 by Steve_Gordon (9,680 points)   2 5 6
0 votes

We use Fuel for deployment, with a fairly simple network configuration
(Controller/Network node are the same) and OpenDaylight as the neutron
driver. However, we also have SR-IOV configured for some nics, and there
might be something interesting here.

The instance was created with an SR-IOV port, and in the logs I see
"Assigning a pci device without numa affinity toinstance
389109a4-540e-48d9-82b1-873b02cb4d31 which has numa topology". Then shortly
after creation fails and the hypervisor seems to crash.

So today I tried to create an instance without SR-IOV and
hw:policy=dedicated, and it worked fine. Then I did the same but added an
SR-IOV port, and I get the same crash (though not across all nodes this
time...)

I assume we have some kind of misconfiguration somewhere, though the entire
hypervisor crashing doesn't seem correct either :-)

/Tomas

On 17 September 2017 at 00:32, Steve Gordon sgordon@redhat.com wrote:

----- Original Message -----

From: "Tomas Brännström" tomas.a.brannstrom@tieto.com
To: openstack@lists.openstack.org
Sent: Friday, September 15, 2017 5:56:34 AM
Subject: [Openstack] QEMU/KVM crash when mixing cpu_policy:dedicated and
non-dedicated flavors?

Hi
I just noticed a strange (?) issue when I tried to create an instance
with
a flavor with hw:cpu_policy=dedicated. The instance failed with error:

Unable to read from monitor: Connection reset by peer', u'code': 500,
u'details': u' File
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926,
in
dobuildandruninstance\n filterproperties)
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line
2116,
in buildandruninstance\n instanceuuid=instance.uuid,
reason=six.text
type(e))

And all other instances were shut down, even those living on another
compute host than the new one was scheduled to. A quick googling reveals
that this could be due to the hypervisor crashing (though why would it
crash on unrelated compute hosts??).

Are there any more specific messages in the system logs or elsewhere?
Check /var/log/libvirt/* in particular, though I suspect it will be the
original source of the above message it may have some additional useful
information earlier.

The only odd thing here that I can think of was that the existing
instances
did -not- use dedicated cpu policy -- can there be problems like this
when
attempting to mix dedicated and non-dedicated policies?

The main problem if you mix them on the same node is that Nova wont
account properly for this when placing guests, the current design assumes
that a node will be used either for "normal" instances (with CPU
overcommit) or "dedicated" instances (no CPU overcommit, pinning) and the
two will be separated via the use of host aggregates and flavors. This in
and of itself should not result in a QEMU crash though it may eventually
result in issues w.r.t. balancing of scheduling/placement decisions. If
instances on other nodes went down at the same time I'd be looking for a
broader issue, what is your storage and networking setup like?

-Steve

This was with Mitaka.

/Tomas


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/
openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/
openstack

--
Steve Gordon,
Principal Product Manager,
Red Hat OpenStack Platform


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
responded Sep 18, 2017 by Tomas_Brännström (200 points)   2
...