On 12/8/2014 3:12 PM, Jeremy Stanley wrote:
On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
As Dan Berrang? noted, it's nearly impossible to reproduce this issue
independently outside of OpenStack Gating environment. I brought this up
at the recently concluded KVM Forum earlier this October. To debug this
any further, one of the QEMU block layer developers asked if we can get
QEMU instance running on Gate run under
gdb (IIRC, danpb suggested
this too, previously) to get further tracing details.
We document thoroughly how to reproduce the environments we use for
testing OpenStack. There's nothing rarified about "a Gate run" that
anyone with access to a public cloud provider would be unable to
reproduce, save being able to run it over and over enough times to
expose less frequent failures.
FWIW, I myself couldn't reproduce it independently via libvirt
alone or via QMP (QEMU Machine Protocol) commands.
Dan's workaround ("enable it permanently, except for under the
gate") sounds sensible to me.
I'm dubious of this as it basically says "we know this breaks
sometimes, so we're going to stop testing that it works at all and
possibly let it get even more broken, but you should be safe to rely
on it anyway."
The QA team tries very hard to make our integration testing
environment as closely as possible mimic real-world deployment
configurations. If these sorts of bugs emerge more often because of,
for example, resource constraints in the test environment then it
should be entirely likely they'd also be seen in production with the
same frequency if run on similarly constrained equipment. And as
we've observed in the past, any code path we stop testing quickly
accumulates new bugs that go unnoticed until they impact someone's
production environment at 3am.
Bringing this back up since Jesse Keating in IRC was asking about this
again today. Sounds like we've heard from a few people that are running
this in labs without problems, maybe they are patching libvirt/qemu, I
don't know, but we have other things that we know have broken parts and
that's why they run on the experimental queue, e.g. cells, nova +
ceph/rbd. We also know we're a bit busted in the ec2 API right now with
the latest boto release (2.35.1), so we have a cap on that.
These issues are being worked, but regarding this particular way that
we've disabled the function (with a version cap in the code), someone
has to go in and patch that out, which kind of sucks if they could have
just used a config option to enable it at their own risk.
That's why I'm proposing something like an [experimental] group. We
could put this into the [workarounds] group but this isn't really a
workaround for anything so that doesn't really make sense to me.
I'd personally be OK with putting it into the [libvirt] group with a
warning in the config option help and code that this isn't currently
tested in the gate so we aren't sure it's going to work, which we've
done for cells and some of the virt drivers, e.g. libvirt on