settingsLogin | Registersettings

[openstack-dev] [all][qa][glance] some recent tempest problems

0 votes

This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

Thanks!

Erno Kuvaja
Glance Release Czar

Brian Rosmaita
Glance PTL


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jun 26, 2017 in openstack-dev by rosmaita.fossdev_at_ (4,180 points)   1 2 2

20 Responses

0 votes

On 06/16/2017 10:21 AM, Sean McGinnis wrote:

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.

As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers. The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the easy case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable. I'm not sure what the impact is on
Cinder third-party CI for other drivers.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Eric_Harney (2,100 points)   1 2
0 votes

On 06/16/2017 09:51 AM, Sean McGinnis wrote:

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.

To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?

Again, we seem to be missing specifics and a set of events here, which
lacking that everyone is trying to guess what the problems are, which I
don't think is effective.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_Dague (66,200 points)   4 11 17
0 votes

On 06/16/2017 10:46 AM, Eric Harney wrote:
On 06/16/2017 10:21 AM, Sean McGinnis wrote:

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.

As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers. The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the easy case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable. I'm not sure what the impact is on
Cinder third-party CI for other drivers.

Ah, so the issue is that
gate-tempest-dsvm-full-ceph-plugin-src-glance_store-ubuntu-xenial is
Voting, because when the regex was made to stop ceph jobs from voting
(which they aren't on Nova, Tempest, Glance, or Cinder), it wasn't
applied there.

It's also a question about why a library is doing different back end
testing through full stack testing, instead of more targeted and
controlled behavior. Which I think is probably also less than ideal.

Both would be good things to fix.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_Dague (66,200 points)   4 11 17
0 votes

So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?

Ceph is voting on os-brick patches. So it does block some things when
we run into this situation.

But again, we should avoid getting into this situation in the first
place, voting or no.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_McGinnis (11,820 points)   2 3 6
0 votes

On 6/16/2017 3:32 PM, Sean McGinnis wrote:

So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?

Ceph is voting on os-brick patches. So it does block some things when
we run into this situation.

But again, we should avoid getting into this situation in the first
place, voting or no.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Yeah there is a distinction between the ceph nv job that runs on
nova/cinder/glance changes and the ceph job that runs on os-brick and
glance_store changes. When we made the tempest dsvm ceph job non-voting
we failed to mirror that in the os-brick/glance-store jobs. We should do
that.

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 17, 2017 by mriedemos_at_gmail.c (15,720 points)   2 5 11
0 votes

On 6/16/2017 8:13 PM, Matt Riedemann wrote:
Yeah there is a distinction between the ceph nv job that runs on
nova/cinder/glance changes and the ceph job that runs on os-brick and
glance_store changes. When we made the tempest dsvm ceph job non-voting
we failed to mirror that in the os-brick/glance-store jobs. We should do
that.

Here you go:

https://review.openstack.org/#/c/475095/

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 17, 2017 by mriedemos_at_gmail.c (15,720 points)   2 5 11
0 votes

On 6/16/2017 9:46 AM, Eric Harney wrote:
On 06/16/2017 10:21 AM, Sean McGinnis wrote:

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.

As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers. The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the easy case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable. I'm not sure what the impact is on
Cinder third-party CI for other drivers.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

This is generally why we have config options in Tempest to not run tests
that certain backends don't implement, like all of the backup/snapshot
volume tests that the NFS job was failing on forever.

I think it's perfectly valid to have tests in Tempest for things that
not all backends implement as long as they are configurable. It's up to
the various CI jobs to configure Tempest properly for what they support
and then work on reducing the number of things they don't support. We've
been doing that for ages now.

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 17, 2017 by mriedemos_at_gmail.c (15,720 points)   2 5 11
0 votes

On 6/16/2017 8:58 AM, Eric Harney wrote:
I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed. The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns. This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

There are no guarantees, no. The unshelve API reference is here [1]. The
asynchronous postconditions section just says:

"After you successfully shelve a server, its status changes to ACTIVE.
The server appears on the compute node.

The shelved image is deleted from the list of images returned by an API
call."

It doesn't say the image is deleted immediately, or that it waits for
the image to be gone before changing the instance status to ACTIVE.

I see there is also a typo in there, that should say after you
successfully unshelve a server.

From an API user point of view, this is all asynchronous because it's
an RPC cast from the nova-api service to the nova-conductor and finally
nova-compute service when unshelving the instance.

So I think the test is making some wrong assumptions on how fast the
image is going to be deleted when the instance is active.

As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
deleting an image in the v2 API [2]. If the image delete is asynchronous
then that should probably be a 202.

Either way the Tempest test should probably be in a wait loop for the
image to be gone if it's really going to assert this.

[1]
https://developer.openstack.org/api-ref/compute/?expanded=unshelve-restore-shelved-server-unshelve-action-detail#unshelve-restore-shelved-server-unshelve-action
[2]
https://developer.openstack.org/api-ref/image/v2/index.html?expanded=delete-an-image-detail#delete-an-image

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 19, 2017 by mriedemos_at_gmail.c (15,720 points)   2 5 11
0 votes

On 06/19/2017 09:22 AM, Matt Riedemann wrote:
On 6/16/2017 8:58 AM, Eric Harney wrote:

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed. The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns. This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

There are no guarantees, no. The unshelve API reference is here [1]. The
asynchronous postconditions section just says:

"After you successfully shelve a server, its status changes to ACTIVE.
The server appears on the compute node.

The shelved image is deleted from the list of images returned by an API
call."

It doesn't say the image is deleted immediately, or that it waits for
the image to be gone before changing the instance status to ACTIVE.

I see there is also a typo in there, that should say after you
successfully unshelve a server.

From an API user point of view, this is all asynchronous because it's an
RPC cast from the nova-api service to the nova-conductor and finally
nova-compute service when unshelving the instance.

So I think the test is making some wrong assumptions on how fast the
image is going to be deleted when the instance is active.

As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
deleting an image in the v2 API [2]. If the image delete is asynchronous
then that should probably be a 202.

Either way the Tempest test should probably be in a wait loop for the
image to be gone if it's really going to assert this.

Thanks for confirming this.

What do we need to do to get this fixed in Tempest? Nobody from Tempest
Core has responded to the revert patch [3] since this explanation was
posted.

IMO we should revert this for now and someone can implement a fixed
version if this test is needed.

[3] https://review.openstack.org/#/c/471352/

responded Jun 26, 2017 by Eric_Harney (2,100 points)   1 2
0 votes

On Mon, Jun 26, 2017 at 11:58 PM, Eric Harney eharney@redhat.com wrote:
On 06/19/2017 09:22 AM, Matt Riedemann wrote:

On 6/16/2017 8:58 AM, Eric Harney wrote:

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed. The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns. This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

There are no guarantees, no. The unshelve API reference is here [1]. The
asynchronous postconditions section just says:

"After you successfully shelve a server, its status changes to ACTIVE.
The server appears on the compute node.

The shelved image is deleted from the list of images returned by an API
call."

It doesn't say the image is deleted immediately, or that it waits for
the image to be gone before changing the instance status to ACTIVE.

I see there is also a typo in there, that should say after you
successfully unshelve a server.

From an API user point of view, this is all asynchronous because it's an
RPC cast from the nova-api service to the nova-conductor and finally
nova-compute service when unshelving the instance.

So I think the test is making some wrong assumptions on how fast the
image is going to be deleted when the instance is active.

As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
deleting an image in the v2 API [2]. If the image delete is asynchronous
then that should probably be a 202.

Either way the Tempest test should probably be in a wait loop for the
image to be gone if it's really going to assert this.

Thanks for confirming this.

What do we need to do to get this fixed in Tempest? Nobody from Tempest
Core has responded to the revert patch [3] since this explanation was
posted.

IMO we should revert this for now and someone can implement a fixed
version if this test is needed.

Sorry for delay. Let's fix this instead of revert -
https://review.openstack.org/#/c/477821/

-gmann

[3] https://review.openstack.org/#/c/471352/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 27, 2017 by GHANSHYAM_MANN (5,700 points)   1 3 4
...