settingsLogin | Registersettings

[openstack-dev] [all][qa][glance] some recent tempest problems

0 votes

This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

Thanks!

Erno Kuvaja
Glance Release Czar

Brian Rosmaita
Glance PTL


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jun 26, 2017 in openstack-dev by rosmaita.fossdev_at_ (4,180 points)   1 2 2

20 Responses

0 votes

Excerpts from Brian Rosmaita's message of 2017-06-15 13:04:39 -0400:

This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

Thanks!

Erno Kuvaja
Glance Release Czar

Brian Rosmaita
Glance PTL

Asymmetric gating definitely has a way of introducing these problems.

Which jobs are involved?

Doug


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 15, 2017 by Doug_Hellmann (87,520 points)   3 4 10
0 votes

On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 15, 2017 by Sean_Dague (66,200 points)   4 8 14
0 votes

https://review.openstack.org/#/c/471352/ may be an example

Original Mail

Sender: sean@dague.net
To: openstack-dev@lists.openstack.org
Date: 2017/06/16 05:25
Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems

On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by zhu.fanglei_at_zte.c (500 points)  
0 votes

On Fri, Jun 16, 2017 at 9:43 AM, zhu.fanglei@zte.com.cn wrote:
https://review.openstack.org/#/c/471352/ may be an example

If this is case which is ceph related, i think we already discussed
these kind of cases, where functionality depends on backend storage
and how to handle corresponding tests failure [1].

Solution on that was Ceph job should exclude such test case which
functionality is not implemented/supported in ceph byregex. Jon
Bernard is working on this tests blacklist [2].

If there is any other job or case, then we can discuss/think of having
job running for Tempest gate also which i think we do in most cases.

And about making ceph job as voting, i remember we did not do that due
to stability ok job. Ceph job fails frequently and once Jon patches
merge and job is consistently stable then we can make voting.

Original Mail
Sender: sean@dague.net;
To: openstack-dev@lists.openstack.org;
Date: 2017/06/16 05:25
Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems

On 06/15/2017 01:04 PM, Brian Rosmaita wrote:

This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple
projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

--
Sean Dague
http://dague.net

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

..2 https://review.openstack.org/#/c/459774/ ,
https://review.openstack.org/#/c/459445/

-gmann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by GHANSHYAM_MANN (5,700 points)   1 3 4
0 votes

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.

To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sean


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_McGinnis (11,820 points)   2 3 6
0 votes

On 06/16/2017 09:51 AM, Sean McGinnis wrote:

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.

To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sure, if ceph is the primary concern, that feels like it should be a
reasonable specific thing to fix. It's not a grand issue, it's a
specific mismatch on what configs should be common.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_Dague (66,200 points)   4 8 14
0 votes

On 06/15/2017 10:51 PM, Ghanshyam Mann wrote:
On Fri, Jun 16, 2017 at 9:43 AM, zhu.fanglei@zte.com.cn wrote:

If this is case which is ceph related, i think we already discussed
these kind of cases, where functionality depends on backend storage
and how to handle corresponding tests failure [1].

Solution on that was Ceph job should exclude such test case which
functionality is not implemented/supported in ceph byregex. Jon
Bernard is working on this tests blacklist [2].

If there is any other job or case, then we can discuss/think of having
job running for Tempest gate also which i think we do in most cases.

And about making ceph job as voting, i remember we did not do that due
to stability ok job. Ceph job fails frequently and once Jon patches
merge and job is consistently stable then we can make voting.

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed. The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns. This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

Original Mail
Sender: sean@dague.net;
To: openstack-dev@lists.openstack.org;
Date: 2017/06/16 05:25
Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems

On 06/15/2017 01:04 PM, Brian Rosmaita wrote:

This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on. This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate. This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple
projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest. That would in theory open a regression path, which is why
we haven't put up a patch yet. Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place. The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too. The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

--
Sean Dague
http://dague.net

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

..2 https://review.openstack.org/#/c/459774/ ,
https://review.openstack.org/#/c/459445/

-gmann



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Eric_Harney (2,100 points)   1 2
0 votes

On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague sean@dague.net wrote:
On 06/16/2017 09:51 AM, Sean McGinnis wrote:

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

 -Sean

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.

To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sure, if ceph is the primary concern, that feels like it should be a
reasonable specific thing to fix. It's not a grand issue, it's a
specific mismatch on what configs should be common.

yea, we had such cases and decided to have blacklist of tests not
suitable for ceph. ceph job will exclude the tests failing on ceph.
Jon is working on this - https://review.openstack.org/#/c/459774/

This approach solve the problem without limiting tests scope. [1]

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

-gmann

    -Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by GHANSHYAM_MANN (5,700 points)   1 3 4
0 votes

Excerpts from Ghanshyam Mann's message of 2017-06-16 23:05:08 +0900:

On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague sean@dague.net wrote:

On 06/16/2017 09:51 AM, Sean McGinnis wrote:

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

 -Sean

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.

To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sure, if ceph is the primary concern, that feels like it should be a
reasonable specific thing to fix. It's not a grand issue, it's a
specific mismatch on what configs should be common.

yea, we had such cases and decided to have blacklist of tests not
suitable for ceph. ceph job will exclude the tests failing on ceph.
Jon is working on this - https://review.openstack.org/#/c/459774/

This approach solve the problem without limiting tests scope. [1]

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

-gmann

Is ceph behaving in an unexpected way or are the tests are making
implicit assumptions that might also cause trouble for other backends
if these tests ever make it into the suite used by the interop team?

Doug


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Doug_Hellmann (87,520 points)   3 4 10
0 votes

yea, we had such cases and decided to have blacklist of tests not
suitable for ceph. ceph job will exclude the tests failing on ceph.
Jon is working on this - https://review.openstack.org/#/c/459774/

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 16, 2017 by Sean_McGinnis (11,820 points)   2 3 6
...