settingsLogin | Registersettings

[openstack-dev] [heat][infra] Help needed! high gate failure rate

0 votes

Hi all

We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes most of
our patch stuck with the gate.

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be some
thing wrong with our infra. We need some help from infra team, to know if
any clue on this failure rate? Maybe some change in heat or infra cause
this? Is this only happen in heat's gate? (Do see some fail from other
teams, but not as bad as heat's gate.)

Thanks for any kind of help

[1]
http://status.openstack.org/openstack-health/#/g/project/openstack~2Fheat?duration=P14D

--
May The Force of OpenStack Be With You,

Rico Linirc: ricolin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Aug 10, 2017 in openstack-dev by rico.lin.guanyu_at_g (1,740 points)   2

6 Responses

0 votes

On 08/10/2017 06:18 PM, Rico Lin wrote:
We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes most of
our patch stuck with the gate.

There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
[1]

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be some
thing wrong with our infra. We need some help from infra team, to know if
any clue on this failure rate?

The reality is you're just going to have to triage this and be a lot
more specific with issues. I find opening an etherpad and going
through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by Ian_Wienand (3,620 points)   4 5
0 votes

On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand iwienand@redhat.com wrote:

On 08/10/2017 06:18 PM, Rico Lin wrote:

We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes most
of
our patch stuck with the gate.

There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
[1]

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-
apache-ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be
some
thing wrong with our infra. We need some help from infra team, to know if
any clue on this failure rate?

The reality is you're just going to have to triage this and be a lot
more specific with issues.

One of the issues we see recently is that, many jobs killed mid way through
the tests as the job times out(120 mins). It seems jobs are many times
scheduled to very slow nodes, where setting up devstack takes more than 80
mins[1].

[1]
http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.html#_2017-08-10_05_55_49_035693

I find opening an etherpad and going
through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Rabi Misra


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by Rabi_Mishra (2,140 points)   2 5
0 votes

The reality is you're just going to have to triage this and be a lot
more specific with issues. I find opening an etherpad and going
through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

Here [1] I collect some fail records from gate
As we can tell, most of environments set-up becomes really slow and failed
at some point with time out error.
In [1] I collect information for failed node. Hope you can find any clue
from it.

[1] https://etherpad.openstack.org/p/heat-gate-fail-2017-08

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by rico.lin.guanyu_at_g (1,740 points)   2
0 votes

On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra ramishra@redhat.com wrote:

On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand iwienand@redhat.com wrote:

On 08/10/2017 06:18 PM, Rico Lin wrote:

We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes
most of
our patch stuck with the gate.

There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
[1]

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-
ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be
some
thing wrong with our infra. We need some help from infra team, to know
if
any clue on this failure rate?

The reality is you're just going to have to triage this and be a lot
more specific with issues.

One of the issues we see recently is that, many jobs killed mid way
through the tests as the job times out(120 mins). It seems jobs are many
times scheduled to very slow nodes, where setting up devstack takes more
than 80 mins[1].

[1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-
functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.
html#_2017-08-10_05_55_49_035693

We download an image from a fedora mirror and it seems to take more than
1hr.

http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400

Probably an issue with the specific mirror or some infra network bandwidth
issue. I've submitted a patch to change the mirror to see if that helps.

I find opening an etherpad and going

through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscrib
e
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Rabi Misra

--
Regards,
Rabi Mishra


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by Rabi_Mishra (2,140 points)   2 5
0 votes

On Thu, Aug 10, 2017 at 07:22:42PM +0530, Rabi Mishra wrote:
On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra ramishra@redhat.com wrote:

On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand iwienand@redhat.com wrote:

On 08/10/2017 06:18 PM, Rico Lin wrote:

We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes
most of
our patch stuck with the gate.

There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
[1]

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-
ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be
some
thing wrong with our infra. We need some help from infra team, to know
if
any clue on this failure rate?

The reality is you're just going to have to triage this and be a lot
more specific with issues.

One of the issues we see recently is that, many jobs killed mid way
through the tests as the job times out(120 mins). It seems jobs are many
times scheduled to very slow nodes, where setting up devstack takes more
than 80 mins[1].

[1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-
functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.
html#_2017-08-10_05_55_49_035693

We download an image from a fedora mirror and it seems to take more than
1hr.

http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400

Probably an issue with the specific mirror or some infra network bandwidth
issue. I've submitted a patch to change the mirror to see if that helps.

Today we mirror both fedora-26[1] and fedora-25 (to be removed shortly). So if
you want to consider bumping your image for testing, you can fetch it from our
AFS mirrors.

You can source /etc/ci/mirror_info.sh to get information about things we mirror.

[1] http://mirror.regionone.infracloud-vanilla.openstack.org/fedora/releases/26/CloudImages/x86_64/images/

I find opening an etherpad and going

through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscrib
e
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Rabi Misra

--
Regards,
Rabi Mishra


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by pabelanger_at_redhat (6,560 points)   1 1 2
0 votes

On Thu, Aug 10, 2017 at 12:04 PM, Paul Belanger pabelanger@redhat.com wrote:
On Thu, Aug 10, 2017 at 07:22:42PM +0530, Rabi Mishra wrote:

On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra ramishra@redhat.com wrote:

On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand iwienand@redhat.com wrote:

On 08/10/2017 06:18 PM, Rico Lin wrote:

We're facing a high failure rate in Heat's gates [1], four of our gate
suffering with fail rate from 6 to near 20% in 14 days. which makes
most of
our patch stuck with the gate.

There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
[1]

gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-
ubuntu-xenia(9.09%)
gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)

We still try to find out what's the cause but (IMO,) seems it might be
some
thing wrong with our infra. We need some help from infra team, to know
if
any clue on this failure rate?

The reality is you're just going to have to triage this and be a lot
more specific with issues.

One of the issues we see recently is that, many jobs killed mid way
through the tests as the job times out(120 mins). It seems jobs are many
times scheduled to very slow nodes, where setting up devstack takes more
than 80 mins[1].

[1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-
functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.
html#_2017-08-10_05_55_49_035693

We download an image from a fedora mirror and it seems to take more than
1hr.

http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400

Probably an issue with the specific mirror or some infra network bandwidth
issue. I've submitted a patch to change the mirror to see if that helps.

Today we mirror both fedora-26[1] and fedora-25 (to be removed shortly). So if
you want to consider bumping your image for testing, you can fetch it from our
AFS mirrors.

You can source /etc/ci/mirror_info.sh to get information about things we mirror.

[1] http://mirror.regionone.infracloud-vanilla.openstack.org/fedora/releases/26/CloudImages/x86_64/images/

In order to make the gate happy, I've taken the time to submit this
patch, appreciate if it can be reviewed so we can reduce the churn on
our instances:

https://review.openstack.org/#/c/492634/

I find opening an etherpad and going

through the failures one-by-one helpful (e.g. I keep [2] for centos
jobs I'm interested in).

Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.

If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them ([3],[4] some examples).

Questions in #openstack-infra will usually get a helpful response too

Good luck :)

-i

[1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
[2] https://etherpad.openstack.org/p/centos7-dsvm-triage
[3] https://review.openstack.org/491800
[4] https://review.openstack.org/491466



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscrib
e
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Rabi Misra

--
Regards,
Rabi Mishra


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 10, 2017 by Mohammed_Naser (3,860 points)   1 3
...