settingsLogin | Registersettings

[openstack-dev] [tripleo] critical situation with CI / upgrade jobs

0 votes

So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.

Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade testing

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this week,
Thanks.
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Aug 20, 2017 in openstack-dev by emilien_at_redhat.co (36,940 points)   2 6 10

13 Responses

0 votes

On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi emilien@redhat.com wrote:

So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.

Problem #2: from Ocata to Pike (containerized) missing container upload
step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade
testing

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this
week,
Thanks.
--
Emilien Macchi

I think due to some of the limitations with run times upstream we may need
to rethink the workflow with upgrade tests upstream. It's not very clear to
me what can be done with the multinode nodepool jobs outside of what is
already being done. I think we do have some choices with ovb jobs. I'm
not going to try and solve in this email but rethinking how we CI upgrades
in the upstream infrastructure should be a focus for the Queens PTG. We
will need to focus on bringing run times significantly down as it's
incredibly difficult to run two installs in 175 minutes across all the
upstream cloud providers.

Thanks Emilien for all the work you have done around upgrades!


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by Wesley_Hayutin (2,320 points)   2
0 votes

On Wed, Aug 16, 2017 at 4:33 AM, Emilien Macchi emilien@redhat.com wrote:

So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.

Problem #2: from Ocata to Pike (containerized) missing container upload
step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade
testing

+1 but for Ocata to Pike, sounds like the container/images related problems
2 and 3 above are both in progress or being looked at (weshay/sbaker ++) in
which case we might be able to fix O...P jobs at least?

For Newton to Ocata, is it consistent which clouds we are timing out on? I
've looked at that https://bugs.launchpad.net/tripleo/+bug/1702955 before
and I know other folks from upgrades have too, but couldn't find some root
cause, or any upgrades operations taking too long/timing out/error etc. If
it is consistent which clouds time out we can use that info to guide us in
the case that we make the jobs non-voting for N...O (e.g. a known list of
'timing out clouds' to decide if we should inspect the ci logs closer
before merging some patch). Obviously only until/unless we actually root
cause that one (I will also find some time to check again)

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress

for master, +1 I think this is essentially what I am saying above for O...P
- sounds like problem 2 is well in progress from weshay and the other
container/image related problem 3 is the main outstanding item. Since RC1
is this week I think what you are proposing as mitigation is fair. So we
re-evaluate making these jobs voting before the final RCs end of August

  • Run TripleO upgrade scenarios as third party CI in RDO Cloud or
    somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this
week,
Thanks.

thanks for putting this together. I think if we really had to pick one the
O..P ci has priority obviously this week (!)... I think the
container/images related issues for O...P are both expected/teething issues
from the huge amount of work done by the containerization team and can
hopefully be resolved quickly.

marios

--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by Marios_Andreou (3,200 points)   3 4
0 votes

On 16.08.2017 5:06, Wesley Hayutin wrote:

On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi <emilien@redhat.com
emilien@redhat.com> wrote:

So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
<https://bugs.launchpad.net/tripleo/+bug/1702955>
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
<https://www.diffchecker.com/VTXkNFuk>
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.


Problem #2: from Ocata to Pike (containerized) missing container
upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
<https://bugs.launchpad.net/tripleo/+bug/1710938>
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
<https://review.openstack.org/#/c/493972>
Thanks to that work, we managed to find the problem #3.


Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
<https://bugs.launchpad.net/tripleo/+bug/1710992>
The CI jobs are timeouting during the upgrade process because
downloading + uploading _all_ containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.


Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade
testing

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward
this week,
Thanks.
--
Emilien Macchi

I think due to some of the limitations with run times upstream we may
need to rethink the workflow with upgrade tests upstream. It's not very
clear to me what can be done with the multinode nodepool jobs outside of
what is already being done. I think we do have some choices with ovb

We could limit the upstream multinode jobs scope to only do upgrade
testing of a couple of the services deployed, like keystone and nova and
neutron, or so.

jobs. I'm not going to try and solve in this email but rethinking how
we CI upgrades in the upstream infrastructure should be a focus for the
Queens PTG. We will need to focus on bringing run times significantly
down as it's incredibly difficult to run two installs in 175 minutes
across all the upstream cloud providers.

Thanks Emilien for all the work you have done around upgrades!

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by bdobreli_at_redhat.c (2,260 points)   2 3
0 votes

On 16.08.2017 3:33, Emilien Macchi wrote:
So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.

Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade testing

This doesn't look like a viable option to me. I'd prefer reduce the
scope (deployed services under upgrade testing) of the upgrade testing,
but release only having it passing for that scope.

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this week,
Thanks.

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by bdobreli_at_redhat.c (2,260 points)   2 3
0 votes

On Wed, Aug 16, 2017 at 12:37 AM, Marios Andreou mandreou@redhat.com wrote:
For Newton to Ocata, is it consistent which clouds we are timing out on?

It's not consistent but the rate is very high:
http://cistatus.tripleo.org/

gate-tripleo-ci-centos-7-multinode-upgrades - 30 % of success this week
gate-tripleo-ci-centos-7-scenario001-multinode-upgrades - 13% of
success this week
gate-tripleo-ci-centos-7-scenario002-multinode-upgrades - 34% of
success this week
gate-tripleo-ci-centos-7-scenario003-multinode-upgrades - 78% of
success this week

(results on stable/ocata)

So as you can see results are not good at all for gate jobs.

for master, +1 I think this is essentially what I am saying above for O...P
- sounds like problem 2 is well in progress from weshay and the other
container/image related problem 3 is the main outstanding item. Since RC1 is
this week I think what you are proposing as mitigation is fair. So we
re-evaluate making these jobs voting before the final RCs end of August

We might need to help him, and see how we can accelerate this work now.

thanks for putting this together. I think if we really had to pick one the
O..P ci has priority obviously this week (!)... I think the container/images
related issues for O...P are both expected/teething issues from the huge
amount of work done by the containerization team and can hopefully be
resolved quickly.

I agree, priority is O..P for now - and getting these upgrade jobs working.
Note that the upgrade scenarios are not working correctly yet on
master, we'll need to figure that out as well. If you can maybe help
to have a look, that would be awesome.

Thanks,
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by emilien_at_redhat.co (36,940 points)   2 6 10
0 votes

On Wed, Aug 16, 2017 at 3:17 AM, Bogdan Dobrelya bdobreli@redhat.com wrote:
We could limit the upstream multinode jobs scope to only do upgrade
testing of a couple of the services deployed, like keystone and nova and
neutron, or so.

That would be a huge regression in our CI. Strong -2 on this idea.
We worked hard to have a pretty descent coverage during Ocata, we're
not going to give it up easily.
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by emilien_at_redhat.co (36,940 points)   2 6 10
0 votes

On Tue, Aug 15, 2017 at 11:06:20PM -0400, Wesley Hayutin wrote:
On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi emilien@redhat.com wrote:

So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.

Problem #2: from Ocata to Pike (containerized) missing container upload
step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade
testing

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this
week,
Thanks.
--
Emilien Macchi

I think due to some of the limitations with run times upstream we may need
to rethink the workflow with upgrade tests upstream. It's not very clear to
me what can be done with the multinode nodepool jobs outside of what is
already being done. I think we do have some choices with ovb jobs. I'm
not going to try and solve in this email but rethinking how we CI upgrades
in the upstream infrastructure should be a focus for the Queens PTG. We
will need to focus on bringing run times significantly down as it's
incredibly difficult to run two installs in 175 minutes across all the
upstream cloud providers.

Can you explain in more details where the bottlenecks are for the 175 mins?
That's just shy of 3 hours, and seems like more then enough time.

Not that it can be solved now, but maybe it is time to look at these jobs the
other way, how can we make them faster and what optimizations need to be made.

One example, we spend a lot of time in rebuilding RPM package with DLRN. It is
possible in zuulv3 we'll be able to make changes to the CI workflow, so only 1
nodes builds a package, then all other jobs download new packages from that
node.

Another thing we can look at, is more parallel testing inplace of serial. I
can't point to anything specific, but would be helpful to sit down with sombody
to better understand all the back and forth between undercloud / overcloud /
multinodes / etc.

Thanks Emilien for all the work you have done around upgrades!


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by pabelanger_at_redhat (6,560 points)   1 1 2
0 votes

Here's an update on the situation.

On Tue, Aug 15, 2017 at 6:33 PM, Emilien Macchi emilien@redhat.com wrote:
Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
[...]

We still need some help to find out why upgrade jobs timeout so much
in stable/ocata.

Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
[...]

The patch worked and helped! We've got a successful job running today:
http://logs.openstack.org/00/461000/32/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2f13627/console.html#_2017-08-16_01_31_32_009061

We're now pushing to the next step: testing the upgrade with pingtest.
See https://review.openstack.org/#/c/494268/ and the Depends-On: on
https://review.openstack.org/#/c/461000/.

If pingtest proves to work, it would be a good news and prove that we
have a basic workflow in place on which we can iterate.

The next iterations afterward would be to work on the 4 scenarios that
are also going to be upgrades from Ocata to pike (001 to 004).
For that, we'll need Problem #1 and #2 resolved before we want to make
any progress here, to not hit the same issues that before.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Steve is still working on it: https://review.openstack.org/#/c/448328/
Steve, if you need any help (reviewing or coding) - please let us
know, as we consider this thing important to have and probably good to
have in Pike.

Thanks,
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 16, 2017 by emilien_at_redhat.co (36,940 points)   2 6 10
0 votes

On Wed, Aug 16, 2017 at 3:47 PM, Emilien Macchi emilien@redhat.com wrote:
Here's an update on the situation.

On Tue, Aug 15, 2017 at 6:33 PM, Emilien Macchi emilien@redhat.com wrote:

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
[...]

We still need some help to find out why upgrade jobs timeout so much
in stable/ocata.

Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
[...]

The patch worked and helped! We've got a successful job running today:
http://logs.openstack.org/00/461000/32/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2f13627/console.html#_2017-08-16_01_31_32_009061

We're now pushing to the next step: testing the upgrade with pingtest.
See https://review.openstack.org/#/c/494268/ and the Depends-On: on
https://review.openstack.org/#/c/461000/.

If pingtest proves to work, it would be a good news and prove that we
have a basic workflow in place on which we can iterate.

Pingtest doesn't work:
http://logs.openstack.org/00/461000/37/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/1beac0e/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-08-17_01_03_09

We need to investigate and find out why.
If nobody looks at it before, I'll take a change tomorrow.

The next iterations afterward would be to work on the 4 scenarios that
are also going to be upgrades from Ocata to pike (001 to 004).
For that, we'll need Problem #1 and #2 resolved before we want to make
any progress here, to not hit the same issues that before.

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Steve is still working on it: https://review.openstack.org/#/c/448328/
Steve, if you need any help (reviewing or coding) - please let us
know, as we consider this thing important to have and probably good to
have in Pike.

Thanks,
--
Emilien Macchi

--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 17, 2017 by emilien_at_redhat.co (36,940 points)   2 6 10
0 votes

On Thu, Aug 17, 2017 at 10:47 AM, Emilien Macchi emilien@redhat.com wrote:

Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading all containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.

Steve is still working on it: https://review.openstack.org/#/c/448328/
Steve, if you need any help (reviewing or coding) - please let us
know, as we consider this thing important to have and probably good to
have in Pike.

I have a couple of changes up now, one to capture the relationship between
images and services[1], and another to add an argument to the prepare
command to filter the image list based on which services are containerised
[2]. Once these land, all the calls to prepare in CI can be modified to
also specify these heat environment files, and this will reduce uploads to
only the images required.

[1] https://review.openstack.org/#/c/448328/
[2] https://review.openstack.org/#/c/494367/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Aug 17, 2017 by Steve_Baker (7,380 points)   1 3 6
...