settingsLogin | Registersettings

[openstack-dev] [tripleo] container jobs are unstable

0 votes

Hey,

I've noticed that container jobs look pretty unstable lately; to me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-22_00_08_55_358973

If anyone could file a bug and see how we can bring it back as soon as
possible, I think we want to maintain this job in stable shape.
I remember Container squad wanted it voting because it was supposed to
be stable, but I'm not sure that's the case today.

Also, it would be great to have the container jobs in
http://tripleo.org/cistatus.html - what do you think?

Thanks for your help,
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Apr 7, 2017 in openstack-dev by emilien_at_redhat.co (36,940 points)   2 6 10

15 Responses

0 votes

On Wed, 2017-03-29 at 22:07 -0400, Paul Belanger wrote:
On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:

On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi <emilien@redhat.com

wrote:

On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco <flavio@redhat.co
m> wrote:

On 23/03/17 16:24 +0100, Martin André wrote:

On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince <dprince@redhat.c
om> wrote:

On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:

On 22/03/17 13:32 +0100, Flavio Percoco wrote:

On 21/03/17 23:15 -0400, Emilien Macchi wrote:

Hey,

I've noticed that container jobs look pretty unstable
lately; to
me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/g
ate-tripleo-
ci-centos-7-ovb-containers-oooq-
nv/bca496a/console.html#_2017-03-
22000855358973

There are different hypothesis on what is going on
here. Some
patches have
landed to improve the write performance on containers
by using
hostpath mounts
but we think the real slowness is coming from the
images download.

This said, this is still under investigation and the
containers
squad will
report back as soon as there are new findings.

Also, to be more precise, Martin André is looking into
this. He also
fixed the
gate in the last 2 weeks.

I spoke w/ Martin on IRC. He seems to think this is the
cause of some
of the failures:

http://logs.openstack.org/32/446432/1/check-tripleo/gate-

tripleo-ci-cen

tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
controller-
0/var/log/extra/docker/containers/heat_engine/log/heat/heat
-
engine.log.txt.gz#_2017-03-21_20_26_29_697

Looks like Heat isn't able to create Nova instances in the
overcloud
due to "Host 'overcloud-novacompute-0' is not mapped to any
cell'. This
means our cells initialization code for containers may not
be quite
right... or there is a race somewhere.

Here are some findings. I've looked at time measures from CI
for
https://review.openstack.org/#/c/448533/ which provided the
most
recent results:

  • gate-tripleo-ci-centos-7-ovb-ha [1]
       undercloud install: 23
       overcloud deploy: 72
       total time: 125
  • gate-tripleo-ci-centos-7-ovb-nonha [2]
       undercloud install: 25
       overcloud deploy: 48
       total time: 122
  • gate-tripleo-ci-centos-7-ovb-updates [3]
       undercloud install: 24
       overcloud deploy: 57
       total time: 152
  • gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
       undercloud install: 28
       overcloud deploy: 48
       total time: 165 (timeout)

Looking at the undercloud & overcloud install times, the most
task
consuming tasks, the containers job isn't doing that bad
compared to
other OVB jobs. But looking closer I could see that:
- the containers job pulls docker images from dockerhub, this
process
takes roughly 18 min.

I think we can optimize this a bit by having the script that
populates

the

local
registry in the overcloud job to run in parallel. The docker
daemon can

do

multiple pulls w/o problems.

+A

  • the postci takes a long time with quickstart, 13 min (4 min
    alone
    spent on docker log collection) whereas it takes only 3 min
    when using
    tripleo.sh

mmh, does this have anything to do with ansible being in
between? Or is

that

time specifically for the part that gets the logs?

Adding all these numbers, we're at about 40 min of additional
time for
oooq containers job which is enough to cross the CI job
limit.

There is certainly a lot of room for optimization here and
there and
I'll explore how we can speed up the containers CI job over
the next

Thanks a lot for the update. The time break down is fantastic,
Flavio

TBH the problem is far from being solved:

  1. Click on https://status-tripleoci.rhcloud.com/
  2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv

Container job has been failing more than 55% of the time.

As a reference,
gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
gate-tripleo-ci-centos-7-ovb-ha has 64% of success.

It clearly means the ovb-containers job was and is not ready to
be run
in the check pipeline, it's not reliable enough.

The current queue time in TripleO OVB is 11 hours. This is not
acceptable for TripleO developers and we need a short term
solution,
which is disabling this job from the check pipeline:
https://review.openstack.org/#/c/451546/

Yes, given resource constraints I don't see an alternative in the
short
term.

On the long-term, we need to:

  • Stabilize ovb-containers which is AFIK already WIP by Martin
    (kudos
    to him). My hope is Martin gets enough help from Container squad
    to
    work on this topic.
  • Remove ovb-nonha scenario from the check pipeline - and
    probably
    keep it periodic. Dan Prince started some work on it:
    https://review.openstack.org/#/c/449791/ and
    https://review.openstack.org/#/c/449785/ - but not much progress
    on it
    in the recent days.
  • Engage some work on getting multinode-scenario(001,002,003,004)
    jobs
    for containers, so we don't need much OVB jobs (only one
    probably) for
    container scenarios.

Another work item in progress which should help with the stability
of the
ovb containers job is Dan has set up a docker-distribution based
registry
on a node in rhcloud. Once jobs are pulling images from this there
should
be less timeouts due to image pull speed.

Before we go and stand up private infrastructure for tripleo to
depend on, can
we please work on solving this is for all openstack projects
upstream? We do
want to run regional mirrors for docker things, however we need to
address
issues on how to integration this with AFS.

We are trying to break the cycle of tripleo standing up private
infrastructure
and consume more community based. So far we are making good progress,
however I
would see this effort a step backwards, not forward.

I would propose that we do both. Lets setup resources in-rack that help
us efficiently cache containers from dockerhub. And lets also do the
same within infra so that jobs running there benefit as well.

IMO a local, in-rack proxy/mirror that requires little to no
maintenance (which is all we are setting up here really) is a very good
pattern.

Are there other ideas that will allow us to avoid the overhead of
continually pulling images into our Rack from dockerhub?

Dan

I know everyone is busy by working on container support in
composable
services, but we might assign more resources on CI work here,
otherwise I'm not sure how we're going to stabilize the CI.

Any feedback is very welcome.

tripleo-ci-centos-7-ovb-ha/d2c1b16/

tripleo-ci-centos-7-ovb-nonha/d6df760/

tripleo-ci-centos-7-ovb-updates/3b1f795/

tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/

Dan

Flavio




OpenStack Development Mailing List (not for usage
questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?su
bject:unsubs
cribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/opens
tack-dev


_

OpenStack Development Mailing List (not for usage
questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscri
be
http://lists.openstack.org/cgi-bin/mailman/listinfo/opensta
ck-dev



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subjec
t:

unsubscribe


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:

unsubscribe

--
Emilien Macchi



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:un
subscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsu
bscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubs
cribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Mar 30, 2017 by Dan_Prince (8,160 points)   1 5 7
0 votes

On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:
On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:

On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:

On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi emilien@redhat.com wrote:

On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco flavio@redhat.com wrote:

On 23/03/17 16:24 +0100, Martin André wrote:

On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince dprince@redhat.com wrote:

On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:

On 22/03/17 13:32 +0100, Flavio Percoco wrote:

On 21/03/17 23:15 -0400, Emilien Macchi wrote:

Hey,

I've noticed that container jobs look pretty unstable lately; to
me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
22000855358973

There are different hypothesis on what is going on here. Some
patches have
landed to improve the write performance on containers by using
hostpath mounts
but we think the real slowness is coming from the images download.

This said, this is still under investigation and the containers
squad will
report back as soon as there are new findings.

Also, to be more precise, Martin André is looking into this. He also
fixed the
gate in the last 2 weeks.

I spoke w/ Martin on IRC. He seems to think this is the cause of some
of the failures:

http://logs.openstack.org/32/446432/1/check-tripleo/gate-
tripleo-ci-cen
tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
engine.log.txt.gz#_2017-03-21_20_26_29_697

Looks like Heat isn't able to create Nova instances in the overcloud
due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
means our cells initialization code for containers may not be quite
right... or there is a race somewhere.

Here are some findings. I've looked at time measures from CI for
https://review.openstack.org/#/c/448533/ which provided the most
recent results:

  • gate-tripleo-ci-centos-7-ovb-ha [1]
    undercloud install: 23
    overcloud deploy: 72
    total time: 125
  • gate-tripleo-ci-centos-7-ovb-nonha [2]
    undercloud install: 25
    overcloud deploy: 48
    total time: 122
  • gate-tripleo-ci-centos-7-ovb-updates [3]
    undercloud install: 24
    overcloud deploy: 57
    total time: 152
  • gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
    undercloud install: 28
    overcloud deploy: 48
    total time: 165 (timeout)

Looking at the undercloud & overcloud install times, the most task
consuming tasks, the containers job isn't doing that bad compared to
other OVB jobs. But looking closer I could see that:
- the containers job pulls docker images from dockerhub, this process
takes roughly 18 min.

I think we can optimize this a bit by having the script that populates
the
local
registry in the overcloud job to run in parallel. The docker daemon can
do
multiple pulls w/o problems.

+A

  • the postci takes a long time with quickstart, 13 min (4 min alone
    spent on docker log collection) whereas it takes only 3 min when using
    tripleo.sh

mmh, does this have anything to do with ansible being in between? Or is
that
time specifically for the part that gets the logs?

Adding all these numbers, we're at about 40 min of additional time for
oooq containers job which is enough to cross the CI job limit.

There is certainly a lot of room for optimization here and there and
I'll explore how we can speed up the containers CI job over the next

Thanks a lot for the update. The time break down is fantastic,
Flavio

TBH the problem is far from being solved:

  1. Click on https://status-tripleoci.rhcloud.com/
  2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv

Container job has been failing more than 55% of the time.

As a reference,
gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
gate-tripleo-ci-centos-7-ovb-ha has 64% of success.

It clearly means the ovb-containers job was and is not ready to be run
in the check pipeline, it's not reliable enough.

The current queue time in TripleO OVB is 11 hours. This is not
acceptable for TripleO developers and we need a short term solution,
which is disabling this job from the check pipeline:
https://review.openstack.org/#/c/451546/

Yes, given resource constraints I don't see an alternative in the short
term.

On the long-term, we need to:

  • Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
    to him). My hope is Martin gets enough help from Container squad to
    work on this topic.
  • Remove ovb-nonha scenario from the check pipeline - and probably
    keep it periodic. Dan Prince started some work on it:
    https://review.openstack.org/#/c/449791/ and
    https://review.openstack.org/#/c/449785/ - but not much progress on it
    in the recent days.
  • Engage some work on getting multinode-scenario(001,002,003,004) jobs
    for containers, so we don't need much OVB jobs (only one probably) for
    container scenarios.

Another work item in progress which should help with the stability of the
ovb containers job is Dan has set up a docker-distribution based registry
on a node in rhcloud. Once jobs are pulling images from this there should
be less timeouts due to image pull speed.

Before we go and stand up private infrastructure for tripleo to depend on, can
we please work on solving this is for all openstack projects upstream? We do
want to run regional mirrors for docker things, however we need to address
issues on how to integration this with AFS.

We are trying to break the cycle of tripleo standing up private infrastructure
and consume more community based. So far we are making good progress, however I
would see this effort a step backwards, not forward.

To be fair, we discussed this on IRC yesterday, everyone agreed infra
supported docker cache/registry was a great idea, but you said there was no
known timeline for it actually getting done.

So while we all want to see that happen, and potentially help out with the
effort, we're also trying to mitigate the fact that work isn't done by
working around it in our OVB environment.

FWIW I think we absolutely need multinode container jobs, e.g using infra
resources, as that has worked out great for our puppet based CI, but we
really need to work out how to optimize the container download speed in
that environment before that will work well AFAIK.

You referenced https://review.openstack.org/#/c/447524/ in your other
reply, which AFAICS is a spec about publishing to dockerhub, which sounds
great, but we have the opposite problem, we need to consume those published
images during our CI runs, and currently downloading images takes too long.
So we ideally need some sort of local registry/pull-through-cache that
speeds up that process.

How can we move forward here, is there anyone on the infra side we can work
with to discuss further?

Yes, I am currently working with clarkb to adress some of these concerns. Today
we are looking at setup our cloud mirrors to cache[1] specific URLs, for example
we are trying testing out http://trunk.rdoproject.org This is not a long term
solution for projects, but a short. It will be opt-in for now, rather then us
set it up for all jobs. Long term, we move rdoproject.org into AFS.

I have been trying to see if we can do the same for docker hub, and continue to
run it. The main issue, at least for me, is we don't want to depend on docker
tooling for this. I'd rather not install a docker into our control play at this
point in time.

So, all of that to stay, it will take some time. I understand it is a high
priority, but lets solve the current mirroring issues with tripleo first (RDO,
gems, github), and lets see if the apache cache proxy with work for
hub.docker.com too.

[1] https://review.openstack.org/451554

Thanks!

Steve


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Mar 30, 2017 by pabelanger_at_redhat (6,560 points)   1 1 2
0 votes

On Thu, Mar 30, 2017 at 11:01:08AM -0400, Paul Belanger wrote:
On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:

To be fair, we discussed this on IRC yesterday, everyone agreed infra
supported docker cache/registry was a great idea, but you said there was no
known timeline for it actually getting done.

So while we all want to see that happen, and potentially help out with the
effort, we're also trying to mitigate the fact that work isn't done by
working around it in our OVB environment.

FWIW I think we absolutely need multinode container jobs, e.g using infra
resources, as that has worked out great for our puppet based CI, but we
really need to work out how to optimize the container download speed in
that environment before that will work well AFAIK.

You referenced https://review.openstack.org/#/c/447524/ in your other
reply, which AFAICS is a spec about publishing to dockerhub, which sounds
great, but we have the opposite problem, we need to consume those published
images during our CI runs, and currently downloading images takes too long.
So we ideally need some sort of local registry/pull-through-cache that
speeds up that process.

How can we move forward here, is there anyone on the infra side we can work
with to discuss further?

Yes, I am currently working with clarkb to adress some of these concerns. Today
we are looking at setup our cloud mirrors to cache[1] specific URLs, for example
we are trying testing out http://trunk.rdoproject.org This is not a long term
solution for projects, but a short. It will be opt-in for now, rather then us
set it up for all jobs. Long term, we move rdoproject.org into AFS.

I have been trying to see if we can do the same for docker hub, and continue to
run it. The main issue, at least for me, is we don't want to depend on docker
tooling for this. I'd rather not install a docker into our control play at this
point in time.

So, all of that to stay, it will take some time. I understand it is a high
priority, but lets solve the current mirroring issues with tripleo first (RDO,
gems, github), and lets see if the apache cache proxy with work for
hub.docker.com too.

[1] https://review.openstack.org/451554

Wanted to follow up to this thread, we managed to get a reverse proxy cache[2]
for https://registry-1.docker.io working. So far, I've just tested ubuntu,
fedora, centos images but the caching works. Once we land this, any jobs using
docker can take advantage of the mirror.

[2] https://review.openstack.org/#/c/453811


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Apr 6, 2017 by pabelanger_at_redhat (6,560 points)   1 1 2
0 votes

On Thu, Mar 30, 2017 at 10:08 AM, Steven Hardy shardy@redhat.com wrote:

On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:

On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:

On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi emilien@redhat.com
wrote:

On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco flavio@redhat.com
wrote:

On 23/03/17 16:24 +0100, Martin André wrote:

On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince dprince@redhat.com
wrote:

On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:

On 22/03/17 13:32 +0100, Flavio Percoco wrote:

On 21/03/17 23:15 -0400, Emilien Macchi wrote:

Hey,

I've noticed that container jobs look pretty unstable
lately; to
me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/gate-
tripleo-
ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_
2017-03-
22000855358973

There are different hypothesis on what is going on here. Some
patches have
landed to improve the write performance on containers by using
hostpath mounts
but we think the real slowness is coming from the images
download.

This said, this is still under investigation and the
containers
squad will
report back as soon as there are new findings.

Also, to be more precise, Martin André is looking into this. He
also
fixed the
gate in the last 2 weeks.

I spoke w/ Martin on IRC. He seems to think this is the cause of
some
of the failures:

http://logs.openstack.org/32/446432/1/check-tripleo/gate-
tripleo-ci-cen
tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
controller-
0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
engine.log.txt.gz#_2017-03-21_20_26_29_697

Looks like Heat isn't able to create Nova instances in the
overcloud
due to "Host 'overcloud-novacompute-0' is not mapped to any
cell'. This
means our cells initialization code for containers may not be
quite
right... or there is a race somewhere.

Here are some findings. I've looked at time measures from CI for
https://review.openstack.org/#/c/448533/ which provided the most
recent results:

  • gate-tripleo-ci-centos-7-ovb-ha [1]
    undercloud install: 23
    overcloud deploy: 72
    total time: 125
  • gate-tripleo-ci-centos-7-ovb-nonha [2]
    undercloud install: 25
    overcloud deploy: 48
    total time: 122
  • gate-tripleo-ci-centos-7-ovb-updates [3]
    undercloud install: 24
    overcloud deploy: 57
    total time: 152
  • gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
    undercloud install: 28
    overcloud deploy: 48
    total time: 165 (timeout)

Looking at the undercloud & overcloud install times, the most task
consuming tasks, the containers job isn't doing that bad compared
to
other OVB jobs. But looking closer I could see that:
- the containers job pulls docker images from dockerhub, this
process
takes roughly 18 min.

I think we can optimize this a bit by having the script that
populates
the
local
registry in the overcloud job to run in parallel. The docker
daemon can
do
multiple pulls w/o problems.

+A

  • the postci takes a long time with quickstart, 13 min (4 min
    alone
    spent on docker log collection) whereas it takes only 3 min when
    using
    tripleo.sh

mmh, does this have anything to do with ansible being in between?
Or is
that
time specifically for the part that gets the logs?

Adding all these numbers, we're at about 40 min of additional
time for
oooq containers job which is enough to cross the CI job limit.

There is certainly a lot of room for optimization here and there
and
I'll explore how we can speed up the containers CI job over the
next

Thanks a lot for the update. The time break down is fantastic,
Flavio

TBH the problem is far from being solved:

  1. Click on https://status-tripleoci.rhcloud.com/
  2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv

Container job has been failing more than 55% of the time.

As a reference,
gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
gate-tripleo-ci-centos-7-ovb-ha has 64% of success.

It clearly means the ovb-containers job was and is not ready to be
run
in the check pipeline, it's not reliable enough.

The current queue time in TripleO OVB is 11 hours. This is not
acceptable for TripleO developers and we need a short term solution,
which is disabling this job from the check pipeline:
https://review.openstack.org/#/c/451546/

Yes, given resource constraints I don't see an alternative in the short
term.

On the long-term, we need to:

  • Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
    to him). My hope is Martin gets enough help from Container squad to
    work on this topic.
  • Remove ovb-nonha scenario from the check pipeline - and probably
    keep it periodic. Dan Prince started some work on it:
    https://review.openstack.org/#/c/449791/ and
    https://review.openstack.org/#/c/449785/ - but not much progress on
    it
    in the recent days.
  • Engage some work on getting multinode-scenario(001,002,003,004)
    jobs
    for containers, so we don't need much OVB jobs (only one probably)
    for
    container scenarios.

Another work item in progress which should help with the stability of
the
ovb containers job is Dan has set up a docker-distribution based
registry
on a node in rhcloud. Once jobs are pulling images from this there
should
be less timeouts due to image pull speed.

Before we go and stand up private infrastructure for tripleo to depend
on, can
we please work on solving this is for all openstack projects upstream?
We do
want to run regional mirrors for docker things, however we need to
address
issues on how to integration this with AFS.

We are trying to break the cycle of tripleo standing up private
infrastructure
and consume more community based. So far we are making good progress,
however I
would see this effort a step backwards, not forward.

To be fair, we discussed this on IRC yesterday, everyone agreed infra
supported docker cache/registry was a great idea, but you said there was no
known timeline for it actually getting done.

So while we all want to see that happen, and potentially help out with the
effort, we're also trying to mitigate the fact that work isn't done by
working around it in our OVB environment.

FWIW I think we absolutely need multinode container jobs, e.g using infra
resources, as that has worked out great for our puppet based CI, but we
really need to work out how to optimize the container download speed in
that environment before that will work well AFAIK.

Gabriele has started working on this
https://review.openstack.org/#/c/454152/

You referenced https://review.openstack.org/#/c/447524/ in your other
reply, which AFAICS is a spec about publishing to dockerhub, which sounds
great, but we have the opposite problem, we need to consume those published
images during our CI runs, and currently downloading images takes too long.
So we ideally need some sort of local registry/pull-through-cache that
speeds up that process.

How can we move forward here, is there anyone on the infra side we can work
with to discuss further?

Thanks!

Steve


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Apr 6, 2017 by Wesley_Hayutin (2,320 points)   2
0 votes

On Thu, 2017-04-06 at 15:32 -0400, Paul Belanger wrote:
On Thu, Mar 30, 2017 at 11:01:08AM -0400, Paul Belanger wrote:

On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:

To be fair, we discussed this on IRC yesterday, everyone agreed
infra
supported docker cache/registry was a great idea, but you said
there was no
known timeline for it actually getting done.

So while we all want to see that happen, and potentially help out
with the
effort, we're also trying to mitigate the fact that work isn't
done by
working around it in our OVB environment.

FWIW I think we absolutely need multinode container jobs, e.g
using infra
resources, as that has worked out great for our puppet based CI,
but we
really need to work out how to optimize the container download
speed in
that environment before that will work well AFAIK.

You referenced https://review.openstack.org/#/c/447524/ in your
other
reply, which AFAICS is a spec about publishing to dockerhub,
which sounds
great, but we have the opposite problem, we need to consume those
published
images during our CI runs, and currently downloading images takes
too long.
So we ideally need some sort of local registry/pull-through-cache
that
speeds up that process.

How can we move forward here, is there anyone on the infra side
we can work
with to discuss further?

Yes, I am currently working with clarkb to adress some of these
concerns. Today
we are looking at setup our cloud mirrors to cache[1] specific
URLs, for example
we are trying testing out http://trunk.rdoproject.org  This is not
a long term
solution for projects, but a short. It will be opt-in for now,
rather then us
set it up for all jobs.  Long term, we move rdoproject.org into
AFS.

I have been trying to see if we can do the same for docker hub, and
continue to
run it.  The main issue, at least for me, is we don't want to
depend on docker
tooling for this. I'd rather not install a docker into our control
play at this
point in time.

So, all of that to stay, it will take some time. I understand it is
a high
priority, but lets solve the current mirroring issues with tripleo
first (RDO,
gems, github), and lets see if the apache cache proxy with work for
hub.docker.com too.

[1] https://review.openstack.org/451554

Wanted to follow up to this thread, we managed to get a reverse proxy
cache[2]
for https://registry-1.docker.io working. So far, I've just tested
ubuntu,
fedora, centos images but the caching works. Once we land this, any
jobs using
docker can take advantage of the mirror.

[2] https://review.openstack.org/#/c/453811

Thanks for your help in this Paul.

A reverse proxy cache wasn't exactly what I was expecting so it took a
few more patches to get all this initially wired into the TripleO OVB
jobs (6 patches so far). Once we have this we can duplicate a similar
setup for the multinode patches as well.

I created a quick etherpad below [1] to track the status of these
patches. I think they mostly need to land in the order they are listed
in the etherpad...

[1] https://etherpad.openstack.org/p/tripleo-docker-registry-mirror



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubs
cribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Apr 7, 2017 by Dan_Prince (8,160 points)   1 5 7
...