settingsLogin | Registersettings

[openstack-dev] [tripleo] Validations before upgrades and updates

0 votes

Hi folks, after some discussion locally with colleagues about improving the
upgrades experience, one of the items that came up was pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service status.
It was pointed out by shardy that for such checks it is better to instead
continue to use the per-service manifests where possible like [2] for
example where we check status before N..O major upgrade. There may still be
some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).

So do folks have any thoughts about this subject - for example the kinds of
things we should be checking - Steve said he had some reviews in progress
for collecting the overcloud ansible puppet/docker config into an ansible
playbook that the operator can invoke for upgrade of the 'manual' nodes
(for example compute in the N..O workflow) - the point being that we can
add more per-service ansible validation tasks into the service manifests
for execution when the play is run by the operator - but I'll let Steve
point at and talk about those.

cheers, marios

[0] https://github.com/openstack/tripleo-validations
[1] https://review.openstack.org/#/c/462918/
[2] https://github.com/openstack/tripleo-heat-templates/blob/
stable/ocata/puppet/services/neutron-api.yaml#L197


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked May 24, 2017 in openstack-dev by Marios_Andreou (3,200 points)   3 4

10 Responses

0 votes

On 05/08/2017 01:45 PM, Marios Andreou wrote:
Hi folks, after some discussion locally with colleagues about
improving the upgrades experience, one of the items that came up was
pre-upgrade and update validations. I took an AI to look at the
current status of tripleo-validations [0] and posted a simple WIP [1]
intended to be run before an undercloud update/upgrade and which just
checks service status. It was pointed out by shardy that for such
checks it is better to instead continue to use the per-service
manifests where possible like [2] for example where we check status
before N..O major upgrade. There may still be some undercloud specific
validations that we can land into the tripleo-validations repo
(thinking about things like the neutron networks/ports, validating the
current nova nodes state etc?).
Yes, I think a bunch of validation: db states, services states, network
connectivity (external, internal)

So do folks have any thoughts about this subject - for example the
kinds of things we should be checking - Steve said he had some reviews
in progress for collecting the overcloud ansible puppet/docker config
into an ansible playbook that the operator can invoke for upgrade of
the 'manual' nodes (for example compute in the N..O workflow) - the
point being that we can add more per-service ansible validation tasks
into the service manifests for execution when the play is run by the
operator - but I'll let Steve point at and talk about those.

I have a WIP review about that [1], but i need to revisit it a bit, to
add a part into the mistral workflow (Im also writing a POC to create a
mistral workbook for major upgrade and validate minor/major upgrade
before starting [2], I have a third one in progress, not pushed yet, to
implement the major upgrade option in the cli):

[1] https://review.openstack.org/444224
[2] https://review.openstack.org/#/c/462961

cheers, marios

[0] https://github.com/openstack/tripleo-validations
[1] https://review.openstack.org/#/c/462918/
[2] https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/services/neutron-api.yaml#L197


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 9, 2017 by mathieu_bultel (640 points)  
0 votes

On Mon, May 8, 2017 at 7:45 AM, Marios Andreou mandreou@redhat.com wrote:
Hi folks, after some discussion locally with colleagues about improving the
upgrades experience, one of the items that came up was pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service status. It
was pointed out by shardy that for such checks it is better to instead
continue to use the per-service manifests where possible like [2] for
example where we check status before N..O major upgrade. There may still be
some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).

So do folks have any thoughts about this subject - for example the kinds of
things we should be checking - Steve said he had some reviews in progress
for collecting the overcloud ansible puppet/docker config into an ansible
playbook that the operator can invoke for upgrade of the 'manual' nodes (for
example compute in the N..O workflow) - the point being that we can add more
per-service ansible validation tasks into the service manifests for
execution when the play is run by the operator - but I'll let Steve point at
and talk about those.

It looks like a good idea to me. I don't think our operators want to
update / upgrade OpenStack if the cloud is not in a consistent working
state before.

Here's the things we could test:
- Pacemaker cluster health
- Ceph health
- Database
- APIs healthcheck
- RabbitMQ health

--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 12, 2017 by emilien_at_redhat.co (36,940 points)   3 8 13
0 votes

On 05/08/2017 06:45 AM, Marios Andreou wrote:
Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade
and update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service
status. It was pointed out by shardy that for such checks it is better
to instead continue to use the per-service manifests where possible
like [2] for example where we check status before N..O major upgrade.
There may still be some undercloud specific validations that we can land
into the tripleo-validations repo (thinking about things like the
neutron networks/ports, validating the current nova nodes state etc?).

So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into
an ansible playbook that the operator can invoke for upgrade of the
'manual' nodes (for example compute in the N..O workflow) - the point
being that we can add more per-service ansible validation tasks into the
service manifests for execution when the play is run by the operator -
but I'll let Steve point at and talk about those.

We had a similar discussion regarding controller node replacement
because starting that process with the overcloud in an inconsistent
state tends to end badly. Unfortunately those docs are only available
downstream at this time, but the basics were:

-Verify that the stack is in a *_COMPLETE state (this may seem obvious,
but we've had people try to do these major processes while the stack is
in a broken state)
-Verify undercloud disk space. For node replacement we recommended a
minimum of 10 GB free.
-Verify that all pacemaker services are up.
-Check Galera and Rabbit clusters and verify all nodes are up.
-For node replacement we also disabled stonith. That might be a good
idea during upgrades as well in case some services take a while to come
back up. You really don't want a node getting killed during the process.
-General undercloud service checks (nova, ironic, etc.)


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 15, 2017 by Ben_Nemec (19,660 points)   2 3 5
0 votes

On Mon, May 08, 2017 at 02:45:08PM +0300, Marios Andreou wrote:
Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service status.
It was pointed out by shardy that for such checks it is better to instead
continue to use the per-service  manifests where possible like [2] for
example where we check status before N..O major upgrade. There may still
be some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).
So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into an
ansible playbook that the operator can invoke for upgrade of the 'manual'
nodes (for example compute in the N..O workflow) - the point being that we
can add more per-service ansible validation tasks into the service
manifests for execution when the play is run by the operator - but I'll
let Steve point at and talk about those. 

Thanks for starting this thread Marios, sorry for the slow reply due to
Summit etc.

As we discussed, I think adding validations is great, but I'd prefer we
kept any overcloud validations specific to services in t-h-t instead of
trying to manage service specific things over multiple repos.

This would also help with the idea of per-step validations I think, where
e.g you could have a "is service active" test and run it after the step
where we expect the service to start, a blueprint was raised a while back
asking for exactly that:

https://blueprints.launchpad.net/tripleo/+spec/step-by-step-validation

One way we could achive this is to add ansible tasks that perform some
validation after each step, where we combine the tasks for all services,
similar to how we already do upgradetasks and hostprep_tasks:

https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/database/redis.yaml#L92

With the benefit of hindsight using ansible tags for upgrade_tasks wasn't
the best approach, because you can't change the tags via SoftwareDeployment
(e.g you need a SoftwareConfig per step), it's better if we either generate
the list of tasks by merging maps e.g

validation_tasks:
step3:
- sometask

Or via ansible conditionals where we pass a step value in to each run of
the tasks:

validation_tasks:
- sometask
when: step == 3

The latter approach is probably my preference, because it'll require less
complex merging in the heat layer.

As you mentioned, I've been working on ways to make the deployment steps
more ansible driven, so having these tasks integrated with the t-h-t model
would be well aligned with that I think:

https://review.openstack.org/#/c/454816/

https://review.openstack.org/#/c/462211/

Happy to discuss further when you're ready to start integrating some
overcloud validations.

Thanks!

Steve


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 15, 2017 by Steven_Hardy (16,900 points)   2 7 13
0 votes

On Mon, May 15, 2017 at 6:27 PM, Steven Hardy shardy@redhat.com wrote:
On Mon, May 08, 2017 at 02:45:08PM +0300, Marios Andreou wrote:

Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service status.
It was pointed out by shardy that for such checks it is better to instead
continue to use the per-service  manifests where possible like [2] for
example where we check status before N..O major upgrade. There may still
be some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).
So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into an
ansible playbook that the operator can invoke for upgrade of the 'manual'
nodes (for example compute in the N..O workflow) - the point being that we
can add more per-service ansible validation tasks into the service
manifests for execution when the play is run by the operator - but I'll
let Steve point at and talk about those.Â

Thanks for starting this thread Marios, sorry for the slow reply due to
Summit etc.

As we discussed, I think adding validations is great, but I'd prefer we
kept any overcloud validations specific to services in t-h-t instead of
trying to manage service specific things over multiple repos.

This would also help with the idea of per-step validations I think, where
e.g you could have a "is service active" test and run it after the step
where we expect the service to start, a blueprint was raised a while back
asking for exactly that:

https://blueprints.launchpad.net/tripleo/+spec/step-by-step-validation

One way we could achive this is to add ansible tasks that perform some
validation after each step, where we combine the tasks for all services,
similar to how we already do upgradetasks and hostprep_tasks:

https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/database/redis.yaml#L92

With the benefit of hindsight using ansible tags for upgrade_tasks wasn't
the best approach, because you can't change the tags via SoftwareDeployment
(e.g you need a SoftwareConfig per step), it's better if we either generate
the list of tasks by merging maps e.g

validation_tasks:
step3:
- sometask

Or via ansible conditionals where we pass a step value in to each run of
the tasks:

validation_tasks:
- sometask
when: step == 3

The latter approach is probably my preference, because it'll require less
complex merging in the heat layer.

As you mentioned, I've been working on ways to make the deployment steps
more ansible driven, so having these tasks integrated with the t-h-t model
would be well aligned with that I think:

https://review.openstack.org/#/c/454816/

https://review.openstack.org/#/c/462211/

Happy to discuss further when you're ready to start integrating some
overcloud validations.

Maybe these are two kinds of pre-upgrade validations that serve
different purposes.

The more general validations (like checking connectivity, making sure
the stack is in good shape, repos are available, etc.) should give
operators a fair amount of confidence that all basic prerequisites to
start an update are met before the upgrade is started. They could be
run from the UI or CLI and would fit well into the tripleo-validations
repo. Similar to the existing tripleo-validations, failures don't
prevent operators from doing something.

The service-specific validations otoh are closely tied to the upgrade
process and will stop further progress when failing. They are
fundamentally different to the tripleo-validations and could therefore
live in t-h-t.

I personally don't see why we shouldn't have pre-upgrade validations
both in tripleo-validations and in t-h-t, as long as we know which
ones go where. If everything that's tied to a specific overcloud
service or upgrade step goes into t-h-t, I could see these two groups
(using the validations suggested earlier in this thread):

tripleo-validations:
- Undercloud service check
- Verify that the stack is in a *_COMPLETE state
- Verify undercloud disk space. For node replacement we recommended a
minimum of 10 GB free.
- Network/repo availability check (undercloud and overcloud)
- Verify we're at the latest version of the current release
- ...

tripleo-heat-templates:
- Pacemaker cluster health
- Ceph health
- APIs healthcheck (per overcloud service)
- Check Galera and Rabbit clusters and verify all nodes are up.
- Disabling stonith.
- ...

In theory I could imagine another variety of pre-upgrade validations:
Ones that are general in nature (not tied to an overcloud service),
but are specific to a particular version jump (so they would be run
before a N..O upgrade, but wouldn't make sense for an O..P jump).
These could still live in the tripleo-validations repo, but would only
exist as backports to the relevant "from"-version. But lacking a good
example, this is probably a bit academic for now. :-)

Any thoughts?

Thanks
Florian


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 16, 2017 by Florian_Fuchs (520 points)  
0 votes

On Tue, May 9, 2017 at 11:46 AM, mathieu bultel mbultel@redhat.com wrote:

On 05/08/2017 01:45 PM, Marios Andreou wrote:

Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service status.
It was pointed out by shardy that for such checks it is better to instead
continue to use the per-service manifests where possible like [2] for
example where we check status before N..O major upgrade. There may still be
some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).

Yes, I think a bunch of validation: db states, services states, network
connectivity (external, internal)

So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into an
ansible playbook that the operator can invoke for upgrade of the 'manual'
nodes (for example compute in the N..O workflow) - the point being that we
can add more per-service ansible validation tasks into the service
manifests for execution when the play is run by the operator - but I'll let
Steve point at and talk about those.

I have a WIP review about that [1], but i need to revisit it a bit, to add
a part into the mistral workflow (Im also writing a POC to create a mistral
workbook for major upgrade and validate minor/major upgrade before starting
[2], I have a third one in progress, not pushed yet, to implement the major
upgrade option in the cli):

[1] https://review.openstack.org/444224 https://review.openstack.org/444224
[2] https://review.openstack.org/#/c/462961

ack I had a pass at those when you first sent this - will look again. I
think our discussion have highlighted a few things

  • lack of tripleo-common/client support for upgrades workflow (e.g. so we
    can kill the -e for every invocation)

  • better - more pre-upgrade/update validations - especially undercloud
    doesn't have much/any (already have some pre-upgrade validations) also
    minor update doesn't have much/any converage .

  • improving the 'manual node upgrade' - instead of running
    upgrade-non-controller.sh use a playbook. that can be run by the operator
    (or even automated?)

  • better logging tracking of current upgrades step/progress - when failures
    happen

  • related to ^^^ when upgrade/update completes on current node writing a
    /etc/tripleorelease or somesuch to say 'mitaka' or 'ocata' or whatever the
    version just been upgraded to, is.

I recall Emilien suggesting blueprint for tracking we should discuss some
more and get these down as one or more blueprints probably - assuming folks
agree with that list and what we can add/remove/change - lets work
together to split the blueprints if you like but lets discuss them a bit
here first/or not more comments lets see :)

thanks

cheers, marios

[0] https://github.com/openstack/tripleo-validations
[1] https://review.openstack.org/#/c/462918/
[2] https://github.com/openstack/https://github.com/openstack/
tripleo-heat-templates/blob/stable/ocata/puppet/services/neutron-api.yaml#
L197


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribehttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 17, 2017 by Marios_Andreou (3,200 points)   3 4
0 votes

On Mon, May 15, 2017 at 7:27 PM, Steven Hardy shardy@redhat.com wrote:

On Mon, May 08, 2017 at 02:45:08PM +0300, Marios Andreou wrote:

Hi folks, after some discussion locally with colleagues about
improving
the upgrades experience, one of the items that came up was
pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service
status.
It was pointed out by shardy that for such checks it is better to
instead
continue to use the per-service  manifests where possible like
[2]Â for
example where we check status before N..O major upgrade. There may
still
be some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).
So do folks have any thoughts about this subject - for example the
kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config
into an
ansible playbook that the operator can invoke for upgrade of the
'manual'
nodes (for example compute in the N..O workflow) - the point being
that we
can add more per-service ansible validation tasks into the service
manifests for execution when the play is run by the operator - but
I'll
let Steve point at and talk about those.Â

Thanks for starting this thread Marios, sorry for the slow reply due to
Summit etc.

As we discussed, I think adding validations is great, but I'd prefer we
kept any overcloud validations specific to services in t-h-t instead of
trying to manage service specific things over multiple repos.

This would also help with the idea of per-step validations I think, where
e.g you could have a "is service active" test and run it after the step
where we expect the service to start, a blueprint was raised a while back
asking for exactly that:

https://blueprints.launchpad.net/tripleo/+spec/step-by-step-validation

thanks for this we can use it one less to file :D

and ack on the overcloud vs undercloud - sounds like tripleo-validations is
the right place/folks agree in general to having some ansible tasks there
to validate stuff (thanks very much @bnemec and @emilien for suggestions

thanks


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 17, 2017 by Marios_Andreou (3,200 points)   3 4
0 votes

On Tue, May 16, 2017 at 4:28 PM, Florian Fuchs flfuchs@redhat.com wrote:

On Mon, May 15, 2017 at 6:27 PM, Steven Hardy shardy@redhat.com wrote:

On Mon, May 08, 2017 at 02:45:08PM +0300, Marios Andreou wrote:

Hi folks, after some discussion locally with colleagues about
improving
the upgrades experience, one of the items that came up was
pre-upgrade and
update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be
run
before an undercloud update/upgrade and which just checks service
status.
It was pointed out by shardy that for such checks it is better to
instead
continue to use the per-service  manifests where possible like [2]Â
for
example where we check status before N..O major upgrade. There may
still
be some undercloud specific validations that we can land into the
tripleo-validations repo (thinking about things like the neutron
networks/ports, validating the current nova nodes state etc?).
So do folks have any thoughts about this subject - for example the
kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config
into an
ansible playbook that the operator can invoke for upgrade of the
'manual'
nodes (for example compute in the N..O workflow) - the point being
that we
can add more per-service ansible validation tasks into the service
manifests for execution when the play is run by the operator - but
I'll
let Steve point at and talk about those.Â

Thanks for starting this thread Marios, sorry for the slow reply due to
Summit etc.

As we discussed, I think adding validations is great, but I'd prefer we
kept any overcloud validations specific to services in t-h-t instead of
trying to manage service specific things over multiple repos.

This would also help with the idea of per-step validations I think, where
e.g you could have a "is service active" test and run it after the step
where we expect the service to start, a blueprint was raised a while back
asking for exactly that:

https://blueprints.launchpad.net/tripleo/+spec/step-by-step-validation

One way we could achive this is to add ansible tasks that perform some
validation after each step, where we combine the tasks for all services,
similar to how we already do upgradetasks and hostprep_tasks:

https://github.com/openstack/tripleo-heat-templates/blob/
master/docker/services/database/redis.yaml#L92

With the benefit of hindsight using ansible tags for upgrade_tasks wasn't
the best approach, because you can't change the tags via
SoftwareDeployment
(e.g you need a SoftwareConfig per step), it's better if we either
generate
the list of tasks by merging maps e.g

validation_tasks:
step3:
- sometask

Or via ansible conditionals where we pass a step value in to each run of
the tasks:

validation_tasks:
- sometask
when: step == 3

The latter approach is probably my preference, because it'll require less
complex merging in the heat layer.

As you mentioned, I've been working on ways to make the deployment steps
more ansible driven, so having these tasks integrated with the t-h-t
model
would be well aligned with that I think:

https://review.openstack.org/#/c/454816/

https://review.openstack.org/#/c/462211/

Happy to discuss further when you're ready to start integrating some
overcloud validations.

Maybe these are two kinds of pre-upgrade validations that serve
different purposes.

The more general validations (like checking connectivity, making sure
the stack is in good shape, repos are available, etc.) should give
operators a fair amount of confidence that all basic prerequisites to
start an update are met before the upgrade is started. They could be
run from the UI or CLI and would fit well into the tripleo-validations
repo. Similar to the existing tripleo-validations, failures don't
prevent operators from doing something.

The service-specific validations otoh are closely tied to the upgrade
process and will stop further progress when failing.

yeah - you could also argue that the current overcloud service upgrade
validations (which just check 'is this service running OK' at step0 of the
upgrade) are also pre upgrade since we didn't do anything yet it is
literally step0. Note that as of upgrade to stable/ocata you can disable
these if you need to re-run the upgrade step for example so it doesn't fail
on the service checks.

They are
fundamentally different to the tripleo-validations and could therefore
live in t-h-t.

ACK yeah this seems to be the general consensus forming here -
tripleo-validations for checking things especially on the undercloud
and continue to use the tht for the overcloud service validations. For many
reasons and especially since we get the benefit of 'auto' generated list of
services currently deployed per node etc etc so per service validation runs
only if service is deployed.

I personally don't see why we shouldn't have pre-upgrade validations
both in tripleo-validations and in t-h-t, as long as we know which
ones go where. If everything that's tied to a specific overcloud
service or upgrade step goes into t-h-t, I could see these two groups
(using the validations suggested earlier in this thread):

tripleo-validations:
- Undercloud service check
- Verify that the stack is in a *_COMPLETE state
- Verify undercloud disk space. For node replacement we recommended a
minimum of 10 GB free.
- Network/repo availability check (undercloud and overcloud)
- Verify we're at the latest version of the current release
- ...

tripleo-heat-templates:
- Pacemaker cluster health
- Ceph health
- APIs healthcheck (per overcloud service)
- Check Galera and Rabbit clusters and verify all nodes are up.
- Disabling stonith.
- ...

thanks these all seem like good things to be checking and the split seems
reasonable to me,

In theory I could imagine another variety of pre-upgrade validations:
Ones that are general in nature (not tied to an overcloud service),

but are specific to a particular version jump (so they would be run

before a N..O upgrade, but wouldn't make sense for an O..P jump).

for sure, we have things like migrations for example service foo-api is
deprecated and instead the foo service is now served by apache and that
will happen only in a specific upgrade version

These could still live in the tripleo-validations repo, but would only
exist as backports to the relevant "from"-version. But lacking a good
example, this is probably a bit academic for now. :-)

Any thoughts?

Thanks
Florian


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 17, 2017 by Marios_Andreou (3,200 points)   3 4
0 votes

On 08/05/17 21:45, Marios Andreou wrote:
Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade
and update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service
status. It was pointed out by shardy that for such checks it is better
to instead continue to use the per-service manifests where possible
like [2] for example where we check status before N..O major upgrade.
There may still be some undercloud specific validations that we can land
into the tripleo-validations repo (thinking about things like the
neutron networks/ports, validating the current nova nodes state etc?).

So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into
an ansible playbook that the operator can invoke for upgrade of the
'manual' nodes (for example compute in the N..O workflow) - the point
being that we can add more per-service ansible validation tasks into the
service manifests for execution when the play is run by the operator -
but I'll let Steve point at and talk about those.

cheers, marios

[0] https://github.com/openstack/tripleo-validations
[1] https://review.openstack.org/#/c/462918/
[2] https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/services/neutron-api.yaml#L197


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Hi Marios,

Forgive me if I misunderstand here, but it looks like part of this goal
is to do things like ensure the overcloud is in a decent state before an
upgrade/update is executed.

How would this work in a situation where I have hit an openstack bug
which causes my cinder service to stop working/fail, and I a fix has
been created/packaged, ready for me to update my overcloud with, but the
validations bomb out because cinder isn't running (and I can't update my
overcloud to the newest package with the fix because the validation fails?)

Regards,

Graeme

--
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 24, 2017 by Graeme_Gillies (200 points)  
0 votes

On Wed, May 24, 2017 at 4:09 AM, Graeme Gillies ggillies@redhat.com wrote:

On 08/05/17 21:45, Marios Andreou wrote:

Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade
and update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service
status. It was pointed out by shardy that for such checks it is better
to instead continue to use the per-service manifests where possible
like [2] for example where we check status before N..O major upgrade.
There may still be some undercloud specific validations that we can land
into the tripleo-validations repo (thinking about things like the
neutron networks/ports, validating the current nova nodes state etc?).

So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into
an ansible playbook that the operator can invoke for upgrade of the
'manual' nodes (for example compute in the N..O workflow) - the point
being that we can add more per-service ansible validation tasks into the
service manifests for execution when the play is run by the operator -
but I'll let Steve point at and talk about those.

cheers, marios

[0] https://github.com/openstack/tripleo-validations
[1] https://review.openstack.org/#/c/462918/
[2] https://github.com/openstack/tripleo-heat-templates/blob/
stable/ocata/puppet/services/neutron-api.yaml#L197



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:
unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Hi Marios,

Forgive me if I misunderstand here, but it looks like part of this goal
is to do things like ensure the overcloud is in a decent state before an
upgrade/update is executed.

How would this work in a situation where I have hit an openstack bug
which causes my cinder service to stop working/fail, and I a fix has
been created/packaged, ready for me to update my overcloud with, but the
validations bomb out because cinder isn't running (and I can't update my
overcloud to the newest package with the fix because the validation fails?)

o/ right... so there are roughly two groups of things here - validations
for the undercloud (of which we don't have much and we want to add some)
and validations for the overcloud. For the former we are targetting
tripleo-validations and for the latter adding to the existing service
checks in the tripleo-heat-template service manifests for execution during
the upgrade.

For both we need a way to disable them - one of the key concerns is the
scenario you describe. For the overcoud service checks we already have that
at least for the current simple "is service running " (grep
SkipUpgradeConfigTags at
https://docs.openstack.org/developer/tripleo-docs/post_deployment/upgrade.html).
For the tripleo-validations I believe there is a 'validations fatal' type
flag already that you can pass to the client.

hope it answers your concern

Regards,

Graeme

--
Graeme Gillies
Principal Systems Administrator
Openstack Infrastructure
Red Hat Australia


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded May 24, 2017 by Marios_Andreou (3,200 points)   3 4
...