settingsLogin | Registersettings

[openstack-dev] [puppet][Fuel] OpenstackLib Client Provider Better Exception Handling

0 votes

Hi, folks

  • Intro

Per our discussion at Meeting #54 [0] I would like to propose the uniform
approach of exception handling for all puppet-openstack providers accessing
any types of OpenStack APIs.

  • Problem Description

While working on Fuel during deployment of multi-node HA-aware environments
we faced many intermittent operational issues, e.g.:

401/403 authentication failures when we were doing scaling of OpenStack
controllers due to difference in hashing view between keystone instances
503/502/504 errors due to temporary connectivity issues
non-idempotent operations like deletion or creation - e.g. if you are
deleting an endpoint and someone is deleting on the other node and you get
404 - you should continue with success instead of failing. 409 Conflict
error should also signal us to re-fetch resource parameters and then decide
what to do with them.

Obviously, it is not optimal to rerun puppet to correct such errors when we
can just handle an exception properly.

  • Current State of Art

There is some exception handling, but it does not cover all the
aforementioned use cases.

  • Proposed solution

Introduce a library of exception handling methods which should be the same
for all puppet openstack providers as these exceptions seem to be generic.
Then, for each of the providers we can introduce provider-specific
libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work on
that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?

[0]
http://eavesdrop.openstack.org/meetings/puppet_openstack/2015/puppet_openstack.2015-10-06-15.00.html

--
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com http://www.mirantis.ru/
www.mirantis.ru
vkuklin@mirantis.com


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Oct 8, 2015 in openstack-dev by Vladimir_Kuklin (7,320 points)   1 3 4

8 Responses

0 votes

On 10/08/2015 07:38 AM, Vladimir Kuklin wrote:
[...]
* Proposed solution

Introduce a library of exception handling methods which should be the
same for all puppet openstack providers as these exceptions seem to be
generic. Then, for each of the providers we can introduce
provider-specific libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work on
that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?

This is excellent feedback from how modules work in Fuel and I'm sure
you're not alone, everybody deploying OpenStack with Puppet is hitting
these issues.

You might want to refactor [1] and manage more use-cases.
If you plan to work on it, I would suggest to use our upstream backlog
[2] so we can involve the whole group in that work.

[1]
https://github.com/openstack/puppet-openstacklib/blob/master/lib/puppet/provider/openstack.rb
[2] https://trello.com/b/4X3zxWRZ/on-going-effort

Thanks for taking care of that,
--
Emilien Macchi


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Oct 13, 2015 by emilien_at_redhat.co (36,940 points)   2 6 10
0 votes

On 10/13/2015 12:57 PM, Emilien Macchi wrote:

On 10/08/2015 07:38 AM, Vladimir Kuklin wrote:
[...]

  • Proposed solution

Introduce a library of exception handling methods which should be the
same for all puppet openstack providers as these exceptions seem to be
generic. Then, for each of the providers we can introduce
provider-specific libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work on
that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?
This is excellent feedback from how modules work in Fuel and I'm sure
you're not alone, everybody deploying OpenStack with Puppet is hitting
these issues.

You might want to refactor [1] and manage more use-cases.
If you plan to work on it, I would suggest to use our upstream backlog
[2] so we can involve the whole group in that work.

[1]
https://github.com/openstack/puppet-openstacklib/blob/master/lib/puppet/provider/openstack.rb
[2] https://trello.com/b/4X3zxWRZ/on-going-effort

If the issue is that openstackclient output is hard to parse, we should
tell openstackclient to output JSON -
https://bugs.launchpad.net/puppet-openstacklib/+bug/1479387

Thanks for taking care of that,


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 13, 2015 by Rich_Megginson (3,020 points)   2 5
0 votes

On Thu, Oct 8, 2015 at 5:38 AM, Vladimir Kuklin vkuklin@mirantis.com
wrote:

Hi, folks

  • Intro

Per our discussion at Meeting #54 [0] I would like to propose the uniform
approach of exception handling for all puppet-openstack providers accessing
any types of OpenStack APIs.

  • Problem Description

While working on Fuel during deployment of multi-node HA-aware
environments we faced many intermittent operational issues, e.g.:

401/403 authentication failures when we were doing scaling of OpenStack
controllers due to difference in hashing view between keystone instances
503/502/504 errors due to temporary connectivity issues
non-idempotent operations like deletion or creation - e.g. if you are
deleting an endpoint and someone is deleting on the other node and you get
404 - you should continue with success instead of failing. 409 Conflict
error should also signal us to re-fetch resource parameters and then decide
what to do with them.

Obviously, it is not optimal to rerun puppet to correct such errors when
we can just handle an exception properly.

  • Current State of Art

There is some exception handling, but it does not cover all the
aforementioned use cases.

  • Proposed solution

Introduce a library of exception handling methods which should be the same
for all puppet openstack providers as these exceptions seem to be generic.
Then, for each of the providers we can introduce provider-specific
libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work on
that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?

[0]
http://eavesdrop.openstack.org/meetings/puppet_openstack/2015/puppet_openstack.2015-10-06-15.00.html

I think that we should look into some solutions here as I'm generally for
something we can solve once and re-use. Currently we solve some of this at
TWC by serializing our deploys and disabling puppet site wide while we do
so. This avoids the issue of Keystone on one node removing and endpoint
while the other nodes (who still have old code) keep trying to add it back.

For connectivity issues especially after service restarts, we're using
puppet-healthcheck [0] and I'd like to discuss that more in Tokyo as an
alternative to explicit retries and delays. It's in the etherpad so
hopefully you can attend.

[0] - https://github.com/puppet-community/puppet-healthcheck


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 15, 2015 by Matt_Fischer (9,340 points)   1 4 8
0 votes

On 15/10/15 12:42, Matt Fischer wrote:

On Thu, Oct 8, 2015 at 5:38 AM, Vladimir Kuklin <vkuklin@mirantis.com
vkuklin@mirantis.com> wrote:

Hi, folks

* Intro

Per our discussion at Meeting #54 [0] I would like to propose the
uniform approach of exception handling for all puppet-openstack
providers accessing any types of OpenStack APIs.

* Problem Description

While working on Fuel during deployment of multi-node HA-aware
environments we faced many intermittent operational issues, e.g.:

401/403 authentication failures when we were doing scaling of
OpenStack controllers due to difference in hashing view between
keystone instances
503/502/504 errors due to temporary connectivity issues

The 5xx errors are not connectivity issues:

500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

I believe nothing should be done to trap them.

The connectivity issues are different matter (to be addressed as
mentioned by Matt)

non-idempotent operations like deletion or creation - e.g. if you
are deleting an endpoint and someone is deleting on the other node
and you get 404 - you should continue with success instead of
failing. 409 Conflict error should also signal us to re-fetch
resource parameters and then decide what to do with them.

Obviously, it is not optimal to rerun puppet to correct such errors
when we can just handle an exception properly.

* Current State of Art

There is some exception handling, but it does not cover all the
aforementioned use cases.

* Proposed solution

Introduce a library of exception handling methods which should be
the same for all puppet openstack providers as these exceptions seem
to be generic. Then, for each of the providers we can introduce
provider-specific libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work
on that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?

The real issue is because we're dealing with openstackclient, a CLI tool
and not an API. Therefore no error propagation is expected.

Using REST interfaces for all Openstack API would provide all HTTP errors:

Check for "HTTP Response Classes" in
http://ruby-doc.org/stdlib-2.2.3/libdoc/net/http/rdoc/Net/HTTP.html

[0] http://eavesdrop.openstack.org/meetings/puppet_openstack/2015/puppet_openstack.2015-10-06-15.00.html

I think that we should look into some solutions here as I'm generally
for something we can solve once and re-use. Currently we solve some of
this at TWC by serializing our deploys and disabling puppet site wide
while we do so. This avoids the issue of Keystone on one node removing
and endpoint while the other nodes (who still have old code) keep trying
to add it back.

For connectivity issues especially after service restarts, we're using
puppet-healthcheck [0] and I'd like to discuss that more in Tokyo as an
alternative to explicit retries and delays. It's in the etherpad so
hopefully you can attend.

+1

[0] - https://github.com/puppet-community/puppet-healthcheck


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 15, 2015 by Gilles_Dubreuil (1,420 points)   2
0 votes

Gilles,

5xx errors like 503 and 502/504 could always be intermittent operational
issues. E.g. when you access your keystone backends through some proxy and
there is a connectivity issue between the proxy and backends which
disappears in 10 seconds, you do not need to rerun the puppet completely -
just retry the request.

Regarding "REST interfaces for all Openstack API" - this is very close to
another topic that I raised ([0]) - using native Ruby application and
handle the exceptions. Otherwise whenever we have an OpenStack client
(generic or neutron/glance/etc. one) sending us a message like '[111]
Connection refused' this message is very much determined by the framework
that OpenStack is using within this release for clients. It could be
requests or any other type of framework which sends different text
message depending on its version. So it is very bothersome to write a bunch
of 'if' clauses or gigantic regexps instead of handling simple Ruby
exception. So I agree with you here - we need to work with the API
directly. And, by the way, if you also support switching to native Ruby
OpenStack API client, please feel free to support movement towards it in
the thread [0]

Matt and Gilles,

Regarding puppet-healthcheck - I do not think that puppet-healtcheck
handles exactly what I am mentioning here - it is not running exactly at
the same time as we run the request.

E.g. 10 seconds ago everything was OK, then we had a temporary connectivity
issue, then everything is ok again in 10 seconds. Could you please describe
how puppet-healthcheck can help us solve this problem?

Or another example - there was an issue with keystone accessing token
database when you have several keystone instances running, or there was
some desync between these instances, e.g. you fetched the token at keystone

1 and then you verify it again keystone #2. Keystone #2 had some issues

verifying it not due to the fact that token was bad, but due to the fact
that that keystone #2 had some issues. We would get 401 error and instead
of trying to rerun the puppet we would need just to handle this issue
locally by retrying the request.

[0] http://permalink.gmane.org/gmane.comp.cloud.openstack.devel/66423

On Thu, Oct 15, 2015 at 12:23 PM, Gilles Dubreuil gilles@redhat.com wrote:

On 15/10/15 12:42, Matt Fischer wrote:

On Thu, Oct 8, 2015 at 5:38 AM, Vladimir Kuklin <vkuklin@mirantis.com
vkuklin@mirantis.com> wrote:

Hi, folks

* Intro

Per our discussion at Meeting #54 [0] I would like to propose the
uniform approach of exception handling for all puppet-openstack
providers accessing any types of OpenStack APIs.

* Problem Description

While working on Fuel during deployment of multi-node HA-aware
environments we faced many intermittent operational issues, e.g.:

401/403 authentication failures when we were doing scaling of
OpenStack controllers due to difference in hashing view between
keystone instances
503/502/504 errors due to temporary connectivity issues

The 5xx errors are not connectivity issues:

500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

I believe nothing should be done to trap them.

The connectivity issues are different matter (to be addressed as
mentioned by Matt)

non-idempotent operations like deletion or creation - e.g. if you
are deleting an endpoint and someone is deleting on the other node
and you get 404 - you should continue with success instead of
failing. 409 Conflict error should also signal us to re-fetch
resource parameters and then decide what to do with them.

Obviously, it is not optimal to rerun puppet to correct such errors
when we can just handle an exception properly.

* Current State of Art

There is some exception handling, but it does not cover all the
aforementioned use cases.

* Proposed solution

Introduce a library of exception handling methods which should be
the same for all puppet openstack providers as these exceptions seem
to be generic. Then, for each of the providers we can introduce
provider-specific libraries that will inherit from this one.

Our mos-puppet team could add this into their backlog and could work
on that in upstream or downstream and propose it upstream.

What do you think on that, puppet folks?

The real issue is because we're dealing with openstackclient, a CLI tool
and not an API. Therefore no error propagation is expected.

Using REST interfaces for all Openstack API would provide all HTTP errors:

Check for "HTTP Response Classes" in
http://ruby-doc.org/stdlib-2.2.3/libdoc/net/http/rdoc/Net/HTTP.html

[0]

http://eavesdrop.openstack.org/meetings/puppet_openstack/2015/puppet_openstack.2015-10-06-15.00.html

I think that we should look into some solutions here as I'm generally
for something we can solve once and re-use. Currently we solve some of
this at TWC by serializing our deploys and disabling puppet site wide
while we do so. This avoids the issue of Keystone on one node removing
and endpoint while the other nodes (who still have old code) keep trying
to add it back.

For connectivity issues especially after service restarts, we're using
puppet-healthcheck [0] and I'd like to discuss that more in Tokyo as an
alternative to explicit retries and delays. It's in the etherpad so
hopefully you can attend.

+1


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com
www.mirantis.ru
vkuklin@mirantis.com


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 15, 2015 by Vladimir_Kuklin (7,320 points)   1 3 4
0 votes

On Thu, Oct 15, 2015 at 4:10 AM, Vladimir Kuklin vkuklin@mirantis.com
wrote:

Gilles,

5xx errors like 503 and 502/504 could always be intermittent operational
issues. E.g. when you access your keystone backends through some proxy and
there is a connectivity issue between the proxy and backends which
disappears in 10 seconds, you do not need to rerun the puppet completely -
just retry the request.

Regarding "REST interfaces for all Openstack API" - this is very close to
another topic that I raised ([0]) - using native Ruby application and
handle the exceptions. Otherwise whenever we have an OpenStack client
(generic or neutron/glance/etc. one) sending us a message like '[111]
Connection refused' this message is very much determined by the framework
that OpenStack is using within this release for clients. It could be
requests or any other type of framework which sends different text
message depending on its version. So it is very bothersome to write a bunch
of 'if' clauses or gigantic regexps instead of handling simple Ruby
exception. So I agree with you here - we need to work with the API
directly. And, by the way, if you also support switching to native Ruby
OpenStack API client, please feel free to support movement towards it in
the thread [0]

Matt and Gilles,

Regarding puppet-healthcheck - I do not think that puppet-healtcheck
handles exactly what I am mentioning here - it is not running exactly at
the same time as we run the request.

E.g. 10 seconds ago everything was OK, then we had a temporary
connectivity issue, then everything is ok again in 10 seconds. Could you
please describe how puppet-healthcheck can help us solve this problem?

You are right, it probably won't. At that point you are using puppet to
work around some fundamental issues in your OpenStack deployment.

Or another example - there was an issue with keystone accessing token
database when you have several keystone instances running, or there was
some desync between these instances, e.g. you fetched the token at keystone

1 and then you verify it again keystone #2. Keystone #2 had some issues

verifying it not due to the fact that token was bad, but due to the fact
that that keystone #2 had some issues. We would get 401 error and instead
of trying to rerun the puppet we would need just to handle this issue
locally by retrying the request.

[0] http://permalink.gmane.org/gmane.comp.cloud.openstack.devel/66423

Another one that is a deployment architecture problem. We solved this by
configuring the load balancer to direct keystone traffic to a single db
node, now we solve it with Fernet tokens. If you have this specific issue
above it's going to manifest in all kinds of strange ways and can even
happen to control services like neutron/nova etc as well. Which means even
if we get puppet to pass with a bunch of retries, OpenStack is not healthy
and the users will not be happy about it.

I don't want to give them impression that I am completely opposed to
retries, but on the other hand, when my deployment is broken, I want to
know quickly, not after 10 minutes of retries, so we need to balance that.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 15, 2015 by Matt_Fischer (9,340 points)   1 4 8
0 votes

Matt

You are right, it probably won't. At that point you are using puppet to
work around some fundamental issues in your OpenStack deployment.

Actually, as you know, with Fuel we are shipping our code to people who
have their own infrastructure. We do not have any control over that
infrastructure and any information about it. So we should expect the worst
- that sometimes such issues will happen and we need to take care of them
in the best possible way, e.g. someone tripped the wire and then put it
back into the switch. And it seems that we can do it right in puppet code
instead of making user wait for puppet rerun.

Another one that is a deployment architecture problem. We solved this by
configuring the load balancer to direct keystone traffic to a single db
node, now we solve it with Fernet tokens. If you have this
specific issue above it's going to manifest in all kinds of strange ways
and can even happen to control services like neutron/nova etc as well.
Which means even if we get puppet to pass with a bunch of
retries, OpenStack is not healthy and the users will not be happy about
it.

Again, what you described is for the case when the system was in some
undesirable state like reading from incorrect database and then got into
persistent working state. And you solve it by making load balancer aware of
which backend to send request to. But I am talking about sporadic failures
which from the statistical point of view look negligible and should not be
handled by load balancer. Imagine the situation when load balancer is ok
with that backend and this backend faces intermittent operational issue
like getting garbled response or having some bug in the code. This is a
sporadic failure which will not be caught by load balancer because if you
make it so sensitive to such issues it will behave poorly. So, I think, the
best option here is to handle such issues on application level.

On Thu, Oct 15, 2015 at 4:37 PM, Matt Fischer matt@mattfischer.com wrote:

On Thu, Oct 15, 2015 at 4:10 AM, Vladimir Kuklin vkuklin@mirantis.com
wrote:

Gilles,

5xx errors like 503 and 502/504 could always be intermittent operational
issues. E.g. when you access your keystone backends through some proxy and
there is a connectivity issue between the proxy and backends which
disappears in 10 seconds, you do not need to rerun the puppet completely -
just retry the request.

Regarding "REST interfaces for all Openstack API" - this is very close
to another topic that I raised ([0]) - using native Ruby application and
handle the exceptions. Otherwise whenever we have an OpenStack client
(generic or neutron/glance/etc. one) sending us a message like '[111]
Connection refused' this message is very much determined by the framework
that OpenStack is using within this release for clients. It could be
requests or any other type of framework which sends different text
message depending on its version. So it is very bothersome to write a bunch
of 'if' clauses or gigantic regexps instead of handling simple Ruby
exception. So I agree with you here - we need to work with the API
directly. And, by the way, if you also support switching to native Ruby
OpenStack API client, please feel free to support movement towards it in
the thread [0]

Matt and Gilles,

Regarding puppet-healthcheck - I do not think that puppet-healtcheck
handles exactly what I am mentioning here - it is not running exactly at
the same time as we run the request.

E.g. 10 seconds ago everything was OK, then we had a temporary
connectivity issue, then everything is ok again in 10 seconds. Could you
please describe how puppet-healthcheck can help us solve this problem?

You are right, it probably won't. At that point you are using puppet to
work around some fundamental issues in your OpenStack deployment.

Or another example - there was an issue with keystone accessing token
database when you have several keystone instances running, or there was
some desync between these instances, e.g. you fetched the token at keystone

1 and then you verify it again keystone #2. Keystone #2 had some issues

verifying it not due to the fact that token was bad, but due to the fact
that that keystone #2 had some issues. We would get 401 error and instead
of trying to rerun the puppet we would need just to handle this issue
locally by retrying the request.

[0] http://permalink.gmane.org/gmane.comp.cloud.openstack.devel/66423

Another one that is a deployment architecture problem. We solved this by
configuring the load balancer to direct keystone traffic to a single db
node, now we solve it with Fernet tokens. If you have this specific issue
above it's going to manifest in all kinds of strange ways and can even
happen to control services like neutron/nova etc as well. Which means even
if we get puppet to pass with a bunch of retries, OpenStack is not healthy
and the users will not be happy about it.

I don't want to give them impression that I am completely opposed to
retries, but on the other hand, when my deployment is broken, I want to
know quickly, not after 10 minutes of retries, so we need to balance that.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com
www.mirantis.ru
vkuklin@mirantis.com


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 15, 2015 by Vladimir_Kuklin (7,320 points)   1 3 4
0 votes

On 15/10/15 21:10, Vladimir Kuklin wrote:
Gilles,

5xx errors like 503 and 502/504 could always be intermittent operational
issues. E.g. when you access your keystone backends through some proxy
and there is a connectivity issue between the proxy and backends which
disappears in 10 seconds, you do not need to rerun the puppet completely
- just retry the request.

Look, I don't have much experience with those errors in real case
scenarios. And this is just a details for my understanding, those
errors are coming from a running HTTP service, therefore this is not a
connectivity issue to the service but something wrong beyond that.

Regarding "REST interfaces for all Openstack API" - this is very close
to another topic that I raised ([0]) - using native Ruby application and
handle the exceptions. Otherwise whenever we have an OpenStack client
(generic or neutron/glance/etc. one) sending us a message like '[111]
Connection refused' this message is very much determined by the
framework that OpenStack is using within this release for clients. It
could be requests or any other type of framework which sends different
text message depending on its version. So it is very bothersome to write
a bunch of 'if' clauses or gigantic regexps instead of handling simple
Ruby exception. So I agree with you here - we need to work with the API
directly. And, by the way, if you also support switching to native Ruby
OpenStack API client, please feel free to support movement towards it in
the thread [0]

Yes, I totally agree with you on that approach (native Ruby lib).
This why I mentioned it here because for me the exception handling would
be solved at once.

Matt and Gilles,

Regarding puppet-healthcheck - I do not think that puppet-healtcheck
handles exactly what I am mentioning here - it is not running exactly at
the same time as we run the request.

E.g. 10 seconds ago everything was OK, then we had a temporary
connectivity issue, then everything is ok again in 10 seconds. Could you
please describe how puppet-healthcheck can help us solve this problem?

Or another example - there was an issue with keystone accessing token
database when you have several keystone instances running, or there was
some desync between these instances, e.g. you fetched the token at
keystone #1 and then you verify it again keystone #2. Keystone #2 had
some issues verifying it not due to the fact that token was bad, but due
to the fact that that keystone #2 had some issues. We would get 401
error and instead of trying to rerun the puppet we would need just to
handle this issue locally by retrying the request.

[0] http://permalink.gmane.org/gmane.comp.cloud.openstack.devel/66423

On Thu, Oct 15, 2015 at 12:23 PM, Gilles Dubreuil <gilles@redhat.com
gilles@redhat.com> wrote:

On 15/10/15 12:42, Matt Fischer wrote:
>
>
> On Thu, Oct 8, 2015 at 5:38 AM, Vladimir Kuklin <vkuklin@mirantis.com <mailto:vkuklin@mirantis.com>
> <mailto:vkuklin@mirantis.com <mailto:vkuklin@mirantis.com>>> wrote:
>
>     Hi, folks
>
>     * Intro
>
>     Per our discussion at Meeting #54 [0] I would like to propose the
>     uniform approach of exception handling for all puppet-openstack
>     providers accessing any types of OpenStack APIs.
>
>     * Problem Description
>
>     While working on Fuel during deployment of multi-node HA-aware
>     environments we faced many intermittent operational issues, e.g.:
>
>     401/403 authentication failures when we were doing scaling of
>     OpenStack controllers due to difference in hashing view between
>     keystone instances
>     503/502/504 errors due to temporary connectivity issues

The 5xx errors are not connectivity issues:

500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

I believe nothing should be done to trap them.

The connectivity issues are different matter (to be addressed as
mentioned by Matt)

>     non-idempotent operations like deletion or creation - e.g. if you
>     are deleting an endpoint and someone is deleting on the other node
>     and you get 404 - you should continue with success instead of
>     failing. 409 Conflict error should also signal us to re-fetch
>     resource parameters and then decide what to do with them.
>
>     Obviously, it is not optimal to rerun puppet to correct such errors
>     when we can just handle an exception properly.
>
>     * Current State of Art
>
>     There is some exception handling, but it does not cover all the
>     aforementioned use cases.
>
>     * Proposed solution
>
>     Introduce a library of exception handling methods which should be
>     the same for all puppet openstack providers as these exceptions seem
>     to be generic. Then, for each of the providers we can introduce
>     provider-specific libraries that will inherit from this one.
>
>     Our mos-puppet team could add this into their backlog and could work
>     on that in upstream or downstream and propose it upstream.
>
>     What do you think on that, puppet folks?
>

The real issue is because we're dealing with openstackclient, a CLI tool
and not an API. Therefore no error propagation is expected.

Using REST interfaces for all Openstack API would provide all HTTP
errors:

Check for "HTTP Response Classes" in
http://ruby-doc.org/stdlib-2.2.3/libdoc/net/http/rdoc/Net/HTTP.html


>     [0] http://eavesdrop.openstack.org/meetings/puppet_openstack/2015/puppet_openstack.2015-10-06-15.00.html
>
>
> I think that we should look into some solutions here as I'm generally
> for something we can solve once and re-use. Currently we solve some of
> this at TWC by serializing our deploys and disabling puppet site wide
> while we do so. This avoids the issue of Keystone on one node removing
> and endpoint while the other nodes (who still have old code) keep trying
> to add it back.
>
> For connectivity issues especially after service restarts, we're using
> puppet-healthcheck [0] and I'd like to discuss that more in Tokyo as an
> alternative to explicit retries and delays. It's in the etherpad so
> hopefully you can attend.

+1

>
> [0] - https://github.com/puppet-community/puppet-healthcheck
>
>
>
>
__________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com
www.mirantis.ru
vkuklin@mirantis.com vkuklin@mirantis.com


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 16, 2015 by Gilles_Dubreuil (1,420 points)   2
...