settingsLogin | Registersettings

[openstack-dev] [tripleo] rh1 outage today

0 votes

Hi all,

As you may or may not have noticed all ovb jobs on rh1 started failing
sometime last night. After some investigation today I found a few issues.

First, our nova db archiving wasn't working. This was due to the
auto-increment counter issue described by melwitt in
http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html
Deleting the problematic rows from the shadow table got us past that.

On another db-related note, we seem to have turned ceilometer back on at
some point in rh1. I think that was intentional to avoid notification
queues backing up, but it led to a different problem. We had
approximately 400 GB of mongodb data from ceilometer that we don't
actually care about. I cleaned that up and set a TTL in ceilometer so
hopefully this won't happen again.

Unfortunately neither of these things completely resolved the extreme
slowness in the cloud that was causing every testenv to fail. After
trying a number of things that made no difference, the culprit seems to
have been rabbitmq. There was nothing obviously wrong with it according
to the web interface, the queues were all short and messages seemed to
be getting delivered. However, when I ran rabbitmqctl status at the CLI
it reported that the node was down. Since something was clearly wrong I
went ahead and restarted it. After that everything seems to be back to
normal.

I'm not sure exactly what the cause of all this was. We did get kind of
inundated with jobs yesterday after a zuul restart which I think is what
probably pushed us over the edge, but that has happened before without
bringing the cloud down. It was probably a combination of some
previously unnoticed issues stacking up over time and the large number
of testenvs requested all at once.

In any case, testenvs are creating successfully again and the jobs in
the queue look good so far. If you notice any problems please let me
know though. I'm hoping this will help with the job timeouts, but that
remains to be seen.

-Ben


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Oct 30, 2017 in openstack-dev by Ben_Nemec (19,660 points)   2 3 3

3 Responses

0 votes

Thanks for the postmortem; it's always a good read tp learn stuff :)

On 28 Oct 2017 00:11, "Ben Nemec" openstack@nemebean.com wrote:

Hi all,

As you may or may not have noticed all ovb jobs on rh1 started failing
sometime last night. After some investigation today I found a few issues.

First, our nova db archiving wasn't working. This was due to the
auto-increment counter issue described by melwitt in
http://lists.openstack.org/pipermail/openstack-dev/2017-Sept
ember/122903.html Deleting the problematic rows from the shadow table got
us past that.

On another db-related note, we seem to have turned ceilometer back on at
some point in rh1. I think that was intentional to avoid notification
queues backing up, but it led to a different problem. We had approximately
400 GB of mongodb data from ceilometer that we don't actually care about.
I cleaned that up and set a TTL in ceilometer so hopefully this won't
happen again.

Is there an alarm or something we could set to get notified about this kind
of stuff? Or better yet, something we could automate to avoid this? What's
usimg mongodb nowadays?

Unfortunately neither of these things completely resolved the extreme
slowness in the cloud that was causing every testenv to fail. After trying
a number of things that made no difference, the culprit seems to have been
rabbitmq. There was nothing obviously wrong with it according to the web
interface, the queues were all short and messages seemed to be getting
delivered. However, when I ran rabbitmqctl status at the CLI it reported
that the node was down. Since something was clearly wrong I went ahead and
restarted it. After that everything seems to be back to normal.

Same questiom as above, could we set and alarm or automate the node
recovery?

I'm not sure exactly what the cause of all this was. We did get kind of
inundated with jobs yesterday after a zuul restart which I think is what
probably pushed us over the edge, but that has happened before without
bringing the cloud down. It was probably a combination of some previously
unnoticed issues stacking up over time and the large number of testenvs
requested all at once.

In any case, testenvs are creating successfully again and the jobs in the
queue look good so far. If you notice any problems please let me know
though. I'm hoping this will help with the job timeouts, but that remains
to be seen.

-Ben


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 28, 2017 by Juan_Antonio_Osorio (2,900 points)   1 2
0 votes

It turns out this wasn't quite resolved yet. I was still seeing some
excessively long stack creation times today and it turns out one of our
compute nodes had virtualization turned off. This caused all of its
instances to fail and need a retry. Once I disabled the compute service
on it stacks seemed to be creating in a normal amount of time again.

This happened because the node had some hardware issues, and apparently
the fix was to replace the system board so we got it back with
everything set to default. I fixed this and re-enabled the node and all
seems well again.

On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:
Thanks for the postmortem; it's always a good read tp learn stuff :)

On 28 Oct 2017 00:11, "Ben Nemec" <openstack@nemebean.com
openstack@nemebean.com> wrote:

Hi all,

As you may or may not have noticed all ovb jobs on rh1 started
failing sometime last night.  After some investigation today I found
a few issues.

First, our nova db archiving wasn't working.  This was due to the
auto-increment counter issue described by melwitt in
http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html
 
Deleting the problematic rows from the shadow table got us past that.

On another db-related note, we seem to have turned ceilometer back
on at some point in rh1.  I think that was intentional to avoid
notification queues backing up, but it led to a different problem. 
We had approximately 400 GB of mongodb data from ceilometer that we
don't actually care about.  I cleaned that up and set a TTL in
ceilometer so hopefully this won't happen again.

Is there an alarm or something we could set to get notified about this
kind of stuff? Or better yet, something we could automate to avoid this?
What's usimg mongodb nowadays?

Setting a TTL should avoid this in the future. Note that I don't think
mongo is still used by default, but in our old Mitaka version it was.

For the nova archiving thing I think we'd have to set up email
notifications for failed cron jobs. That would be a good RFE.

Unfortunately neither of these things completely resolved the
extreme slowness in the cloud that was causing every testenv to
fail.  After trying a number of things that made no difference, the
culprit seems to have been rabbitmq.  There was nothing obviously
wrong with it according to the web interface, the queues were all
short and messages seemed to be getting delivered.  However, when I
ran rabbitmqctl status at the CLI it reported that the node was
down.  Since something was clearly wrong I went ahead and restarted
it.  After that everything seems to be back to normal.

Same questiom as above, could we set and alarm or automate the node
recovery?

On this one I have no idea. As I noted, when I looked at the rabbit web
ui everything looked fine. This isn't like the notification queue
problem where one look at the queue lengths made it obvious something
was wrong. Messages were being delivered successfully, just very, very
slowly. Maybe looking at messages per second would help, but that would
be hard to automate. You'd have to know if there were few messages
going through because of performance issues or if the cloud is just
under light load.

I guess it's also worth noting that at some point this cloud is going
away in favor of RDO cloud. Of course we said that back in December
when we discussed the OVS port exhaustion issue and now 11 months later
it still hasn't happened. That's why I haven't been too inclined to
pursue extensive monitoring for the existing cloud though.

I'm not sure exactly what the cause of all this was.  We did get
kind of inundated with jobs yesterday after a zuul restart which I
think is what probably pushed us over the edge, but that has
happened before without bringing the cloud down.  It was probably a
combination of some previously unnoticed issues stacking up over
time and the large number of testenvs requested all at once.

In any case, testenvs are creating successfully again and the jobs
in the queue look good so far.  If you notice any problems please
let me know though.  I'm hoping this will help with the job
timeouts, but that remains to be seen.

-Ben

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 30, 2017 by Ben_Nemec (19,660 points)   2 3 3
0 votes

On 10/30/2017 05:14 PM, Ben Nemec wrote:
It turns out this wasn't quite resolved yet. I was still seeing some
excessively long stack creation times today and it turns out one of our
compute nodes had virtualization turned off. This caused all of its
instances to fail and need a retry. Once I disabled the compute service
on it stacks seemed to be creating in a normal amount of time again.

This happened because the node had some hardware issues, and apparently
the fix was to replace the system board so we got it back with
everything set to default. I fixed this and re-enabled the node and all
seems well again.

On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:

Thanks for the postmortem; it's always a good read tp learn stuff :)

On 28 Oct 2017 00:11, "Ben Nemec" <openstack@nemebean.com
openstack@nemebean.com> wrote:

Hi all,

As you may or may not have noticed all ovb jobs on rh1 started
failing sometime last night.  After some investigation today I found
a few issues.

First, our nova db archiving wasn't working.  This was due to the
auto-increment counter issue described by melwitt in

http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html



Deleting the problematic rows from the shadow table got us past that.

On another db-related note, we seem to have turned ceilometer back
on at some point in rh1.  I think that was intentional to avoid
notification queues backing up, but it led to a different problem. 
We had approximately 400 GB of mongodb data from ceilometer that we
don't actually care about.  I cleaned that up and set a TTL in
ceilometer so hopefully this won't happen again.

Is there an alarm or something we could set to get notified about this
kind of stuff? Or better yet, something we could automate to avoid
this? What's usimg mongodb nowadays?

Setting a TTL should avoid this in the future. Note that I don't think
mongo is still used by default, but in our old Mitaka version it was.

For the nova archiving thing I think we'd have to set up email
notifications for failed cron jobs. That would be a good RFE.

And done: https://bugs.launchpad.net/tripleo/+bug/1728737

Unfortunately neither of these things completely resolved the
extreme slowness in the cloud that was causing every testenv to
fail.  After trying a number of things that made no difference, the
culprit seems to have been rabbitmq.  There was nothing obviously
wrong with it according to the web interface, the queues were all
short and messages seemed to be getting delivered.  However, when I
ran rabbitmqctl status at the CLI it reported that the node was
down.  Since something was clearly wrong I went ahead and restarted
it.  After that everything seems to be back to normal.

Same questiom as above, could we set and alarm or automate the node
recovery?

On this one I have no idea. As I noted, when I looked at the rabbit web
ui everything looked fine. This isn't like the notification queue
problem where one look at the queue lengths made it obvious something
was wrong. Messages were being delivered successfully, just very, very
slowly. Maybe looking at messages per second would help, but that would
be hard to automate. You'd have to know if there were few messages
going through because of performance issues or if the cloud is just
under light load.

I guess it's also worth noting that at some point this cloud is going
away in favor of RDO cloud. Of course we said that back in December
when we discussed the OVS port exhaustion issue and now 11 months later
it still hasn't happened. That's why I haven't been too inclined to
pursue extensive monitoring for the existing cloud though.

I'm not sure exactly what the cause of all this was.  We did get
kind of inundated with jobs yesterday after a zuul restart which I
think is what probably pushed us over the edge, but that has
happened before without bringing the cloud down.  It was probably a
combination of some previously unnoticed issues stacking up over
time and the large number of testenvs requested all at once.

In any case, testenvs are creating successfully again and the jobs
in the queue look good so far.  If you notice any problems please
let me know though.  I'm hoping this will help with the job
timeouts, but that remains to be seen.

-Ben

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 30, 2017 by Ben_Nemec (19,660 points)   2 3 3
...