settingsLogin | Registersettings

[openstack-dev] [nova][oslo][openstack-ansible] DB deadlocks, Mitaka, and You

0 votes

Hello all,

As some folks may know the OSIC cloud1 just upgraded to Mitaka last
week (I know what's old is new again, sorry). Since the upgrade things
have been running fairly smoothly however there's been one issue that
has left me scratching my head. When attempting a scale out test we've
run into issues where the nova client was returning a message
indicating it had encountered a DB deadlock [0]. In attempting to gain
more information we enabled debug logging and collected the following
[1] (Here is a pretty version for the tracebacks [4]). This is the
command we're using to build all of the VMs [2] which was being used
to build 3 vms per compute node for 242 compute nodes. Once the
instances are all online we grab the nodes IPv6 address and ensure
we're able to SSH to them. While we're happy to report that almost all
of the VMs came online in one shot we did run into a few of these DB
dead lock messages and would like to see if folks out on the interwebs
have seen or experienced such problems. If so we'd love to know if
there's some remediation that we can use to make this all even
happier. It should be noted that this is not a Major issue at this
point, 3 out of 726 VMs didn't come online due to the problem but it
would be amazing if there was something we could do to generally
resolve this.

Other potentially interesting things:
* DB is MariaDB "mariadb-galera-server-10.0". The replication system
is using xtrabackup from percona with version "2.3.5-1.trusty".
* The DB is a 3 node cluster however the connection to the cluster
is using a VIP on our loadbalancer and the services are only ever
connected to 1 of the three nodes; this is for both reads and writes.
Should a node become un-happy the load balancer promotes and demotes
always ensuring only 1 node is being connected to.
* While reproducing this issue repeatedly we've watched wsrep to see
if nodes we're dropping or otherwise having a bad day. To our dismay
there were no un-happy nodes and the wsrep state seemed to remain
OPERATIONAL with minimal latency (Example [3]).
* For all of the OpenStack Services we use pymsql as it's DB driver
we were using mysql-python but I believe OpenStack-Ansible switched in
kilo due to DB deadlock issues we were experiencing.
* Cloud1 is running nova at commit
"dd30603f91e6fd3d1a4db452f20a51ba8820e1f4" which was the HEAD of
"stable/mitaka" on 29-09-2016. This code is in production today with
absolutely no modifications.

Any insight would be greatly appreciated. Thanks again everyone!

[0] http://paste.openstack.org/show/586302/
[1] http://paste.openstack.org/show/586309/
[2] http://paste.openstack.org/show/586298/
[3] http://paste.openstack.org/show/586306/
[4] http://paste.openstack.org/show/586307/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Oct 19, 2016 in openstack-dev by Carter,_Kevin (580 points)   1
retagged Jan 26, 2017 by admin

5 Responses

0 votes

On 10/18/2016 9:16 PM, Carter, Kevin wrote:
Hello all,

As some folks may know the OSIC cloud1 just upgraded to Mitaka last
week (I know what's old is new again, sorry). Since the upgrade things
have been running fairly smoothly however there's been one issue that
has left me scratching my head. When attempting a scale out test we've
run into issues where the nova client was returning a message
indicating it had encountered a DB deadlock [0]. In attempting to gain
more information we enabled debug logging and collected the following
[1] (Here is a pretty version for the tracebacks [4]). This is the
command we're using to build all of the VMs [2] which was being used
to build 3 vms per compute node for 242 compute nodes. Once the
instances are all online we grab the nodes IPv6 address and ensure
we're able to SSH to them. While we're happy to report that almost all
of the VMs came online in one shot we did run into a few of these DB
dead lock messages and would like to see if folks out on the interwebs
have seen or experienced such problems. If so we'd love to know if
there's some remediation that we can use to make this all even
happier. It should be noted that this is not a Major issue at this
point, 3 out of 726 VMs didn't come online due to the problem but it
would be amazing if there was something we could do to generally
resolve this.

Other potentially interesting things:
* DB is MariaDB "mariadb-galera-server-10.0". The replication system
is using xtrabackup from percona with version "2.3.5-1.trusty".
* The DB is a 3 node cluster however the connection to the cluster
is using a VIP on our loadbalancer and the services are only ever
connected to 1 of the three nodes; this is for both reads and writes.
Should a node become un-happy the load balancer promotes and demotes
always ensuring only 1 node is being connected to.
* While reproducing this issue repeatedly we've watched wsrep to see
if nodes we're dropping or otherwise having a bad day. To our dismay
there were no un-happy nodes and the wsrep state seemed to remain
OPERATIONAL with minimal latency (Example [3]).
* For all of the OpenStack Services we use pymsql as it's DB driver
we were using mysql-python but I believe OpenStack-Ansible switched in
kilo due to DB deadlock issues we were experiencing.
* Cloud1 is running nova at commit
"dd30603f91e6fd3d1a4db452f20a51ba8820e1f4" which was the HEAD of
"stable/mitaka" on 29-09-2016. This code is in production today with
absolutely no modifications.

Any insight would be greatly appreciated. Thanks again everyone!

[0] http://paste.openstack.org/show/586302/
[1] http://paste.openstack.org/show/586309/
[2] http://paste.openstack.org/show/586298/
[3] http://paste.openstack.org/show/586306/
[4] http://paste.openstack.org/show/586307/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

It doesn't look like the same stacktrace but do you have this patch?

https://review.openstack.org/#/c/367508/

That was a known deadlock when creating an instance that we'd see in the
upstream CI system.

It looks like you're failing on the compute manager, which is confusing
to me how you'd get a DBDeadlock back in the nova CLI given we've
already returned the API response by the time we cast to a compute to
build the instance.

--

Thanks,

Matt Riedemann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 19, 2016 by Matt_Riedemann (48,320 points)   3 10 26
0 votes

-----Original Message-----
From: Carter, Kevin kevin@cloudnull.com
Reply: OpenStack Development Mailing List (not for usage questions)

Date: October 18, 2016 at 21:18:13
To: OpenStack Development Mailing List (not for usage questions)

Subject:  [openstack-dev] [nova][oslo][openstack-ansible] DB
deadlocks, Mitaka, and You

Hello all,

As some folks may know the OSIC cloud1 just upgraded to Mitaka last
week (I know what's old is new again, sorry). Since the upgrade things
have been running fairly smoothly however there's been one issue that
has left me scratching my head. When attempting a scale out test we've
run into issues where the nova client was returning a message
indicating it had encountered a DB deadlock [0]. In attempting to gain
more information we enabled debug logging and collected the following
[1] (Here is a pretty version for the tracebacks [4]). This is the
command we're using to build all of the VMs [2] which was being used
to build 3 vms per compute node for 242 compute nodes. Once the
instances are all online we grab the nodes IPv6 address and ensure
we're able to SSH to them. While we're happy to report that almost all
of the VMs came online in one shot we did run into a few of these DB
dead lock messages and would like to see if folks out on the interwebs
have seen or experienced such problems. If so we'd love to know if
there's some remediation that we can use to make this all even
happier. It should be noted that this is not a Major issue at this
point, 3 out of 726 VMs didn't come online due to the problem but it
would be amazing if there was something we could do to generally
resolve this.

Other potentially interesting things:
* DB is MariaDB "mariadb-galera-server-10.0". The replication system
is using xtrabackup from percona with version "2.3.5-1.trusty".
* The DB is a 3 node cluster however the connection to the cluster
is using a VIP on our loadbalancer and the services are only ever
connected to 1 of the three nodes; this is for both reads and writes.
Should a node become un-happy the load balancer promotes and demotes
always ensuring only 1 node is being connected to.
* While reproducing this issue repeatedly we've watched wsrep to see
if nodes we're dropping or otherwise having a bad day. To our dismay
there were no un-happy nodes and the wsrep state seemed to remain
OPERATIONAL with minimal latency (Example [3]).
* For all of the OpenStack Services we use pymsql as it's DB driver
we were using mysql-python but I believe OpenStack-Ansible switched in
kilo due to DB deadlock issues we were experiencing.
* Cloud1 is running nova at commit
"dd30603f91e6fd3d1a4db452f20a51ba8820e1f4" which was the HEAD of
"stable/mitaka" on 29-09-2016. This code is in production today with
absolutely no modifications.

Any insight would be greatly appreciated. Thanks again everyone!

[0] http://paste.openstack.org/show/586302/
[1] http://paste.openstack.org/show/586309/
[2] http://paste.openstack.org/show/586298/
[3] http://paste.openstack.org/show/586306/
[4] http://paste.openstack.org/show/586307/

Hey Kevin,

So just looking at the pastes you have here, I'm inclined to believe
this is actually a bug in oslodb/sqlalchemy. If you follow the trace,
there's a PyMySQL InternalError not being handled inside of
sqlalchemy. I'm not sure if SQLAlchemy considers InternalErrors to be
something it cannot retry, or something that the user should decide
how to handle, but I would start chatting with the folk who work on
oslo
db and SQLAlchemy in the community.

That said, this also looks like something that should be reported to
Nova. Something causing an unhandled exception is definitely bug worth
(even if the fix belongs somewhere in one of its dependencies).

--
Ian Cordasco


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 19, 2016 by sigmavirus24_at_gmai (8,720 points)   2 2 3
0 votes

On 10/19/2016 08:36 AM, Ian Cordasco wrote:
Hey Kevin,

So just looking at the pastes you have here, I'm inclined to believe
this is actually a bug in oslodb/sqlalchemy. If you follow the trace,
there's a PyMySQL InternalError not being handled inside of
sqlalchemy. I'm not sure if SQLAlchemy considers InternalErrors to be
something it cannot retry, or something that the user should decide
how to handle, but I would start chatting with the folk who work on
oslo
db and SQLAlchemy in the community.

SQLAlchemy itself does not retry transactions. A retry is typically at
the method level where the calling application (nova in this case) would
make use of the oslo retry decorator, seen here:
https://github.com/openstack/oslo.db/blob/master/oslo_db/api.py#L85 .
This decorator is configured to retry based on specific oslo-level
exceptions being intercepted, of which DBDeadlock is the primary
exception this function was written for.

In this case, both stack traces illustrate the error being thrown is
DBDeadlock, which is an oslo-db-specific error that is the result of the
correct handling of this PyMySQL error code. The original error object
is maintained as a data member of DBDeadlock so that the source of the
DBDeadlock can be seen. The declaration of this interception is here:
https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L56
. SQLAlchemy re-throws this user-generated exception in the context of
the original, so in Python 2 where stack traces are still a confusing
affair, it's hard to see that this interception occurred, but DBDeadlock
indicates that it has.

That said, this also looks like something that should be reported to
Nova. Something causing an unhandled exception is definitely bug worth
(even if the fix belongs somewhere in one of its dependencies).

--
Ian Cordasco


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 19, 2016 by Mike_Bayer (15,260 points)   1 6 8
0 votes

Hi Matt and thanks for the reply,

We do have that commit as found here: [
https://github.com/openstack/nova/blob/dd30603f91e6fd3d1a4db452f20a51ba8820e1f4/nova/db/sqlalchemy/api.py#L1846
]. If there's anything you'd like to see as we're trying to figure
this out I'd be happy to provide {any,every}thing.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 19, 2016 by Carter,_Kevin (580 points)   1
0 votes

On 10/19/2016 9:47 AM, Carter, Kevin wrote:
Hi Matt and thanks for the reply,

We do have that commit as found here: [
https://github.com/openstack/nova/blob/dd30603f91e6fd3d1a4db452f20a51ba8820e1f4/nova/db/sqlalchemy/api.py#L1846
]. If there's anything you'd like to see as we're trying to figure
this out I'd be happy to provide {any,every}thing.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Well I still don't understand how you get a DBDeadlock failure in n-cpu
while building the VM and that's coming back to the client via n-api -
by the time you hit a failure in n-cpu building the VM we should have
cast from the API and responded with a 202 to the client.

--

Thanks,

Matt Riedemann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 19, 2016 by Matt_Riedemann (48,320 points)   3 10 26
...