settingsLogin | Registersettings

[openstack-dev] [Neutron] Issue with pymysql

0 votes

Hi neutrons,

I'd like to draw your attention to an issue discovered by rally gate job:
http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE

I don't have bandwidth to take a deep look at it, but first impression is
that it is some issue with nested transaction support either on sqlalchemy
or pymysql side.
Also, besides errors with nested transactions, there are a lot of Lock wait
timeouts.

I think it makes sense to start with reverting the patch that moves to
pymysql.

Thanks,
Eugene.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jun 12, 2015 in openstack-dev by Eugene_Nikanorov (7,480 points)   1 3 7

16 Responses

0 votes

It is however interesting that both "lock wait timeouts" and "missing
savepoint" errors occur in operations pertaining the same table -
securitygroups in this case.
I wonder if the switch to pymysl has not actually uncovered some other bug
in Neutron.

I have no opposition to a revert, but since this will affect most projects,
it's probably worth finding some time to investigate what is triggering
this failure when sqlalchemy is backed by pymysql before doing that.

Salvatore

On 12 June 2015 at 03:32, Eugene Nikanorov enikanorov@mirantis.com wrote:

Hi neutrons,

I'd like to draw your attention to an issue discovered by rally gate job:

http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE

I don't have bandwidth to take a deep look at it, but first impression is
that it is some issue with nested transaction support either on sqlalchemy
or pymysql side.
Also, besides errors with nested transactions, there are a lot of Lock
wait timeouts.

I think it makes sense to start with reverting the patch that moves to
pymysql.

Thanks,
Eugene.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Salvatore_Orlando (12,280 points)   2 5 8
0 votes

On 06/12/2015 01:17 AM, Salvatore Orlando wrote:
It is however interesting that both "lock wait timeouts" and "missing
savepoint" errors occur in operations pertaining the same table -
securitygroups in this case.
I wonder if the switch to pymysl has not actually uncovered some other
bug in Neutron.

I have no opposition to a revert, but since this will affect most
projects, it's probably worth finding some time to investigate what is
triggering this failure when sqlalchemy is backed by pymysql before
doing that.

Right, we knew that the db driver would move some bugs around because
we're no longer blocking python processes on db access (so there used to
be a pseudo synchronization point before you ever got to the database).

My feeling is this should be looked into before it is straight reverted
(are jobs failing beyond Rally?). There are a number of benefits with
the new driver, and we can't get to python3 with the old one.

Rally failing is also an indicator that just such and implicit lock was
behavior that was depended on before, because it will be sending a bunch
of similar operations all at once as a kind of stress test. It would
tend to expose issues like this first.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Sean_Dague (66,200 points)   4 8 14
0 votes

On Fri, Jun 12, 2015 at 7:13 PM, Sean Dague sean@dague.net wrote:

On 06/12/2015 01:17 AM, Salvatore Orlando wrote:

It is however interesting that both "lock wait timeouts" and "missing
savepoint" errors occur in operations pertaining the same table -
securitygroups in this case.
I wonder if the switch to pymysl has not actually uncovered some other
bug in Neutron.

I have no opposition to a revert, but since this will affect most
projects, it's probably worth finding some time to investigate what is
triggering this failure when sqlalchemy is backed by pymysql before
doing that.

Right, we knew that the db driver would move some bugs around because
we're no longer blocking python processes on db access (so there used to
be a pseudo synchronization point before you ever got to the database).

My feeling is this should be looked into before it is straight reverted
(are jobs failing beyond Rally?). There are a number of benefits with

A quick look at logstash.openstack.org shows some of the stacktraces are
happening in other neutron jobs as well.

the new driver, and we can't get to python3 with the old one.

Agreed, pymysql is not to blame it looks like we have hit some neutron
issues. So lets try to fix neutron. Just because neutron reverts the
default sql connector doesn't mean operators won't end up trying pymysql.

Rally failing is also an indicator that just such and implicit lock was
behavior that was depended on before, because it will be sending a bunch
of similar operations all at once as a kind of stress test. It would
tend to expose issues like this first.

Glad to see us catch these issues early.

    -Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Joe_Gordon (24,620 points)   2 5 8
0 votes

On 06/12/2015 06:31 AM, Joe Gordon wrote:

On Fri, Jun 12, 2015 at 7:13 PM, Sean Dague <sean@dague.net
sean@dague.net> wrote:

On 06/12/2015 01:17 AM, Salvatore Orlando wrote:
> It is however interesting that both "lock wait timeouts" and "missing
> savepoint" errors occur in operations pertaining the same table -
> securitygroups in this case.
> I wonder if the switch to pymysl has not actually uncovered some other
> bug in Neutron.
>
> I have no opposition to a revert, but since this will affect most
> projects, it's probably worth finding some time to investigate what is
> triggering this failure when sqlalchemy is backed by pymysql before
> doing that.

Right, we knew that the db driver would move some bugs around because
we're no longer blocking python processes on db access (so there used to
be a pseudo synchronization point before you ever got to the database).

My feeling is this should be looked into before it is straight reverted
(are jobs failing beyond Rally?). There are a number of benefits with

A quick look at logstash.openstack.org
shows some of the stacktraces are happening in other neutron jobs as well.

the new driver, and we can't get to python3 with the old one.

Agreed, pymysql is not to blame it looks like we have hit some neutron
issues. So lets try to fix neutron. Just because neutron reverts the
default sql connector doesn't mean operators won't end up trying pymysql.

Rally failing is also an indicator that just such and implicit lock was
behavior that was depended on before, because it will be sending a bunch
of similar operations all at once as a kind of stress test. It would
tend to expose issues like this first.

Glad to see us catch these issues early.

So, looking at little deeper this is also exposing across the board as
well. I filed this bug - https://bugs.launchpad.net/neutron/+bug/1464612.

I'm going to trigger and push the revert, as it's blocked fixes for
stable/kilo being able to merge code (so it's cascading).

We can decide if we bring this back for all non-neutron jobs once things
are working again (it lives behind a config var in devstack, so easy to
set it per job).

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Sean_Dague (66,200 points)   4 8 14
0 votes

On Fri, 12 Jun 2015, Joe Gordon wrote:

Glad to see us catch these issues early.

Yes! CI is doing exactly the job it is supposed to be doing here. It
is finding bugs in code. When that happens we should fix the bugs,
not revert. Even if it stalls other stuff.

--
Chris Dent tw:@anticdent freenode:cdent
https://tank.peermore.com/tanks/cdent


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Chris_Dent (7,940 points)   1 4 7
0 votes

On 6/11/15 9:32 PM, Eugene Nikanorov wrote:
Hi neutrons,

I'd like to draw your attention to an issue discovered by rally gate job:
http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE

I don't have bandwidth to take a deep look at it, but first impression
is that it is some issue with nested transaction support either on
sqlalchemy or pymysql side.
Also, besides errors with nested transactions, there are a lot of Lock
wait timeouts.

I think it makes sense to start with reverting the patch that moves to
pymysql.
My immediate reaction is that this is perhaps a concurrency-related
issue; because PyMySQL is pure python and allows for full blown eventlet
monkeypatching, I wonder if somehow the same PyMySQL connection is being
used in multiple contexts. E.g. one greenlet starts up a savepoint,
using identifier "_3" which is based on a counter that is local to the
SQLAlchemy Connection, but then another greenlet shares that PyMySQL
connection somehow with another SQLAlchemy Connection that uses the same
identifier.

I'm not saying this is a bug in PyMySQL or Eventlet necessarily, it
could be a bug in Neutron itself, since none of this code has ever been
used with a true context-switching greenlet environment at the database
connection level.

Thanks,
Eugene.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Mike_Bayer (15,260 points)   1 5 6
0 votes

On 6/12/15 6:42 AM, Sean Dague wrote:
On 06/12/2015 06:31 AM, Joe Gordon wrote:

On Fri, Jun 12, 2015 at 7:13 PM, Sean Dague <sean@dague.net
sean@dague.net> wrote:

 On 06/12/2015 01:17 AM, Salvatore Orlando wrote:
 > It is however interesting that both "lock wait timeouts" and "missing
 > savepoint" errors occur in operations pertaining the same table -
 > securitygroups in this case.
 > I wonder if the switch to pymysl has not actually uncovered some other
 > bug in Neutron.
 >
 > I have no opposition to a revert, but since this will affect most
 > projects, it's probably worth finding some time to investigate what is
 > triggering this failure when sqlalchemy is backed by pymysql before
 > doing that.

 Right, we knew that the db driver would move some bugs around because
 we're no longer blocking python processes on db access (so there used to
 be a pseudo synchronization point before you ever got to the database).

 My feeling is this should be looked into before it is straight reverted
 (are jobs failing beyond Rally?). There are a number of benefits with

A quick look at logstash.openstack.org
shows some of the stacktraces are happening in other neutron jobs as well.

 the new driver, and we can't get to python3 with the old one.

Agreed, pymysql is not to blame it looks like we have hit some neutron
issues. So lets try to fix neutron. Just because neutron reverts the
default sql connector doesn't mean operators won't end up trying pymysql.

 Rally failing is also an indicator that just such and implicit lock was
 behavior that was depended on before, because it will be sending a bunch
 of similar operations all at once as a kind of stress test. It would
 tend to expose issues like this first.

Glad to see us catch these issues early.
So, looking at little deeper this is also exposing across the board as
well. I filed this bug - https://bugs.launchpad.net/neutron/+bug/1464612.

I'm going to trigger and push the revert, as it's blocked fixes for
stable/kilo being able to merge code (so it's cascading).
Please keep me very deeply in the loop on this because it will save
everyone a lot of time if I can decipher the interaction with SQLAlchemy
internals for folks.

We can decide if we bring this back for all non-neutron jobs once things
are working again (it lives behind a config var in devstack, so easy to
set it per job).

-Sean


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Mike_Bayer (15,260 points)   1 5 6
0 votes

Excerpts from Chris Dent's message of 2015-06-12 03:47:02 -0700:

On Fri, 12 Jun 2015, Joe Gordon wrote:

Glad to see us catch these issues early.

Yes! CI is doing exactly the job it is supposed to be doing here. It
is finding bugs in code. When that happens we should fix the bugs,
not revert. Even if it stalls other stuff.

I'd like to offer an alternative path. Please do remember that people
deploy OpenStack from trunk, and we want them to do that, because if an
organization is focused and prepared, continuous deployment will produce
a vibrant OpenStack experience for users, and will keep those operators
in a position to contribute.

We also have hundreds of developers who need the system to work well.
There are maybe only a handful who will work on this issue. It does not
make any sense to me to stall hundreds of developers while a handful
work through something that may take weeks.

So, consider reverting quickly, landing a test that fails at least more
reliably with the current patch, and then have that handful of users who
are capable of fixing it work in their own branch until the problem is
resolved to a level that it won't stand in the way of progress for CD
operators and the development community at large.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Clint_Byrum (40,940 points)   4 5 9
0 votes

This is exactly what I did in [1].

Cheers,
Armando

[1]
https://review.openstack.org/#/q/status:open+branch:master+topic:neutron-unstable,n,z

On 12 June 2015 at 09:00, Clint Byrum clint@fewbar.com wrote:

Excerpts from Chris Dent's message of 2015-06-12 03:47:02 -0700:

On Fri, 12 Jun 2015, Joe Gordon wrote:

Glad to see us catch these issues early.

Yes! CI is doing exactly the job it is supposed to be doing here. It
is finding bugs in code. When that happens we should fix the bugs,
not revert. Even if it stalls other stuff.

I'd like to offer an alternative path. Please do remember that people
deploy OpenStack from trunk, and we want them to do that, because if an
organization is focused and prepared, continuous deployment will produce
a vibrant OpenStack experience for users, and will keep those operators
in a position to contribute.

We also have hundreds of developers who need the system to work well.
There are maybe only a handful who will work on this issue. It does not
make any sense to me to stall hundreds of developers while a handful
work through something that may take weeks.

So, consider reverting quickly, landing a test that fails at least more
reliably with the current patch, and then have that handful of users who
are capable of fixing it work in their own branch until the problem is
resolved to a level that it won't stand in the way of progress for CD
operators and the development community at large.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Armando_M. (23,560 points)   2 4 7
0 votes

On 6/12/15 11:37 AM, Mike Bayer wrote:

On 6/11/15 9:32 PM, Eugene Nikanorov wrote:

Hi neutrons,

I'd like to draw your attention to an issue discovered by rally gate job:
http://logs.openstack.org/96/190796/4/check/gate-rally-dsvm-neutron-rally/7a18e43/logs/screen-q-svc.txt.gz?level=TRACE

I don't have bandwidth to take a deep look at it, but first
impression is that it is some issue with nested transaction support
either on sqlalchemy or pymysql side.
Also, besides errors with nested transactions, there are a lot of
Lock wait timeouts.

I think it makes sense to start with reverting the patch that moves
to pymysql.
My immediate reaction is that this is perhaps a concurrency-related
issue; because PyMySQL is pure python and allows for full blown
eventlet monkeypatching, I wonder if somehow the same PyMySQL
connection is being used in multiple contexts. E.g. one greenlet
starts up a savepoint, using identifier "_3" which is based on a
counter that is local to the SQLAlchemy Connection, but then another
greenlet shares that PyMySQL connection somehow with another
SQLAlchemy Connection that uses the same identifier.

reading more of the log, it seems the main issue is just that there's a
deadlock on inserting into the securitygroups table. The deadlock on
insert can be because of an index being locked.

I'd be curious to know how many greenlets are running concurrently here,
and what the overall transaction looks like within the operation that is
failing here (e.g. does each transaction insert multiple rows into
securitygroups? that would make a deadlock seem more likely).


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 12, 2015 by Mike_Bayer (15,260 points)   1 5 6
...