settingsLogin | Registersettings

[openstack-dev] [nova] [all] Excessively high greenlet default + excessively low connection pool defaults leads to connection pool latency, timeout errors, idle database connections / workers

0 votes

Hi all -

Let me start out with the assumptions I'm going from for what I want to
talk about.

  1. I'm looking at Nova right now, but I think similar things are going
    on in other Openstack apps.

  2. Settings that we see in nova.conf, including:

wsgidefaultpool_size = 1000

maxpoolsize =

max_overflow =

osapicomputeworkers =

metadata_workers =

are often not understood by deployers, and/or are left unchanged in a
wide variety of scenarios. If you are in fact working for deployers
that do change these values to something totally different, then you
might not be impacted here, and if it turns out that everyone changes
all these settings in real-world scenarios and zzzeek you are just being
silly thinking nobody sets these appropriately, then fooey for me, I guess.

  1. There's talk about more Openstack services, at least Nova from what I
    heard the other day, moving to be based on a real webserver deployment
    in any case, the same way Keystone is. To the degree this is true
    would also mitigate what I'm seeing but still, there's good changes that
    can be made here.

Basically, the syndrome I want to talk about can be mostly mitigated
just by changing the numbers around in #2, but I don't really know that
people know any of this, and also I think some of the defaults here
should just be changed completely as their current values are useless in
pretty much all cases.

Suppose we run on a 24-core machine, and therefore have 24 API worker
processes. Each worker represents a WSGI server, which will use an
eventlet greenlet pool with 1000 greenlets.

Then, assuming neither maxpoolsize or maxoverflow is changed, this
indicates that for a single SQLAlchemy Engine, the most database
connections that are allowed by this Engine at one time is 15.
pool
size defaults to 5 and max_overflow defaults to 10. We get our
engine from oslo.db however oslo.db does not change these defaults which
ultimately come from SQLAlchemy itself.

The problem is then basically that 1000 greenlets is way, way more than
15, meaning hundreds of requests can all pile up on a process and all be
blocked, waiting for a connection pool that's been configured to only
allow 15 database connections at most.

But wait! You say. We have twenty-four worker processes. So if we
had 100 concurrent requests, these requests would not all pile up on
just one process, they'd be distributed among the workers. Any
additional requests beyond the 15 * 24 == 360 that we can handle
(assuming a simplistic relationship between API requests and database
connections, which it is not) would just queue up as they do anyway, so
it makes no difference. Right? **Right???*

It does make a difference! Because show me in nova source code where
exactly this algorithm is that knows how to distribute requests evenly
among the workers...There is no such logic! Some months ago, I began
thinking and fretting, how the heck does this work? There's 24
workers, one socket.accept(), requests come in and sockets are somehow
divvyed up to child forks, but how? I asked some of the deep unix
gurus locally here, and the best answer we could come up with is: it's
random!

Cue the mythbusters music. "Nova receives WSGI requests and sends them
to workers with a random distribution, meaning that under load, some
workers will have too many requests and be waiting on DB access which
can in fact cause pool timeout issues under very latent circumstances,
others will be more idle than they should be".

As we begin the show, we cut into a background segment where we show
that in fact, Mike and some other folks doing load testing actually
see connection pool timeout errors in the logs already, on a 24 core
machine, even though we see hundreds of idle connections at the same
time (just to note, the error we are talking about is "QueuePool limit
of size 5 overflow 5 reached, connection timed out, timeout 5"). So
that we actually see this happening in an actual test situation is what
led me to finally just write a test suite for the whole thing.

Here's the test suite!
https://gist.github.com/zzzeek/c69138fd0d0b3e553a1f I've tried to make
this as super-simple as possible to read, use, and understand. It uses
Nova's nova.wsgi.Server directly with a simple "hello-world" style app,
as well as oslo_service.service.Service and service.launch() the same
way I see in nova/service.py (please confirm I'm using all the right
code and things here just like in Nova, thanks!). The "hello world"
app gets a connection from the pool, does nothing with it, waits a few
seconds then returns it. All the while counting everything going on
and reporting on its metrics every 10 requests.

The "hello world" app uses a SQLAlchemy connection pool with a little
bit lower number of connections, and a timeout of only ten seconds
instead of thirty by default (but feel free to change it on the command
line), and a "work" operation that takes a random amount of time between
zero and five seconds, just to make the problem more obviously
reproducible on any hardware. When we leave the default greenlets at
1000 and hit the server with Apache ab and concurrency of at least 75,
there are connection pool timeouts galore, and the metrics also show
workers waiting anywhere from a full second to 5 seconds (before timing
out) for a database connection:

INFO:loadtest:Status for pid 32625: avg wait time for connection
1.4358 sec;
worst wait time 3.9267 sec; connection failures 5;
num requests over the limit: 29; max concurrency seen: 25

ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5
reached, connection timed out, timeout 5

Bring the number of greenlets down to ten (yes, only ten) and the
errors go to zero, the ab test will complete the given number of
requests faster than it does with the 1000-greenlet version. The
average time a worker spends waiting for a database connection drops an
order of magnitude:

 INFO:loadtest:Status for pid 460: avg wait time for connection

0.0140 sec;
worst wait time 0.0540 sec; connection failures 0;
num requests over the limit: 0; max concurrency seen: 11

That's even though our worker's "fake" work requests are still taking as
long as 5 seconds per request to complete.

But if we only have a super low number of greenlets and only a few dozen
workers, what happens if we have more than 240 requests come in at once,
aren't those connections going to get rejected? No way! eventlet's
networking system is better than that, those connection requests just
get queued up in any case, waiting for a greenlet to be available. Play
with the script and its settings to see.

But if we're blocking any connection attempts based on what's available
at the database level, aren't we under-utilizing for API calls that need
to do a lot of other things besides DB access? The answer is that may
very well be true! Which makes the guidance more complicated based on
what service we are talking about. So here, my guidance is oriented
towards those Openstack services that are primarily doing database
access as their primary work.

Given the above caveat, I'm hoping people can look at this and verify my
assumptions and the results. Assuming I am not just drunk on eggnog,
what would my recommendations be? Basically:

  1. at least for DB-oriented services, the number of 1000 greenlets
    should be way way lower, and we most likely should allow for a lot
    more connections to be used temporarily within a particular worker,
    which means I'd take the maxoverflow setting and default it to like 50,
    or 100. The Greenlet number should then be very similar to the
    max
    overflow number, and maybe even a little less, as Nova API calls
    right now often will use more than one connection concurrently.

  2. longer term, let's please drop the eventlet pool thing and just use a
    real web server! (but still tune the connection pool appropriately). A
    real web server will at least know how to efficiently direct requests to
    worker processes. If all Openstack workers were configurable under a
    single web server config, that would also be a nice way to centralize
    tuning and profiling overall.

Thanks for reading!


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Dec 18, 2015 in openstack-dev by Mike_Bayer (15,260 points)   1 6 8

20 Responses

0 votes

Hi Mike,

Thank you for this brilliant analysis! We've been seeing such timeout
errors in downstream periodically and this is the first time someone
has analysed the root cause thoroughly.

On Fri, Dec 18, 2015 at 10:33 PM, Mike Bayer mbayer@redhat.com wrote:
Hi all -

Let me start out with the assumptions I'm going from for what I want to
talk about.

  1. I'm looking at Nova right now, but I think similar things are going
    on in other Openstack apps.

  2. Settings that we see in nova.conf, including:

wsgidefaultpool_size = 1000

maxpoolsize =

max_overflow =

osapicomputeworkers =

metadata_workers =

are often not understood by deployers, and/or are left unchanged in a
wide variety of scenarios. If you are in fact working for deployers
that do change these values to something totally different, then you
might not be impacted here, and if it turns out that everyone changes
all these settings in real-world scenarios and zzzeek you are just being
silly thinking nobody sets these appropriately, then fooey for me, I guess.

My understanding is that DB connection pool / workers number options
are usually changed, while the number of eventlet greenlets is not:

http://codesearch.openstack.org/?q=wsgi_default_pool_size&i=nope&files=&repos=
http://codesearch.openstack.org/?q=max_pool_size&i=nope&files=&repos=

I think it's for "historical" reasons when MySQL-Python was considered
to be the default DB API driver and we had to work around its
concurrency issues with eventlet by using multiple forks of services.

But as you point out even with a non-blocking DB API driver like
pymysql we are still having problems with timeouts due to pool vs
greenlets number settings.

  1. There's talk about more Openstack services, at least Nova from what I
    heard the other day, moving to be based on a real webserver deployment
    in any case, the same way Keystone is. To the degree this is true
    would also mitigate what I'm seeing but still, there's good changes that
    can be made here.

I think, ideally we'd like to have "wsgi container agnostic" apps not
coupled to eventlet or anything else - so that it will be up to a
deployer to choose the application server.

But if we only have a super low number of greenlets and only a few dozen
workers, what happens if we have more than 240 requests come in at once,
aren't those connections going to get rejected? No way! eventlet's
networking system is better than that, those connection requests just
get queued up in any case, waiting for a greenlet to be available. Play
with the script and its settings to see.

Right, it must be controlled by the backlog argument value here:

https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L80

But if we're blocking any connection attempts based on what's available
at the database level, aren't we under-utilizing for API calls that need
to do a lot of other things besides DB access? The answer is that may
very well be true! Which makes the guidance more complicated based on
what service we are talking about. So here, my guidance is oriented
towards those Openstack services that are primarily doing database
access as their primary work.

I believe, all our APIs are pretty much DB oriented.

Given the above caveat, I'm hoping people can look at this and verify my
assumptions and the results. Assuming I am not just drunk on eggnog,
what would my recommendations be? Basically:

  1. at least for DB-oriented services, the number of 1000 greenlets
    should be way way lower, and we most likely should allow for a lot
    more connections to be used temporarily within a particular worker,
    which means I'd take the maxoverflow setting and default it to like 50,
    or 100. The Greenlet number should then be very similar to the
    max
    overflow number, and maybe even a little less, as Nova API calls
    right now often will use more than one connection concurrently.

I suggest we tweak the config options values in both oslo.service and
oslo.db to provide reasonable production defaults and document the
"correlation" between DB connection pool / greenlet workers number
settings.

  1. longer term, let's please drop the eventlet pool thing and just use a
    real web server! (but still tune the connection pool appropriately). A
    real web server will at least know how to efficiently direct requests to
    worker processes. If all Openstack workers were configurable under a
    single web server config, that would also be a nice way to centralize
    tuning and profiling overall.

I'd rather we simply not couple to eventlet unconditionally and allow
deployers to choose the WSGI container they want to use.

Thanks,
Roman


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 6, 2016 by Roman_Podoliaka (3,620 points)   2 2
0 votes

On 01/06/2016 09:11 AM, Roman Podoliaka wrote:
Hi Mike,

Thank you for this brilliant analysis! We've been seeing such timeout
errors in downstream periodically and this is the first time someone
has analysed the root cause thoroughly.

On Fri, Dec 18, 2015 at 10:33 PM, Mike Bayer mbayer@redhat.com wrote:

But if we only have a super low number of greenlets and only a few dozen
workers, what happens if we have more than 240 requests come in at once,
aren't those connections going to get rejected? No way! eventlet's
networking system is better than that, those connection requests just
get queued up in any case, waiting for a greenlet to be available. Play
with the script and its settings to see.

Right, it must be controlled by the backlog argument value here:

https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L80

oh wow, totally missed that! But, how does backlog here interact with
multiple processes? E.g. all workers are saturated, it will place a
waiting connection onto a random greenlet which then has to wait? It
would be better if the "backlog" were pushed up to the parent process,
not sure if that's possible?


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 6, 2016 by Mike_Bayer (15,260 points)   1 6 8
0 votes

Actually we already do that in the parent process. The parent process:

1) starts and creates a socket

2) binds the socket and calls listen() on it passing the backlog value
(http://linux.die.net/man/2/listen)

3) passes the socket to the eventlet WSGI server
(https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L177-L192)

4) forks $*_workers times (child processes inherit all open file
descriptors including the socket one)

5) child processes call accept() in a loop

Linux gurus please correct me here, but my understanding is that Linux
kernel queues up to $backlog number of connections per socket. In
our case child processes inherited the FD of the socket, so they will
accept() connections from the same queue in the kernel, i.e. the
backlog value is for all child processes, not per process.

E.g. all workers are saturated, it will place a waiting connection onto a random greenlet which then has to wait?

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L846-L853

The "saturation" point for a child process in our case will be when we
run out of available greenlets in the pool, so that pool.spawn_n()
call will block and it won't call accept() anymore, until one or more
greenlets finishes processing of previous requests.

Or, a particular greenlet can do a blocking call, which won't yield
the execution context back to the event loop, so that eventlet WSGI
server green thread won't get a chance to be executed and call
accept() (e.g. a call to MySQL-Python without tpool).

The kernel will queue up to $backlog connections for us until we call
accept() in one of the child processes.

On Thu, Jan 7, 2016 at 12:02 AM, Mike Bayer mbayer@redhat.com wrote:

On 01/06/2016 09:11 AM, Roman Podoliaka wrote:

Hi Mike,

Thank you for this brilliant analysis! We've been seeing such timeout
errors in downstream periodically and this is the first time someone
has analysed the root cause thoroughly.

On Fri, Dec 18, 2015 at 10:33 PM, Mike Bayer mbayer@redhat.com wrote:

But if we only have a super low number of greenlets and only a few dozen
workers, what happens if we have more than 240 requests come in at once,
aren't those connections going to get rejected? No way! eventlet's
networking system is better than that, those connection requests just
get queued up in any case, waiting for a greenlet to be available. Play
with the script and its settings to see.

Right, it must be controlled by the backlog argument value here:

https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L80

oh wow, totally missed that! But, how does backlog here interact with
multiple processes? E.g. all workers are saturated, it will place a
waiting connection onto a random greenlet which then has to wait? It
would be better if the "backlog" were pushed up to the parent process,
not sure if that's possible?


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Roman_Podoliaka (3,620 points)   2 2
0 votes

On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka rpodolyaka@mirantis.com wrote:

Linux gurus please correct me here, but my understanding is that Linux
kernel queues up to $backlog number of connections per socket. In
our case child processes inherited the FD of the socket, so they will
accept() connections from the same queue in the kernel, i.e. the
backlog value is for all child processes, not per process.

Yes, it will be shared across all children.

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

It’s worse than that. What I’ve seen (via strace) is that eventlet actually
converts socket into a non-blocking socket, then converts that accept() into a
epoll()/accept() pair in every child. Then when a connection comes in, every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

This means that every time there is a request, every child process is woken
up, scheduled on CPU and then put back to sleep. This is one of the
reasons we’re (slowly) moving to uWSGI.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Clayton_O'Neill (2,060 points)   2
0 votes

On Thu, Jan 7, 2016 at 6:39 AM, Clayton O'Neill clayton@oneill.net wrote:

On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka rpodolyaka@mirantis.com
wrote:

Linux gurus please correct me here, but my understanding is that Linux
kernel queues up to $backlog number of connections per socket. In
our case child processes inherited the FD of the socket, so they will
accept() connections from the same queue in the kernel, i.e. the
backlog value is for all child processes, not per process.

Yes, it will be shared across all children.

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

It’s worse than that. What I’ve seen (via strace) is that eventlet
actually
converts socket into a non-blocking socket, then converts that accept()
into a
epoll()/accept() pair in every child. Then when a connection comes in,
every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

This means that every time there is a request, every child process is woken
up, scheduled on CPU and then put back to sleep. This is one of the
reasons we’re (slowly) moving to uWSGI.

I just want to note that I've got a change proposed to devstack that adds a
config option to run keystone in uwsgi (rather than under eventlet or in
apache httpd mod_wsgi), see https://review.openstack.org/#/c/257571/ . It's
specific to keystone since I didn't think other projects were moving away
from eventlet, too.

  • Brant


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Brant_Knudson (5,640 points)   1 2 2
0 votes

On Thu, Jan 07 2016, Brant Knudson wrote:

I just want to note that I've got a change proposed to devstack that adds a
config option to run keystone in uwsgi (rather than under eventlet or in
apache httpd mod_wsgi), see https://review.openstack.org/#/c/257571/ . It's
specific to keystone since I didn't think other projects were moving away
from eventlet, too.

Well, all the telemetry projects (Ceilometer, Aodh and Gnocchi) ditched
or never used eventlet, so it'd be cool to have that more generic I
guess. We'd be glad being able to use it.

--
Julien Danjou
;; Free Software hacker
;; https://julien.danjou.info


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Jan 7, 2016 by Julien_Danjou (20,500 points)   2 4 7
0 votes

On 01/07/2016 07:39 AM, Clayton O'Neill wrote:
On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka rpodolyaka@mirantis.com wrote:

Linux gurus please correct me here, but my understanding is that Linux
kernel queues up to $backlog number of connections per socket. In
our case child processes inherited the FD of the socket, so they will
accept() connections from the same queue in the kernel, i.e. the
backlog value is for all child processes, not per process.

Yes, it will be shared across all children.

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

It’s worse than that. What I’ve seen (via strace) is that eventlet actually
converts socket into a non-blocking socket, then converts that accept() into a
epoll()/accept() pair in every child. Then when a connection comes in, every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

is that eventlet-specific or would we see the same thing in gevent ?

This means that every time there is a request, every child process is woken
up, scheduled on CPU and then put back to sleep. This is one of the
reasons we’re (slowly) moving to uWSGI.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Mike_Bayer (15,260 points)   1 6 8
0 votes

On Thu, Jan 7, 2016 at 10:44 AM, Mike Bayer mbayer@redhat.com wrote:
On 01/07/2016 07:39 AM, Clayton O'Neill wrote:

On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka rpodolyaka@mirantis.com wrote:

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

It’s worse than that. What I’ve seen (via strace) is that eventlet actually
converts socket into a non-blocking socket, then converts that accept() into a
epoll()/accept() pair in every child. Then when a connection comes in, every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

is that eventlet-specific or would we see the same thing in gevent ?

I’m not sure. For eventlet it’s a natural consequence of most of this being
implemented in Python. It looks like some of this is implemented in C in
gevent, so they may handle the situation differently.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Clayton_O'Neill (2,060 points)   2
0 votes

On 01/07/2016 09:56 AM, Brant Knudson wrote:

On Thu, Jan 7, 2016 at 6:39 AM, Clayton O'Neill <clayton@oneill.net
clayton@oneill.net> wrote:

On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka
<rpodolyaka@mirantis.com <mailto:rpodolyaka@mirantis.com>> wrote:
>
> Linux gurus please correct me here, but my understanding is that Linux
> kernel queues up to $backlog number of connections *per socket*. In
> our case child processes inherited the FD of the socket, so they will
> accept() connections from the same queue in the kernel, i.e. the
> backlog value is for *all* child processes, not *per* process.


Yes, it will be shared across all children.

>
> In each child process eventlet WSGI server calls accept() in a loop to
> get a client socket from the kernel and then puts into a greenlet from
> a pool for processing:

It’s worse than that.  What I’ve seen (via strace) is that eventlet
actually
converts socket into a non-blocking socket, then converts that
accept() into a
epoll()/accept() pair in every child.  Then when a connection comes
in, every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

This means that every time there is a request, every child process
is woken
up, scheduled on CPU and then put back to sleep.  This is one of the
reasons we’re (slowly) moving to uWSGI.

I just want to note that I've got a change proposed to devstack that
adds a config option to run keystone in uwsgi (rather than under
eventlet or in apache httpd mod_wsgi), see
https://review.openstack.org/#/c/257571/ . It's specific to keystone
since I didn't think other projects were moving away from eventlet, too.

I feel like this is a confused point that keeps being brought up.

The preferred long term direction of all API services is to be deployed
on a real web server platform. It's a natural fit for those services as
they are accepting HTTP requests and doing things with them.

Most OpenStack projects have worker services beyond just an HTTP server.
(Keystone is one of the very few exceptions here). Nova has nearly a
dozen of these worker services. These don't naturally fit as wsgi apps,
they are more traditional daemons, which accept requests over the
network, but also have periodic jobs internally and self initiate
actions. They are not just call / response. There is no long term
direction for these to move off of eventlet.

-Sean

--
Sean Dague
http://dague.net


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Sean_Dague (66,200 points)   4 11 18
0 votes

On 01/07/2016 11:02 AM, Sean Dague wrote:
On 01/07/2016 09:56 AM, Brant Knudson wrote:

On Thu, Jan 7, 2016 at 6:39 AM, Clayton O'Neill <clayton@oneill.net
clayton@oneill.net> wrote:

On Thu, Jan 7, 2016 at 2:49 AM, Roman Podoliaka
<rpodolyaka@mirantis.com <mailto:rpodolyaka@mirantis.com>> wrote:
>
> Linux gurus please correct me here, but my understanding is that Linux
> kernel queues up to $backlog number of connections *per socket*. In
> our case child processes inherited the FD of the socket, so they will
> accept() connections from the same queue in the kernel, i.e. the
> backlog value is for *all* child processes, not *per* process.


Yes, it will be shared across all children.

>
> In each child process eventlet WSGI server calls accept() in a loop to
> get a client socket from the kernel and then puts into a greenlet from
> a pool for processing:

It’s worse than that.  What I’ve seen (via strace) is that eventlet
actually
converts socket into a non-blocking socket, then converts that
accept() into a
epoll()/accept() pair in every child.  Then when a connection comes
in, every
child process wakes up out of poll and races to try to accept on the the
non-blocking socket, and all but one of them fails.

This means that every time there is a request, every child process
is woken
up, scheduled on CPU and then put back to sleep.  This is one of the
reasons we’re (slowly) moving to uWSGI.

I just want to note that I've got a change proposed to devstack that
adds a config option to run keystone in uwsgi (rather than under
eventlet or in apache httpd mod_wsgi), see
https://review.openstack.org/#/c/257571/ . It's specific to keystone
since I didn't think other projects were moving away from eventlet, too.

I feel like this is a confused point that keeps being brought up.

The preferred long term direction of all API services is to be deployed
on a real web server platform. It's a natural fit for those services as
they are accepting HTTP requests and doing things with them.

Most OpenStack projects have worker services beyond just an HTTP server.
(Keystone is one of the very few exceptions here). Nova has nearly a
dozen of these worker services. These don't naturally fit as wsgi apps,
they are more traditional daemons, which accept requests over the
network, but also have periodic jobs internally and self initiate
actions. They are not just call / response. There is no long term
direction for these to move off of eventlet.

This is totally speaking as an outsider without taking into account all
the history of these decisions, but the notion of "Python + we're a
daemon" == "we must use eventlet" seems a little bit rigid. Also, the
notion of "we have background tasks" == "we cant run in a web server",
also not clear. If a service intends to serve HTTP requests, that
portion of that service should be deployed in a web server; if the
system has other "background tasks", ideally those are in a separate
daemon altogether, but also even if you're under something like
mod_wsgi, you can spawn a child process or worker thread regardless.
You always have a Python interpreter running and all the things it can do.

-Sean


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 7, 2016 by Mike_Bayer (15,260 points)   1 6 8
...