settingsLogin | Registersettings

[openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

0 votes

Hi,

recently I noticed we got oom-killer in action in one of our jobs [1]. I
saw it several times, so far only with linux bridge job. The consequence
is that usually mysqld gets killed as a processes that consumes most of
the memory, sometimes even nova-api gets killed.

Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Thanks,
Jakub

[1]
http://logs.openstack.org/73/373973/13/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/295d92f/logs/syslog.txt.gz#_Jan_11_13_56_32


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jan 19, 2017 in openstack-dev by jlibosva_at_redhat.c (1,180 points)   1 4

10 Responses

0 votes

On 2017-01-13 16:48:26 +0100 (+0100), Jakub Libosvar wrote:
[...]
Does anybody know whether we can bump memory on nodes in the gate without
losing resources for running other jobs?
[...]

We picked 8gb back when typical devstack-gate jobs only used around
2gb of memory, to make sure there was a hard upper limit developers
could expect when trying to recreate the same tests locally on their
systems. It would take a lot of convincing to raise that further
(and yes it would reduce the number of test instances we can run in
most of our providers since memory is generally the limiting factor
for our nova quotas).
--
Jeremy Stanley


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 13, 2017 by Jeremy_Stanley (56,700 points)   3 5 7
0 votes

On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:
Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Ideally I think we would see more work to reduce memory consumption.
Heat has been able to more than halve their memory usage recently [0].
Perhaps start by identifying the biggest memory hogs and go from there?

[0]
http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html

Clark


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 13, 2017 by Clark_Boylan (8,800 points)   1 2 4
0 votes

Sounds like we must have a memory leak in the Linux bridge agent if that's
the only difference between the Linux bridge job and the ovs ones. Is there
a bug tracking this?

On Jan 13, 2017 08:58, "Clark Boylan" cboylan@sapwetik.org wrote:

On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:

Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Ideally I think we would see more work to reduce memory consumption.
Heat has been able to more than halve their memory usage recently [0].
Perhaps start by identifying the biggest memory hogs and go from there?

[0]
http://lists.openstack.org/pipermail/openstack-dev/2017-
January/109748.html

Clark


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 13, 2017 by kevin_at_benton.pub (15,600 points)   2 3 4
0 votes

2017-01-13 11:13 GMT-06:00 Kevin Benton kevin@benton.pub:

Sounds like we must have a memory leak in the Linux bridge agent if that's
the only difference between the Linux bridge job and the ovs ones. Is there
a bug tracking this?

Just created one [1]. For now, this issue was observed in two cases
(mentioned in bug description).

[1] https://bugs.launchpad.net/neutron/+bug/1656386


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 13, 2017 by Darek_Smigiel (1,960 points)   1
0 votes

2017-01-13 17:56 GMT+01:00 Clark Boylan cboylan@sapwetik.org:

On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:

Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Ideally I think we would see more work to reduce memory consumption.
Heat has been able to more than halve their memory usage recently [0].
Perhaps start by identifying the biggest memory hogs and go from there?

[0]
http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html

In order to have some real data, I've run reproduce.sh for a random
full tempest check and aggregated the memory usage from ps output
during the tempest run [1].
To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.

As a side note, we are seeing consistent failures for the Chef
OpenStack Cookbook integration tests on infra. We have set up an
external CI now running on 12G instances and are getting successful
results there. [2]

[1] http://paste.openstack.org/show/595348/
[2] https://review.openstack.org/409900


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 18, 2017 by Dr._Jens_Rosenboom (1,680 points)   3
0 votes

On 1/13/2017 9:48 AM, Jakub Libosvar wrote:
Hi,

recently I noticed we got oom-killer in action in one of our jobs [1]. I
saw it several times, so far only with linux bridge job. The consequence
is that usually mysqld gets killed as a processes that consumes most of
the memory, sometimes even nova-api gets killed.

Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Thanks,
Jakub

[1]
http://logs.openstack.org/73/373973/13/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/295d92f/logs/syslog.txt.gz#_Jan_11_13_56_32


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

I don't think it's just the linuxbridge job, see:

http://status.openstack.org//elastic-recheck/index.html#1656850

And the linked logstash query, then expand by build_name.

I also tracked that in logstash to have started around 1/10 which was
under our 10-days of logs, so something happened around then to start
tipping us over. I had some leads in the bug report but I think the
keystone team took over from there.

--

Thanks,

Matt Riedemann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 18, 2017 by Matt_Riedemann (48,320 points)   3 7 21
0 votes

On 1/18/2017 4:53 AM, Jens Rosenboom wrote:
To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.

I'm not really surprised at all about Nova being a memory hog with the
versioned object stuff we have which does it's own nesting of objects.

What tools to people use to be able to profile the memory usage by the
types of objects in memory while this is running?

--

Thanks,

Matt Riedemann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 18, 2017 by Matt_Riedemann (48,320 points)   3 7 21
0 votes

On 01/14/2017 02:48 AM, Jakub Libosvar wrote:
recently I noticed we got oom-killer in action in one of our jobs [1].

Any other ideas?

I spent quite a while chasing down similar things with centos a while
ago. I do have some ideas :)

The symptom is probably that mysql gets chosen by the OOM killer but
it's unlikely to be mysql's fault, it's just big and a good target.

If the system is going offline, I added the ability to turn on the
netconsole in devstack-gate with [1]. As the comment mentions, you
can put little tests that stream data in /dev/kmsg and they will
generally get off the host, even if ssh has been killed. I found this
very useful for getting the initial oops data (i've used this several
times for other gate oopses, including other kernel issues we've
seen).

For starting to pin down what is really consuming the memory, the
first thing I did was wrote a peak-memory usage tracker that gave me
stats on memory growth during the devstack run [2]. You have to
enable this with "enableservice peakmemtracker". This starts to
give you the big picture of where memory is starting to go.

At this point, you should have a rough idea of the real cause, and
you're going to want to start dumping /proc/pid/smaps of target
processes to get an idea of where the memory they're allocating is
going, or at the very least what libraries might be involved. The
next step is going to depend on what you need to target...

If it's python, it can get a bit tricky to see where the memory is
going but there's a number of approaches. At the time, despite it
being mostly unmaintained but I had some success with guppy [1]. In
my case, for example, I managed to hook into swift's wsgi startup and
run that under guppy, giving me the ability to get some heap stats.
from my notes [4] that looked something like


import signal, os
from guppy import hpy

def handler(signum, frame):
f = open('/tmp/heap.txt', 'w+')
f.write("testing\n")
hp = hpy()
f.write(str(hp.heap()))
f.close()

if name == 'main':
conffile, options = parseoptions()
signal.signal(signal.SIGUSR1, handler)

sys.exit(run_wsgi(conf_file, 'object-server',
                  global_conf_callback=server.global_conf_callback,
                  **options))

There are of course other tools from gdb to malloc tracers, etc.

But that was enough that I could try different things and compare the
heap usage. Once you've got the smoking gun ... well then the hard
work starts of fixing it :) In my case it was pycparser and we came up
with a good solution [5].

Hopefully that's some useful tips ... #openstack-infra can of course
help holding vms etc as required.

-i

[1] http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n438
[2] https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/peakmem_tracker.sh
[3] https://pypi.python.org/pypi/guppy/
[4] https://etherpad.openstack.org/p/oom-in-rax-centos7-CI-job
[5] https://github.com/eliben/pycparser/issues/72


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 19, 2017 by Ian_Wienand (3,620 points)   4 5
0 votes

On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemann <mriedem@linux.vnet.ibm.com
wrote:

On 1/18/2017 4:53 AM, Jens Rosenboom wrote:

To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.

I'm not really surprised at all about Nova being a memory hog with the
versioned object stuff we have which does it's own nesting of objects.

What tools to people use to be able to profile the memory usage by the
types of objects in memory while this is running?

objgraph and guppy/heapy

http://smira.ru/wp-content/uploads/2011/08/heapy.html

https://www.huyng.com/posts/python-performance-analysis

You can also use gc.get_objects() (
https://docs.python.org/2/library/gc.html#gc.get_objects) to get a list of
all objects in memory and go from there.

Slots (https://docs.python.org/2/reference/datamodel.html#slots) are useful
for reducing the memory usage of objects.

--

Thanks,

Matt Riedemann


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 19, 2017 by Joe_Gordon (24,620 points)   2 5 8
0 votes

What I don't understand is why the OOM killer is being invoked when there
is almost no swap space being used at all. Check out the memory output when
it's killed:

http://logs.openstack.org/59/382659/26/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7de01d0/logs/syslog.txt.gz#_Jan_11_15_54_36

"Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Free swap =
7994832kB
Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Total swap =
7999020kB"

Do we have something set that is effectively disabling the usage of swap
space?

On Wed, Jan 18, 2017 at 4:13 PM, Joe Gordon joe.gordon0@gmail.com wrote:

On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemann <
mriedem@linux.vnet.ibm.com> wrote:

On 1/18/2017 4:53 AM, Jens Rosenboom wrote:

To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.

I'm not really surprised at all about Nova being a memory hog with the
versioned object stuff we have which does it's own nesting of objects.

What tools to people use to be able to profile the memory usage by the
types of objects in memory while this is running?

objgraph and guppy/heapy

http://smira.ru/wp-content/uploads/2011/08/heapy.html

https://www.huyng.com/posts/python-performance-analysis

You can also use gc.get_objects() (https://docs.python.org/2/
library/gc.html#gc.get_objects) to get a list of all objects in memory
and go from there.

Slots (https://docs.python.org/2/reference/datamodel.html#slots) are
useful for reducing the memory usage of objects.

--

Thanks,

Matt Riedemann



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscrib
e
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jan 24, 2017 by kevin_at_benton.pub (15,600 points)   2 3 4
...