settingsLogin | Registersettings

[Openstack-operators] [Ceilometer] Real world experience with Ceilometer deployments - Feedback requested

0 votes

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it
    grab 29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering
    rules and tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes installation.
    It exepects all routers be on same host. Fixed or not - I don't know, we
    have our own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes it
    traces and skip counting. Fresh example: if metadata has '.' in the
    name, ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud
billings, not the worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small
amount of resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:

Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
asked Feb 12, 2015 in openstack-operators by George_Shuklin (4,720 points)   2 10 13

18 Responses

0 votes

@Maish Saidel-Keesing,

Hi Maish, I?m from eBay Inc, and we?re enabling 1000+ ceilometer compute agents. Hope our experience could help.

We choose an OpenTSDB backend instead of MongoDB in the first place, so we avoid of most of the issues related to MongoDB.

However, during deployment, we still met many issues as below:

  1. The inspector of libvirt didn?t work in nova-cell mode. We fixed it by using instance uuid to identify vm, and submitted to upstream. (https://bugs.launchpad.net/ceilometer/+bug/1396473)
  2. There?s huge load to nova/glance client that even drag them down. We resolved it in 3 ways as below to reduce the load:

Our original thinking about MongoDB is to only store some metadata definition, and put most other metrics to a time series db.

So all in all, we think probably you can consider to change your main storage backend MongoDB, and that may improve your Ceilometer performance.
Also some performance related code enhance/modification based on your conditions would be better.

Thanks,
Bryant(Cloud Team, eBay Inc)

-------------- next part --------------
An HTML attachment was scrubbed...
URL:

responded Feb 12, 2015 by Zeng,_Bryant (140 points)   1
0 votes

Unfortunately, I can only confirm the sorry state of Ceilometer.
We tried it on a very small setup (6 compute nodes) and run in so many issues, we dropped it and created our own solution based on a mix of scripts that read from the nova/neutron DB, iptables and collectd data. No need for more collection agents than what we are already running for the systems monitoring.

We tried the version in Havana and, later, in Icehouse. For starters the documentation was suggesting MySQL as default backend. MySQL will last just a few days and then break down under the size of the tables. We tried MongoDB, but were still not satisfied with performance on such a small cluster.
Then there is the metering agent. It is yet another daemon, not integrated in Neutron and there is no documentation about what it is actually measuring. What if I have multiple routers? Ingress and Egress? >From which point of view?
The same applies to Cinder, it requires and external agent (to be run via cron!).

Some metrics were not recorded, we couldn't understand why and, again, no documentation and no tooling to help us understand whether we were just missing some config options somewhere in nova-compute or there was some other problem with KVM/libvirt versions.
And even when we had some data and wanted to generate just a proof-of-concept report with some information about tenant resource usage, we found problems with the API. The fact that no one had bothered to write a simple proof of concept script that uses the API to actually do something useful was really off-putting.

We had to dig in libvirt to understand what some of the metrics actually mean.
We found that we could read those same metrics from our (more efficient, well-known) monitoring system.

For some time we run just the agents and aggregated the data in an elasticsearch instance through the UDP msgpack pipeline (more bugs, message format is inconsistent, different agents generate different fields, in slightly different formats).
It works. But for our needs it was just too much work. Most of the data is already available from other sources with well-known APIs.

Ah, also there is a long standing bug open: Sahara and Ceilometer cannot be used together. And we use Sahara.

I opened bugs for some of these issues, but since then I lost interest.

In the end, I think it really depends on what kind of data you need and what (developer) resources you can throw at the problem.
Unless in Juno things changed dramatically, Ceilometer will not work out of the box. You will have to lose time because of the non-existent documentation, you will have to develop code and scripts anyway and finally you will have to create something between your billing system and the ceilometer API, because to the best of my knowledge there is nothing that uses it.

eBay has the resources to do all that. We don't.

-----Original Message-----
From: George Shuklin [mailto:george.shuklin at gmail.com]
Sent: Thursday 12 February 2015 02:59
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with Ceilometer deployments - Feedback requested

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it grab 29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering rules and tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes installation.
    It exepects all routers be on same host. Fixed or not - I don't know, we have our own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes it traces and skip counting. Fresh example: if metadata has '.' in the name, ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud billings, not the worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small amount of resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:
Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
s


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 12, 2015 by daniele.venzano_at_e (660 points)   1 2
0 votes

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems. The stacktach/monasca teams are due to demo to the operators meetup in Philadelphia in March.

Does anyone have experience to share comparing ceilometer with stacktach ?

Tim

-----Original Message-----
From: Daniele Venzano [mailto:daniele.venzano at eurecom.fr]
Sent: 12 February 2015 12:24
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with
Ceilometer deployments - Feedback requested

Unfortunately, I can only confirm the sorry state of Ceilometer.
We tried it on a very small setup (6 compute nodes) and run in so many issues,
we dropped it and created our own solution based on a mix of scripts that read
from the nova/neutron DB, iptables and collectd data. No need for more
collection agents than what we are already running for the systems monitoring.

We tried the version in Havana and, later, in Icehouse. For starters the
documentation was suggesting MySQL as default backend. MySQL will last just a
few days and then break down under the size of the tables. We tried MongoDB,
but were still not satisfied with performance on such a small cluster.
Then there is the metering agent. It is yet another daemon, not integrated in
Neutron and there is no documentation about what it is actually measuring.
What if I have multiple routers? Ingress and Egress? From which point of view?
The same applies to Cinder, it requires and external agent (to be run via cron!).

Some metrics were not recorded, we couldn't understand why and, again, no
documentation and no tooling to help us understand whether we were just
missing some config options somewhere in nova-compute or there was some
other problem with KVM/libvirt versions.
And even when we had some data and wanted to generate just a proof-of-
concept report with some information about tenant resource usage, we found
problems with the API. The fact that no one had bothered to write a simple
proof of concept script that uses the API to actually do something useful was
really off-putting.

We had to dig in libvirt to understand what some of the metrics actually mean.
We found that we could read those same metrics from our (more efficient, well-
known) monitoring system.

For some time we run just the agents and aggregated the data in an
elasticsearch instance through the UDP msgpack pipeline (more bugs, message
format is inconsistent, different agents generate different fields, in slightly
different formats).
It works. But for our needs it was just too much work. Most of the data is
already available from other sources with well-known APIs.

Ah, also there is a long standing bug open: Sahara and Ceilometer cannot be
used together. And we use Sahara.

I opened bugs for some of these issues, but since then I lost interest.

In the end, I think it really depends on what kind of data you need and what
(developer) resources you can throw at the problem.
Unless in Juno things changed dramatically, Ceilometer will not work out of the
box. You will have to lose time because of the non-existent documentation, you
will have to develop code and scripts anyway and finally you will have to create
something between your billing system and the ceilometer API, because to the
best of my knowledge there is nothing that uses it.

eBay has the resources to do all that. We don't.

-----Original Message-----
From: George Shuklin [mailto:george.shuklin at gmail.com]
Sent: Thursday 12 February 2015 02:59
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with
Ceilometer deployments - Feedback requested

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it grab
    29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering rules and
    tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes installation.
    It exepects all routers be on same host. Fixed or not - I don't know, we have our
    own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes it
    traces and skip counting. Fresh example: if metadata has '.' in the name,
    ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud billings, not the
worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small amount of
resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:

Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
s


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 12, 2015 by Tim_Bell (16,440 points)   1 9 11
0 votes

I thought stacktach was more in the vein of diagnostic.? Not billable resources.?

On Feb 12, 2015 10:47 AM, Tim Bell <Tim.Bell at cern.ch> wrote:

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems. The stacktach/monasca teams are due to demo to the operators meetup in Philadelphia? in March.

Does anyone have experience to share comparing ceilometer with stacktach ?

Tim

-----Original Message-----
From: Daniele Venzano [mailto:daniele.venzano at eurecom.fr]
Sent: 12 February 2015 12:24
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with
Ceilometer deployments - Feedback requested

Unfortunately, I can only confirm the sorry state of Ceilometer.
We tried it on a very small setup (6 compute nodes) and run in so many issues,
we dropped it and created our own solution based on a mix of scripts that read
from the nova/neutron DB, iptables and collectd data. No need for more
collection agents than what we are already running for the systems monitoring.

We tried the version in Havana and, later, in Icehouse. For starters the
documentation was suggesting MySQL as default backend. MySQL will last just a
few days and then break down under the size of the tables. We tried MongoDB,
but were still not satisfied with performance on such a small cluster.
Then there is the metering agent. It is yet another daemon, not integrated in
Neutron and there is no documentation about what it is actually measuring.
What if I have multiple routers? Ingress and Egress? From which point of view?
The same applies to Cinder, it requires and external agent (to be run via cron!).

Some metrics were not recorded, we couldn't understand why and, again, no
documentation and no tooling to help us understand whether we were just
missing some config options somewhere in nova-compute or there was some
other problem with KVM/libvirt versions.
And even when we had some data and wanted to generate just a proof-of-
concept report with some information about tenant resource usage, we found
problems with the API. The fact that no one had bothered to write a simple
proof of concept script that uses the API to actually do something useful was
really off-putting.

We had to dig in libvirt to understand what some of the metrics actually mean.
We found that we could read those same metrics from our (more efficient, well-
known) monitoring system.

For some time we run just the agents and aggregated the data in an
elasticsearch instance through the UDP msgpack pipeline (more bugs, message
format is inconsistent, different agents generate different fields, in slightly
different formats).
It works. But for our needs it was just too much work. Most of the data is
already available from other sources with well-known APIs.

Ah, also there is a long standing bug open: Sahara and Ceilometer cannot be
used together. And we use Sahara.

I opened bugs for some of these issues, but since then I lost interest.

In the end, I think it really depends on what kind of data you need and what
(developer) resources you can throw at the problem.
Unless in Juno things changed dramatically, Ceilometer will not work out of the
box. You will have to lose time because of the non-existent documentation, you
will have to develop code and scripts anyway and finally you will have to create
something between your billing system and the ceilometer API, because to the
best of my knowledge there is nothing that uses it.

eBay has the resources to do all that. We don't.

-----Original Message-----
From: George Shuklin [mailto:george.shuklin at gmail.com]
Sent: Thursday 12 February 2015 02:59
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with
Ceilometer deployments - Feedback requested

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it grab
    29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering rules and
    tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes installation.
    It exepects all routers be on same host. Fixed or not - I don't know, we have our
    own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes it
    traces and skip counting. Fresh example: if metadata has '.' in the name,
    ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud billings, not the
worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small amount of
resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:

Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
s


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 12, 2015 by matt (5,480 points)   1 2 3
0 votes

Event-based Monitoring & Billing solution for OpenStack

Unsure what its checking out for billing though.


Kris Lindgren
Senior Linux Systems Engineer
GoDaddy, LLC.

On 2/12/15, 9:17 AM, "Matt Joyce" wrote:

I thought stacktach was more in the vein of diagnostic. Not billable
resources.

On Feb 12, 2015 10:47 AM, Tim Bell <Tim.Bell at cern.ch> wrote:

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems. The
stacktach/monasca teams are due to demo to the operators meetup in
Philadelphia in March.

Does anyone have experience to share comparing ceilometer with
stacktach ?

Tim

-----Original Message-----
From: Daniele Venzano [mailto:daniele.venzano at eurecom.fr]
Sent: 12 February 2015 12:24
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience
with
Ceilometer deployments - Feedback requested

Unfortunately, I can only confirm the sorry state of Ceilometer.
We tried it on a very small setup (6 compute nodes) and run in so
many issues,
we dropped it and created our own solution based on a mix of scripts
that read
from the nova/neutron DB, iptables and collectd data. No need for
more
collection agents than what we are already running for the systems
monitoring.

We tried the version in Havana and, later, in Icehouse. For starters
the
documentation was suggesting MySQL as default backend. MySQL will
last just a
few days and then break down under the size of the tables. We tried
MongoDB,
but were still not satisfied with performance on such a small
cluster.
Then there is the metering agent. It is yet another daemon, not
integrated in
Neutron and there is no documentation about what it is actually
measuring.
What if I have multiple routers? Ingress and Egress? From which point
of view?
The same applies to Cinder, it requires and external agent (to be run
via cron!).

Some metrics were not recorded, we couldn't understand why and,
again, no
documentation and no tooling to help us understand whether we were
just
missing some config options somewhere in nova-compute or there was
some
other problem with KVM/libvirt versions.
And even when we had some data and wanted to generate just a
proof-of-
concept report with some information about tenant resource usage, we
found
problems with the API. The fact that no one had bothered to write a
simple
proof of concept script that uses the API to actually do something
useful was
really off-putting.

We had to dig in libvirt to understand what some of the metrics
actually mean.
We found that we could read those same metrics from our (more
efficient, well-
known) monitoring system.

For some time we run just the agents and aggregated the data in an
elasticsearch instance through the UDP msgpack pipeline (more bugs,
message
format is inconsistent, different agents generate different fields,
in slightly
different formats).
It works. But for our needs it was just too much work. Most of the
data is
already available from other sources with well-known APIs.

Ah, also there is a long standing bug open: Sahara and Ceilometer
cannot be
used together. And we use Sahara.

I opened bugs for some of these issues, but since then I lost
interest.

In the end, I think it really depends on what kind of data you need
and what
(developer) resources you can throw at the problem.
Unless in Juno things changed dramatically, Ceilometer will not work
out of the
box. You will have to lose time because of the non-existent
documentation, you
will have to develop code and scripts anyway and finally you will
have to create
something between your billing system and the ceilometer API, because
to the
best of my knowledge there is nothing that uses it.

eBay has the resources to do all that. We don't.

-----Original Message-----
From: George Shuklin [mailto:george.shuklin at gmail.com]
Sent: Thursday 12 February 2015 02:59
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience
with
Ceilometer deployments - Feedback requested

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it
    grab
    29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering
    rules and
    tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes
    installation.
    It exepects all routers be on same host. Fixed or not - I don't know,
    we have our
    own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes
    it
    traces and skip counting. Fresh example: if metadata has '.' in the
    name,
    ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud
billings, not the
worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small
amount of
resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:

Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to
the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator

s


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 12, 2015 by Kris_G._Lindgren (7,740 points)   1 7 12
0 votes

so my understanding is that the billing part of ceilometer has to do with
financial requirements in how and what is reported in metrics. ie... you
cannot bill for rounded values or heuristics. but stacktach has no such
specific need to solve for financial needs and really is intended to help
an operator isolate or identify issues / potential issues.

that's a pretty big difference in use case.

-mayy

On Thu, Feb 12, 2015 at 11:30 AM, Kris G. Lindgren
wrote:

Event-based Monitoring & Billing solution for OpenStack

Unsure what its checking out for billing though.


Kris Lindgren
Senior Linux Systems Engineer
GoDaddy, LLC.

On 2/12/15, 9:17 AM, "Matt Joyce" wrote:

I thought stacktach was more in the vein of diagnostic. Not billable
resources.

On Feb 12, 2015 10:47 AM, Tim Bell <Tim.Bell at cern.ch> wrote:

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems. The
stacktach/monasca teams are due to demo to the operators meetup in
Philadelphia in March.

Does anyone have experience to share comparing ceilometer with
stacktach ?

Tim

-----Original Message-----
From: Daniele Venzano [mailto:daniele.venzano at eurecom.fr]
Sent: 12 February 2015 12:24
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience
with
Ceilometer deployments - Feedback requested

Unfortunately, I can only confirm the sorry state of Ceilometer.
We tried it on a very small setup (6 compute nodes) and run in so
many issues,
we dropped it and created our own solution based on a mix of scripts
that read
from the nova/neutron DB, iptables and collectd data. No need for
more
collection agents than what we are already running for the systems
monitoring.

We tried the version in Havana and, later, in Icehouse. For starters
the
documentation was suggesting MySQL as default backend. MySQL will
last just a
few days and then break down under the size of the tables. We tried
MongoDB,
but were still not satisfied with performance on such a small
cluster.
Then there is the metering agent. It is yet another daemon, not
integrated in
Neutron and there is no documentation about what it is actually
measuring.
What if I have multiple routers? Ingress and Egress? From which point
of view?
The same applies to Cinder, it requires and external agent (to be run
via cron!).

Some metrics were not recorded, we couldn't understand why and,
again, no
documentation and no tooling to help us understand whether we were
just
missing some config options somewhere in nova-compute or there was
some
other problem with KVM/libvirt versions.
And even when we had some data and wanted to generate just a
proof-of-
concept report with some information about tenant resource usage, we
found
problems with the API. The fact that no one had bothered to write a
simple
proof of concept script that uses the API to actually do something
useful was
really off-putting.

We had to dig in libvirt to understand what some of the metrics
actually mean.
We found that we could read those same metrics from our (more
efficient, well-
known) monitoring system.

For some time we run just the agents and aggregated the data in an
elasticsearch instance through the UDP msgpack pipeline (more bugs,
message
format is inconsistent, different agents generate different fields,
in slightly
different formats).
It works. But for our needs it was just too much work. Most of the
data is
already available from other sources with well-known APIs.

Ah, also there is a long standing bug open: Sahara and Ceilometer
cannot be
used together. And we use Sahara.

I opened bugs for some of these issues, but since then I lost
interest.

In the end, I think it really depends on what kind of data you need
and what
(developer) resources you can throw at the problem.
Unless in Juno things changed dramatically, Ceilometer will not work
out of the
box. You will have to lose time because of the non-existent
documentation, you
will have to develop code and scripts anyway and finally you will
have to create
something between your billing system and the ceilometer API, because
to the
best of my knowledge there is nothing that uses it.

eBay has the resources to do all that. We don't.

-----Original Message-----
From: George Shuklin [mailto:george.shuklin at gmail.com]
Sent: Thursday 12 February 2015 02:59
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience
with
Ceilometer deployments - Feedback requested

Ceilometer is in sad state.

  1. Collector leaks memory. We ran it on same host with mongo, and it
    grab
    29Gb out of 32, leaving mongo with less than gig memory available.
  2. Metering agent cause huge load on neutron-server. o(n) of metering
    rules and
    tenants. Few bugs reported, one bugfix in review.
  3. Metering agent simply do no work on multi-network-nodes
    installation.
    It exepects all routers be on same host. Fixed or not - I don't know,
    we have our
    own crude fix.
  4. Many rough edges. Ceilometer much less tested than nova. Sometimes
    it
    traces and skip counting. Fresh example: if metadata has '.' in the
    name,
    ceilometer trace on it and did not count in glance usage.
  5. Very slow on reports (using mongo's mapreduce).

Overall feeling: barely usable, but with my experience with cloud
billings, not the
worst thing I saw in my life.

About load: except reporting and memory leaks, it use rather small
amount of
resources.

On 02/11/2015 09:37 PM, Maish Saidel-Keesing wrote:

Is Ceilometer ready for prime time?

I would be interested in hearing from people who have deployed
OpenStack clouds with Ceilometer, and their experience. Some of the
topics I am looking for feedback on are:

  • Database Size
  • MongoDB management, Sharding, replica sets etc.
  • Replication strategies
  • Database backup/restore
  • Overall useability
  • Gripes, pains and problems (things to look out for)
  • Possible replacements for Ceilometer that you have used instead

If you are willing to share - I am sure it will be beneficial to
the
whole community.

Thanks in Advance

With best regards,

Maish Saidel-Keesing
Platform Architect
Cisco


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator

s


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL:

responded Feb 12, 2015 by matt (5,480 points)   1 2 3
0 votes

Hey Tim!

Thanks for the mention. I'm keen to hear the responses on this as
well.

I haven't been very active on the ML recently, so perhaps it's a good time
for an update (or an intro for those not familiar with StackTach [1])

StackTach started out as a diagnostics tool. It consumes notifications
from Nova and Glance and gives you tools for watching "operations" as
they flow through the system. An operation might be "create instance",
or "migrate" or "add network", etc. Pretty handy stuff. Especially if
you're in the process of standing up a new OpenStack deploy.

We quickly found we could get some other really cool information from
these notifications. Performance monitoring, auditing, billing and
usage data ... lots of cool stuff. Within Rax we have StackTach
deployed in all of our regions and use it for all these purposes.

StackTach doesn't really compare with Ceilometer or Monasca. We are
100% focused on a notification/event management and not metrics
(CPU=80%). Monasca would be a better comparison in that case.

But, StackTach is not great. It takes some real care and feeding to run at
scale. Particularly with the workers. StackTach has no provisions for
horizontal scaling. And there are no provisions for long term
archiving. We do it, but it's fragile.

So, about a year ago, we started working on StackTach version 3 (STv3)
to address these problems [2]. We're currently rolling this out within
Rax. We're still in the "driving a car with square wheels" phase, but
it's getting better. We're horizontally scalable. We have Ansible
deploy scripts. We support long term archiving to Swift, and soon to
HDFS. We're highly componentized so you can pick and choose the pieces
you want to use (as Monasca is doing, wrapping many of our libraries
to fit their model). And we should be able to support most
notification types ... not just Nova and Glance and not just
OpenStack. We're aiming to make this a broad solution.

Hopefully we'll be able to show more at the Ops meetup :)

That said, I'd love to hear about headaches and failures of the older
StackTach release and how people are using it, or hope to use it.

Cheers!
-S

PS> I'm behind on my screencast series. Hopefully I'll get them updated once
get past pre-prod. :)

[1] https://github.com/stackforge?query=stacktach
[2] https://www.youtube.com/playlist?list=PLmyM48VxCGaW5pPdyFNWCuwVT1bCBV5p3


From: Tim Bell [Tim.Bell at cern.ch]
Sent: Thursday, February 12, 2015 11:47 AM
To: Daniele Venzano; openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] [Ceilometer] Real world experience with Ceilometer deployments - Feedback requested

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems. The stacktach/monasca teams are due to demo to the operators meetup in Philadelphia in March.

Does anyone have experience to share comparing ceilometer with stacktach ?

Tim

responded Feb 12, 2015 by Sandy_Walsh (3,660 points)   2 4
0 votes

Hi Sandy,

That said, I'd love to hear about headaches and failures of the older
StackTach release and how people are using it, or hope to use it.

We have two StackTach v2 environments, one of which has been running for
almost 3 years. For that particular environment, it can be a bear to do
queries, sometimes taking up to a few minutes. This is understandable with
how the information is stored in the db.

Another issue we've seen is that the workers sometimes fail to reconnect to
Rabbit after a WAN outage. The remedy for that is to restart the workers
from a cron.

But other than that, it runs great. Our environments are definitely not at
the same scale as, say, eBay or CERN, and so operating StackTach has been
manageable.

We use StackTach only as a troubleshooting tool. If a user is having an
issue, we'll bring up their event history and review their timeline. I
think this alone makes it an invaluable tool.

I reviewed all of your StackTach v3 stuff the other week. At first glance,
there's definitely a lot more moving parts than with v2, but after reading
about each one, they all make sense. I'm looking forward to trying some of
it out.

I'd be happy to talk more in Philadelphia if you'd like. :)

Thanks,
Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

responded Feb 12, 2015 by Joe_Topjian (5,780 points)   1 7 10
0 votes

btw> if you want to know how StackTach handles billing, here's the salient part from our Hong Kong presentation [1]. Back when we were attempting Ceilometer integration.

[1] http://youtu.be/c8zZtSL0t00?t=8m26s

responded Feb 12, 2015 by Sandy_Walsh (3,660 points)   2 4
0 votes

Hi Tim,

Does anyone have any proposals regarding

  • Possible replacements for Ceilometer that you have used instead

It seems that many sites have written their own systems.

Sorry - I should have appended this at the end of my last post.

I need to preface this with "I have never used Ceilometer nor do our
environments require billing". But we're already collecting a lot of
information that could be used for billing.

The nova usage-list command reports a tenant's compute resource
allocation per 24 hour period.

For per-instance metrics, I've posted a script that will collect them here:

https://github.com/osops/tools-generic/blob/master/libvirt/instance_metrics.rb

I recently discovered that the nova diagnostics command reports almost
the same information, minus the CPU usage that I'm polling via ps. This
might not be needed for most environments, though, and so nova diagnostics alone should be fine.

So between all of this information, we're able to create a good picture of
a tenant's compute usage. Of course, if we were to do billing, this would
all need fed into a billing system of some sort. Plus, the 24 hour
resolution might be too large.

But hopefully it gives a good indication that polling some basic metrics of
compute usage doesn't require a lot of resources. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

responded Feb 12, 2015 by Joe_Topjian (5,780 points)   1 7 10
...