settingsLogin | Registersettings

[openstack-dev] Scheduler proposal

0 votes

Several months ago I proposed an experiment [0] to see if switching the data model for the Nova scheduler to use Cassandra as the backend would be a significant improvement as opposed to the current design using multiple copies of the same data (compute_node in MySQL DB, HostState in memory in the scheduler, ResourceTracker in memory in the compute node) and trying to keep them all in sync via passing messages. It was discussed at the Nova mid-cycle, and while there was certainly not an immediate rejection, it was felt very strongly that there was too much work for us to focus on, and such an experiment, no matter how potentially beneficial, would prevent us from accomplishing the tasks we had undertaken for Liberty. And, as disappointed as I was, I had to agree. So I promised that I would write up my ideas on why I thought that such an experiment was worthwhile.

I've finally gotten around to finishing writing up that proposal [1], and I'd like to hope that it would be the basis for future discussions about addressing some of the underlying issues that exist in OpenStack for historical reasons, and how we might rethink these choices today. I'd prefer comments and discussion here on the dev list, so that all can see your ideas, but I will be in Tokyo for the summit, and would also welcome some informal discussion there, too.

-- Ed Leafe

[0] http://lists.openstack.org/pipermail/openstack-dev/2015-July/069593.html
[1] http://blog.leafe.com/reimagining_scheduler/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

asked Oct 7, 2015 in openstack-dev by Ed_Leafe (11,720 points)   1 3 6
retagged Feb 4, 2016 by admin

107 Responses

0 votes

Just a question,

Why cassandra?

I'm curious what drew u to using that for an experiment?

/me not saying its a bad choice, just curious...

Ed Leafe wrote:
Several months ago I proposed an experiment [0] to see if switching the data model for the Nova scheduler to use Cassandra as the backend would be a significant improvement as opposed to the current design using multiple copies of the same data (compute_node in MySQL DB, HostState in memory in the scheduler, ResourceTracker in memory in the compute node) and trying to keep them all in sync via passing messages. It was discussed at the Nova mid-cycle, and while there was certainly not an immediate rejection, it was felt very strongly that there was too much work for us to focus on, and such an experiment, no matter how potentially beneficial, would prevent us from accomplishing the tasks we had undertaken for Liberty. And, as disappointed as I was, I had to agree. So I promised that I would write up my ideas on why I thought that such an experiment was worthwhile.

I've finally gotten around to finishing writing up that proposal [1], and I'd like to hope that it would be the basis for future discussions about addressing some of the underlying issues that exist in OpenStack for historical reasons, and how we might rethink these choices today. I'd prefer comments and discussion here on the dev list, so that all can see your ideas, but I will be in Tokyo for the summit, and would also welcome some informal discussion there, too.

-- Ed Leafe

[0] http://lists.openstack.org/pipermail/openstack-dev/2015-July/069593.html
[1] http://blog.leafe.com/reimagining_scheduler/


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 7, 2015 by Joshua_Harlow (12,560 points)   1 4 4
0 votes

On 07/10/15 13:36, Ed Leafe wrote:
Several months ago I proposed an experiment [0] to see if switching the data model for the Nova scheduler to use Cassandra as the backend would be a significant improvement as opposed to the current design using multiple copies of the same data (compute_node in MySQL DB, HostState in memory in the scheduler, ResourceTracker in memory in the compute node) and trying to keep them all in sync via passing messages.

It seems to me (disclaimer: not a Nova dev) that which database to use
is completely irrelevant to your proposal, which is really about moving
the scheduling from a distributed collection of Python processes with
ad-hoc (or sometimes completely missing) synchronisation into the
database to take advantage of its well-defined semantics. But you've
framed it in such a way as to guarantee that this never gets discussed,
because everyone will be too busy arguing about whether or not Cassandra
is better than Galera.

cheers,
Zane.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 7, 2015 by Zane_Bitter (21,640 points)   4 6 9
0 votes

Excerpts from Zane Bitter's message of 2015-10-07 12:28:36 -0700:

On 07/10/15 13:36, Ed Leafe wrote:

Several months ago I proposed an experiment [0] to see if switching the data model for the Nova scheduler to use Cassandra as the backend would be a significant improvement as opposed to the current design using multiple copies of the same data (compute_node in MySQL DB, HostState in memory in the scheduler, ResourceTracker in memory in the compute node) and trying to keep them all in sync via passing messages.

It seems to me (disclaimer: not a Nova dev) that which database to use
is completely irrelevant to your proposal, which is really about moving
the scheduling from a distributed collection of Python processes with
ad-hoc (or sometimes completely missing) synchronisation into the
database to take advantage of its well-defined semantics. But you've
framed it in such a way as to guarantee that this never gets discussed,
because everyone will be too busy arguing about whether or not Cassandra
is better than Galera.

Your point is valid Zane, that the idea is more about having a
synchronized view of the scheduling state, and not about Cassandra.

I think Cassandra makes the proposal more realistic and easier to think
aboutthough, as Cassandra is focused on problems of the scale that this
represents. Galera won't do this well at any kind of scale, without
the added complexity and inefficiency of cells. So whatever Galera's
capability for a single node to handle the write churn of a truly
synchronized scheduler is, would be the maximum capacity of one cell.

I like the concrete nature of this proposal, and suggest people review
it as a whole, and not try to reduce it to its components without an
extremely strong reason to do so.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 7, 2015 by Clint_Byrum (40,940 points)   4 5 9
0 votes

On 10/07/2015 11:36 AM, Ed Leafe wrote:

I've finally gotten around to finishing writing up that proposal [1], and I'd
like to hope that it would be the basis for future discussions about
addressing some of the underlying issues that exist in OpenStack for
historical reasons, and how we might rethink these choices today. I'd prefer
comments and discussion here on the dev list, so that all can see your ideas,
but I will be in Tokyo for the summit, and would also welcome some informal
discussion there, too.

-- Ed Leafe

[1] http://blog.leafe.com/reimagining_scheduler/

I've wondered for a while (ever since I looked at the scheduler code, really)
why we couldn't implement more of the scheduler as database transactions.

I haven't used Cassandra, so maybe you can clarify something about updates
across a distributed DB. I just read up on lightweight transactions, and it
says that they're restricted to a single partition. Is that an acceptable
limitation for this usage?

Some points that might warrant further discussion:

1) Some resources (RAM) only require tracking amounts. Other resources (CPUs,
PCI devices) require tracking allocation of specific individual host resources
(for CPU pinning, PCI device allocation, etc.). Presumably for the latter we
would have to actually do the allocation of resources at the time of the
scheduling operation in order to update the database with the claimed resources
in a race-free way.

2) Are you suggesting that all of nova switch to Cassandra, or just the
scheduler and resource tracking portions? If the latter, how would we handle
things like pinned CPUs and PCI devices that are currently associated with
specific instances in the nova DB?

3) The concept of the compute node updating the DB when things change is really
orthogonal to the new scheduling model. The current scheduling model would
benefit from that as well.

4) It seems to me that to avoid races we need to do one of the following. Which
are you proposing?
a) Serialize the entire scheduling operation so that only one instance can
schedule at once.
b) Make the evaluation of filters and claiming of resources a single atomic DB
transaction.
c) Do a loop where we evaluate the filters, pick a destination, try to claim the
resources in the DB, and retry the whole thing if the resources have already
been claimed.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 7, 2015 by Chris_Friesen (20,420 points)   3 16 24
0 votes

On Oct 7, 2015, at 2:28 PM, Zane Bitter zbitter@redhat.com wrote:

It seems to me (disclaimer: not a Nova dev) that which database to use is completely irrelevant to your proposal,

Well, not entirely. The difference is that what Cassandra offers that separates it from other DBs is exactly the feature that we need. The solution to the scheduler isn't to simply "use a database".

which is really about moving the scheduling from a distributed collection of Python processes with ad-hoc (or sometimes completely missing) synchronisation into the database to take advantage of its well-defined semantics. But you've framed it in such a way as to guarantee that this never gets discussed, because everyone will be too busy arguing about whether or not Cassandra is better than Galera.

Understood - all one has to do is review the original thread from back in July to see this happening. But the reason that I framed it then as an experiment in which we would come up with measures of success we could all agree on up-front was so that if someone else thought that Product Foo would be even better, we could set up a similar test bed and try it out. IOW, instead of bikeshedding, if you want a different color, you build another shed and we can all have a look.

-- Ed Leafe


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Oct 8, 2015 by Ed_Leafe (11,720 points)   1 3 6
0 votes

I think if you went ahead and did the experiment, and had good results from it, the discussion would start to progress whether or not folks were fond of Cassandra or ...

Thanks,
Kevin


From: Ed Leafe [ed@leafe.com]
Sent: Wednesday, October 07, 2015 5:24 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Scheduler proposal

On Oct 7, 2015, at 2:28 PM, Zane Bitter zbitter@redhat.com wrote:

It seems to me (disclaimer: not a Nova dev) that which database to use is completely irrelevant to your proposal,

Well, not entirely. The difference is that what Cassandra offers that separates it from other DBs is exactly the feature that we need. The solution to the scheduler isn't to simply "use a database".

which is really about moving the scheduling from a distributed collection of Python processes with ad-hoc (or sometimes completely missing) synchronisation into the database to take advantage of its well-defined semantics. But you've framed it in such a way as to guarantee that this never gets discussed, because everyone will be too busy arguing about whether or not Cassandra is better than Galera.

Understood - all one has to do is review the original thread from back in July to see this happening. But the reason that I framed it then as an experiment in which we would come up with measures of success we could all agree on up-front was so that if someone else thought that Product Foo would be even better, we could set up a similar test bed and try it out. IOW, instead of bikeshedding, if you want a different color, you build another shed and we can all have a look.

-- Ed Leafe


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 8, 2015 by Fox,_Kevin_M (29,360 points)   1 3 4
0 votes

On 7 October 2015 at 16:00, Chris Friesen chris.friesen@windriver.com
wrote:

1) Some resources (RAM) only require tracking amounts. Other resources
(CPUs, PCI devices) require tracking allocation of specific individual host
resources (for CPU pinning, PCI device allocation, etc.). Presumably for
the latter we would have to actually do the allocation of resources at the
time of the scheduling operation in order to update the database with the
claimed resources in a race-free way.

The whole process is inherently racy (and this is inevitable, and correct),
which is why the scheduler works the way it does:

  • scheduler guesses at a host based on (guaranteed - hello distributed
    systems!) outdated information
  • VM is scheduled to a host that looks like it might work, and host
    attempts to run it
  • VM run may fail (because the information was outdated or has become
    outdated), in which case we retry the schedule

In fact, with PCI devices the code has been written rather carefully to
make sure that they fit into this model. There is central per-device
tracking (which, fwiw, I argued against back in the day) but that's not how
allocation works (or, considering how long it is since I looked, worked).

PCI devices are actually allocated from pools of equivalent devices, and
allocation works in the same manner as other scheduling: you work out from
the nova boot call what constraints a host must satisfy (in this case, in
number of PCI devices in specific pools), you check your best guess at
global host state against those constraints, and you pick one of the hosts
that meets the constraints to schedule on.

So: yes, there is a central registry of devices, which we try to keep up to
date - but this is for admins to refer to, it's not a necessity of
scheduling. The scheduler input is the pool counts, which work largely the
same way as the available memory works as regards scheduling and updating.

No idea on CPUs, sorry, but again I'm not sure why the behaviour would be
any different: compare suspected host state against needs, schedule if it
fits, hope you got it right and tolerate if you didn't.

That being the case, it's worth noting that the database can be eventually
consistent and doesn't need to be transactional. It's also worth
considering that the database can have multiple (mutually inconsistent)
copies. There's no need to use a central datastore if you don't want to -
one theoretical example is to run multiple schedulers and let each
scheduler attempt to collate cloud state from unreliable messages from the
compute hosts. This is not quite what happens today, because messages we
send over Rabbit are reliable and therefore costly.
--
Ian.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 8, 2015 by Ian_Wells (5,300 points)   1 2 5
0 votes

On Oct 7, 2015, at 6:00 PM, Chris Friesen chris.friesen@windriver.com wrote:

I've wondered for a while (ever since I looked at the scheduler code, really) why we couldn't implement more of the scheduler as database transactions.

I haven't used Cassandra, so maybe you can clarify something about updates across a distributed DB. I just read up on lightweight transactions, and it says that they're restricted to a single partition. Is that an acceptable limitation for this usage?

An implementation detail. A partition is defined by the partition key, not by any physical arrangement of nodes. The partition key would have to depend on the resource type, and whatever other columns would make such a query unique.

Some points that might warrant further discussion:

1) Some resources (RAM) only require tracking amounts. Other resources (CPUs, PCI devices) require tracking allocation of specific individual host resources (for CPU pinning, PCI device allocation, etc.). Presumably for the latter we would have to actually do the allocation of resources at the time of the scheduling operation in order to update the database with the claimed resources in a race-free way.

Yes, that's correct. A lot of thought would have to be put into how to best represent these different types of resources, and that's something that I have ideas about, but would feel a whole lot better defining only after talking these concepts over with others who understand the underlying concepts better than I do.

2) Are you suggesting that all of nova switch to Cassandra, or just the scheduler and resource tracking portions? If the latter, how would we handle things like pinned CPUs and PCI devices that are currently associated with specific instances in the nova DB?

I am only thinking of the scheduler as a separate service. Perhaps Nova as a whole might benefit from switching to Cassandra for its database needs, but I haven't really thought about that at all.

3) The concept of the compute node updating the DB when things change is really orthogonal to the new scheduling model. The current scheduling model would benefit from that as well.

Actually, it isn't that different. Compute nodes send updates to the scheduler when instances are created/deleted/resized/etc., so this isn't much of a stretch.

4) It seems to me that to avoid races we need to do one of the following. Which are you proposing?
a) Serialize the entire scheduling operation so that only one instance can schedule at once.
b) Make the evaluation of filters and claiming of resources a single atomic DB transaction.
c) Do a loop where we evaluate the filters, pick a destination, try to claim the resources in the DB, and retry the whole thing if the resources have already been claimed.

Probably a combination of b) and c). Filters would, for lack of a better term, add CSQL WHERE clauses to the query, which would return a set of acceptable hosts. Weighers would order these hosts in terms of desirability, and then the claim would be attempted. If the claim failed because the host had changed, the next acceptable host would be selected, etc. I don't imagine that "retrying the whole thing" would be an efficient option, unless there were no other acceptable hosts returned from the original filtering query.

Put another way: if we are in a racy situation, and two scheduler processes are trying to place a similar instance, both processes would most likely come up with the same set of hosts ordered in the same way. One of those processes would "win", and claim the first choice. The other would fail the transaction, and would then claim the second choice on the list. IMO, this is how you best deal with race conditions.

-- Ed Leafe


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Oct 8, 2015 by Ed_Leafe (11,720 points)   1 3 6
0 votes

On 10/07/2015 07:23 PM, Ian Wells wrote:
On 7 October 2015 at 16:00, Chris Friesen <chris.friesen@windriver.com
chris.friesen@windriver.com> wrote:

1) Some resources (RAM) only require tracking amounts.  Other resources
(CPUs, PCI devices) require tracking allocation of specific individual host
resources (for CPU pinning, PCI device allocation, etc.).  Presumably for
the latter we would have to actually do the allocation of resources at the
time of the scheduling operation in order to update the database with the
claimed resources in a race-free way.

The whole process is inherently racy (and this is inevitable, and correct),
which is why the scheduler works the way it does:

  • scheduler guesses at a host based on (guaranteed - hello distributed systems!)
    outdated information
  • VM is scheduled to a host that looks like it might work, and host attempts to
    run it
  • VM run may fail (because the information was outdated or has become outdated),
    in which case we retry the schedule

Why is it inevitable?

Theoretically if the DB knew about what resources were originally available and
what resources have been consumed, then it should be able to allocate resources
race-free (possibly with some retries involved if racing against other
schedulers updating the DB, but that would be internal to the scheduler itself).

Or does that just not scale enough and we need to use inherently racy models?

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 8, 2015 by Chris_Friesen (20,420 points)   3 16 24
0 votes

Forgive the top-post.

Cross-posting to openstack-operators for their feedback as well.

Ed the work seems very promising, and I am interested to see how this
evolves.

With my operator hat on I have one piece of feedback.

By adding in a new Database solution (Cassandra) we are now up to three
different database solutions in use in OpenStack

MySQL (practically everything)
MongoDB (Ceilometer)
Cassandra.

Not to mention two different message queues
Kafka (Monasca)
RabbitMQ (everything else)

Operational overhead has a cost - maintaining 3 different database
tools, backing them up, providing HA, etc. has operational cost.

This is not to say that this cannot be overseen, but it should be taken
into consideration.

And if they can be consolidated into an agreed solution across the
whole of OpenStack - that would be highly beneficial (IMHO).

--
Best Regards,
Maish Saidel-Keesing

On 10/08/15 03:24, Ed Leafe wrote:
On Oct 7, 2015, at 2:28 PM, Zane Bitter zbitter@redhat.com wrote:

It seems to me (disclaimer: not a Nova dev) that which database to use is completely irrelevant to your proposal,
Well, not entirely. The difference is that what Cassandra offers that separates it from other DBs is exactly the feature that we need. The solution to the scheduler isn't to simply "use a database".

which is really about moving the scheduling from a distributed collection of Python processes with ad-hoc (or sometimes completely missing) synchronisation into the database to take advantage of its well-defined semantics. But you've framed it in such a way as to guarantee that this never gets discussed, because everyone will be too busy arguing about whether or not Cassandra is better than Galera.
Understood - all one has to do is review the original thread from back in July to see this happening. But the reason that I framed it then as an experiment in which we would come up with measures of success we could all agree on up-front was so that if someone else thought that Product Foo would be even better, we could set up a similar test bed and try it out. IOW, instead of bikeshedding, if you want a different color, you build another shed and we can all have a look.

-- Ed Leafe


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Oct 8, 2015 by maishsk_at_maishsk.c (3,000 points)   1 4 6
...