settingsLogin | Registersettings

[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

0 votes

Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify its design goals in accuracy, performance, reliability and compatibility improvements. It will also be an Austin Summit Session if elected: https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know resource-provider is the major direction of Nova scheduler, and there will be fundamental changes in the future, especially according to the bp https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. However, this prototype proposes a much faster and compatible way to make schedule decisions based on scheduler caches. The in-memory decisions are made at the same speed with the caching scheduler, but the caches are kept consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:


Background:
The host state cache maintained by host manager is the scheduler resource view during schedule decision making. It is updated whenever a request is received[1], and all the compute node records are retrieved from db every time. There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in retrieving compute node records. The db block time of a single request is 355ms in average in the deployment of 3 compute nodes, compared with only 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a problem using only one scheduler. The detailed analysis of one-scheduler-problem is located in bug analysis[2]. In short, there is a gap between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in resource tracker. A recent scheduler resource consumption in cache can be lost and overwritten by compute node data because of it, result in cache inconsistency and unexpected retries. In a one-scheduler experiment using 3-node deployment, there are 7 retries out of 31 concurrent schedule requests recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even worse" performance result using parallel schedulers. In the same experiment with 4 schedulers on separate machines, the average db block time is increased to 697ms per request and there are 16 retries out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update model to scheduler host state cache. Instead of refreshing caches from db, every compute node maintains its accurate version of host state cache updated by the resource tracker, and sends incremental updates directly to schedulers. So the scheduler cache are synchronized to the correct state as soon as possible with the lowest overhead. Also, scheduler will send resource claim with its decision to the target compute node. The compute node can decide whether the resource claim is successful immediately by its local host state cache and send responds back ASAP. With all the claims are tracked from schedulers to compute nodes, no false overwrites will happen, and thus the gaps between scheduler cache and real compute node states are minimized. The benefits are obvious with recorded experiments[3] compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average decision time per request is about 3ms in both single and multiple scheduler scenarios, which is equal to the in-memory decision time of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is eliminated, there should be 0 retries in one-scheduler deployment, as proven in the experiment. Thanks to the quick claim responding implementation, there are only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data structure of HostState is unchanged. In fact, this prototype even supports filter scheduler running at the same time(already tested). Like other operations with resource changes such as migration, resizing or shelving, they make claims in the resource tracker directly and update the compute node host state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world scenarios, such as network issues, service unexpectedly down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are only casts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an incremental seed, so any message reordering, lost or duplication due to network issues can be detected by MessageWindow immediately. The inconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its async mode. There is no need to send all the messages one by one in the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to all known remotes for quick cache synchronization, even before the service record is available in db. And if a remote service is unexpectedly down according to service group records, no more messages will send to it. The ComputeFilter is also removed because of this feature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute nodes, but also from compute-node host state to the resource tracker. One reason is that there is still a gap between a claim is acknowledged by compute-node host state and the claim is successful in resource tracker. It is necessary to track those unhandled claims to keep host state accurate. The second reason is to separate schedulers from compute node and resource trackers. Scheduler only export limited interfaces update_from_compute and handle_rt_claim_failure to compute service and the RT, so the testing and reusing are easier with clear boundaries.

TODOs:
There are still many features to be implemented, the most important are unit tests and incremental updates to PCI and NUMA resources, all of them are marked out inline.

References:
[1] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<

The original commit history of this prototype is located in https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please refer to the commit message of https://review.openstack.org/#/c/280047/

Regards,
-Yingxin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Feb 15, 2016 in openstack-dev by Cheng,_Yingxin (1,120 points)  

41 Responses

0 votes

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler
refactoring in this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin yingxin.cheng@intel.com
wrote:

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
testify its design goals in accuracy, performance, reliability and
compatibility improvements. It will also be an Austin Summit Session if
elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:

  1. Is this feature possible to be accepted in the Newton release?

  2. Suggestions to improve its design and compatibility.

  3. Possibilities to integrate with resource-provider bp series: I know
    resource-provider is the major direction of Nova scheduler, and there will
    be fundamental changes in the future, especially according to the bp
    https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
    However, this prototype proposes a much faster and compatible way to make
    schedule decisions based on scheduler caches. The in-memory decisions are
    made at the same speed with the caching scheduler, but the caches are kept
    consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:


Background:

The host state cache maintained by host manager is the scheduler resource
view during schedule decision making. It is updated whenever a request is
received[1], and all the compute node records are retrieved from db every
time. There are several problems in this update model, proven in
experiments[3]:

  1. Performance: The scheduler performance is largely affected by db access
    in retrieving compute node records. The db block time of a single request
    is 355ms in average in the deployment of 3 compute nodes, compared with
    only 3ms in in-memory decision-making. Imagine there could be at most 1k
    nodes, even 10k nodes in the future.

  2. Race conditions: This is not only a parallel-scheduler problem, but
    also a problem using only one scheduler. The detailed analysis of
    one-scheduler-problem is located in bug analysis[2]. In short, there is a
    gap between the scheduler makes a decision in host state cache and the

compute node updates its in-db resource record according to that decision
in resource tracker. A recent scheduler resource consumption in cache can
be lost and overwritten by compute node data because of it, result in cache
inconsistency and unexpected retries. In a one-scheduler experiment using
3-node deployment, there are 7 retries out of 31 concurrent schedule
requests recorded, results in 22.6% extra performance overhead.

  1. Parallel scheduler support: The design of filter scheduler leads to an
    "even worse" performance result using parallel schedulers. In the same
    experiment with 4 schedulers on separate machines, the average db block
    time is increased to 697ms per request and there are 16 retries out of 31
    schedule requests, namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing a new
update model to scheduler host state cache. Instead of refreshing caches
from db, every compute node maintains its accurate version of host state
cache updated by the resource tracker, and sends incremental updates
directly to schedulers. So the scheduler cache are synchronized to the
correct state as soon as possible with the lowest overhead. Also, scheduler
will send resource claim with its decision to the target compute node. The
compute node can decide whether the resource claim is successful
immediately by its local host state cache and send responds back ASAP. With
all the claims are tracked from schedulers to compute nodes, no false
overwrites will happen, and thus the gaps between scheduler cache and real
compute node states are minimized. The benefits are obvious with recorded
experiments[3] compared with caching scheduler and filter scheduler:

  1. There is no db block time during scheduler decision making, the average
    decision time per request is about 3ms in both single and multiple
    scheduler scenarios, which is equal to the in-memory decision time of
    filter scheduler and caching scheduler.

  2. Since the scheduler claims are tracked and the "false overwrite" is
    eliminated, there should be 0 retries in one-scheduler deployment, as
    proven in the experiment. Thanks to the quick claim responding
    implementation, there are only 2 retries out of 31 requests in the
    4-scheduler experiment.

  3. All the filtering and weighing algorithms are compatible because the
    data structure of HostState is unchanged. In fact, this prototype even
    supports filter scheduler running at the same time(already tested). Like
    other operations with resource changes such as migration, resizing or
    shelving, they make claims in the resource tracker directly and update the
    compute node host state immediately without major changes.

Extra features:

More efforts are made to better adjust the implementation to real-world
scenarios, such as network issues, service unexpectedly down and
overwhelming messages etc:

  1. The communication between schedulers and compute nodes are only casts,
    there are no RPC calls thus no blocks during scheduling.

  2. All updates from nodes to schedulers are labelled with an incremental
    seed, so any message reordering, lost or duplication due to network issues
    can be detected by MessageWindow immediately. The inconsistent cache can be
    detected and refreshed correctly.

  3. The overwhelming messages are compressed by MessagePipe in its async
    mode. There is no need to send all the messages one by one in the MQ, they
    can be merged before sent to schedulers.

  4. When a new service is up or recovered, it sends notifications to all
    known remotes for quick cache synchronization, even before the service
    record is available in db. And if a remote service is unexpectedly down
    according to service group records, no more messages will send to it. The
    ComputeFilter is also removed because of this feature, the scheduler can
    detect remote compute nodes by itself.

  5. In fact the claim tracking is not only from schedulers to compute
    nodes, but also from compute-node host state to the resource tracker. One
    reason is that there is still a gap between a claim is acknowledged by
    compute-node host state and the claim is successful in resource tracker. It
    is necessary to track those unhandled claims to keep host state accurate.
    The second reason is to separate schedulers from compute node and resource
    trackers. Scheduler only export limited interfaces update_from_compute
    and handle_rt_claim_failure to compute service and the RT, so the testing
    and reusing are easier with clear boundaries.

TODOs:

There are still many features to be implemented, the most important are
unit tests and incremental updates to PCI and NUMA resources, all of them
are marked out inline.

References:

[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104

[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24

[3] http://paste.openstack.org/show/486929/

----------------------------<<

The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler

For instructions to install and test this prototype, please refer to the
commit message of https://review.openstack.org/#/c/280047/

Regards,

-Yingxin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by boris_at_pavlovic.me (6,900 points)   1 4 6
0 votes

Thanks Boris, the idea is quite similar in “Do not have db accesses during scheduler decision making” because db blocks are introduced at the same time, this is very bad for the lock-free design of nova scheduler.

Another important idea is that “Only compute node knows its own final compute-node resource view” or “The accurate resource view only exists at the place where it is actually consumed.” I.e., The incremental updates can only come from the actual “consumption” action, no matter where it is(e.g. compute node, storage service, network service, etc.). Borrow the terms from resource-provider, compute nodes can maintain its accurate version of “compute-node-inventory” cache, and can send incremental updates because it actually consumes compute resources, furthermore, storage service can also maintain an accurate version of “storage-inventory” cache and send incremental updates if it also consumes storage resources. If there are central services in charge of consuming all the resources, the accurate cache and updates must come from them.

The third idea is “compatibility”. This prototype focuses on a very small scope by only introducing a new hostmanager driver “sharedhostmanager” with minor other changes. The driver can be changed back to “hostmanager” very easily. It can also run with filter schedulers and caching schedulers. Most importantly, the filtering and weighing algorithms are kept unchanged. So more changes can be introduced for the complete version of “shared state scheduler” because it is evolving in a gradual way.

Regards,
-Yingxin

From: Boris Pavlovic [mailto:boris@pavlovic.me]
Sent: Monday, February 15, 2016 1:59 PM
To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler refactoring in this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin yingxin.cheng@intel.com wrote:
Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to testify its design goals in accuracy, performance, reliability and compatibility improvements. It will also be an Austin Summit Session if elected: https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know resource-provider is the major direction of Nova scheduler, and there will be fundamental changes in the future, especially according to the bp https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. However, this prototype proposes a much faster and compatible way to make schedule decisions based on scheduler caches. The in-memory decisions are made at the same speed with the caching scheduler, but the caches are kept consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:


Background:
The host state cache maintained by host manager is the scheduler resource view during schedule decision making. It is updated whenever a request is received[1], and all the compute node records are retrieved from db every time. There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in retrieving compute node records. The db block time of a single request is 355ms in average in the deployment of 3 compute nodes, compared with only 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a problem using only one scheduler. The detailed analysis of one-scheduler-problem is located in bug analysis[2]. In short, there is a gap between the scheduler makes a decision in host state cache and the compute node updates its in-db resource record according to that decision in resource tracker. A recent scheduler resource consumption in cache can be lost and overwritten by compute node data because of it, result in cache inconsistency and unexpected retries. In a one-scheduler experiment using 3-node deployment, there are 7 retries out of 31 concurrent schedule requests recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even worse" performance result using parallel schedulers. In the same experiment with 4 schedulers on separate machines, the average db block time is increased to 697ms per request and there are 16 retries out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update model to scheduler host state cache. Instead of refreshing caches from db, every compute node maintains its accurate version of host state cache updated by the resource tracker, and sends incremental updates directly to schedulers. So the scheduler cache are synchronized to the correct state as soon as possible with the lowest overhead. Also, scheduler will send resource claim with its decision to the target compute node. The compute node can decide whether the resource claim is successful immediately by its local host state cache and send responds back ASAP. With all the claims are tracked from schedulers to compute nodes, no false overwrites will happen, and thus the gaps between scheduler cache and real compute node states are minimized. The benefits are obvious with recorded experiments[3] compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average decision time per request is about 3ms in both single and multiple scheduler scenarios, which is equal to the in-memory decision time of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is eliminated, there should be 0 retries in one-scheduler deployment, as proven in the experiment. Thanks to the quick claim responding implementation, there are only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data structure of HostState is unchanged. In fact, this prototype even supports filter scheduler running at the same time(already tested). Like other operations with resource changes such as migration, resizing or shelving, they make claims in the resource tracker directly and update the compute node host state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world scenarios, such as network issues, service unexpectedly down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are only casts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an incremental seed, so any message reordering, lost or duplication due to network issues can be detected by MessageWindow immediately. The inconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its async mode. There is no need to send all the messages one by one in the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to all known remotes for quick cache synchronization, even before the service record is available in db. And if a remote service is unexpectedly down according to service group records, no more messages will send to it. The ComputeFilter is also removed because of this feature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute nodes, but also from compute-node host state to the resource tracker. One reason is that there is still a gap between a claim is acknowledged by compute-node host state and the claim is successful in resource tracker. It is necessary to track those unhandled claims to keep host state accurate. The second reason is to separate schedulers from compute node and resource trackers. Scheduler only export limited interfaces update_from_compute and handle_rt_claim_failure to compute service and the RT, so the testing and reusing are easier with clear boundaries.

TODOs:
There are still many features to be implemented, the most important are unit tests and incremental updates to PCI and NUMA resources, all of them are marked out inline.

References:
[1] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<

The original commit history of this prototype is located in https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please refer to the commit message of https://review.openstack.org/#/c/280047/

Regards,
-Yingxin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by Cheng,_Yingxin (1,120 points)  
0 votes

Yingxin,

Basically, what we implemented was next:

  • Scheduler consumes RPC updates from Computes
  • Scheduler keeps world state in memory (and each message from compute is
    treat like a incremental update)
  • Incremental update is shared across multiple instances of schedulers
    (so one message from computes is only consumed once)
  • Schema less host state (to be able to use single scheduler service for
    all resources)

^ All this was done in backward compatible way and it was really easy to
migrate.

If this was accepted, we were planing to work on making scheduler non
depend from Nova (which is actually quite simple task after those change)
and moving that code outside of Nova.

So solutions are quite similar overall.
I hope you'll get more luck with getting them in upstream.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 11:08 PM, Cheng, Yingxin yingxin.cheng@intel.com
wrote:

Thanks Boris, the idea is quite similar in “Do not have db accesses during
scheduler decision making” because db blocks are introduced at the same
time, this is very bad for the lock-free design of nova scheduler.

Another important idea is that “Only compute node knows its own final
compute-node resource view” or “The accurate resource view only exists at
the place where it is actually consumed.” I.e., The incremental updates can
only come from the actual “consumption” action, no matter where it is(e.g.
compute node, storage service, network service, etc.). Borrow the terms
from resource-provider, compute nodes can maintain its accurate version of
“compute-node-inventory” cache, and can send incremental updates because it
actually consumes compute resources, furthermore, storage service can also
maintain an accurate version of “storage-inventory” cache and send
incremental updates if it also consumes storage resources. If there are
central services in charge of consuming all the resources, the accurate
cache and updates must come from them.

The third idea is “compatibility”. This prototype focuses on a very small
scope by only introducing a new hostmanager driver “sharedhostmanager”
with minor other changes. The driver can be changed back to “host
manager”
very easily. It can also run with filter schedulers and caching schedulers.
Most importantly, the filtering and weighing algorithms are kept unchanged.
So more changes can be introduced for the complete version of “shared state
scheduler” because it is evolving in a gradual way.

Regards,

-Yingxin

From: Boris Pavlovic [mailto:boris@pavlovic.me]
Sent: Monday, February 15, 2016 1:59 PM
To: OpenStack Development Mailing List (not for usage questions) <
openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards
the "shared state scheduler"

Yingxin,

This looks quite similar to the work of this bp:

https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler
refactoring in this way.

Thanks.

Best regards,

Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin yingxin.cheng@intel.com
wrote:

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
testify its design goals in accuracy, performance, reliability and
compatibility improvements. It will also be an Austin Summit Session if
elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:

  1. Is this feature possible to be accepted in the Newton release?

  2. Suggestions to improve its design and compatibility.

  3. Possibilities to integrate with resource-provider bp series: I know
    resource-provider is the major direction of Nova scheduler, and there will
    be fundamental changes in the future, especially according to the bp
    https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
    However, this prototype proposes a much faster and compatible way to make
    schedule decisions based on scheduler caches. The in-memory decisions are
    made at the same speed with the caching scheduler, but the caches are kept
    consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:


Background:

The host state cache maintained by host manager is the scheduler resource
view during schedule decision making. It is updated whenever a request is
received[1], and all the compute node records are retrieved from db every
time. There are several problems in this update model, proven in
experiments[3]:

  1. Performance: The scheduler performance is largely affected by db access
    in retrieving compute node records. The db block time of a single request
    is 355ms in average in the deployment of 3 compute nodes, compared with
    only 3ms in in-memory decision-making. Imagine there could be at most 1k
    nodes, even 10k nodes in the future.

  2. Race conditions: This is not only a parallel-scheduler problem, but
    also a problem using only one scheduler. The detailed analysis of
    one-scheduler-problem is located in bug analysis[2]. In short, there is a
    gap between the scheduler makes a decision in host state cache and the compute
    node updates its in-db resource record according to that decision in
    resource tracker. A recent scheduler resource consumption in cache can be
    lost and overwritten by compute node data because of it, result in cache
    inconsistency and unexpected retries. In a one-scheduler experiment using
    3-node deployment, there are 7 retries out of 31 concurrent schedule
    requests recorded, results in 22.6% extra performance overhead.

  3. Parallel scheduler support: The design of filter scheduler leads to an
    "even worse" performance result using parallel schedulers. In the same
    experiment with 4 schedulers on separate machines, the average db block
    time is increased to 697ms per request and there are 16 retries out of 31
    schedule requests, namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing a new
update model to scheduler host state cache. Instead of refreshing caches
from db, every compute node maintains its accurate version of host state
cache updated by the resource tracker, and sends incremental updates
directly to schedulers. So the scheduler cache are synchronized to the
correct state as soon as possible with the lowest overhead. Also, scheduler
will send resource claim with its decision to the target compute node. The
compute node can decide whether the resource claim is successful
immediately by its local host state cache and send responds back ASAP. With
all the claims are tracked from schedulers to compute nodes, no false
overwrites will happen, and thus the gaps between scheduler cache and real
compute node states are minimized. The benefits are obvious with recorded
experiments[3] compared with caching scheduler and filter scheduler:

  1. There is no db block time during scheduler decision making, the average
    decision time per request is about 3ms in both single and multiple
    scheduler scenarios, which is equal to the in-memory decision time of
    filter scheduler and caching scheduler.

  2. Since the scheduler claims are tracked and the "false overwrite" is
    eliminated, there should be 0 retries in one-scheduler deployment, as
    proven in the experiment. Thanks to the quick claim responding
    implementation, there are only 2 retries out of 31 requests in the
    4-scheduler experiment.

  3. All the filtering and weighing algorithms are compatible because the
    data structure of HostState is unchanged. In fact, this prototype even
    supports filter scheduler running at the same time(already tested). Like
    other operations with resource changes such as migration, resizing or
    shelving, they make claims in the resource tracker directly and update the
    compute node host state immediately without major changes.

Extra features:

More efforts are made to better adjust the implementation to real-world
scenarios, such as network issues, service unexpectedly down and
overwhelming messages etc:

  1. The communication between schedulers and compute nodes are only casts,
    there are no RPC calls thus no blocks during scheduling.

  2. All updates from nodes to schedulers are labelled with an incremental
    seed, so any message reordering, lost or duplication due to network issues
    can be detected by MessageWindow immediately. The inconsistent cache can be
    detected and refreshed correctly.

  3. The overwhelming messages are compressed by MessagePipe in its async
    mode. There is no need to send all the messages one by one in the MQ, they
    can be merged before sent to schedulers.

  4. When a new service is up or recovered, it sends notifications to all
    known remotes for quick cache synchronization, even before the service
    record is available in db. And if a remote service is unexpectedly down
    according to service group records, no more messages will send to it. The
    ComputeFilter is also removed because of this feature, the scheduler can
    detect remote compute nodes by itself.

  5. In fact the claim tracking is not only from schedulers to compute
    nodes, but also from compute-node host state to the resource tracker. One
    reason is that there is still a gap between a claim is acknowledged by
    compute-node host state and the claim is successful in resource tracker. It
    is necessary to track those unhandled claims to keep host state accurate.
    The second reason is to separate schedulers from compute node and resource
    trackers. Scheduler only export limited interfaces update_from_compute
    and handle_rt_claim_failure to compute service and the RT, so the testing
    and reusing are easier with clear boundaries.

TODOs:

There are still many features to be implemented, the most important are
unit tests and incremental updates to PCI and NUMA resources, all of them
are marked out inline.

References:

[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104

[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24

[3] http://paste.openstack.org/show/486929/

----------------------------<<

The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler

For instructions to install and test this prototype, please refer to the
commit message of https://review.openstack.org/#/c/280047/

Regards,

-Yingxin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by boris_at_pavlovic.me (6,900 points)   1 4 6
0 votes

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
https://review.openstack.org/#/c/280047/ to testify its design goals
in accuracy, performance, reliability and compatibility improvements.
It will also be an Austin Summit Session if elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:

  1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so
it would be the best way to discuss on the design.

  1. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec
for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous
solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to
provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending
updates to new scheduler)
- how can we test it
- can we have the feature optional for operators

  1. Possibilities to integrate with resource-provider bp series: I know
    resource-provider is the major direction of Nova scheduler, and there
    will be fundamental changes in the future, especially according to the
    bp
    https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
    However, this prototype proposes a much faster and compatible way to
    make schedule decisions based on scheduler caches. The in-memory
    decisions are made at the same speed with the caching scheduler, but
    the caches are kept consistent with compute nodes as quickly as
    possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you know
that our resource modeling is drastically subject to change in Mitaka
and Newton. That is the new game, so I'd love to see how you plan to
interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:


Background:

The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated whenever
a request is received[1], and all the compute node records are
retrieved from db every time. There are several problems in this
update model, proven in experiments[3]:

  1. Performance: The scheduler performance is largely affected by db
    access in retrieving compute node records. The db block time of a
    single request is 355ms in average in the deployment of 3 compute
    nodes, compared with only 3ms in in-memory decision-making. Imagine
    there could be at most 1k nodes, even 10k nodes in the future.

  2. Race conditions: This is not only a parallel-scheduler problem, but
    also a problem using only one scheduler. The detailed analysis of
    one-scheduler-problem is located in bug analysis[2]. In short, there
    is a gap between the scheduler makes a decision in host state cache
    and the

compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource consumption
in cache can be lost and overwritten by compute node data because of
it, result in cache inconsistency and unexpected retries. In a
one-scheduler experiment using 3-node deployment, there are 7 retries
out of 31 concurrent schedule requests recorded, results in 22.6%
extra performance overhead.

  1. Parallel scheduler support: The design of filter scheduler leads to
    an "even worse" performance result using parallel schedulers. In the
    same experiment with 4 schedulers on separate machines, the average db
    block time is increased to 697ms per request and there are 16 retries
    out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing a new
update model to scheduler host state cache. Instead of refreshing
caches from db, every compute node maintains its accurate version of
host state cache updated by the resource tracker, and sends
incremental updates directly to schedulers. So the scheduler cache are
synchronized to the correct state as soon as possible with the lowest
overhead. Also, scheduler will send resource claim with its decision
to the target compute node. The compute node can decide whether the
resource claim is successful immediately by its local host state cache
and send responds back ASAP. With all the claims are tracked from
schedulers to compute nodes, no false overwrites will happen, and thus
the gaps between scheduler cache and real compute node states are
minimized. The benefits are obvious with recorded experiments[3]
compared with caching scheduler and filter scheduler:

  1. There is no db block time during scheduler decision making, the
    average decision time per request is about 3ms in both single and
    multiple scheduler scenarios, which is equal to the in-memory decision
    time of filter scheduler and caching scheduler.

  2. Since the scheduler claims are tracked and the "false overwrite" is
    eliminated, there should be 0 retries in one-scheduler deployment, as
    proven in the experiment. Thanks to the quick claim responding
    implementation, there are only 2 retries out of 31 requests in the
    4-scheduler experiment.

  3. All the filtering and weighing algorithms are compatible because
    the data structure of HostState is unchanged. In fact, this prototype
    even supports filter scheduler running at the same time(already
    tested). Like other operations with resource changes such as
    migration, resizing or shelving, they make claims in the resource
    tracker directly and update the compute node host state immediately
    without major changes.

Extra features:

More efforts are made to better adjust the implementation to
real-world scenarios, such as network issues, service unexpectedly
down and overwhelming messages etc:

  1. The communication between schedulers and compute nodes are only
    casts, there are no RPC calls thus no blocks during scheduling.

  2. All updates from nodes to schedulers are labelled with an
    incremental seed, so any message reordering, lost or duplication due
    to network issues can be detected by MessageWindow immediately. The
    inconsistent cache can be detected and refreshed correctly.

  3. The overwhelming messages are compressed by MessagePipe in its
    async mode. There is no need to send all the messages one by one in
    the MQ, they can be merged before sent to schedulers.

  4. When a new service is up or recovered, it sends notifications to
    all known remotes for quick cache synchronization, even before the
    service record is available in db. And if a remote service is
    unexpectedly down according to service group records, no more messages
    will send to it. The ComputeFilter is also removed because of this
    feature, the scheduler can detect remote compute nodes by itself.

  5. In fact the claim tracking is not only from schedulers to compute
    nodes, but also from compute-node host state to the resource tracker.
    One reason is that there is still a gap between a claim is
    acknowledged by compute-node host state and the claim is successful in
    resource tracker. It is necessary to track those unhandled claims to
    keep host state accurate. The second reason is to separate schedulers
    from compute node and resource trackers. Scheduler only export limited
    interfaces update_from_compute and handle_rt_claim_failure to
    compute service and the RT, so the testing and reusing are easier with
    clear boundaries.

TODOs:

There are still many features to be implemented, the most important
are unit tests and incremental updates to PCI and NUMA resources, all
of them are marked out inline.

References:

[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104

[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
https://bugs.launchpad.net/nova/+bug/1341420/comments/24

[3] http://paste.openstack.org/show/486929/

----------------------------<<

The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler
https://github.com/cyx1231st/nova/commits/shared-scheduler

For instructions to install and test this prototype, please refer to
the commit message of https://review.openstack.org/#/c/280047/

Regards,

-Yingxin


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by Sylvain_Bauza (14,100 points)   1 3 5
0 votes

Thanks Sylvain,

  1. The below ideas will be extended to a spec ASAP.

  2. Thanks for providing concerns I've not thought it yet, they will be in the spec soon.

  3. Let me copy my thoughts from another thread about the integration with resource-provider:
    The idea is about "Only compute node knows its own final compute-node resource view" or "The accurate resource view only exists at the place where it is actually consumed." I.e., The incremental updates can only come from the actual "consumption" action, no matter where it is(e.g. compute node, storage service, network service, etc.). Borrow the terms from resource-provider, compute nodes can maintain its accurate version of "compute-node-inventory" cache, and can send incremental updates because it actually consumes compute resources, furthermore, storage service can also maintain an accurate version of "storage-inventory" cache and send incremental updates if it also consumes storage resources. If there are central services in charge of consuming all the resources, the accurate cache and updates must come from them.

Regards,
-Yingxin

From: Sylvain Bauza [mailto:sbauza@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify its design goals in accuracy, performance, reliability and compatibility improvements. It will also be an Austin Summit Session if elected: https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so it would be the best way to discuss on the design.

  1. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending updates to new scheduler)
- how can we test it
- can we have the feature optional for operators

  1. Possibilities to integrate with resource-provider bp series: I know resource-provider is the major direction of Nova scheduler, and there will be fundamental changes in the future, especially according to the bp https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. However, this prototype proposes a much faster and compatible way to make schedule decisions based on scheduler caches. The in-memory decisions are made at the same speed with the caching scheduler, but the caches are kept consistent with compute nodes as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you know that our resource modeling is drastically subject to change in Mitaka and Newton. That is the new game, so I'd love to see how you plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share your ideas because all of you are having great ideas to improve a current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:


Background:
The host state cache maintained by host manager is the scheduler resource view during schedule decision making. It is updated whenever a request is received[1], and all the compute node records are retrieved from db every time. There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in retrieving compute node records. The db block time of a single request is 355ms in average in the deployment of 3 compute nodes, compared with only 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a problem using only one scheduler. The detailed analysis of one-scheduler-problem is located in bug analysis[2]. In short, there is a gap between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in resource tracker. A recent scheduler resource consumption in cache can be lost and overwritten by compute node data because of it, result in cache inconsistency and unexpected retries. In a one-scheduler experiment using 3-node deployment, there are 7 retries out of 31 concurrent schedule requests recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even worse" performance result using parallel schedulers. In the same experiment with 4 schedulers on separate machines, the average db block time is increased to 697ms per request and there are 16 retries out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update model to scheduler host state cache. Instead of refreshing caches from db, every compute node maintains its accurate version of host state cache updated by the resource tracker, and sends incremental updates directly to schedulers. So the scheduler cache are synchronized to the correct state as soon as possible with the lowest overhead. Also, scheduler will send resource claim with its decision to the target compute node. The compute node can decide whether the resource claim is successful immediately by its local host state cache and send responds back ASAP. With all the claims are tracked from schedulers to compute nodes, no false overwrites will happen, and thus the gaps between scheduler cache and real compute node states are minimized. The benefits are obvious with recorded experiments[3] compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average decision time per request is about 3ms in both single and multiple scheduler scenarios, which is equal to the in-memory decision time of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is eliminated, there should be 0 retries in one-scheduler deployment, as proven in the experiment. Thanks to the quick claim responding implementation, there are only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data structure of HostState is unchanged. In fact, this prototype even supports filter scheduler running at the same time(already tested). Like other operations with resource changes such as migration, resizing or shelving, they make claims in the resource tracker directly and update the compute node host state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world scenarios, such as network issues, service unexpectedly down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are only casts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an incremental seed, so any message reordering, lost or duplication due to network issues can be detected by MessageWindow immediately. The inconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its async mode. There is no need to send all the messages one by one in the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to all known remotes for quick cache synchronization, even before the service record is available in db. And if a remote service is unexpectedly down according to service group records, no more messages will send to it. The ComputeFilter is also removed because of this feature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute nodes, but also from compute-node host state to the resource tracker. One reason is that there is still a gap between a claim is acknowledged by compute-node host state and the claim is successful in resource tracker. It is necessary to track those unhandled claims to keep host state accurate. The second reason is to separate schedulers from compute node and resource trackers. Scheduler only export limited interfaces update_from_compute and handle_rt_claim_failure to compute service and the RT, so the testing and reusing are easier with clear boundaries.

TODOs:
There are still many features to be implemented, the most important are unit tests and incremental updates to PCI and NUMA resources, all of them are marked out inline.

References:
[1] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<

The original commit history of this prototype is located in https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please refer to the commit message of https://review.openstack.org/#/c/280047/

Regards,
-Yingxin


OpenStack Development Mailing List (not for usage questions)

Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by Cheng,_Yingxin (1,120 points)  
0 votes

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :

Thanks Sylvain,

  1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)

  1. Thanks for providing concerns I’ve not thought it yet, they will be
    in the spec soon.

  2. Let me copy my thoughts from another thread about the integration
    with resource-provider:

The idea is about “Only compute node knows its own final compute-node
resource view” or “The accurate resource view only exists at the place
where it is actually consumed.” I.e., The incremental updates can only
come from the actual “consumption” action, no matter where it is(e.g.
compute node, storage service, network service, etc.). Borrow the
terms from resource-provider, compute nodes can maintain its accurate
version of “compute-node-inventory” cache, and can send incremental
updates because it actually consumes compute resources, furthermore,
storage service can also maintain an accurate version of
“storage-inventory” cache and send incremental updates if it also
consumes storage resources. If there are central services in charge of
consuming all the resources, the accurate cache and updates must come
from them.

That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain

Regards,

-Yingxin

From:Sylvain Bauza [mailto:sbauza@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions)
openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] A prototype implementation
towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
<https://review.openstack.org/#/c/280047/> to testify its design
goals in accuracy, performance, reliability and compatibility
improvements. It will also be an Austin Summit Session if elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316


I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so
it would be the best way to discuss on the design.

2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the
spec for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous
solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to
provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending
updates to new scheduler)
- how can we test it
- can we have the feature optional for operators

3. Possibilities to integrate with resource-provider bp series: I
know resource-provider is the major direction of Nova scheduler,
and there will be fundamental changes in the future, especially
according to the bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible way
to make schedule decisions based on scheduler caches. The
in-memory decisions are made at the same speed with the caching
scheduler, but the caches are kept consistent with compute nodes
as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you know
that our resource modeling is drastically subject to change in Mitaka
and Newton. That is the new game, so I'd love to see how you plan to
interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:

>>----------------------------

Background:

The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated
whenever a request is received[1], and all the compute node
records are retrieved from db every time. There are several
problems in this update model, proven in experiments[3]:

1. Performance: The scheduler performance is largely affected by
db access in retrieving compute node records. The db block time of
a single request is 355ms in average in the deployment of 3
compute nodes, compared with only 3ms in in-memory
decision-making. Imagine there could be at most 1k nodes, even 10k
nodes in the future.

2. Race conditions: This is not only a parallel-scheduler problem,
but also a problem using only one scheduler. The detailed analysis
of one-scheduler-problem is located in bug analysis[2]. In short,
there is a gap between the scheduler makes a decision in host
state cache and the

compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource
consumption in cache can be lost and overwritten by compute node
data because of it, result in cache inconsistency and unexpected
retries. In a one-scheduler experiment using 3-node deployment,
there are 7 retries out of 31 concurrent schedule requests
recorded, results in 22.6% extra performance overhead.

3. Parallel scheduler support: The design of filter scheduler
leads to an "even worse" performance result using parallel
schedulers. In the same experiment with 4 schedulers on separate
machines, the average db block time is increased to 697ms per
request and there are 16 retries out of 31 schedule requests,
namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing a
new update model to scheduler host state cache. Instead of
refreshing caches from db, every compute node maintains its
accurate version of host state cache updated by the resource
tracker, and sends incremental updates directly to schedulers. So
the scheduler cache are synchronized to the correct state as soon
as possible with the lowest overhead. Also, scheduler will send
resource claim with its decision to the target compute node. The
compute node can decide whether the resource claim is successful
immediately by its local host state cache and send responds back
ASAP. With all the claims are tracked from schedulers to compute
nodes, no false overwrites will happen, and thus the gaps between
scheduler cache and real compute node states are minimized. The
benefits are obvious with recorded experiments[3] compared with
caching scheduler and filter scheduler:

1. There is no db block time during scheduler decision making, the
average decision time per request is about 3ms in both single and
multiple scheduler scenarios, which is equal to the in-memory
decision time of filter scheduler and caching scheduler.

2. Since the scheduler claims are tracked and the "false
overwrite" is eliminated, there should be 0 retries in
one-scheduler deployment, as proven in the experiment. Thanks to
the quick claim responding implementation, there are only 2
retries out of 31 requests in the 4-scheduler experiment.

3. All the filtering and weighing algorithms are compatible
because the data structure of HostState is unchanged. In fact,
this prototype even supports filter scheduler running at the same
time(already tested). Like other operations with resource changes
such as migration, resizing or shelving, they make claims in the
resource tracker directly and update the compute node host state
immediately without major changes.

Extra features:

More efforts are made to better adjust the implementation to
real-world scenarios, such as network issues, service unexpectedly
down and overwhelming messages etc:

1. The communication between schedulers and compute nodes are only
casts, there are no RPC calls thus no blocks during scheduling.

2. All updates from nodes to schedulers are labelled with an
incremental seed, so any message reordering, lost or duplication
due to network issues can be detected by MessageWindow
immediately. The inconsistent cache can be detected and refreshed
correctly.

3. The overwhelming messages are compressed by MessagePipe in its
async mode. There is no need to send all the messages one by one
in the MQ, they can be merged before sent to schedulers.

4. When a new service is up or recovered, it sends notifications
to all known remotes for quick cache synchronization, even before
the service record is available in db. And if a remote service is
unexpectedly down according to service group records, no more
messages will send to it. The ComputeFilter is also removed
because of this feature, the scheduler can detect remote compute
nodes by itself.

5. In fact the claim tracking is not only from schedulers to
compute nodes, but also from compute-node host state to the
resource tracker. One reason is that there is still a gap between
a claim is acknowledged by compute-node host state and the claim
is successful in resource tracker. It is necessary to track those
unhandled claims to keep host state accurate. The second reason is
to separate schedulers from compute node and resource trackers.
Scheduler only export limited interfaces `update_from_compute` and
`handle_rt_claim_failure` to compute service and the RT, so the
testing and reusing are easier with clear boundaries.

TODOs:

There are still many features to be implemented, the most
important are unit tests and incremental updates to PCI and NUMA
resources, all of them are marked out inline.

References:

[1]
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104


[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
<https://bugs.launchpad.net/nova/+bug/1341420/comments/24>

[3] http://paste.openstack.org/show/486929/

----------------------------<<

The original commit history of this prototype is located in
https://github.com/cyx1231st/nova/commits/shared-scheduler

For instructions to install and test this prototype, please refer
to the commit message of https://review.openstack.org/#/c/280047/

Regards,

-Yingxin




__________________________________________________________________________

OpenStack Development Mailing List (not for usage questions)

Unsubscribe:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe  <mailto:OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by Sylvain_Bauza (14,100 points)   1 3 5
0 votes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 02/15/2016 03:27 AM, Sylvain Bauza wrote:

  • can we have the feature optional for operators

One thing that concerns me is the lesson learned from simply having a
compute node's instance information sent and persisted in memory. That
was resisted by several large operators, due to overhead. This
proposal will have to store that and more in memory.


iQIcBAEBCgAGBQJWwdpaAAoJEKMgtcocwZqL7AsP/2/NqRtyepV/vmf7AMR/aI4P
sKKdD5tT3dydvNZnsvSRWTkzTfvSPQqXSEOPfbMsdAcNGFerRMQ+gz4CTq9Jioa4
4Q7H3k4BgMucCOq8ZScUHLb4Ymw2A2ksXYqVIk4BALH3H0i/V+M6XMXQC7mFLSS2
Y8wqdPfb5qIvR4Zf90XPi1wkQiL0rx3WLpN6wHYovaJS7rxMfOT8/ZoO5s/zs5FM
n9n+qcB0aMY4RT4+8J49homw1+hatPmo0lp4Hcyp7cCg1cvUidIXDDqw6ycMKRio
mKMthiNT01kCG1mSRd9U3aszXnovGqGspl7K1R1SBt+4kiHIXQ4khiTDSnNuxnk+
3GnM3gL72ZjVcppDReII9KpJ9ZPxhD4tKNsRfcZQrEjQCwGz5tY275FItyVypw1x
tVLIsPzPbqFV/AjubPVChmypWsnG3bZT4eoCK0k8stRd5SArfR61Z20ai4r1fqWO
1HU6kT1iYCwhMUk58Om35e/G1a/4t/ZiSvdzS1NermD6SGh0oMmRYgQs7nSRMZ5Y
pOIDLToVN/AnSxpgNgqsOpgLXRn3vCskrqxJqqcLiwEQnkh10gEnEvZ6gPAXOJLl
axctk4jEbKLdNoEwEXuiAFJjGlNG/8J5J3T9CHbFRuUVDKZfcgKZc4l49pTYa8iA
OiKK4a6Fw2UWbAdBTuCD
=sxd3
-----END PGP SIGNATURE-----


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 15, 2016 by Ed_Leafe (11,720 points)   1 3 6
0 votes

To better illustrate the differences between shared-state, resource-provider and legacy scheduler, I've drew 3 simplified pictures [1] in emphasizing the location of resource view, the location of claim and resource consumption, and the resource update/refresh pattern in three kinds of schedulers. Hoping I'm correct in the "resource-provider scheduler" part.

A point of view from my analysis in comparing three schedulers (before real experiment):
1. Performance: The performance bottlehead of resource-provider and legacy scheduler is from the centralized db and scheduler cache refreshing. It can be alleviated by changing to a stand-alone high performance database. And the cache refreshing is designed to be replaced by to direct SQL queries according to resource-provider scheduler spec [2]. The performance bottlehead of shared-state scheduler may come from the overwhelming update messages, it can also be alleviated by changing to stand-alone distributed message queue and by using the "MessagePipe" to merge messages.
2. Final decision accuracy: I think the accuracy of the final decision are high in all three schedulers, because until now the consistent resource view and the final resource consumption with claims are all in the same place. It's resource trackers in shared-state scheduler and legacy scheduler, and it's the resource-provider db in resource-provider scheduler.
3. Scheduler decision accuracy: IMO the order of accuracy of a single schedule decision is resource-provider > shared-state >> legacy scheduler. The resource-provider scheduler can get the accurate resource view directly from db. Shared-state scheduler is getting the most accurate resource view by constantly collecting updates from resource trackers and by tracking the scheduler claims from schedulers to RTs. Legacy scheduler's decision is the worst because it doesn't track its claims and get resource views from compute nodes records which are not that accurate.
4. Design goal difference:
The fundamental design goal of the two new schedulers is different. Copy my views from [2], I think it is the choice between "the loose distributed consistency with retries" and "the strict centralized consistency with locks".

As can be seen in the illustrations [1], the main compatibility issue between shared-state and resource-provider scheduler is caused by the different location of claim/consumption and the assumed consistent resource view. IMO unless the claims are allowed to happen in both places(resource tracker and resource-provider db), it seems difficult to make shared-state and resource-provider scheduler work together.

[1] https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
[2] https://review.openstack.org/#/c/271823/

Regards,
-Yingxin

From: Sylvain Bauza [mailto:sbauza@redhat.com]
Sent: Monday, February 15, 2016 9:48 PM
To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
Thanks Sylvain,

  1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)

  1. Thanks for providing concerns I've not thought it yet, they will be in the spec soon.

  2. Let me copy my thoughts from another thread about the integration with resource-provider:
    The idea is about "Only compute node knows its own final compute-node resource view" or "The accurate resource view only exists at the place where it is actually consumed." I.e., The incremental updates can only come from the actual "consumption" action, no matter where it is(e.g. compute node, storage service, network service, etc.). Borrow the terms from resource-provider, compute nodes can maintain its accurate version of "compute-node-inventory" cache, and can send incremental updates because it actually consumes compute resources, furthermore, storage service can also maintain an accurate version of "storage-inventory" cache and send incremental updates if it also consumes storage resources. If there are central services in charge of consuming all the resources, the accurate cache and updates must come from them.

That is one of the things I'd like to see in your spec, and how you could interact with the new model.
Thanks,
-Sylvain

Regards,
-Yingxin

From: Sylvain Bauza [mailto:sbauza@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgopenstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify its design goals in accuracy, performance, reliability and compatibility improvements. It will also be an Austin Summit Session if elected: https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so it would be the best way to discuss on the design.

  1. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending updates to new scheduler)
- how can we test it
- can we have the feature optional for operators

  1. Possibilities to integrate with resource-provider bp series: I know resource-provider is the major direction of Nova scheduler, and there will be fundamental changes in the future, especially according to the bp https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. However, this prototype proposes a much faster and compatible way to make schedule decisions based on scheduler caches. The in-memory decisions are made at the same speed with the caching scheduler, but the caches are kept consistent with compute nodes as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you know that our resource modeling is drastically subject to change in Mitaka and Newton. That is the new game, so I'd love to see how you plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share your ideas because all of you are having great ideas to improve a current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:


Background:
The host state cache maintained by host manager is the scheduler resource view during schedule decision making. It is updated whenever a request is received[1], and all the compute node records are retrieved from db every time. There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in retrieving compute node records. The db block time of a single request is 355ms in average in the deployment of 3 compute nodes, compared with only 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, even 10k nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a problem using only one scheduler. The detailed analysis of one-scheduler-problem is located in bug analysis[2]. In short, there is a gap between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in resource tracker. A recent scheduler resource consumption in cache can be lost and overwritten by compute node data because of it, result in cache inconsistency and unexpected retries. In a one-scheduler experiment using 3-node deployment, there are 7 retries out of 31 concurrent schedule requests recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even worse" performance result using parallel schedulers. In the same experiment with 4 schedulers on separate machines, the average db block time is increased to 697ms per request and there are 16 retries out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update model to scheduler host state cache. Instead of refreshing caches from db, every compute node maintains its accurate version of host state cache updated by the resource tracker, and sends incremental updates directly to schedulers. So the scheduler cache are synchronized to the correct state as soon as possible with the lowest overhead. Also, scheduler will send resource claim with its decision to the target compute node. The compute node can decide whether the resource claim is successful immediately by its local host state cache and send responds back ASAP. With all the claims are tracked from schedulers to compute nodes, no false overwrites will happen, and thus the gaps between scheduler cache and real compute node states are minimized. The benefits are obvious with recorded experiments[3] compared with caching scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average decision time per request is about 3ms in both single and multiple scheduler scenarios, which is equal to the in-memory decision time of filter scheduler and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is eliminated, there should be 0 retries in one-scheduler deployment, as proven in the experiment. Thanks to the quick claim responding implementation, there are only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data structure of HostState is unchanged. In fact, this prototype even supports filter scheduler running at the same time(already tested). Like other operations with resource changes such as migration, resizing or shelving, they make claims in the resource tracker directly and update the compute node host state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world scenarios, such as network issues, service unexpectedly down and overwhelming messages etc:
1. The communication between schedulers and compute nodes are only casts, there are no RPC calls thus no blocks during scheduling.
2. All updates from nodes to schedulers are labelled with an incremental seed, so any message reordering, lost or duplication due to network issues can be detected by MessageWindow immediately. The inconsistent cache can be detected and refreshed correctly.
3. The overwhelming messages are compressed by MessagePipe in its async mode. There is no need to send all the messages one by one in the MQ, they can be merged before sent to schedulers.
4. When a new service is up or recovered, it sends notifications to all known remotes for quick cache synchronization, even before the service record is available in db. And if a remote service is unexpectedly down according to service group records, no more messages will send to it. The ComputeFilter is also removed because of this feature, the scheduler can detect remote compute nodes by itself.
5. In fact the claim tracking is not only from schedulers to compute nodes, but also from compute-node host state to the resource tracker. One reason is that there is still a gap between a claim is acknowledged by compute-node host state and the claim is successful in resource tracker. It is necessary to track those unhandled claims to keep host state accurate. The second reason is to separate schedulers from compute node and resource trackers. Scheduler only export limited interfaces update_from_compute and handle_rt_claim_failure to compute service and the RT, so the testing and reusing are easier with clear boundaries.

TODOs:
There are still many features to be implemented, the most important are unit tests and incremental updates to PCI and NUMA resources, all of them are marked out inline.

References:
[1] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
[2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
[3] http://paste.openstack.org/show/486929/
----------------------------<<

The original commit history of this prototype is located in https://github.com/cyx1231st/nova/commits/shared-scheduler
For instructions to install and test this prototype, please refer to the commit message of https://review.openstack.org/#/c/280047/

Regards,
-Yingxin


OpenStack Development Mailing List (not for usage questions)

Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)

Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 17, 2016 by Cheng,_Yingxin (1,120 points)  
0 votes

On Wed, 17 Feb 2016, Cheng, Yingxin wrote:

To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.

That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.

Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).

--
Chris Dent (╯°□°)╯︵┻━┻ http://anticdent.org/
freenode: cdent tw: @anticdent__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

responded Feb 17, 2016 by cdent_plus_os_at_ant (12,800 points)   2 2 5
0 votes

Le 17/02/2016 12:59, Chris Dent a écrit :

On Wed, 17 Feb 2016, Cheng, Yingxin wrote:

To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.

That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.

So, to be clear, I'm really happy to see the resource-providers series
for many reasons :
- it will help us getting a nice Facade for getting the resources and
attributing them
- it will help a shared-storage deployment by making sure that we
don't have some resource problems when the resource is shared
- it will create a possibility for external resource providers to
provide some resource types to Nova so the Nova scheduler could use them
(like Neutron related resources)

That, I really want to have it implemented in Mitaka and Newton and I'm
totally on-board and supporting it.

TBC, the only problem I see with the series is [2], not the whole, please.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.

Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

So, IMHO, we should only have the compute nodes being the authority for
allocating resources. They are many reasons for that I provided in the
spec review, but I can reply again :

  • #1 If we consider that an external system, as a resource provider,
    will provide a single resource class usage (like network segment
    availability), it will still require the instance to be spawned
    for consuming that resource class, even if the scheduler accounts
    for it. That would mean that the scheduler would have to manage a
    list of allocations with TTL, and periodically verify that the
    allocation succeeded by asking the external system (or getting
    feedback from the external system). See, that's racy.
  • #2 the scheduler is just a decision maker, by any case it doesn't
    account for the real instance creation (it doesn't hold the
    ownership of the instance). Having it being accountable for the
    instances usage is heavily difficult. Take for example a request for
    CPU pinning or NUMA affinity. The user can't really express which
    pin of the pCPU he will get, that's the compute node which will do
    that for him. Of course, the scheduler will help picking an host
    that can fit the request, but the real pinning will happen in the
    compute node.

Also, I'm very interested in keeping an optimistic scheduler which
wouldn't lock the entire view of the world anytime a request comes in.
There are many papers showing different architectures and benchmarks
against different possibilities and TBH, I'm very concerned by the
scaling effect.
Also, we should keep in mind our new paradigm called Cells V2, which
implies a global distributed scheduler for handling all requests. Having
it following the same design tenets of OpenStack [3] by having a
"eventually consistent shared-state" makes my guts saying that I'd love
to see that.

The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).

My point is that while I truly understand the need of getting an API
resource like "scheduler, get me how much my cloud is free", that
shouldn't necessarly need to be accurate but rather eventually consistent.
If operators want to do capacity planning, they need trends and
thresholds, not exactly knowing the precise amounts that can change
everytime a request comes in.

-Sylvain

[2] https://review.openstack.org/#/c/271823/
[3] https://wiki.openstack.org/wiki/BasicDesignTenets


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Feb 17, 2016 by Sylvain_Bauza (14,100 points)   1 3 5
...