settingsLogin | Registersettings

[Openstack-operators] Fwd: Re: [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality

0 votes

Oops replied to the wrong list.

-------- Forwarded Message --------
Subject: Re: [Openstack-operators] [nova][ironic][scheduler][placement]
IMPORTANT: Getting rid of the automated reschedule functionality
Date: Mon, 22 May 2017 14:12:49 -0500
From: Matt Riedemann mriedem.os@gmail.com
To: openstack-dev@lists.openstack.org

On 5/22/2017 1:50 PM, Matt Riedemann wrote:

On 5/22/2017 12:54 PM, Jay Pipes wrote:

Hi Ops,

I need your feedback on a very important direction we would like to
pursue. I realize that there were Forum sessions about this topic at
the summit in Boston and that there were some decisions that were
reached.

I'd like to revisit that decision and explain why I'd like your
support for getting rid of the automatic reschedule behaviour entirely
in Nova for Pike.

== The current situation and why it sucks ==

Nova currently attempts to "reschedule" instances when any of the
following events occur:

a) the "claim resources" process that occurs on the nova-compute
worker results in the chosen compute node exceeding its own capacity

b) in between the time a compute node was chosen by the scheduler,
another process launched an instance that would violate an affinity
constraint

c) an "unknown" exception occurs during the spawn process. In
practice, this really only is seen when the Ironic baremetal node that
was chosen by the scheduler turns out to be unreliable (IPMI issues,
BMC failures, etc) and wasn't able to launch the instance. [1]

The logic for handling these reschedules makes the Nova conductor,
scheduler and compute worker code very complex. With the new cellsv2
architecture in Nova, child cells are not able to communicate with the
Nova scheduler (and thus "ask for a reschedule").

To be clear, they are able to communicate, and do, as long as you
configure them to be able to do so. The long-term goal is that you don't
have to configure them to be able to do so, so we're trying to design
and work in that mode toward that goal.

We (the Nova team) would like to get rid of the automated rescheduling
behaviour that Nova currently exposes because we could eliminate a
large amount of complexity (which leads to bugs) from the
already-complicated dance of communication that occurs between
internal Nova components.

== What we would like to do ==

With the move of the resource claim to the Nova scheduler [2], we can
entirely eliminate the a) class of Reschedule causes.

This leaves class b) and c) causes of Rescheduling.

For class b) causes, we should be able to solve this issue when the
placement service understands affinity/anti-affinity (maybe
Queens/Rocky). Until then, we propose that instead of raising a
Reschedule when an affinity constraint was last-minute violated due to
a racing scheduler decision, that we simply set the instance to an
ERROR state.

Personally, I have only ever seen anti-affinity/affinity use cases in
relation to NFV deployments, and in every NFV deployment of OpenStack
there is a VNFM or MANO solution that is responsible for the
orchestration of instances belonging to various service function
chains. I think it is reasonable to expect the MANO system to be
responsible for attempting a re-launch of an instance that was set to
ERROR due to a last-minute affinity violation.

Operators, do you agree with the above?

Finally, for class c) Reschedule causes, I do not believe that we
should be attempting automated rescheduling when "unknown" errors
occur. I just don't believe this is something Nova should be doing.

I recognize that large Ironic users expressed their concerns about
IPMI/BMC communication being unreliable and not wanting to have users
manually retry a baremetal instance launch. But, on this particular
point, I'm of the opinion that Nova just do one thing and do it well.
Nova isn't an orchestrator, nor is it intending to be a "just
continually try to get me to this eventual state" system like Kubernetes.

If we removed Reschedule for class c) failures entirely, large Ironic
deployers would have to train users to manually retry a failed launch
or would need to write a simple retry mechanism into whatever
client/UI that they expose to their users.

Ironic operators, would the above decision force you to abandon Nova
as the multi-tenant BMaaS facility?

Thanks in advance for your consideration and feedback.

Best,
-jay

[1] This really does not occur with any frequency for hypervisor virt
drivers, since the exceptions those hypervisors throw are caught by
the nova-compute worker and handled without raising a Reschedule.

Are you sure about that?

https://github.com/openstack/nova/blob/931c3f48188e57e71aa6518d5253e1a5bd9a27c0/nova/compute/manager.py#L2041-L2049

The compute manager handles anything non-specific that leaks up from the
virt driver.spawn() method and reschedules it. Think
ProcessExecutionError when vif plugging fails in the libvirt driver
because the command blew up for some reason (sudo on the host is
wrong?). I'm not saying it should, as I'm guessing most of these types
of failures are due to misconfiguration, but it is how things currently
work today.

Not to sound like we don't have a united front here, but I want to
restate the concern I expressed this morning when talking about this.

I'm not an operator and don't have the background or experience there.

The 95% number thrown around at the summit was made up, as far as I
know. There is no published data that I'm aware of which says someone
tested reschedules at scale and in 95% of cases they were due to the
situation described in (a) above.

We're less than three weeks from the p-2 milestone. Feature freeze is
July 27. That is plenty of time (ideally) to get this code done and
merged. However, I don't want to underestimate the number of weird
things that are going to come out of this pretty large change in how
things work, especially when multiple cells and quotas changes are
happening.

Therefore I'm on the side of being conservative here and allowing
reschedules within a cell for now. I think long-term it'd be a good idea
to disable reschedules by default for new installs, and for people that
really need them (or feel more secure by having them) then they can turn
them on. But I'd rather see that gradually phased out once we see how
things are working for awhile (at least a release).

Yes that means possible duplication and technical debt, but I think
we've always accepted some of that, at least temporarily, for large
changes so we can ease the transition.

--

Thanks,

Matt


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
asked May 23, 2017 in openstack-operators by mriedemos_at_gmail.c (15,720 points)   2 4 10

31 Responses

0 votes

On Mon, May 22, 2017 at 10:54 AM, Jay Pipes jaypipes@gmail.com wrote:

Hi Ops,

Hi!

For class b) causes, we should be able to solve this issue when the
placement service understands affinity/anti-affinity (maybe Queens/Rocky).
Until then, we propose that instead of raising a Reschedule when an
affinity constraint was last-minute violated due to a racing scheduler
decision, that we simply set the instance to an ERROR state.

Personally, I have only ever seen anti-affinity/affinity use cases in
relation to NFV deployments, and in every NFV deployment of OpenStack there
is a VNFM or MANO solution that is responsible for the orchestration of
instances belonging to various service function chains. I think it is
reasonable to expect the MANO system to be responsible for attempting a
re-launch of an instance that was set to ERROR due to a last-minute
affinity violation.

Operators, do you agree with the above?

I do not. My affinity and anti-affinity use cases reflect the need to build
large applications across failure domains in a datacenter.

Anti-affinity: Most anti-affinity use cases relate to the ability to
guarantee that instances are scheduled across failure domains, others
relate to security compliance.

Affinity: Hadoop/Big data deployments have affinity use cases, where nodes
processing data need to be in the same rack as the nodes which house the
data. This is a common setup for large hadoop deployers.

I recognize that large Ironic users expressed their concerns about
IPMI/BMC communication being unreliable and not wanting to have users
manually retry a baremetal instance launch. But, on this particular point,
I'm of the opinion that Nova just do one thing and do it well. Nova isn't
an orchestrator, nor is it intending to be a "just continually try to get
me to this eventual state" system like Kubernetes.

Kubernetes is a larger orchestration platform that provides autoscale. I
don't expect Nova to provide autoscale, but

I agree that Nova should do one thing and do it really well, and in my mind
that thing is reliable provisioning of compute resources. Kubernetes does
autoscale among other things. I'm not asking for Nova to provide Autoscale,
I -AM- asking OpenStack's compute platform to provision a discrete compute
resource reliably. This means overcoming common and simple error cases. As
a deployer of OpenStack I'm trying to build a cloud that wraps the chaos of
infrastructure, and present a reliable facade. When my users issue a boot
request, I want to see if fulfilled. I don't expect it to be a 100%
guarantee across any possible failure, but I expect (and my users demand)
that my "Infrastructure as a service" API make reasonable accommodation to
overcome common failures.

If we removed Reschedule for class c) failures entirely, large Ironic
deployers would have to train users to manually retry a failed launch or
would need to write a simple retry mechanism into whatever client/UI that
they expose to their users.

Ironic operators, would the above decision force you to abandon Nova as
the multi-tenant BMaaS facility?

I just glanced at one of my production clusters and found there are around
7K users defined, many of whom use OpenStack on a daily basis. When they
issue a boot call, they expect that request to be honored. From their
perspective, if they call AWS, they get what they ask for. If you remove
reschedules you're not just breaking the expectation of a single deployer,
but for my thousands of engineers who, every day, rely on OpenStack to
manage their stack.

I don't have a "i'll take my football and go home" mentality. But if you
remove the ability for the compute provisioning API to present a reliable
facade over infrastructure, I have to go write something else, or patch it
back in. Now it's even harder for me to get and stay current with OpenStack.

During the summit the agreement was, if I recall, that reschedules would
happen within a cell, and not between the parent and cell. That was
completely acceptable to me.

-James


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by jpenick_at_gmail.com (860 points)  
0 votes

On 5/22/2017 12:54 PM, Jay Pipes wrote:
Hi Ops,

I need your feedback on a very important direction we would like to
pursue. I realize that there were Forum sessions about this topic at the
summit in Boston and that there were some decisions that were reached.

I'd like to revisit that decision and explain why I'd like your support
for getting rid of the automatic reschedule behaviour entirely in Nova
for Pike.

== The current situation and why it sucks ==

Nova currently attempts to "reschedule" instances when any of the
following events occur:

a) the "claim resources" process that occurs on the nova-compute worker
results in the chosen compute node exceeding its own capacity

b) in between the time a compute node was chosen by the scheduler,
another process launched an instance that would violate an affinity
constraint

c) an "unknown" exception occurs during the spawn process. In practice,
this really only is seen when the Ironic baremetal node that was chosen
by the scheduler turns out to be unreliable (IPMI issues, BMC failures,
etc) and wasn't able to launch the instance. [1]

The logic for handling these reschedules makes the Nova conductor,
scheduler and compute worker code very complex. With the new cellsv2
architecture in Nova, child cells are not able to communicate with the
Nova scheduler (and thus "ask for a reschedule").

To be clear, they are able to communicate, and do, as long as you
configure them to be able to do so. The long-term goal is that you don't
have to configure them to be able to do so, so we're trying to design
and work in that mode toward that goal.

We (the Nova team) would like to get rid of the automated rescheduling
behaviour that Nova currently exposes because we could eliminate a large
amount of complexity (which leads to bugs) from the already-complicated
dance of communication that occurs between internal Nova components.

== What we would like to do ==

With the move of the resource claim to the Nova scheduler [2], we can
entirely eliminate the a) class of Reschedule causes.

This leaves class b) and c) causes of Rescheduling.

For class b) causes, we should be able to solve this issue when the
placement service understands affinity/anti-affinity (maybe
Queens/Rocky). Until then, we propose that instead of raising a
Reschedule when an affinity constraint was last-minute violated due to a
racing scheduler decision, that we simply set the instance to an ERROR
state.

Personally, I have only ever seen anti-affinity/affinity use cases in
relation to NFV deployments, and in every NFV deployment of OpenStack
there is a VNFM or MANO solution that is responsible for the
orchestration of instances belonging to various service function chains.
I think it is reasonable to expect the MANO system to be responsible for
attempting a re-launch of an instance that was set to ERROR due to a
last-minute affinity violation.

Operators, do you agree with the above?

Finally, for class c) Reschedule causes, I do not believe that we should
be attempting automated rescheduling when "unknown" errors occur. I just
don't believe this is something Nova should be doing.

I recognize that large Ironic users expressed their concerns about
IPMI/BMC communication being unreliable and not wanting to have users
manually retry a baremetal instance launch. But, on this particular
point, I'm of the opinion that Nova just do one thing and do it well.
Nova isn't an orchestrator, nor is it intending to be a "just
continually try to get me to this eventual state" system like Kubernetes.

If we removed Reschedule for class c) failures entirely, large Ironic
deployers would have to train users to manually retry a failed launch or
would need to write a simple retry mechanism into whatever client/UI
that they expose to their users.

Ironic operators, would the above decision force you to abandon Nova
as the multi-tenant BMaaS facility?

Thanks in advance for your consideration and feedback.

Best,
-jay

[1] This really does not occur with any frequency for hypervisor virt
drivers, since the exceptions those hypervisors throw are caught by the
nova-compute worker and handled without raising a Reschedule.

Are you sure about that?

https://github.com/openstack/nova/blob/931c3f48188e57e71aa6518d5253e1a5bd9a27c0/nova/compute/manager.py#L2041-L2049

The compute manager handles anything non-specific that leaks up from the
virt driver.spawn() method and reschedules it. Think
ProcessExecutionError when vif plugging fails in the libvirt driver
because the command blew up for some reason (sudo on the host is
wrong?). I'm not saying it should, as I'm guessing most of these types
of failures are due to misconfiguration, but it is how things currently
work today.

--

Thanks,

Matt


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by mriedemos_at_gmail.c (15,720 points)   2 4 10
0 votes

On Mon, May 22, 2017 at 11:45:33AM -0700, James Penick wrote:
:On Mon, May 22, 2017 at 10:54 AM, Jay Pipes jaypipes@gmail.com wrote:
:
:> Hi Ops,
:>
:> Hi!
:
:
:>
:> For class b) causes, we should be able to solve this issue when the
:> placement service understands affinity/anti-affinity (maybe Queens/Rocky).
:> Until then, we propose that instead of raising a Reschedule when an
:> affinity constraint was last-minute violated due to a racing scheduler
:> decision, that we simply set the instance to an ERROR state.
:>
:> Personally, I have only ever seen anti-affinity/affinity use cases in
:> relation to NFV deployments, and in every NFV deployment of OpenStack there
:> is a VNFM or MANO solution that is responsible for the orchestration of
:> instances belonging to various service function chains. I think it is
:> reasonable to expect the MANO system to be responsible for attempting a
:> re-launch of an instance that was set to ERROR due to a last-minute
:> affinity violation.
:>
:
:
:> Operators, do you agree with the above?
:>
:
:I do not. My affinity and anti-affinity use cases reflect the need to build
:large applications across failure domains in a datacenter.
:
:Anti-affinity: Most anti-affinity use cases relate to the ability to
:guarantee that instances are scheduled across failure domains, others
:relate to security compliance.
:
:Affinity: Hadoop/Big data deployments have affinity use cases, where nodes
:processing data need to be in the same rack as the nodes which house the
:data. This is a common setup for large hadoop deployers.

James describes my use case as well.

I would also rather see a reschedule, if we're having a really bad day
and reach max retries then see ERR

-Jon


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by jon_at_csail.mit.edu (4,720 points)   1 4 7
0 votes

To be clear, they are able to communicate, and do, as long as you
configure them to be able to do so. The long-term goal is that you don't
have to configure them to be able to do so, so we're trying to design
and work in that mode toward that goal.

No, the cell conductor doesn't have a way to communicate with the
scheduler. It's more than just a "it's not configured to" thing.

If you have multiple cells, then your conductors within a cell point to
the cell MQ as the default transport for all kinds of stuff. If they
call to compute to do a thing, they don't (can't, since it doesn't have
the ability to lookup the cell mapping) target, they just ask on their
default bus.

So, unless scheduler and compute are on the same bus, conductor can't
talk to both at the same time (for non-super conductor operations like
build that expect to target, but then they can't do the non-targeted
operations). If you do that, then you're not doing cellsv2.

[1] This really does not occur with any frequency for hypervisor virt
drivers, since the exceptions those hypervisors throw are caught by
the nova-compute worker and handled without raising a Reschedule.

Are you sure about that?

https://github.com/openstack/nova/blob/931c3f48188e57e71aa6518d5253e1a5bd9a27c0/nova/compute/manager.py#L2041-L2049

Sure, the diaper exception is rescheduled currently. That should
basically be things like misconfiguration type things. Rescheduling
papers over those issues, which I don't like, but in the room it surely
seemed like operators thought that they still needed to be handled.

--Dan


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by Dan_Smith (9,860 points)   1 2 4
0 votes

On 05/22/2017 02:45 PM, James Penick wrote:

I recognize that large Ironic users expressed their concerns about
IPMI/BMC communication being unreliable and not wanting to have
users manually retry a baremetal instance launch. But, on this
particular point, I'm of the opinion that Nova just do one thing and
do it well. Nova isn't an orchestrator, nor is it intending to be a
"just continually try to get me to this eventual state" system like
Kubernetes.

Kubernetes is a larger orchestration platform that provides autoscale. I
don't expect Nova to provide autoscale, but

I agree that Nova should do one thing and do it really well, and in my
mind that thing is reliable provisioning of compute resources.
Kubernetes does autoscale among other things. I'm not asking for Nova to
provide Autoscale, I -AM- asking OpenStack's compute platform to
provision a discrete compute resource reliably. This means overcoming
common and simple error cases. As a deployer of OpenStack I'm trying to
build a cloud that wraps the chaos of infrastructure, and present a
reliable facade. When my users issue a boot request, I want to see if
fulfilled. I don't expect it to be a 100% guarantee across any possible
failure, but I expect (and my users demand) that my "Infrastructure as a
service" API make reasonable accommodation to overcome common failures.

Right, I think hits my major queeziness with throwing the baby out with
the bathwater here. I feel like Nova's job is to give me a compute when
asked for computes. Yes, like malloc, things could fail. But honestly if
Nova can recover from that scenario, it should try to. The baremetal and
affinity cases are pretty good instances where Nova can catch and
recover, and not just export that complexity up.

It would make me sad to just export that complexity to users, and
instead of handing those cases internally make every SDK, App, and
simple script build their own retry loop.

-Sean

--
Sean Dague
http://dague.net


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by Sean_Dague (66,200 points)   4 8 14
0 votes

On 05/22/2017 02:45 PM, James Penick wrote:

During the summit the agreement was, if I recall, that reschedules would
happen within a cell, and not between the parent and cell. That was
completely acceptable to me.

Follow on question (just because the right folks are in this thread, and
it could impact paths forward). I know that some of the inability to
have upcalls in the system is based around firewalling that both Yahoo
and RAX did blocking the compute workers from communicating out.

If the compute worker or cell conductor wanted to make an HTTP call back
to nova-api (through the public interface), with the user context, is
that a network path that would or could be accessible in your case?

-Sean

--
Sean Dague
http://dague.net


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by Sean_Dague (66,200 points)   4 8 14
0 votes

Hi Ops,

I need your feedback on a very important direction we would like to
pursue. I realize that there were Forum sessions about this topic at the
summit in Boston and that there were some decisions that were reached.

I'd like to revisit that decision and explain why I'd like your support
for getting rid of the automatic reschedule behaviour entirely in Nova
for Pike.

== The current situation and why it sucks ==

Nova currently attempts to "reschedule" instances when any of the
following events occur:

a) the "claim resources" process that occurs on the nova-compute worker
results in the chosen compute node exceeding its own capacity

b) in between the time a compute node was chosen by the scheduler,
another process launched an instance that would violate an affinity
constraint

c) an "unknown" exception occurs during the spawn process. In practice,
this really only is seen when the Ironic baremetal node that was chosen
by the scheduler turns out to be unreliable (IPMI issues, BMC failures,
etc) and wasn't able to launch the instance. [1]

The logic for handling these reschedules makes the Nova conductor,
scheduler and compute worker code very complex. With the new cellsv2
architecture in Nova, child cells are not able to communicate with the
Nova scheduler (and thus "ask for a reschedule").

We (the Nova team) would like to get rid of the automated rescheduling
behaviour that Nova currently exposes because we could eliminate a large
amount of complexity (which leads to bugs) from the already-complicated
dance of communication that occurs between internal Nova components.

== What we would like to do ==

With the move of the resource claim to the Nova scheduler [2], we can
entirely eliminate the a) class of Reschedule causes.

This leaves class b) and c) causes of Rescheduling.

For class b) causes, we should be able to solve this issue when the
placement service understands affinity/anti-affinity (maybe
Queens/Rocky). Until then, we propose that instead of raising a
Reschedule when an affinity constraint was last-minute violated due to a
racing scheduler decision, that we simply set the instance to an ERROR
state.

Personally, I have only ever seen anti-affinity/affinity use cases in
relation to NFV deployments, and in every NFV deployment of OpenStack
there is a VNFM or MANO solution that is responsible for the
orchestration of instances belonging to various service function chains.
I think it is reasonable to expect the MANO system to be responsible for
attempting a re-launch of an instance that was set to ERROR due to a
last-minute affinity violation.

Operators, do you agree with the above?

Finally, for class c) Reschedule causes, I do not believe that we should
be attempting automated rescheduling when "unknown" errors occur. I just
don't believe this is something Nova should be doing.

I recognize that large Ironic users expressed their concerns about
IPMI/BMC communication being unreliable and not wanting to have users
manually retry a baremetal instance launch. But, on this particular
point, I'm of the opinion that Nova just do one thing and do it well.
Nova isn't an orchestrator, nor is it intending to be a "just
continually try to get me to this eventual state" system like Kubernetes.

If we removed Reschedule for class c) failures entirely, large Ironic
deployers would have to train users to manually retry a failed launch or
would need to write a simple retry mechanism into whatever client/UI
that they expose to their users.

Ironic operators, would the above decision force you to abandon Nova
as the multi-tenant BMaaS facility?

Thanks in advance for your consideration and feedback.

Best,
-jay

[1] This really does not occur with any frequency for hypervisor virt
drivers, since the exceptions those hypervisors throw are caught by the
nova-compute worker and handled without raising a Reschedule.

[2]
http://specs.openstack.org/openstack/nova-specs/specs/pike/approved/placement-claims.html


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by Jay_Pipes (59,760 points)   3 11 14
0 votes

To be clear on my view of the whole proposal

most of my Rescheduling that I've seen and want are of type "A" where
claim exceeds resources. At least I think they are type "A" and not
"C" unknown.

The exact case is that I over subsribe RAM (1.5x) my users typically over
claim so this is OK (my worst case is a hypervisor using only 10% of
claimed RAM). But there are some hotspots where propertional
utilization is high so libvirt won't start more VMs becasue it really
doesn't have the memory.

If that's solved (or will be at the time reschedule goes away), teh
cases I've actually experienced would be solved.

The anit-affinity use cases are currently most important to be of the
affinity scheduling and I haven't (to my knowlege) seen collisions in
that direction. So I could live with that race becuase for me it is
uncommon (though I imagine for others where positive affinity is
important teh race may get lost mroe frequently)

-Jon

On Mon, May 22, 2017 at 03:00:09PM -0400, Jonathan Proulx wrote:
:On Mon, May 22, 2017 at 11:45:33AM -0700, James Penick wrote:
::On Mon, May 22, 2017 at 10:54 AM, Jay Pipes jaypipes@gmail.com wrote:
::
::> Hi Ops,
::>
::> Hi!
::
::
::>
::> For class b) causes, we should be able to solve this issue when the
::> placement service understands affinity/anti-affinity (maybe Queens/Rocky).
::> Until then, we propose that instead of raising a Reschedule when an
::> affinity constraint was last-minute violated due to a racing scheduler
::> decision, that we simply set the instance to an ERROR state.
::>
::> Personally, I have only ever seen anti-affinity/affinity use cases in
::> relation to NFV deployments, and in every NFV deployment of OpenStack there
::> is a VNFM or MANO solution that is responsible for the orchestration of
::> instances belonging to various service function chains. I think it is
::> reasonable to expect the MANO system to be responsible for attempting a
::> re-launch of an instance that was set to ERROR due to a last-minute
::> affinity violation.
::>
::
::
::> Operators, do you agree with the above?
::>
::
::I do not. My affinity and anti-affinity use cases reflect the need to build
::large applications across failure domains in a datacenter.
::
::Anti-affinity: Most anti-affinity use cases relate to the ability to
::guarantee that instances are scheduled across failure domains, others
::relate to security compliance.
::
::Affinity: Hadoop/Big data deployments have affinity use cases, where nodes
::processing data need to be in the same rack as the nodes which house the
::data. This is a common setup for large hadoop deployers.
:
:James describes my use case as well.
:
:I would also rather see a reschedule, if we're having a really bad day
:and reach max retries then see ERR
:
:-Jon

--


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by jon_at_csail.mit.edu (4,720 points)   1 4 7
0 votes

On 05/22/2017 03:53 PM, Jonathan Proulx wrote:
To be clear on my view of the whole proposal

most of my Rescheduling that I've seen and want are of type "A" where
claim exceeds resources. At least I think they are type "A" and not
"C" unknown.

The exact case is that I over subsribe RAM (1.5x) my users typically over
claim so this is OK (my worst case is a hypervisor using only 10% of
claimed RAM). But there are some hotspots where propertional
utilization is high so libvirt won't start more VMs becasue it really
doesn't have the memory.

If that's solved (or will be at the time reschedule goes away), teh
cases I've actually experienced would be solved.

The anit-affinity use cases are currently most important to be of the
affinity scheduling and I haven't (to my knowlege) seen collisions in
that direction. So I could live with that race becuase for me it is
uncommon (though I imagine for others where positive affinity is
important teh race may get lost mroe frequently)

Thanks for the feedback, Jon.

For the record, affinity really doesn't have much of a race condition at
all. It's really only anti-affinity that has much of a chance of
last-minute violation.

Best,
-jay

On Mon, May 22, 2017 at 03:00:09PM -0400, Jonathan Proulx wrote:
:On Mon, May 22, 2017 at 11:45:33AM -0700, James Penick wrote:
::On Mon, May 22, 2017 at 10:54 AM, Jay Pipes jaypipes@gmail.com wrote:
::
::> Hi Ops,
::>
::> Hi!
::
::
::>
::> For class b) causes, we should be able to solve this issue when the
::> placement service understands affinity/anti-affinity (maybe Queens/Rocky).
::> Until then, we propose that instead of raising a Reschedule when an
::> affinity constraint was last-minute violated due to a racing scheduler
::> decision, that we simply set the instance to an ERROR state.
::>
::> Personally, I have only ever seen anti-affinity/affinity use cases in
::> relation to NFV deployments, and in every NFV deployment of OpenStack there
::> is a VNFM or MANO solution that is responsible for the orchestration of
::> instances belonging to various service function chains. I think it is
::> reasonable to expect the MANO system to be responsible for attempting a
::> re-launch of an instance that was set to ERROR due to a last-minute
::> affinity violation.
::>
::
::
::> Operators, do you agree with the above?
::>
::
::I do not. My affinity and anti-affinity use cases reflect the need to build
::large applications across failure domains in a datacenter.
::
::Anti-affinity: Most anti-affinity use cases relate to the ability to
::guarantee that instances are scheduled across failure domains, others
::relate to security compliance.
::
::Affinity: Hadoop/Big data deployments have affinity use cases, where nodes
::processing data need to be in the same rack as the nodes which house the
::data. This is a common setup for large hadoop deployers.
:
:James describes my use case as well.
:
:I would also rather see a reschedule, if we're having a really bad day
:and reach max retries then see ERR
:
:-Jon


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by Jay_Pipes (59,760 points)   3 11 14
0 votes

That depends..
I differentiate between a compute worker running on a hypervisor, and one
running as a service in the control plane (like the compute worker in an
Ironic cluster).

A compute worker that is running on a hypervisor has highly restricted
network access. But if the compute worker is a service in the control
plane, such as it is with my Ironic installations, that's totally ok. It
really comes down to the fact that I don't want any real or logical network
access between an instance and the heart of the control plane.

I'll allow a child cell control plane to call a parent cell, just not a
hypervisor within the child cell.

On Mon, May 22, 2017 at 12:42 PM, Sean Dague sean@dague.net wrote:

On 05/22/2017 02:45 PM, James Penick wrote:

During the summit the agreement was, if I recall, that reschedules would
happen within a cell, and not between the parent and cell. That was
completely acceptable to me.

Follow on question (just because the right folks are in this thread, and
it could impact paths forward). I know that some of the inability to
have upcalls in the system is based around firewalling that both Yahoo
and RAX did blocking the compute workers from communicating out.

If the compute worker or cell conductor wanted to make an HTTP call back
to nova-api (through the public interface), with the user context, is
that a network path that would or could be accessible in your case?

    -Sean

--
Sean Dague
http://dague.net


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded May 22, 2017 by jpenick_at_gmail.com (860 points)  
...