settingsLogin | Registersettings

[openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone

0 votes

This is interesting from the user point of view:

https://bugs.launchpad.net/nova/+bug/1723880

  • The user creates an instance in a non-default AZ.
  • They shelve offload the instance.
  • The admin deletes the AZ that the instance was using, for whatever reason.
  • The user unshelves the instance which goes back through scheduling and
    fails with NoValidHost because the AZ on the original request spec no
    longer exists.

Now the question is what, if anything, do we do about this bug? Some notes:

  1. How reasonable is it for a user to expect in a stable production
    environment that AZs are going to be deleted from under them? We
    actually have a spec related to this but with AZ renames:

https://review.openstack.org/#/c/446446/

  1. Should we null out the instance.availabilityzone when it's shelved
    offloaded like we do for the instance.host and instance.node attributes?
    Similarly, we would not take into account the
    RequestSpec.availability
    zone when scheduling during unshelve. I tend to
    prefer this option because once you unshelve offload an instance, it's
    no longer associated with a host and therefore no longer associated with
    an AZ. However, is it reasonable to assume that the user doesn't care
    that the instance, once unshelved, is no longer in the originally
    requested AZ? Probably not a safe assumption.

  2. When a user unshelves, they can't propose a new AZ (and I don't think
    we want to add that capability to the unshelve API). So if the original
    AZ is gone, should we automatically remove the
    RequestSpec.availability_zone when scheduling? I tend to not like this
    as it's very implicit and the user could see the AZ on their instance
    change before and after unshelve and be confused.

  3. We could simply do nothing about this specific bug and assert the
    behavior is correct. The user requested an instance in a specific AZ,
    shelved that instance and when they wanted to unshelve it, it's no
    longer available so it fails. The user would have to delete the instance
    and create a new instance from the shelve snapshot image in a new AZ. If
    we implemented Sylvain's spec in #1 above, maybe we don't have this
    problem going forward since you couldn't remove/delete an AZ when there
    are even shelved offloaded instances still tied to it.

Other options?

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Oct 22, 2017 in openstack-dev by mriedemos_at_gmail.c (15,720 points)   2 4 5

4 Responses

0 votes

[not having a dog in this hunt, this is what I would expect as a cloud consumer]

On Mon, Oct 16, 2017 at 10:22 AM, Matt Riedemann mriedemos@gmail.com wrote:
- The user creates an instance in a non-default AZ.
- They shelve offload the instance.
- The admin deletes the AZ that the instance was using, for whatever reason.
- The user unshelves the instance which goes back through scheduling and
fails with NoValidHost because the AZ on the original request spec no longer
exists.

  1. How reasonable is it for a user to expect in a stable production
    environment that AZs are going to be deleted from under them? We actually
    have a spec related to this but with AZ renames:

Change happens...

  1. Should we null out the instance.availabilityzone when it's shelved
    offloaded like we do for the instance.host and instance.node attributes?
    Similarly, we would not take into account the RequestSpec.availability
    zone
    when scheduling during unshelve. I tend to prefer this option because once
    you unshelve offload an instance, it's no longer associated with a host and
    therefore no longer associated with an AZ. However, is it reasonable to
    assume that the user doesn't care that the instance, once unshelved, is no
    longer in the originally requested AZ? Probably not a safe assumption.

Agreed, unless we keep track that the user specified a default or no
AZ at create.

I think nulling the AZ when the original doesn't exist would be
reasonable from a user standpoint, but I'd feel handcuffed if that
happens and I can not select a new AZ. Or throwing a specific error
and letting the user handle it in #3 below:

  1. When a user unshelves, they can't propose a new AZ (and I don't think we
    want to add that capability to the unshelve API). So if the original AZ is

Here is my question... if I can specify an AZ on create, why not on
unshelve? Is it the image location movement under the hood?

gone, should we automatically remove the RequestSpec.availability_zone when
scheduling? I tend to not like this as it's very implicit and the user could
see the AZ on their instance change before and after unshelve and be
confused.

Agreed that explicit is better than implicit.

  1. We could simply do nothing about this specific bug and assert the
    behavior is correct. The user requested an instance in a specific AZ,
    shelved that instance and when they wanted to unshelve it, it's no longer
    available so it fails. The user would have to delete the instance and create
    a new instance from the shelve snapshot image in a new AZ. If we implemented

I do not have the list of things in my head that are preserved in
shelve/unshelve that would be lost in a recreate, but that's where my
worry would come. Presumably that is why I shelved in the first place
rather than snapshotting the server and removing it. Depends on the
cost models too, if I lose my grandfathered-in pricing by being forced
to recreate I amy be unhappy.

Sylvain's spec in #1 above, maybe we don't have this problem going forward
since you couldn't remove/delete an AZ when there are even shelved offloaded
instances still tied to it.

As a user I probably do not mind this, as an operator I'd likely be unhappy.

dt

--

Dean Troyer
dtroyer@gmail.com


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 16, 2017 by Dean_Troyer (13,100 points)   1 3 3
0 votes

On 10/16/2017 11:00 AM, Dean Troyer wrote:
[not having a dog in this hunt, this is what I would expect as a cloud consumer]

Thanks for the user perspective, that's what I'm looking for here, and
operator perspective of course.

On Mon, Oct 16, 2017 at 10:22 AM, Matt Riedemann mriedemos@gmail.com wrote:

  • The user creates an instance in a non-default AZ.
  • They shelve offload the instance.
  • The admin deletes the AZ that the instance was using, for whatever reason.
  • The user unshelves the instance which goes back through scheduling and
    fails with NoValidHost because the AZ on the original request spec no longer
    exists.
  1. How reasonable is it for a user to expect in a stable production
    environment that AZs are going to be deleted from under them? We actually
    have a spec related to this but with AZ renames:

Change happens...

  1. Should we null out the instance.availabilityzone when it's shelved
    offloaded like we do for the instance.host and instance.node attributes?
    Similarly, we would not take into account the RequestSpec.availability
    zone
    when scheduling during unshelve. I tend to prefer this option because once
    you unshelve offload an instance, it's no longer associated with a host and
    therefore no longer associated with an AZ. However, is it reasonable to
    assume that the user doesn't care that the instance, once unshelved, is no
    longer in the originally requested AZ? Probably not a safe assumption.

Agreed, unless we keep track that the user specified a default or no
AZ at create.

We do keep track of what the user originally requested, that is this
RequestSpec object thing I keep referring to.

I think nulling the AZ when the original doesn't exist would be
reasonable from a user standpoint, but I'd feel handcuffed if that
happens and I can not select a new AZ. Or throwing a specific error
and letting the user handle it in #3 below:

At the point of failure, the API has done an RPC cast and returned a 202
to the user, so the only way to provide a message like this to the user
would be to check if the original AZ still exists in the API. We could
do that, it would just be something to be aware of.

  1. When a user unshelves, they can't propose a new AZ (and I don't think we
    want to add that capability to the unshelve API). So if the original AZ is

Here is my question... if I can specify an AZ on create, why not on
unshelve? Is it the image location movement under the hood?

I just don't think it's ever come up. The reason I hesitate to add the
ability to the unshelve API is more or less rooted in my bias toward not
liking shelve/unshelve in general because of how complicated and
half-baked it is (we've had a lot of bugs from these APIs, some of which
are still unresolved). That's not the user's fault though, so one could
argue that if we're not going to deprecate these APIs, we need to make
them more robust. We, as developers, also don't have any idea how many
users are actually using the shelve API, so it's hard to know if we
should spend any time on improving it.

gone, should we automatically remove the RequestSpec.availability_zone when
scheduling? I tend to not like this as it's very implicit and the user could
see the AZ on their instance change before and after unshelve and be
confused.

Agreed that explicit is better than implicit.

  1. We could simply do nothing about this specific bug and assert the
    behavior is correct. The user requested an instance in a specific AZ,
    shelved that instance and when they wanted to unshelve it, it's no longer
    available so it fails. The user would have to delete the instance and create
    a new instance from the shelve snapshot image in a new AZ. If we implemented

I do not have the list of things in my head that are preserved in
shelve/unshelve that would be lost in a recreate, but that's where my
worry would come. Presumably that is why I shelved in the first place
rather than snapshotting the server and removing it. Depends on the
cost models too, if I lose my grandfathered-in pricing by being forced
to recreate I amy be unhappy.

The volumes and ports remain attached to the shelved instance, only the
guest on the hypervisor is destroyed. It doesn't change anything about
quota - you retain quota usage for a shelved instance so you have room
in your quota to unshelve it later.

From what I can tell, the os-simple-tenant-usage API will still count
the instance and it's consumed disk/ram/cpu against you even though the
guest is deleted from the hypervisor while the instance is shelved
offloaded. So the operator is happy about shelved offloaded instances
because that means they have more free capacity for new instances and
moving things, but the user is still getting charged the same, if your
billing model is based on os-simple-tenant-usage (which Telemetry uses I
believe).

Sylvain's spec in #1 above, maybe we don't have this problem going forward
since you couldn't remove/delete an AZ when there are even shelved offloaded
instances still tied to it.

As a user I probably do not mind this, as an operator I'd likely be unhappy.

dt

--

Thanks,

Matt


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 16, 2017 by mriedemos_at_gmail.c (15,720 points)   2 4 5
0 votes

On 10/16/2017 09:22 AM, Matt Riedemann wrote:

  1. Should we null out the instance.availabilityzone when it's shelved offloaded
    like we do for the instance.host and instance.node attributes? Similarly, we
    would not take into account the RequestSpec.availability
    zone when scheduling
    during unshelve. I tend to prefer this option because once you unshelve offload
    an instance, it's no longer associated with a host and therefore no longer
    associated with an AZ.

This statement isn't true in the case where the user specifically requested a
non-default AZ at boot time.

However, is it reasonable to assume that the user doesn't
care that the instance, once unshelved, is no longer in the originally requested
AZ? Probably not a safe assumption.

If they didn't request a non-default AZ then I think we could remove it.

  1. When a user unshelves, they can't propose a new AZ (and I don't think we want
    to add that capability to the unshelve API). So if the original AZ is gone,
    should we automatically remove the RequestSpec.availability_zone when
    scheduling? I tend to not like this as it's very implicit and the user could see
    the AZ on their instance change before and after unshelve and be confused.

I think allowing the user to specify an AZ on unshelve might be a reasonable
option. Or maybe just allow modifying the AZ of a shelved instance without
unshelving it via a PUT on /servers/{server_id}.

  1. We could simply do nothing about this specific bug and assert the behavior is
    correct. The user requested an instance in a specific AZ, shelved that instance
    and when they wanted to unshelve it, it's no longer available so it fails. The
    user would have to delete the instance and create a new instance from the shelve
    snapshot image in a new AZ.

I'm inclined to feel that this is operator error. If they want to delete an AZ
that has shelved instances then they should talk with their customers and update
the stored AZ in the DB to a new "valid" one. (Though currently this would
require manual DB operations.)

If we implemented Sylvain's spec in #1 above, maybe
we don't have this problem going forward since you couldn't remove/delete an AZ
when there are even shelved offloaded instances still tied to it.

I kind of think it would be okay to disallow deleting AZs with shelved instances
in them.

Chris


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 16, 2017 by Chris_Friesen (20,420 points)   3 15 24
0 votes

On 10/16/2017 11:22 AM, Matt Riedemann wrote:
This is interesting from the user point of view:

https://bugs.launchpad.net/nova/+bug/1723880

  • The user creates an instance in a non-default AZ.
  • They shelve offload the instance.
  • The admin deletes the AZ that the instance was using, for whatever
    reason.
  • The user unshelves the instance which goes back through scheduling and
    fails with NoValidHost because the AZ on the original request spec no
    longer exists.

Now the question is what, if anything, do we do about this bug? Some notes:

  1. How reasonable is it for a user to expect in a stable production
    environment that AZs are going to be deleted from under them? We
    actually have a spec related to this but with AZ renames:

https://review.openstack.org/#/c/446446/

I don't think it's reasonable for a user to expect an AZ suddenly gets
deleted from under them, no.

That said, I think it's reasonable for operators to want to rename an
AZ. And because AZs in Nova aren't really things [1], attempting to
change the name of an AZ involves a bunch of nasty DB updates (including
shadow tables). [2]

  1. Should we null out the instance.availabilityzone when it's shelved
    offloaded like we do for the instance.host and instance.node attributes?
    Similarly, we would not take into account the
    RequestSpec.availability
    zone when scheduling during unshelve. I tend to
    prefer this option because once you unshelve offload an instance, it's
    no longer associated with a host and therefore no longer associated with
    an AZ. However, is it reasonable to assume that the user doesn't care
    that the instance, once unshelved, is no longer in the originally
    requested AZ? Probably not a safe assumption.

Yeah, I don't think this is appropriate.

  1. When a user unshelves, they can't propose a new AZ (and I don't think
    we want to add that capability to the unshelve API). So if the original
    AZ is gone, should we automatically remove the
    RequestSpec.availability_zone when scheduling? I tend to not like this
    as it's very implicit and the user could see the AZ on their instance
    change before and after unshelve and be confused.

I don't think this is something we should add to the public API (for
reasons Matt stated in a followup email to Dean). Instead, I think the
"rename AZ" functionality should do the needful DB-related tasks to
change the instance.availability_zone for shelved instances to the new
AZ name...

  1. We could simply do nothing about this specific bug and assert the
    behavior is correct. The user requested an instance in a specific AZ,
    shelved that instance and when they wanted to unshelve it, it's no
    longer available so it fails. The user would have to delete the instance
    and create a new instance from the shelve snapshot image in a new AZ. If
    we implemented Sylvain's spec in #1 above, maybe we don't have this
    problem going forward since you couldn't remove/delete an AZ when there
    are even shelved offloaded instances still tied to it.

I think it's reasonable to prevent deletion of an AZ (whatever that
actually means... see [1]) when the AZ "has instances in it" (whatever
that means... see [1])

Best,
-jay

Other options?

[1] AZs in Nova are just metadata key/values on aggregates and string
values in the instance.availability_zone DB table field that have no FK
relationship to said metadata key/values

[2] Note that, as I've said before, the entire concept of an
availability zone in Nova/Cinder/Neutron is completely fictional and
improperly pretending to be an AWS EC2 availability zone. AZs in Nova
pretend to be failure domains. They are not anything of the sort.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 22, 2017 by Jay_Pipes (59,760 points)   3 10 13
...