settingsLogin | Registersettings

[openstack-dev] [neutron] writable mtu

0 votes

Heya,

we have https://bugs.launchpad.net/neutron/+bug/1671634 approved for
Pike that allows setting MTU for network on creation. (but not update,
as per latest comment from Kevin there) I already see a use case to
modify MTU for an existing network (for example, where you enable
Jumbo frames for underlying infrastructure, and want to raise the
ceiling; another special case is when you migrate between different
encapsulation technologies, like in case of ml2/ovs to networking-ovn
migration where the latter doesn't support VXLAN but Geneve only).

If I go and implement the RFE as-is, and later in Queens we pursue
updating MTU for existing networks, we will have three extensions for
the same thing.

  • net-mtu (existing read only attribute)
  • net-mtu-enhanced (allow write on create)
  • net-mtu-enhanced-enhanced (allow updates)

Not to mention potential addition of per-port MTU that some folks keep
asking for (and we keep pushing against so far).

So, I wonder if we can instead lay the ground for updatable MTU right
away, and allow_post: True from the start, even while implementing
create only as a phase-1. Then we can revisit the decision if needed
without touching api. What do you think?

Another related question is, how do we expose both old and new
extensions at the same time? I would imagine that implementations
capable of writing to the mtu attribute would advertise both old and
new extensions. Is it correct? Does neutron api layer allow for
overlapping attribute maps?

Ihar


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jul 7, 2017 in openstack-dev by Ihar_Hrachyshka (35,300 points)   3 9 10

5 Responses

0 votes

On 5 July 2017 at 14:14, Ihar Hrachyshka ihrachys@redhat.com wrote:

Heya,

we have https://bugs.launchpad.net/neutron/+bug/1671634 approved for
Pike that allows setting MTU for network on creation.

This was actually in the very first MTU spec (in case no one looked),
though it never got implemented. The spec details a whole bunch of stuff
about how to calculate whether the proposed MTU will fit within the encap,
incidentally, and will reject network creations when it doesn't.

Note that the MTU attribute was intended to represent an MTU that will
definitely transit. I guess no-one would actually rely on this, but to
clarify, it's not intended to indicate that bigger packets will be dropped,
only that smaller packets will not be dropped (which is the guarantee you
need for two VMs to talk to each other. Thus the MTU doesn't need to be
increased just because the infrastructure MTU has become larger; it just
means that future networks can be created with larger MTUs from this point,
and the current MTU will still be valid.

This is also the MTU that all VMs on that network will be told, because
they need to use the same value to function. If you change it, VMs after
the event will have problems talking to their earlier friends because they
will now disagree on MTU (and routers will have problems talking to at
least one of those sets).

(but not update,
as per latest comment from Kevin there) I already see a use case to
modify MTU for an existing network (for example, where you enable
Jumbo frames for underlying infrastructure, and want to raise the
ceiling; another special case is when you migrate between different
encapsulation technologies, like in case of ml2/ovs to networking-ovn
migration where the latter doesn't support VXLAN but Geneve only).

You look like you're changing the read-only segmentation type of the
network on this migration - presumably in the DB directly - so you're
changing non-writeable fields already. Couldn't the MTU be changed in a
similarly offline manner?

That said: what will you do with existing VMs that have been told the MTU
of their network already?

Put a different way, a change to the infrastructure can affect MTUs in two
ways:

  • I increase the MTU that a network can pass (by, for instance, increasing
    the infrastructure of the encap). I don't need to change its MTU because
    VMs that run on it will continue to work. I have no means to tell the VMs
    they have a bigger MTU now, and whatever method I might use needs to be
    100% certain to work or left-out VMs will become impossible to talk to, so
    leaving the MTU alone is sane.
  • I decrease the MTU that a network can pass (by, for instance, using an
    encap with larger headers). The network comprehensively breaks; VMs
    frequently fail to communicate regardless of whether I change the network
    MTU property, because running VMs have already learned their MTU value and,
    again, there's no way to update their idea of what it is reliably.
    Basically, this is not a migration that can be done with running VMs.

If I go and implement the RFE as-is, and later in Queens we pursue

updating MTU for existing networks, we will have three extensions for
the same thing.

  • net-mtu (existing read only attribute)
  • net-mtu-enhanced (allow write on create)
  • net-mtu-enhanced-enhanced (allow updates)

Not to mention potential addition of per-port MTU that some folks keep
asking for (and we keep pushing against so far).

So, I wonder if we can instead lay the ground for updatable MTU right
away, and allow_post: True from the start, even while implementing
create only as a phase-1. Then we can revisit the decision if needed
without touching api. What do you think?

It's trivially detectable that an MTU value can't be set at all, or can be
set initially but not changed. Could we use that approach? That way, we
don't need multiple extensions, the current one is sufficient (and - on the
assumption that you don't rely on 'read-only attribute' errors in normal
code, I think we can call this backward compatible).

Another related question is, how do we expose both old and new
extensions at the same time? I would imagine that implementations
capable of writing to the mtu attribute would advertise both old and
new extensions. Is it correct? Does neutron api layer allow for
overlapping attribute maps?

Extension net-mtu: MTU attr exists, can't set MTU at all, passing an MTU
returns a bad argument error
Extension net-mtu: MTU attr exists, can set MTU on startup, failed (too
big) MTU values return a more specific MTU too big error
Extension net-mtu: MTU attr exists, can set after creation, setting MTU
after creation fails as for startup write (which it appears you already
have in mind)

--
Ian.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jul 6, 2017 by Ian_Wells (5,300 points)   1 2 3
0 votes

OK, so I should read before writing...

On 5 July 2017 at 18:11, Ian Wells ijw.ubuntu@cack.org.uk wrote:

On 5 July 2017 at 14:14, Ihar Hrachyshka ihrachys@redhat.com wrote:

Heya,

we have https://bugs.launchpad.net/neutron/+bug/1671634 approved for
Pike that allows setting MTU for network on creation.

This was actually in the very first MTU spec (in case no one looked),
though it never got implemented. The spec details a whole bunch of stuff
about how to calculate whether the proposed MTU will fit within the encap,
incidentally, and will reject network creations when it doesn't.

OK, even referenced in the bug, so apologies, we're all good.

So, I wonder if we can instead lay the ground for updatable MTU right

away, and allow_post: True from the start, even while implementing
create only as a phase-1. Then we can revisit the decision if needed
without touching api. What do you think?

I think I misinterpreted: you'd enable all options and then deal with the
consequences in the backend code which has to implement one the of the
previously listed behaviours? That seems sane to me provided the required
behaviours are documented somewhere where a driver implementer has to trip
over them.
--
Ian.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jul 6, 2017 by Ian_Wells (5,300 points)   1 2 3
0 votes

On Wed, Jul 5, 2017 at 6:11 PM, Ian Wells ijw.ubuntu@cack.org.uk wrote:
On 5 July 2017 at 14:14, Ihar Hrachyshka ihrachys@redhat.com wrote:

Heya,

we have https://bugs.launchpad.net/neutron/+bug/1671634 approved for
Pike that allows setting MTU for network on creation.

This was actually in the very first MTU spec (in case no one looked), though
it never got implemented. The spec details a whole bunch of stuff about how
to calculate whether the proposed MTU will fit within the encap,
incidentally, and will reject network creations when it doesn't.

Note that the MTU attribute was intended to represent an MTU that will
definitely transit. I guess no-one would actually rely on this, but to
clarify, it's not intended to indicate that bigger packets will be dropped,
only that smaller packets will not be dropped (which is the guarantee you
need for two VMs to talk to each other. Thus the MTU doesn't need to be
increased just because the infrastructure MTU has become larger; it just
means that future networks can be created with larger MTUs from this point,
and the current MTU will still be valid.

This is also the MTU that all VMs on that network will be told, because they
need to use the same value to function. If you change it, VMs after the
event will have problems talking to their earlier friends because they will
now disagree on MTU (and routers will have problems talking to at least one
of those sets).

(but not update,
as per latest comment from Kevin there) I already see a use case to
modify MTU for an existing network (for example, where you enable
Jumbo frames for underlying infrastructure, and want to raise the
ceiling; another special case is when you migrate between different
encapsulation technologies, like in case of ml2/ovs to networking-ovn
migration where the latter doesn't support VXLAN but Geneve only).

You look like you're changing the read-only segmentation type of the network
on this migration - presumably in the DB directly - so you're changing
non-writeable fields already. Couldn't the MTU be changed in a similarly
offline manner?

Yeah, you are correct, but we may also hack it around in
networking-ovn by implying all tunneled networks are actually geneve
despite the type in database. (I understand that's rather hackish, but
the very idea of migrating to a driver that doesn't natively support
your tunnel type is hackish af).

Nevertheless, the case where operators want to increase MTU for
existing networks after infrastructure MTU upgrade still stands.

That said: what will you do with existing VMs that have been told the MTU of
their network already?

Same as we do right now when modifying configuration options defining
underlying MTU: change it on API layer, update data path with the new
value (tap to brq to router/dhcp legs) and hope instances will get
there too (by means of dhcp lease refresh eventually happening, or
rebooting instances, or else). There is no silver bullet here, we have
no way to tell instances to update their interface MTUs.

At least not till we get both new ovs and virtio-net in the guests
that will know how to deal with MTU hints:
https://bugzilla.redhat.com/show_bug.cgi?id=1408701
https://bugzilla.redhat.com/show_bug.cgi?id=1366919
(there should also be ovs integration piece but I can't find it right away.)

Though even with that, I don't know if guest will be notified about
changes happening during its execution, or only on boot (that probably
depends on whether virtio polls the mtu storage). And anyway, it
depends on guest kernel, so no luck for windows guests and such.

Put a different way, a change to the infrastructure can affect MTUs in two
ways:

  • I increase the MTU that a network can pass (by, for instance, increasing
    the infrastructure of the encap). I don't need to change its MTU because
    VMs that run on it will continue to work. I have no means to tell the VMs
    they have a bigger MTU now, and whatever method I might use needs to be 100%
    certain to work or left-out VMs will become impossible to talk to, so
    leaving the MTU alone is sane.

In this scenario, it sounds like you assume everything will work just
fine. But you don't consider neutron routers that will enforce the new
larger MTU for fragmentation, that may end up sending frames to
unaware VMs of size that they can't choke.

  • I decrease the MTU that a network can pass (by, for instance, using an
    encap with larger headers). The network comprehensively breaks; VMs
    frequently fail to communicate regardless of whether I change the network
    MTU property, because running VMs have already learned their MTU value and,
    again, there's no way to update their idea of what it is reliably.
    Basically, this is not a migration that can be done with running VMs.

Yeah. You may need to do some multiple step dance, like:

  • before mtu reduction, lower dhcpleaseduration to 3 mins;
  • wait until all leases are refreshed;
  • lower MTU on a network;
  • wait 3 minutes until all instances refresh leases and update their MTUs;
  • restore the original value of dhcpleaseduration.

If I go and implement the RFE as-is, and later in Queens we pursue
updating MTU for existing networks, we will have three extensions for
the same thing.

  • net-mtu (existing read only attribute)
  • net-mtu-enhanced (allow write on create)
  • net-mtu-enhanced-enhanced (allow updates)

Not to mention potential addition of per-port MTU that some folks keep
asking for (and we keep pushing against so far).

So, I wonder if we can instead lay the ground for updatable MTU right
away, and allow_post: True from the start, even while implementing
create only as a phase-1. Then we can revisit the decision if needed
without touching api. What do you think?

It's trivially detectable that an MTU value can't be set at all, or can be
set initially but not changed. Could we use that approach? That way, we
don't need multiple extensions, the current one is sufficient (and - on the
assumption that you don't rely on 'read-only attribute' errors in normal
code, I think we can call this backward compatible).

You mean we just set allowpost: True, allowput: True to existing
extension? That's fine, but we need some way to detect whether
updating/setting MTU will work that does not involve catching an error
on api user side. We can probably mess with the attribute map of the
existing extension, but we will still need separate 'flag' extensions
to detect the change gracefully.

Another related question is, how do we expose both old and new
extensions at the same time? I would imagine that implementations
capable of writing to the mtu attribute would advertise both old and
new extensions. Is it correct? Does neutron api layer allow for
overlapping attribute maps?

Extension net-mtu: MTU attr exists, can't set MTU at all, passing an MTU
returns a bad argument error
Extension net-mtu: MTU attr exists, can set MTU on startup, failed (too big)
MTU values return a more specific MTU too big error
Extension net-mtu: MTU attr exists, can set after creation, setting MTU
after creation fails as for startup write (which it appears you already have
in mind)

--
Ian.


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jul 7, 2017 by Ihar_Hrachyshka (35,300 points)   3 9 10
0 votes

On Wed, Jul 5, 2017 at 6:43 PM, Ian Wells ijw.ubuntu@cack.org.uk wrote:
I think I misinterpreted: you'd enable all options and then deal with the
consequences in the backend code which has to implement one the of the
previously listed behaviours? That seems sane to me provided the required
behaviours are documented somewhere where a driver implementer has to trip
over them.

Yeah. Only with a twist that I believe we also need a 'flag' extension
that would indicate that MTU is writable.

Ihar


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jul 7, 2017 by Ihar_Hrachyshka (35,300 points)   3 9 10
0 votes

On 7 July 2017 at 12:14, Ihar Hrachyshka ihrachys@redhat.com wrote:

That said: what will you do with existing VMs that have been told the
MTU of
their network already?

Same as we do right now when modifying configuration options defining
underlying MTU: change it on API layer, update data path with the new
value (tap to brq to router/dhcp legs) and hope instances will get
there too (by means of dhcp lease refresh eventually happening, or
rebooting instances, or else). There is no silver bullet here, we have
no way to tell instances to update their interface MTUs.

Indeed, and I think that's my point.

Let me propose an option 2.

Refuse to migrate if it would invalidate the MTU property on an existing
network. If this happens, the operator can delete such networks, or clear
them out and recreate them with a smaller MTU. The point being, since the
automation can't reliably fix the MTU of the running VMs, the automation
shouldn't change the MTU of the network - it's not in the power of the
network control code to get the results right - and you should instead tell
the operator he has to make some decisions to make about whether VMs have
to be restarted, networks deleted or recreated, etc. that can't be judged
automatically.

However, explain in the documentation how to make a migration that won't
invalidate your existing virtual networks' MTUs, allowing you to preserve
all your networks with the same MTU they already have. If you migrate
encap-A to bigger-encap-B (and you lose some more bytes from the infra MTU)
it would refuse to migrate most networks unless you simultaneously
increased the path_mtu to allow for the extra bytes. So, B takes 10 extra
bytes, you fiddle with your switches to increase their MTU by 10, your
auto-migration itself fiddles with the MTUs on host interfaces and
vswitches, and the MTU of the virtual network remains the same (because
phys MTU - encap >= biggest allowed virtual network MTU before the upgrade).

At least not till we get both new ovs and virtio-net in the guests
that will know how to deal with MTU hints:
https://bugzilla.redhat.com/show_bug.cgi?id=1408701
https://bugzilla.redhat.com/show_bug.cgi?id=1366919
(there should also be ovs integration piece but I can't find it right
away.)

... and every OS on the planet actually uses it, and no-one uses an e1000
NIC or an SRIOV NIC, and and and...

Though even with that, I don't know if guest will be notified about
changes happening during its execution, or only on boot (that probably
depends on whether virtio polls the mtu storage). And anyway, it
depends on guest kernel, so no luck for windows guests and such.

Put a different way, a change to the infrastructure can affect MTUs in
two
ways:

  • I increase the MTU that a network can pass (by, for instance,
    increasing
    the infrastructure of the encap). I don't need to change its MTU because
    VMs that run on it will continue to work. I have no means to tell the
    VMs
    they have a bigger MTU now, and whatever method I might use needs to be
    100%
    certain to work or left-out VMs will become impossible to talk to, so
    leaving the MTU alone is sane.

In this scenario, it sounds like you assume everything will work just
fine. But you don't consider neutron routers that will enforce the new
larger MTU for fragmentation, that may end up sending frames to
unaware VMs of size that they can't choke.

Actually, no. I'm saying here that I increase the MTU that the network
can pass
- for instance, I change the MTU on my physical switch from 1500
to 9000 - but I don't change anything about my OpenStack network
properties. Thus if I were to send a packet of 9000 (and the property on
the virtual network still says the MTU is 1500) it gets to its destination,
because the API doesn't guarantee that the packets are dropped; it just
makes no guarantee that the packet will be passed, so this is undefined
behaviour territory. The virtual network's MTU property is still 1500,
we can still guarantee that the network will pass packets up to and
including 1500 bytes, and the router interfaces, just like VM interfaces,
are set from the MTU property to a 1500 MTU - so they emit transmissible
packets and they all agree on the MTU size, which is what's necessary for a
network to work. The fact that the fabric will now pass 9000 byte packets
isn't relevant.

  • I decrease the MTU that a network can pass (by, for instance, using an
    encap with larger headers). The network comprehensively breaks; VMs
    frequently fail to communicate regardless of whether I change the network
    MTU property, because running VMs have already learned their MTU value
    and,
    again, there's no way to update their idea of what it is reliably.
    Basically, this is not a migration that can be done with running VMs.

Yeah. You may need to do some multiple step dance, like:

  • before mtu reduction, lower dhcpleaseduration to 3 mins;
  • wait until all leases are refreshed;

... hope and pray that the DHCP agent in the host checks the MTU on every
lease renewal - I'm not saying for definite that it doesn't, but I don't
think anyone usually designs for the MTU to change after interface-up...


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jul 7, 2017 by Ian_Wells (5,300 points)   1 2 3
...