settingsLogin | Registersettings

[Openstack] HA router fail-over time

0 votes

Openstack version: Mitaka

So we're running VM's on compute nodes and HA routers on network nodes (old
school, I know...)

There's the test setup I'm doing:

Test structure: Openstack VM is assigned to a tenant network that has
public IP and I have SNAT turned off on the Openstac, HA routers.

  • From an Openstack VM itself: I ping it's default gateway (the active HA
    router on network node)

  • From outside our network: I ping the VM's public ip

Both work great.

Then, I reboot the network node that's currently active for the tenant
network subnet.

Result:

  • VM to Default GW ping: just about 20 seconds of outage
  • From outside the network, the outage is approximately 40 seconds

a) how do I make the failover time faster? (VRRP, etc)
b) why are they different times?

Thanks in advance!!

Steve


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
asked Mar 30, 2017 in openstack by Sterdnot_Shaken (900 points)   2 4 10

1 Response

0 votes

Hi,



      1. 2:10 keltezéssel, Sterdnot Shaken írta:
        > b) why are they different times?
        Just an idea, but it used to happen because of the underlaying switches
        CAM table expiration (it can be set on indrustrial devices, in our
        brocade, it's 300 sec by default).

For this problem to solve, vrrp routers has to broadcast gratuitous ARPs
with only the virtual MAC and the IP (kind of force to the switch, to
learn the new state).
Check out this: https://tools.ietf.org/html/rfc3768#section-8.2

In neutron side, there is a config option in neutron.conf [DEFAULT]
section, which is "sendarpfor_ha". This is an integer, which used to
send as many ARPs, that you have configured.

BTW, we have this corrected, but sometimes it takes a while for neutron
to came up, and make things works (we use Mitaka) because if i
understand right, the l3 agent, which handle the state changes on a
single thread now, and they introduced a
|hakeepalivedstatechangeserver_threads|option in Newton, which is:

"A new option hakeepalivedstatechangeserver_threads has been added
to configure the number of concurrent threads spawned for keepalived
server connection requests. Higher values increase the CPU load on the
agent nodes. The default value is half of the number of CPUs present on
the node. This allows operators to tune the number of threads to suit
their environment. With more threads, simultaneous requests for multiple
HA routers state change can be handled faster."

Source:
https://docs.openstack.org/releasenotes/neutron/newton.html#upgrade-notes

My another (and last) guess is the reboot/boot process of your network
node... how did you do it? (I mean, gracefull, or a pure reset?) It may
can add a few seconds too, since gracefull shutdown service stopping
dependecies could create a few secs, where some of the required services
not running... (just an example from our boot process, as i saw after a
bit debugging:
- Start OVS (without any dynamic data)
- Start L3 agent (and it's start keepalived)
- New keepalived instance connect to OVS HA network (and L3 agent
start to push the dynamic config to OVS)
- Until they HA network are up and running on both network side
(keepalived daemons can talk to each other) the 2 neutron has
master-master state on the same subnet
- When they can talk, one of them goes to backup (and the gratuitous
ARP came in again, but maybe just mix things up, since the few secs of
master-master state mix things up on the CAM table of your switch(es))

Hope that gives you some clue, where to start debugging :)

Regards:
Peter


Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
responded Mar 30, 2017 by Erdősi_Péter (1,160 points)   1 2
...