settingsLogin | Registersettings

[openstack-dev] [openstack-ansible] L3HA problem

0 votes

Hi,

we deployed our openstack infrastructure with your « exciting » project openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan, br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type) and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host | adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 | p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) | active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 | p-osinfra02-neutron-agents-container-48142ffe | True | :-) | active |
| 55350fac-16aa-488e-91fd-a7db38179c62 | p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) | active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby, :-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip, and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we restart the first (p-osinfra01), we can reboot the instance and we got an ip, a floating ip and we can access by ssh from internet to the vm. (Note: after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4 to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication (heartbeat packets over a hidden project network that connects all ha routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Jun 22, 2016 in openstack-dev by Fabrice_Grelaud (780 points)   1 5 5
retagged Jan 26, 2017 by admin

10 Responses

0 votes

Hi!

What Keepalived version is used?

On Wed, Jun 22, 2016 at 4:24 PM, fabrice grelaud <
fabrice.grelaud@u-bordeaux.fr> wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan
type) and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |

+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby »
until the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size
65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size
65535 bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Anna_Kamyshnikova (3,220 points)   1 6
0 votes

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:
Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by assaf_at_redhat.com (2,500 points)   1 1
0 votes

Hi,

keepalived 1:1.2.7-1ubuntu

Le 22 juin 2016 à 15:41, Anna Kamyshnikova akamyshnikova@mirantis.com a écrit :

Hi!

What Keepalived version is used?

On Wed, Jun 22, 2016 at 4:24 PM, fabrice grelaud <fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:
Hi,

we deployed our openstack infrastructure with your « exciting » project openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan, br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type) and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host | adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 | p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) | active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 | p-osinfra02-neutron-agents-container-48142ffe | True | :-) | active |
| 55350fac-16aa-488e-91fd-a7db38179c62 | p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) | active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby, :-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip, and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we restart the first (p-osinfra01), we can reboot the instance and we got an ip, a floating ip and we can access by ssh from internet to the vm. (Note: after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4 to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication (heartbeat packets over a hidden project network that connects all ha routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Fabrice_Grelaud (780 points)   1 5 5
0 votes

Keepalived 1.2.7 is bad version. Please, see comments in this bug
https://bugs.launchpad.net/neutron/+bug/1497272. I suggest you to try one
of the latest version of Keepalived.

On Wed, Jun 22, 2016 at 5:03 PM, fabrice grelaud <
fabrice.grelaud@u-bordeaux.fr> wrote:

Hi,

keepalived 1:1.2.7-1ubuntu

Le 22 juin 2016 à 15:41, Anna Kamyshnikova akamyshnikova@mirantis.com a
écrit :

Hi!

What Keepalived version is used?

On Wed, Jun 22, 2016 at 4:24 PM, fabrice grelaud <
fabrice.grelaud@u-bordeaux.fr> wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan
type) and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |

+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-)
standby, :-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same
ip, and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby »
until the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network
192.168.100.4 to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size
65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size
65535 bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Anna_Kamyshnikova (3,220 points)   1 6
0 votes

Le 22 juin 2016 à 15:45, Assaf Muller assaf@redhat.com a écrit :

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
<fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

I stop the three neutron agent container.
Restart on infra01 then on infra02

I can see VRRP frames from infra01 (169.254.192.1 -> 224.0.0.18) and receive by infra02.

root@p-osinfra02:~# tcpdump -nl -i em2 | grep 169.254
tcpdump: WARNING: em2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
….
….
then i have
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

No more 169.254.192.1 from infra01 but the IP of HA interface from router on infra02.

And no more VRRP advertisements cross the nodes.
On each infra node, we see VRRP advertisements from the node itself but nothing from the other.

And otherwise, i can ping ha interface from one router namespace to the other:
root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ping 169.254.192.3
PING 169.254.192.3 (169.254.192.3) 56(84) bytes of data.
64 bytes from 169.254.192.3: icmpseq=1 ttl=64 time=0.297 ms
64 bytes from 169.254.192.3: icmp
seq=2 ttl=64 time=0.239 ms
64 bytes from 169.254.192.3: icmp_seq=3 ttl=64 time=0.264 ms

im’ going to test with other version of keepalived (current version here 1.2.7-1 ubuntu 14.04).

Thanks to help

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Fabrice_Grelaud (780 points)   1 5 5
0 votes

Thanks. I will test…

Do you think trusty-backport is enough (1:1.2.13-1~ubuntu14.04.1) ?

Le 22 juin 2016 à 16:21, Anna Kamyshnikova akamyshnikova@mirantis.com a écrit :

Keepalived 1.2.7 is bad version. Please, see comments in this bug https://bugs.launchpad.net/neutron/+bug/1497272 https://bugs.launchpad.net/neutron/+bug/1497272. I suggest you to try one of the latest version of Keepalived.

On Wed, Jun 22, 2016 at 5:03 PM, fabrice grelaud <fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:
Hi,

keepalived 1:1.2.7-1ubuntu

Le 22 juin 2016 à 15:41, Anna Kamyshnikova <akamyshnikova@mirantis.com akamyshnikova@mirantis.com> a écrit :

Hi!

What Keepalived version is used?

On Wed, Jun 22, 2016 at 4:24 PM, fabrice grelaud <fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:
Hi,

we deployed our openstack infrastructure with your « exciting » project openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan, br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type) and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host | adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 | p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) | active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 | p-osinfra02-neutron-agents-container-48142ffe | True | :-) | active |
| 55350fac-16aa-488e-91fd-a7db38179c62 | p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) | active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby, :-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip, and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we restart the first (p-osinfra01), we can reboot the instance and we got an ip, a floating ip and we can access by ssh from internet to the vm. (Note: after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4 to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication (heartbeat packets over a hidden project network that connects all ha routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18 : VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Fabrice_Grelaud (780 points)   1 5 5
0 votes

Le 22 juin 2016 à 17:35, fabrice grelaud fabrice.grelaud@u-bordeaux.fr a écrit :

Le 22 juin 2016 à 15:45, Assaf Muller <assaf@redhat.com assaf@redhat.com> a écrit :

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
<fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

I stop the three neutron agent container.
Restart on infra01 then on infra02

I can see VRRP frames from infra01 (169.254.192.1 -> 224.0.0.18) and receive by infra02.

root@p-osinfra02:~# tcpdump -nl -i em2 | grep 169.254
tcpdump: WARNING: em2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20
….
….
then i have
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50, authtype simple, intvl 2s, length 20

No more 169.254.192.1 from infra01 but the IP of HA interface from router on infra02.

And no more VRRP advertisements cross the nodes.
On each infra node, we see VRRP advertisements from the node itself but nothing from the other.

And otherwise, i can ping ha interface from one router namespace to the other:
root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ping 169.254.192.3
PING 169.254.192.3 (169.254.192.3) 56(84) bytes of data.
64 bytes from 169.254.192.3: icmpseq=1 ttl=64 time=0.297 ms
64 bytes from 169.254.192.3: icmp
seq=2 ttl=64 time=0.239 ms
64 bytes from 169.254.192.3: icmp_seq=3 ttl=64 time=0.264 ms

im’ going to test with other version of keepalived (current version here 1.2.7-1 ubuntu 14.04).

Thanks to help

Note:
I said i can ping between ha interface but not for long time. At one point, i can’t anymore… :-(

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by Fabrice_Grelaud (780 points)   1 5 5
0 votes

On Wed, Jun 22, 2016 at 12:02 PM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:

Le 22 juin 2016 à 17:35, fabrice grelaud fabrice.grelaud@u-bordeaux.fr a
écrit :

Le 22 juin 2016 à 15:45, Assaf Muller assaf@redhat.com a écrit :

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

I stop the three neutron agent container.
Restart on infra01 then on infra02

I can see VRRP frames from infra01 (169.254.192.1 -> 224.0.0.18) and receive
by infra02.

root@p-osinfra02:~# tcpdump -nl -i em2 | grep 169.254
tcpdump: WARNING: em2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
….
….
then i have
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

No more 169.254.192.1 from infra01 but the IP of HA interface from router on
infra02.

And no more VRRP advertisements cross the nodes.
On each infra node, we see VRRP advertisements from the node itself but
nothing from the other.

And otherwise, i can ping ha interface from one router namespace to the
other:
root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ping 169.254.192.3
PING 169.254.192.3 (169.254.192.3) 56(84) bytes of data.
64 bytes from 169.254.192.3: icmpseq=1 ttl=64 time=0.297 ms
64 bytes from 169.254.192.3: icmp
seq=2 ttl=64 time=0.239 ms
64 bytes from 169.254.192.3: icmp_seq=3 ttl=64 time=0.264 ms

im’ going to test with other version of keepalived (current version here
1.2.7-1 ubuntu 14.04).

Thanks to help

Note:
I said i can ping between ha interface but not for long time. At one point,
i can’t anymore… :-(

That's the problem. This becomes a normal Neutron troubleshooting: Why
can't one port ping the other? This might help:
https://assafmuller.com/2015/08/31/neutron-troubleshooting/

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 22, 2016 by assaf_at_redhat.com (2,500 points)   1 1
0 votes

Version 1.2.13 is reliable.

On Wed, Jun 22, 2016 at 8:40 PM, Assaf Muller assaf@redhat.com wrote:

On Wed, Jun 22, 2016 at 12:02 PM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:

Le 22 juin 2016 à 17:35, fabrice grelaud fabrice.grelaud@u-bordeaux.fr
a
écrit :

Le 22 juin 2016 à 15:45, Assaf Muller assaf@redhat.com a écrit :

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA
after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan
type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the
vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+

| id | host
| adminstateup | alive | ha_state |

+--------------------------------------+-----------------------------------------------+----------------+-------+----------+

| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |

+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-)
standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same
ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm.
(Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby »
until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network
192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing
back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i
ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size
65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i
ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size
65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

I stop the three neutron agent container.
Restart on infra01 then on infra02

I can see VRRP frames from infra01 (169.254.192.1 -> 224.0.0.18) and
receive
by infra02.

root@p-osinfra02:~# tcpdump -nl -i em2 | grep 169.254
tcpdump: WARNING: em2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
….
….
then i have
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

No more 169.254.192.1 from infra01 but the IP of HA interface from
router on
infra02.

And no more VRRP advertisements cross the nodes.
On each infra node, we see VRRP advertisements from the node itself but
nothing from the other.

And otherwise, i can ping ha interface from one router namespace to the
other:
root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ping 169.254.192.3
PING 169.254.192.3 (169.254.192.3) 56(84) bytes of data.
64 bytes from 169.254.192.3: icmpseq=1 ttl=64 time=0.297 ms
64 bytes from 169.254.192.3: icmp
seq=2 ttl=64 time=0.239 ms
64 bytes from 169.254.192.3: icmp_seq=3 ttl=64 time=0.264 ms

im’ going to test with other version of keepalived (current version here
1.2.7-1 ubuntu 14.04).

Thanks to help

Note:
I said i can ping between ha interface but not for long time. At one
point,
i can’t anymore… :-(

That's the problem. This becomes a normal Neutron troubleshooting: Why
can't one port ping the other? This might help:
https://assafmuller.com/2015/08/31/neutron-troubleshooting/

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

--
Regards,
Ann Kamyshnikova
Mirantis, Inc


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 23, 2016 by Anna_Kamyshnikova (3,220 points)   1 6
0 votes

Le 22 juin 2016 à 19:40, Assaf Muller assaf@redhat.com a écrit :

On Wed, Jun 22, 2016 at 12:02 PM, fabrice grelaud
<fabrice.grelaud@u-bordeaux.fr fabrice.grelaud@u-bordeaux.fr> wrote:

Le 22 juin 2016 à 17:35, fabrice grelaud fabrice.grelaud@u-bordeaux.fr a
écrit :

Le 22 juin 2016 à 15:45, Assaf Muller assaf@redhat.com a écrit :

On Wed, Jun 22, 2016 at 9:24 AM, fabrice grelaud
fabrice.grelaud@u-bordeaux.fr wrote:

Hi,

we deployed our openstack infrastructure with your « exciting » project
openstack-ansible (mitaka 13.1.2) but we have some problems with L3HA after
create router.

Our infra (closer to the doc):
3 controllers nodes (with bond0 (br-mgmt, br-storage), bond1 (br-vxlan,
br-vlan))
2 compute nodes (same for network)

We create an external network (vlan type), an internal network (vxlan type)
and a router connected to both networks.
And when we launch an instance (cirros), we can’t receive an ip on the vm.

We have:

root@p-osinfra03-utility-container-783041da:~# neutron
l3-agent-list-hosting-router router-bim
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| id | host
| adminstateup | alive | ha_state |
+--------------------------------------+-----------------------------------------------+----------------+-------+----------+
| 3c7918e5-3ad6-4f82-a81b-700790e3c016 |
p-osinfra01-neutron-agents-container-f1ab9c14 | True | :-) |
active |
| f2bf385a-f210-4dbc-8d7d-4b7b845c09b0 |
p-osinfra02-neutron-agents-container-48142ffe | True | :-) |
active |
| 55350fac-16aa-488e-91fd-a7db38179c62 |
p-osinfra03-neutron-agents-container-2f6557f0 | True | :-) |
active |
+--------------------------------------+-----------------------------------------------+----------------+-------+—————+

I know, i got a problem now because i should have :-) active, :-) standby,
:-) standby… Snif...

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6
qdhcp-0ba266fb-15c4-4566-ae88-92d4c8fd2036

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ip a sh
1: lo: <LOOPBACK,UP,LOWERUP> mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid
lft forever preferredlft forever
inet6 ::1/128 scope host
valid
lft forever preferredlft forever
2: ha-4a5f0287-91@if6: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:c2:67:a9 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.1/18 brd 169.254.255.255 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet 169.254.0.1/24 scope global ha-4a5f0287-91
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fec2:67a9/64 scope link
valid
lft forever preferredlft forever
3: qr-44804d69-88@if9: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1450 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:a5:8c:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.254/24 scope global qr-44804d69-88
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:fea5:8cf2/64 scope link
valid
lft forever preferredlft forever
4: qg-c5c7378e-1d@if12: <BROADCAST,MULTICAST,UP,LOWER
UP> mtu 1500 qdisc
pfifofast state UP group default qlen 1000
link/ether fa:16:3e:b6:4c:97 brd ff:ff:ff:ff:ff:ff
inet 147.210.240.11/23 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet 147.210.240.12/32 scope global qg-c5c7378e-1d
valid
lft forever preferredlft forever
inet6 fe80::f816:3eff:feb6:4c97/64 scope link
valid
lft forever preferred_lft forever

Same result on infra02 and infra03, qr and qg interfaces have the same ip,
and ha interfaces the address 169.254.0.1.

If we stop 2 neutron agent containers (p-osinfra02, p-osinfra03) and we
restart the first (p-osinfra01), we can reboot the instance and we got an
ip, a floating ip and we can access by ssh from internet to the vm. (Note:
after few time, we loss our connectivity too).

But if we restart the two containers, we got a ha_state to « standby » until
the three become « active » and finally we have the problem again.

The three routers on infra 01/02/03 are seen as master.

If we ping from our instance to the router (internal network 192.168.100.4
to 192.168.100.254) we can see some ARP Request
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28
ARP, Request who-has 192.168.100.254 tell 192.168.100.4, length 28

And on the compute node we see all these frames on the various interfaces
tap / vxlan-89 / br-vxlan / bond1.vxlanvlan / bond1 / em2 but nothing back.

We also have on ha interface, on each router, the VRRP communication
(heartbeat packets over a hidden project network that connects all ha
routers (vxlan 70) ) . Priori as normal, each router thinks to be master.

root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nl -i ha-4a5f0287-91
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4a5f0287-91, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

root@p-osinfra02-neutron-agents-container-48142ffe:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 tcpdump -nt -i ha-4ee5f8d0-7f
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ha-4ee5f8d0-7f, link-type EN10MB (Ethernet), capture size 65535
bytes
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

Are you seeing VRRP advertisements crossing nodes though? That tcpdump
only shows advertisements from the local node. If nodes aren't
receiving VRRP messages from other nodes, keepalived will declare
itself as master (As expected). Can you ping the 'ha' interface from
one router namespace to the other?

I stop the three neutron agent container.
Restart on infra01 then on infra02

I can see VRRP frames from infra01 (169.254.192.1 -> 224.0.0.18) and receive
by infra02.

root@p-osinfra02:~# tcpdump -nl -i em2 | grep 169.254
tcpdump: WARNING: em2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em2, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
IP 169.254.192.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20
….
….
then i have
IP 169.254.192.3 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 50,
authtype simple, intvl 2s, length 20

No more 169.254.192.1 from infra01 but the IP of HA interface from router on
infra02.

And no more VRRP advertisements cross the nodes.
On each infra node, we see VRRP advertisements from the node itself but
nothing from the other.

And otherwise, i can ping ha interface from one router namespace to the
other:
root@p-osinfra01-neutron-agents-container-f1ab9c14:~# ip netns exec
qrouter-eeb2147a-5cc6-4b5e-b97c-07cfc141e8e6 ping 169.254.192.3
PING 169.254.192.3 (169.254.192.3) 56(84) bytes of data.
64 bytes from 169.254.192.3: icmpseq=1 ttl=64 time=0.297 ms
64 bytes from 169.254.192.3: icmp
seq=2 ttl=64 time=0.239 ms
64 bytes from 169.254.192.3: icmp_seq=3 ttl=64 time=0.264 ms

im’ going to test with other version of keepalived (current version here
1.2.7-1 ubuntu 14.04).

Thanks to help

Note:
I said i can ping between ha interface but not for long time. At one point,
i can’t anymore… :-(

That's the problem. This becomes a normal Neutron troubleshooting: Why
can't one port ping the other? This might help:
https://assafmuller.com/2015/08/31/neutron-troubleshooting/ https://assafmuller.com/2015/08/31/neutron-troubleshooting/

Hi,

thanks for the link. I already had a look (more or less) and finally, i suspected a problem rather on switch side. (nexus)

And after some investigation and tcpdump, we saw that packets to 239.1.1.1 are not forwarding by the switch.
In fact, « igmp snooping » is enabled by default on nexus switch causing the bad behaviour.

We disable igmp snooping and That’s all folks !

Cordially,

Someone could tell me if he has already encountered this problem ?
The infra and compute nodes are connected to a nexus 9000 switch.

Thank you in advance for taking the time to study my request.

Fabrice Grelaud
Université de Bordeaux


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Jun 24, 2016 by Fabrice_Grelaud (780 points)   1 5 5
...