os-brick 1.4 was released over the weekend, and was the first os-brick
to include privsep. We got a really odd failure rate in the
grenade-multinode jobs (1/3 - 1/2) after wards which was super non
obvious why. Hemma looks to have figured it out (this is a summary of
what I've seen on IRC to pull it all together)
Remembering the following -
- New code must work with N-1 configs. So this is
master running with
privsep requires a sudo rule or rootwrap rule (to get to sudo) to allow
the privsep daemon to be spawned for volume actions.
During gate testing we have a blanket sudoer rule for the stack user
during the run of grenade.sh. It has to do system level modifications
broadly to perform the upgrade. This sudoer rule is deleted at the end
of the grenade.sh run before Tempest tests are run, so that Tempest
tests don't accidentally require root privs on their target environment.
Grenade also makes sure that some resources live across the upgrade
boundary. This includes a boot from volume guest, which is torn down
before testing starts. And this is where things get interesting.
This means there is a volume teardown needed before grenade ends. But
there is only one. In single node grenade this happens about 30 seconds
for the end of the script, triggers the privsep daemon start, and then
we're done. And the 50stacksh sudoers file is removed. In multinode,
if the boot from volume server is on the upgrade node, then the same
thing happens. However, if it instead ended up on the subnode, which
is not upgraded, then the volume tear down in on the old node. No
os-brick calls are made on the upgraded node before grenade finishes.
The 50stacksh sudoers file is removed, as expected.
And now all volume tests on those nodes fail.
Which is what should happen. The point is that in production no one is
going to put a blanket sudoers rule like that in place. It's just we
needed it for this activity, and the userid on the services being the
same as the shell user (which is not root) let this fallback rule be used.
The crux of the problem is that os-brick 1.4 and privsep can't be used
without a config file change during the upgrade. Which violates our
policy, because it breaks rolling upgrades.
So... we have a few options:
1) make an exception here with release notes, because it's the only way
to move forward.
2) have some way for os-brick to use either mode for a transition period
(depending on whether privsep is configured to work)
3) Something else.... ?
https://bugs.launchpad.net/os-brick/+bug/1592043 is the bug we've got on
this. We should probably sort out the path forward here on the ML as
there are a bunch of folks in a bunch of different time zones that have
important perspectives here.
OpenStack Development Mailing List (not for usage questions)