Warning: wall of text incoming :-)
On 26/05/2017 03:55, Carter, Kevin wrote:
If you've taken on an adventure like this how did you approach
it? Did it work? Any known issues, gotchas, or things folks should be
generally aware of?
We're fresh out of a Juno-to-Mitaka upgrade. It worked, but it required
significant downtime of the user VMs for an OS upgrade on all compute
nodes (we fell behind CentOS update schedule due to some code requiring
specific kernel versions, so we could not perform a no-downtime upgrade
even though we're using LinuxBridge for the data plane).
We took a significant amount of time to automate almost everything (OS
updates, OpenStack updates and configuration management), but the
control plane migration was performed manually with a lot of
verification steps to ensure the databases would not end up in shambles
(the procedure was carefully written in a runbook and tested on a
separate testbed and on a snapshot of all production databases).
As I said, the update worked but we hit a few snags:
1. glance and neutron DBs were created with latin1 as default charset,
so we had to convert both to UTF8 (dump, iconv, fix the definition,
restore) - this is an operational issue on our side, though
2. on the testbed we found that nova created duplicated entries for all
hypervisors after starting all services, we traced that down to
compute_nodes.host being NULL for all HVs
3. [cache]/enable in nova.conf *must* be set to true if there are
multiple instances of nova-consoleauth/nova-novncproxy, in previous
releases we'd just point nova to our memcache servers and it would work
(probably we overlooked something in the docs)
During our chat today we generally landed on an in-place upgrade with
known API service downtime and little (at least as little as possible)
data plane downtime. The process discussed was basically:
a1. Create utility "thing-a-me" (container, venv, etc) which contains
the required code to run a service through all of the required upgrades.
a2. Stop service(s).
a3. Run migration(s)/upgrade(s) for all releases using the utility
a4. Repeat for all services.
b1. Once all required migrations are complete run a deployment using the
b2. Ensure all services are restarted.
b3. Ensure cloud is functional.
That was our basic workflow, except the "thing-a-me" was myself :-)
Joking aside, we kept one controller host out of the "mass upgrade" loop
and carefully performed single-version upgrades of the packages, running
all required DB migrations for each version.
Also, the tooling is not very general purpose or portable outside of OSA
but it could serve as a guide or just a general talking point.> Are there other tools out there that solve for the multi-release upgrade?
Not that I know. AFAIR, the BlueBox guys (now IBM) had some
Ansible-based tooling for automating a single-version upgrade, I don't
know if they ever considered skip-level upgrades.
- automate as much as possible
- use a configuration management tool to deploy the final configuration
to all nodes (Puppet, Ansible, Chef...)
- have a testing environment which resembles as closely as possible
the production environment
- simulate all migrations on a snapshot of all production databases to
catch any issue early
Do folks believe tools are the right way to solve this or would
comprehensive upgrade documentation be better for the general community?
Both, actually. A generic upgrade tool would need to cover a lot of
deployment scenarios, so it would probably end up being a "reference
Comprehensive skip-level upgrade documentation would be optimal (in our
case we had to rebuild Kilo and Liberty docs from sources).
As most of the upgrade issues center around database migrations, we
discussed some of the potential pitfalls at length. One approach was to
roll-up all DB migrations into a single repository and run all upgrades
for a given project in one step. Another was to simply have mutliple
python virtual environments and just run in-line migrations from a
version specific venv (this is what the OSA tooling does). Does one way
work better than the other? Any thoughts on how this could be better?
Would having N+2/3 migrations addressable within the projects, even if
they're not tested any longer, be helpful?
Some projects apparently keep shipping all migrations, even though
they're not supported.
It was our general thought that folks would be interested in having the
ability to skip releases so we'd like to hear from the community to
validate our thinking.
That's good to know :-)
Via Ranzani 13/2 c - 40127 Bologna, Italy
Phone: +39 051 609 2903
OpenStack-operators mailing list