We have been working on deploying a highly-available Barbican at
Rackspace for a while now. We just recently made it publicly available
through an early access program:
We don't have a full deployment of Barbican yet. Our early access
deployment does not include a Rabbit queue or barbican-worker processes,
for example. This means that we don't yet have the ability to process
/orders requests but we do support secret storage backed by Safenet Luna
SA HSMs via the PCKS#11 Cryptographic Plugin.
Our goal is to be able to provide 99.95% availability with a minimum
throughput of 100 req/sec once we move to Unlimited Availability later
this year, but we still have some work to get us there.
To give you a better idea of what our deployment looks like, here's what
we have in production today:
On the front end we have two sets of VM pairs running haproxy  and
keepalived  using a shared IP address per set. The two sets
represent the blue and green node sets for blue-green zero-downtime
deployments.  Our DNS entry is pointed to the shared IP of the green
lb pair. The blue lb set is only accessible from our control plane, and
is used for functional testing of code before being promoted to green.
At any given time only one VM in each lb set is working and the other is
a hot standby that keepalived can instantly promote if needed while
keeping the same IP address. This gives us the ability to fail over
haproxy faster than DNS can propagate.
Requests are then load-balanced to at least two "API Nodes". These are
VMs set up as docker hosts. We are running Repose , the barbican-api
process and plight  each inside their own container. Repose is used
for rate-limiting, token-validation, and access control. Plight is used
to designate an API node as either blue or green. Each haproxy set is
configured to route only to the api nodes that match its color, however
they constantly query all nodes for blue/green status (more on this later).
For data storage we are running a MariaDB Galera Cluster  in
multi-master mode with 3x VM nodes. The cluster sits behind yet another
haproxy+keepalived pair, so that our db connections are load-balanced to
all three masters. This was mainly driven by our decision to host our
control plane our public cloud, since the multi-master setup gives us
better fault tolerance in the likely event of losing one of the DB
nodes. Previous to this cloud-based deployment we were using PostgreSQL
in a master+slave configuration, but we didn't have a good solution for
fully automatic failovers.
Choosing the right Cryptographic Plugin/Backend is probably going to be
the hardest part about planning for a highly-available deployment. For
our deployments we are using pairs of Luna SA HSMs in HA mode.  This
is currently our bottleneck, and for Newton we plan to focus most of our
development effort in improving the performance of the PKCS#11 Plugin.
Originally we wanted to store one key per project in the HSM itself.
However, we found out early on that the amount storage in the Lunas is
very limited, and completely inadequate for the scale we want to operate
at. This led to the development of the pkek-wrapping model that the
PKCS#11 plugin is currently using. This came with the cost of having to
make more hops to the HSM for a single transaction.
The KMIP Plugin does not use the pkek-wrapping model, and as such is
limited by the amount of storage available in the KMIP device that is
used. Note that when deploying Barbican with the KMIP Plugin, the
database capacity is not relevant.
I'm not super familiar with DogTag, so I can't speak to the limitations
of choosing the DogTag Plugin.
Lastly, since our Lunas are racked in a dedicated environment, we have
physical firewalls (F5s) in front of them. The barbican-api containers
in the api nodes connect to the HSMs over a VPN tunnel from our public
cloud environment to the dedicated environment.
We have two identical environments right now (staging and production),
and we will be adding more production environments in other data centers
later this year.
We deploy new code often, and production usually runs only a week or two
behind the barbican master branch.
For zero-downtime deployments, we've asked our community to stagger
database schema changes across separate commits. The idea is that the
schema change should be introduced first in a separate commit. This
ensures that the current codebase can continue to operate with the new
schema. The actual code changes are made in a follow-up patch.
When we prepare to deploy, we first update the database schema. This is
the only potentially disruptive operation we currently have. In theory
the existing api nodes continue to function with the new schema. We
then build up new blue API nodes with the new code to be rolled out.
All the new nodes are accessible through our blue lb, and this is where
we run our test suite to make sure everything is still good. If the
tests all pass the new blue set is promoted to green, and the
previously-green set is slowly demoted to blue.
We keep the now-blue nodes around in case something breaks and we need
to quickly roll back to the previous version. If all goes well in
staging we do it all over again in prod. The whole thing is driven
through Jenkins using ansible for configuration management. It's not
fully automated in the sense that someone still has to push the button
in Jenkins to get things going, but once we mature our pipeline a bit
more we plan to set it on cruise control.
Next steps for us after we sort out PCKS#11 performance will be to
deploy an HA RabbitMQ, and N api-workers. I don't think we'll be
setting up the keystone-listeners any time soon.
I hope that gives you a good starting point for planning your
HA-Barbican delpoyment. Let me know if you have any more questions.
On 3/21/16 1:23 PM, Daneyon Hansen (danehans) wrote:
Does anyone have experience deploying Barbican in a highly-available
fashion? If so, I'm interested in learning from your experience. Any
insight you can provide is greatly appreciated.
OpenStack Development Mailing List (not for usage questions)
OpenStack Development Mailing List (not for usage questions)