settingsLogin | Registersettings

[Openstack-operators] GPU passthrough success and failure records

0 votes

Hi all,

I've been (very slowly) working on some docs detailing how to setup an
OpenStack Nova Libvirt+QEMU-KVM deployment to provide GPU-accelerated
instances. In Boston I hope to chat to some of the docs team and
figure out an appropriate upstream guide to fit that into. One of the
things I'd like to provide is a community record (better than ML
archives) of what works and doesn't. I've started a first attempt at
collating some basics here:
https://etherpad.openstack.org/p/GPU-passthrough-model-success-failure

I know there are at a least a few lurkers out there doing this too so
please share your own experience. Once there is a bit more data there
it probably makes sense to convert to a tabular format of some kind
(but wasn't immediately obvious to me how that should look given there
are several long list fields)

--
Cheers,
~Blairo


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
asked Aug 10, 2017 in openstack-operators by Blair_Bethwaite (4,080 points)   1 3 5

1 Response

0 votes

Hi folks,

Related to this, I wonder if anyone has ever seen something like a pci
bus error on a GPU node...? We have a fleet of Dell R730s with dual
K80s and we are periodically seeing the host reset with the hardware
log recording a message like:
"A fatal error was detected on a component at bus 4 device 8 function 0."

Which in this case refers to:
$ lspci -t -d 10b5:8747
-+-[0000:82]---00.0-[83-85]--+-08.0-[84]--
| -10.0-[85]--
+-[0000:03]---00.0-[04-06]--+-08.0-[05]--
| -10.0-[06]--

One of the downstream(?) PCIe endpoint facing ports, i.e., the GPU
side of the PCIe switch.

This error causes the host to unceremoniously reset. No error to be
found anywhere host side, just the hardware log. These are currently
Ubuntu Trusty hosts with 4.4 kernel. GPU burn testing does not seem to
trigger it and the host can go back into production and never (so far)
see the issue again. But we've now seen this about 10 times over the
last 12-18 months across a fleet of ~30 of these hosts (sometimes
twice on the same host months apart, but several distinct hosts
overall).

Cheers,

On 7 May 2017 at 07:55, Blair Bethwaite blair.bethwaite@gmail.com wrote:
Hi all,

I've been (very slowly) working on some docs detailing how to setup an
OpenStack Nova Libvirt+QEMU-KVM deployment to provide GPU-accelerated
instances. In Boston I hope to chat to some of the docs team and
figure out an appropriate upstream guide to fit that into. One of the
things I'd like to provide is a community record (better than ML
archives) of what works and doesn't. I've started a first attempt at
collating some basics here:
https://etherpad.openstack.org/p/GPU-passthrough-model-success-failure

I know there are at a least a few lurkers out there doing this too so
please share your own experience. Once there is a bit more data there
it probably makes sense to convert to a tabular format of some kind
(but wasn't immediately obvious to me how that should look given there
are several long list fields)

--
Cheers,
~Blairo

--
Cheers,
~Blairo


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Aug 10, 2017 by Blair_Bethwaite (4,080 points)   1 3 5
...