settingsLogin | Registersettings

[openstack-dev] vGPUs support for Nova - Implementation

0 votes

Please consider the support of MDEV for the /pci framework which
provides support for vGPUs [0].

Accordingly to the discussion [1]

With this first implementation which could be used as a skeleton for
implementing PCI Devices in Resource Tracker we provide support for
attaching vGPUs to guests. And also to provide affinity per NUMA
nodes. An other important point is that that implementation can take
advantage of the ongoing specs like PCI NUMA policies.

  • The Implementation [0]

[PATCH 01/13] pci: update PciDevice object field 'address' to accept
[PATCH 02/13] pci: add for PciDevice object new field mdev
[PATCH 03/13] pci: generalize object unit-tests for different
[PATCH 04/13] pci: add support for mdev device type request
[PATCH 05/13] pci: generalize stats unit-tests for different
[PATCH 06/13] pci: add support for mdev devices type devspec
[PATCH 07/13] pci: add support for resource pool stats of mdev
[PATCH 08/13] pci: make manager to accept handling mdev devices

In this serie of patches we are generalizing the PCI framework to
handle MDEV devices. We arguing it's a lot of patches but most of them
are small and the logic behind is basically to make it understand two
new fields MDEVPF and MDEVVF.

[PATCH 09/13] libvirt: update PCI node device to report mdev devices
[PATCH 10/13] libvirt: report mdev resources
[PATCH 11/13] libvirt: add support to start vm with using mdev (vGPU)

In this serie of patches we make libvirt driver support, as usually,
return resources and attach devices returned by the pci manager. This
part can be reused for Resource Provider.

[PATCH 12/13] functional: rework fakelibvirt host pci devices
[PATCH 13/13] libvirt: resuse SRIOV funtional tests for MDEV devices

Here we reuse 100/100 of the functional tests used for SR-IOV
devices. Again here, this part can be reused for Resource Provider.

  • The Usage

There are no difference between SR-IOV and MDEV, from operators point
of view who knows how to expose SR-IOV devices in Nova, they already
know how to expose MDEV devices (vGPUs).

Operators will be able to expose MDEV devices in the same manner as
they expose SR-IOV:

1/ Configure whitelist devices

['{"vendor_id":"10de"}']

2/ Create aliases

[{"vendor_id":"10de", "name":"vGPU"}]

3/ Configure the flavor

openstack flavor set --property "pci_passthrough:alias"="vGPU:1"

  • Limitations

The mdev does not provide 'productid' but 'mdevtype' which should be
considered to exactly identify which resource users can request e.g:
nvidia-10. To provide that support we have to add a new field
'mdev_type' so aliases could be something like:

{"vendorid":"10de", mdevtype="nvidia-10" "name":"alias-nvidia-10"}
{"vendorid":"10de", mdevtype="nvidia-11" "name":"alias-nvidia-11"}

I do have plan to add but first I need to have support from upstream
to continue that work.

[0] https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:pci-mdev-support
[1] http://lists.openstack.org/pipermail/openstack-dev/2017-September/122591.html


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
asked Oct 2, 2017 in openstack-dev by Sahid_Orentino_Ferdj (1,020 points)   1

14 Responses

0 votes

On 09/28/2017 11:37 AM, Sahid Orentino Ferdjaoui wrote:
Please consider the support of MDEV for the /pci framework which
provides support for vGPUs [0].

Accordingly to the discussion [1]

With this first implementation which could be used as a skeleton for
implementing PCI Devices in Resource Tracker

I'm not entirely sure what you're referring to above as "implementing
PCI devices in Resource Tracker". Could you elaborate? The resource
tracker already embeds a PciManager object that manages PCI devices, as
you know. Perhaps you meant "implement PCI devices as Resource Providers"?

we provide support for
attaching vGPUs to guests. And also to provide affinity per NUMA
nodes. An other important point is that that implementation can take
advantage of the ongoing specs like PCI NUMA policies.

  • The Implementation [0]

[PATCH 01/13] pci: update PciDevice object field 'address' to accept
[PATCH 02/13] pci: add for PciDevice object new field mdev
[PATCH 03/13] pci: generalize object unit-tests for different
[PATCH 04/13] pci: add support for mdev device type request
[PATCH 05/13] pci: generalize stats unit-tests for different
[PATCH 06/13] pci: add support for mdev devices type devspec
[PATCH 07/13] pci: add support for resource pool stats of mdev
[PATCH 08/13] pci: make manager to accept handling mdev devices

In this serie of patches we are generalizing the PCI framework to
handle MDEV devices. We arguing it's a lot of patches but most of them
are small and the logic behind is basically to make it understand two
new fields MDEVPF and MDEVVF.

That's not really "generalizing the PCI framework to handle MDEV
devices" :) More like it's just changing the /pci module to understand a
different device management API, but ok.

[PATCH 09/13] libvirt: update PCI node device to report mdev devices
[PATCH 10/13] libvirt: report mdev resources
[PATCH 11/13] libvirt: add support to start vm with using mdev (vGPU)

In this serie of patches we make libvirt driver support, as usually,
return resources and attach devices returned by the pci manager. This
part can be reused for Resource Provider.

Perhaps, but the idea behind the resource providers framework is to
treat devices as generic things. Placement doesn't need to know about
the particular device attachment status.

[PATCH 12/13] functional: rework fakelibvirt host pci devices
[PATCH 13/13] libvirt: resuse SRIOV funtional tests for MDEV devices

Here we reuse 100/100 of the functional tests used for SR-IOV
devices. Again here, this part can be reused for Resource Provider.

Probably not, but I'll take a look :)

For the record, I have zero confidence in any existing "functional"
tests for NUMA, SR-IOV, CPU pinning, huge pages, and the like.
Unfortunately, due to the fact that these features often require
hardware that either the upstream community CI lacks or that depends on
libraries, drivers and kernel versions that really aren't available to
non-bleeding edge users (or users with very deep pockets).

  • The Usage

There are no difference between SR-IOV and MDEV, from operators point
of view who knows how to expose SR-IOV devices in Nova, they already
know how to expose MDEV devices (vGPUs).

Operators will be able to expose MDEV devices in the same manner as
they expose SR-IOV:

1/ Configure whitelist devices

['{"vendor_id":"10de"}']

2/ Create aliases

[{"vendor_id":"10de", "name":"vGPU"}]

3/ Configure the flavor

openstack flavor set --property "pci_passthrough:alias"="vGPU:1"

  • Limitations

The mdev does not provide 'productid' but 'mdevtype' which should be
considered to exactly identify which resource users can request e.g:
nvidia-10. To provide that support we have to add a new field
'mdev_type' so aliases could be something like:

{"vendorid":"10de", mdevtype="nvidia-10" "name":"alias-nvidia-10"}
{"vendorid":"10de", mdevtype="nvidia-11" "name":"alias-nvidia-11"}

I do have plan to add but first I need to have support from upstream
to continue that work.

As mentioned in IRC and the previous ML discussion, my focus is on the
nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

One thing that would be very useful, Sahid, if you could get with Eric
Fried (efried) on IRC and discuss with him the "generic device
management" system that was discussed at the PTG. It's likely that the
/pci module is going to be overhauled in Rocky and it would be good to
have the mdev device management API requirements included in that
discussion.

Best,
-jay

Best,
-jay

[0] https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:pci-mdev-support
[1] http://lists.openstack.org/pipermail/openstack-dev/2017-September/122591.html


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 28, 2017 by Jay_Pipes (59,760 points)   3 10 13
0 votes

In this serie of patches we are generalizing the PCI framework to
handle MDEV devices. We arguing it's a lot of patches but most of them
are small and the logic behind is basically to make it understand two
new fields MDEVPF and MDEVVF.

That's not really "generalizing the PCI framework to handle MDEV
devices" :) More like it's just changing the /pci module to understand a
different device management API, but ok.

Yeah, the series is adding more fields to our PCI structure to allow for
more variations in the kinds of things we lump into those tables. This
is my primary complaint with this approach, and has been since the topic
first came up. I really want to avoid building any more dependency on
the existing pci-passthrough mechanisms and focus any new effort on
using resource providers for this. The existing pci-passthrough code is
almost universally hated, poorly understood and tested, and something we
should not be further building upon.

In this serie of patches we make libvirt driver support, as usually,
return resources and attach devices returned by the pci manager. This
part can be reused for Resource Provider.

Perhaps, but the idea behind the resource providers framework is to
treat devices as generic things. Placement doesn't need to know about
the particular device attachment status.

I quickly went through the patches and left a few comments. The base
work of pulling some of this out of libvirt is there, but it's all
focused on the act of populating pci structures from the vgpu
information we get from libvirt. That code could be made to instead
populate a resource inventory, but that's about the most of the set that
looks applicable to the placement-based approach.

As mentioned in IRC and the previous ML discussion, my focus is on the
nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

FWIW, I'm not really planning to spend any time reviewing it
until/unless it is retooled to generate an inventory from the virt driver.

With the two patches that report vgpus and then create guests with them
when asked converted to resource providers, I think that would be enough
to have basic vgpu support immediately. No DB migrations, model changes,
etc required. After that, helping to get the nested-rps and traits work
landed gets us the ability to expose attributes of different types of
those vgpus and opens up a lot of possibilities. IMHO, that's work I'm
interested in reviewing.

One thing that would be very useful, Sahid, if you could get with Eric
Fried (efried) on IRC and discuss with him the "generic device
management" system that was discussed at the PTG. It's likely that the
/pci module is going to be overhauled in Rocky and it would be good to
have the mdev device management API requirements included in that
discussion.

Definitely this.

--Dan


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Dan_Smith (9,860 points)   1 2 4
0 votes

On Thu, Sep 28, 2017 at 05:06:16PM -0400, Jay Pipes wrote:
On 09/28/2017 11:37 AM, Sahid Orentino Ferdjaoui wrote:

Please consider the support of MDEV for the /pci framework which
provides support for vGPUs [0].

Accordingly to the discussion [1]

With this first implementation which could be used as a skeleton for
implementing PCI Devices in Resource Tracker

I'm not entirely sure what you're referring to above as "implementing PCI
devices in Resource Tracker". Could you elaborate? The resource tracker
already embeds a PciManager object that manages PCI devices, as you know.
Perhaps you meant "implement PCI devices as Resource Providers"?

A PciManager? I know that we have a field PCI_DEVICE :) - I guess a
virt driver can return inventory with total of PCI devices. Talking
about manager, not sure.

You still have to define "traits", basically for physical network
devices, the users want to select device according physical network,
to select device according the placement on host (NUMA), to select the
device according the bandwidth capability... For GPU it's same
story. And I do not have mentioned devices which support virtual
functions.

So that is what you plan to do for this release :) - Reasonably I
don't think we are close to have something ready for production.

Jay, I have question, Why you don't start by exposing NUMA ?

we provide support for
attaching vGPUs to guests. And also to provide affinity per NUMA
nodes. An other important point is that that implementation can take
advantage of the ongoing specs like PCI NUMA policies.

  • The Implementation [0]

[PATCH 01/13] pci: update PciDevice object field 'address' to accept
[PATCH 02/13] pci: add for PciDevice object new field mdev
[PATCH 03/13] pci: generalize object unit-tests for different
[PATCH 04/13] pci: add support for mdev device type request
[PATCH 05/13] pci: generalize stats unit-tests for different
[PATCH 06/13] pci: add support for mdev devices type devspec
[PATCH 07/13] pci: add support for resource pool stats of mdev
[PATCH 08/13] pci: make manager to accept handling mdev devices

In this serie of patches we are generalizing the PCI framework to
handle MDEV devices. We arguing it's a lot of patches but most of them
are small and the logic behind is basically to make it understand two
new fields MDEVPF and MDEVVF.

That's not really "generalizing the PCI framework to handle MDEV devices" :)
More like it's just changing the /pci module to understand a different
device management API, but ok.

If you prefer call it like that :) - The point is the /pci manages
physical devices, It can passthrough the whole device or its virtual
functions exposed through SRIOV or MDEV.

[PATCH 09/13] libvirt: update PCI node device to report mdev devices
[PATCH 10/13] libvirt: report mdev resources
[PATCH 11/13] libvirt: add support to start vm with using mdev (vGPU)

In this serie of patches we make libvirt driver support, as usually,
return resources and attach devices returned by the pci manager. This
part can be reused for Resource Provider.

Perhaps, but the idea behind the resource providers framework is to treat
devices as generic things. Placement doesn't need to know about the
particular device attachment status.

[PATCH 12/13] functional: rework fakelibvirt host pci devices
[PATCH 13/13] libvirt: resuse SRIOV funtional tests for MDEV devices

Here we reuse 100/100 of the functional tests used for SR-IOV
devices. Again here, this part can be reused for Resource Provider.

Probably not, but I'll take a look :)

For the record, I have zero confidence in any existing "functional" tests
for NUMA, SR-IOV, CPU pinning, huge pages, and the like. Unfortunately, due
to the fact that these features often require hardware that either the
upstream community CI lacks or that depends on libraries, drivers and kernel
versions that really aren't available to non-bleeding edge users (or users
with very deep pockets).

It's good point, if you are not confidence, don't you think it's
premature to move forward on implementing new thing without to have
well trusted functional tests?

  • The Usage

There are no difference between SR-IOV and MDEV, from operators point
of view who knows how to expose SR-IOV devices in Nova, they already
know how to expose MDEV devices (vGPUs).

Operators will be able to expose MDEV devices in the same manner as
they expose SR-IOV:

1/ Configure whitelist devices

['{"vendor_id":"10de"}']

2/ Create aliases

[{"vendor_id":"10de", "name":"vGPU"}]

3/ Configure the flavor

openstack flavor set --property "pci_passthrough:alias"="vGPU:1"

  • Limitations

The mdev does not provide 'productid' but 'mdevtype' which should be
considered to exactly identify which resource users can request e.g:
nvidia-10. To provide that support we have to add a new field
'mdev_type' so aliases could be something like:

{"vendorid":"10de", mdevtype="nvidia-10" "name":"alias-nvidia-10"}
{"vendorid":"10de", mdevtype="nvidia-11" "name":"alias-nvidia-11"}

I do have plan to add but first I need to have support from upstream
to continue that work.

As mentioned in IRC and the previous ML discussion, my focus is on the
nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

No worries, the code is here, tested, fully functionnal and
production-ready, I made effort to make it available at the very
beginning of the release. With some good volitions we could fix any
bugs and have support for vGPUs in Queens.

One thing that would be very useful, Sahid, if you could get with Eric Fried
(efried) on IRC and discuss with him the "generic device management" system
that was discussed at the PTG. It's likely that the /pci module is going to
be overhauled in Rocky and it would be good to have the mdev device
management API requirements included in that discussion.

Best,
-jay

Best,
-jay


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
0 votes

On Fri, Sep 29, 2017 at 2:32 AM, Dan Smith dms@danplanet.com wrote:

In this serie of patches we are generalizing the PCI framework to

handle MDEV devices. We arguing it's a lot of patches but most of them
are small and the logic behind is basically to make it understand two
new fields MDEVPF and MDEVVF.

That's not really "generalizing the PCI framework to handle MDEV devices"
:) More like it's just changing the /pci module to understand a different
device management API, but ok.

Yeah, the series is adding more fields to our PCI structure to allow for
more variations in the kinds of things we lump into those tables. This is
my primary complaint with this approach, and has been since the topic first
came up. I really want to avoid building any more dependency on the
existing pci-passthrough mechanisms and focus any new effort on using
resource providers for this. The existing pci-passthrough code is almost
universally hated, poorly understood and tested, and something we should
not be further building upon.

In this serie of patches we make libvirt driver support, as usually,

return resources and attach devices returned by the pci manager. This
part can be reused for Resource Provider.

Perhaps, but the idea behind the resource providers framework is to treat
devices as generic things. Placement doesn't need to know about the
particular device attachment status.

I quickly went through the patches and left a few comments. The base work
of pulling some of this out of libvirt is there, but it's all focused on
the act of populating pci structures from the vgpu information we get from
libvirt. That code could be made to instead populate a resource inventory,
but that's about the most of the set that looks applicable to the
placement-based approach.

I'll review them too.

As mentioned in IRC and the previous ML discussion, my focus is on the

nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

FWIW, I'm not really planning to spend any time reviewing it until/unless
it is retooled to generate an inventory from the virt driver.

With the two patches that report vgpus and then create guests with them
when asked converted to resource providers, I think that would be enough to
have basic vgpu support immediately. No DB migrations, model changes, etc
required. After that, helping to get the nested-rps and traits work landed
gets us the ability to expose attributes of different types of those vgpus
and opens up a lot of possibilities. IMHO, that's work I'm interested in
reviewing.

That's exactly the things I would like to provide for Queens, so operators
would have a possibility to have flavors asking for vGPU resources in
Queens, even if they couldn't yet ask for a specific VGPU type yet (or
asking to be in the same NUMA cell than the CPU). The latter is definitely
needing to have nested resource providers, but the former (just having vGPU
resource classes provided by the virt driver) is possible for Queens.

One thing that would be very useful, Sahid, if you could get with Eric

Fried (efried) on IRC and discuss with him the "generic device management"
system that was discussed at the PTG. It's likely that the /pci module is
going to be overhauled in Rocky and it would be good to have the mdev
device management API requirements included in that discussion.

Definitely this.

++

--Dan


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Sylvain_Bauza (14,100 points)   1 3 5
0 votes

Hi Sahid,

Please consider the support of MDEV for the /pci framework which provides support for vGPUs [0].

From my understanding, this MDEV implementation for vGPU would be entirely specific to libvirt, is that correct?

XenServer's implementation for vGPU is based on a pooled device model (as described in http://lists.openstack.org/pipermail/openstack-dev/2017-September/122702.html) and directly interfaces with the card using DEMU ("Discrete EMU") as a second device emulator along-side QEMU. There is no mdev integration. I'm concerned about how much mdev-specific functionality would have to be faked up in the XenServer-specific driver for vGPU to be used in this way.

I'm not familiar with mdev, but it looks Linux specific, so would not be usable by Hyper-V?
I've also not been able to find suggestions that VMWare can make use of mdev, although I don't know the architecture of VMWare's integration.

The concepts of PCI and SR-IOV are, of course, generic, but I think out of principal we should avoid a hypervisor-specific integration for vGPU (indeed Citrix has been clear from the beginning that the vGPU integration we are proposing is intentionally hypervisor agnostic)
I also think there is value in exposing vGPU in a generic way, irrespective of the underlying implementation (whether it is DEMU, mdev, SR-IOV or whatever approach Hyper-V/VMWare use).

It's quite difficult for me to see how this will work for other hypervisors. Do you also have a draft alternate spec where more details can be discussed?

Bob


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Bob_Ball (2,500 points)   2 3
0 votes

The concepts of PCI and SR-IOV are, of course, generic

They are, although the PowerVM guys have already pointed out that they
don't even refer to virtual devices by PCI address and thus anything
based on that subsystem isn't going to help them.

but I think out of principal we should avoid a hypervisor-specific
integration for vGPU (indeed Citrix has been clear from the beginning
that the vGPU integration we are proposing is intentionally
hypervisor agnostic) I also think there is value in exposing vGPU in
a generic way, irrespective of the underlying implementation (whether
it is DEMU, mdev, SR-IOV or whatever approach Hyper-V/VMWare use).

I very much agree, of course.

--Dan


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Dan_Smith (9,860 points)   1 2 4
0 votes

Hi Sahid, comments inline. :)

On 09/29/2017 04:53 AM, Sahid Orentino Ferdjaoui wrote:
On Thu, Sep 28, 2017 at 05:06:16PM -0400, Jay Pipes wrote:

On 09/28/2017 11:37 AM, Sahid Orentino Ferdjaoui wrote:

Please consider the support of MDEV for the /pci framework which
provides support for vGPUs [0].

Accordingly to the discussion [1]

With this first implementation which could be used as a skeleton for
implementing PCI Devices in Resource Tracker

I'm not entirely sure what you're referring to above as "implementing PCI
devices in Resource Tracker". Could you elaborate? The resource tracker
already embeds a PciManager object that manages PCI devices, as you know.
Perhaps you meant "implement PCI devices as Resource Providers"?

A PciManager? I know that we have a field PCI_DEVICE :) - I guess a
virt driver can return inventory with total of PCI devices. Talking
about manager, not sure.

I'm referring to this:

https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L33

The PciDevTracker class is instantiated in the resource tracker when the
first ComputeNode object managed by the resource tracker is init'd:

https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L578

On initialization, the PciDevTracker inventories the compute node's
collection of PCI devices by grabbing a list of records from the
pci_devices table in the cell database:

https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L69

and then comparing those DB records with information the hypervisor
returns about PCI devices:

https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L160

Each hypervisor returns something different for the list of pci devices,
as you know. For libvirt, the call that returns PCI device information
is here:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L842

The results of that are jammed into a "pcipassthroughdevices" key in
the returned result of the virt driver's getavailableresource() call.
For libvirt, that's here:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L5809

It is that piece that Eric and myself have been talking about
standardizing into a "generic device management" interface that would
have an update_inventory() method that accepts a ProviderTree object [1]

[1]
https://github.com/openstack/nova/blob/master/nova/compute/provider_tree.py

and would add resource providers corresponding to devices that are made
available to guests for use.

You still have to define "traits", basically for physical network
devices, the users want to select device according physical network,
to select device according the placement on host (NUMA), to select the
device according the bandwidth capability... For GPU it's same
story. And I do not have mentioned devices which support virtual
functions.

Yes, the generic device manager would be responsible for associating
traits to the resource providers it adds to the ProviderTree provided to
it in the update_inventory() call.

So that is what you plan to do for this release :) - Reasonably I
don't think we are close to have something ready for production.

I don't disagree with you that this is a huge amount of refactoring to
undertake over the next couple releases. :)

Jay, I have question, Why you don't start by exposing NUMA ?

I believe you're asking here why we don't start by modeling NUMA nodes
as child resource providers of the compute node? Instead of starting by
modeling PCI devices as child providers of the compute node? If that's
not what you're asking, please do clarify...

We're starting with modeling PCI devices as child providers of the
compute node because they are easier to deal with as a whole than NUMA
nodes and we have the potential of being able to remove the
PciPassthroughFilter from the scheduler in Queens.

I don't see us being able to remove the NUMATopologyFilter from the
scheduler in Queens because of the complexity involved in how coupled
the NUMA topology resource handling is to CPU pinning, huge page
support, and IO emulation thread pinning.

Hope that answers that question; again, lemme know if that's not the
question you were asking! :)

For the record, I have zero confidence in any existing "functional" tests
for NUMA, SR-IOV, CPU pinning, huge pages, and the like. Unfortunately, due
to the fact that these features often require hardware that either the
upstream community CI lacks or that depends on libraries, drivers and kernel
versions that really aren't available to non-bleeding edge users (or users
with very deep pockets).

It's good point, if you are not confidence, don't you think it's
premature to move forward on implementing new thing without to have
well trusted functional tests?

Completely agree with you. I would rather see functional integration
tests that are proven to actually test these complex hardware devices
gating Nova patches before adding any new functionality to Nova.

We're adding lots of functional tests of the placement and resource
providers modeling. I could definitely use some assistance from folks
with access to this specialized hardware to set up and maintain the CI
systems that can provide they are actually exercising these code paths.

  • The Usage

There are no difference between SR-IOV and MDEV, from operators point
of view who knows how to expose SR-IOV devices in Nova, they already
know how to expose MDEV devices (vGPUs).

Operators will be able to expose MDEV devices in the same manner as
they expose SR-IOV:

1/ Configure whitelist devices

['{"vendor_id":"10de"}']

2/ Create aliases

[{"vendor_id":"10de", "name":"vGPU"}]

3/ Configure the flavor

openstack flavor set --property "pci_passthrough:alias"="vGPU:1"

  • Limitations

The mdev does not provide 'productid' but 'mdevtype' which should be
considered to exactly identify which resource users can request e.g:
nvidia-10. To provide that support we have to add a new field
'mdev_type' so aliases could be something like:

{"vendorid":"10de", mdevtype="nvidia-10" "name":"alias-nvidia-10"}
{"vendorid":"10de", mdevtype="nvidia-11" "name":"alias-nvidia-11"}

I do have plan to add but first I need to have support from upstream
to continue that work.

As mentioned in IRC and the previous ML discussion, my focus is on the
nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

No worries, the code is here, tested, fully functionnal and
production-ready, I made effort to make it available at the very
beginning of the release. With some good volitions we could fix any
bugs and have support for vGPUs in Queens.

You cannot say it's tested, fully functional and production-ready until
we see functional integration tests proving that :)

One thing that would be very useful, Sahid, if you could get with Eric Fried
(efried) on IRC and discuss with him the "generic device management" system
that was discussed at the PTG. It's likely that the /pci module is going to
be overhauled in Rocky and it would be good to have the mdev device
management API requirements included in that discussion.

Perhaps you missed the above part of my response. I'd like to repeat
that it would be great to get your input on the generic device
management ideas we've been throwing around.

All the best,
-jay


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Jay_Pipes (59,760 points)   3 10 13
0 votes

On Fri, Sep 29, 2017 at 12:26:07PM +0000, Bob Ball wrote:
Hi Sahid,

Please consider the support of MDEV for the /pci framework which provides support for vGPUs [0].

From my understanding, this MDEV implementation for vGPU would be
entirely specific to libvirt, is that correct?

No, but Linux specific yes. Windows is supporting SR-IOV.

XenServer's implementation for vGPU is based on a pooled device
model (as described in
http://lists.openstack.org/pipermail/openstack-dev/2017-September/122702.html)

This topic is referring something which I guess everyone understand
now - It's basically why I do have added support of MDEV in /pci to
make it working whatever how the virtual devices are exposed, SR-IOV
or MDEV.

a second device emulator along-side QEMU. There is no mdev
integration. I'm concerned about how much mdev-specific
functionality would have to be faked up in the XenServer-specific
driver for vGPU to be used in this way.

What you are refering with your DEMU it's what QEMU/KVM have with its
vfio-pci. XenServer is reading through MDEV since the vendors provide
drivers on Linux using the MDEV framework.

MDEV is a kernel layer, used to expose hardwares, it's not hypervisor
specific.

I'm not familiar with mdev, but it looks Linux specific, so would not be usable by Hyper-V?
I've also not been able to find suggestions that VMWare can make use of mdev, although I don't know the architecture of VMWare's integration.

The concepts of PCI and SR-IOV are, of course, generic, but I think out of principal we should avoid a hypervisor-specific integration for vGPU (indeed Citrix has been clear from the beginning that the vGPU integration we are proposing is intentionally hypervisor agnostic)
I also think there is value in exposing vGPU in a generic way, irrespective of the underlying implementation (whether it is DEMU, mdev, SR-IOV or whatever approach Hyper-V/VMWare use).

It's quite difficult for me to see how this will work for other
hypervisors. Do you also have a draft alternate spec where more
details can be discussed?

I would expect that XenServer provides the MDEV UUID, then it's easy
to ask sysfs if you need to get the NUMA node of the physical device
or the mdev_type.

Bob


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
0 votes

Hi Sahid,

a second device emulator along-side QEMU. There is no mdev
integration. I'm concerned about how much mdev-specific functionality
would have to be faked up in the XenServer-specific driver for vGPU to
be used in this way.

What you are refering with your DEMU it's what QEMU/KVM have with its vfio-pci. XenServer is
reading through MDEV since the vendors provide drivers on Linux using the MDEV framework.
MDEV is a kernel layer, used to expose hardwares, it's not hypervisor specific.

It is possible that the vendor's userspace libraries use mdev, however DEMU has no concept of mdev at all. If the vendor's userspace libraries do use mdev then this is entirely abstracted from XenServer's integration.
While I don't have access to the vendors source for the userspace libraries or the kernel module my understanding was that the kernel module in XenServer's integration is for the userspace libraries to talk to the kernel module and for IOCTLS.

My reading of mdev implies that /sys/class/mdev_bus should exist for it to be used? It does not exist in XenServer, which to me implies that the vendor's driver for XenServer do not use mdev?

Bob


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Sep 29, 2017 by Bob_Ball (2,500 points)   2 3
0 votes

On Fri, Sep 29, 2017 at 11:16:43AM -0400, Jay Pipes wrote:
Hi Sahid, comments inline. :)

On 09/29/2017 04:53 AM, Sahid Orentino Ferdjaoui wrote:

On Thu, Sep 28, 2017 at 05:06:16PM -0400, Jay Pipes wrote:

On 09/28/2017 11:37 AM, Sahid Orentino Ferdjaoui wrote:

Please consider the support of MDEV for the /pci framework which
provides support for vGPUs [0].

Accordingly to the discussion [1]

With this first implementation which could be used as a skeleton for
implementing PCI Devices in Resource Tracker

I'm not entirely sure what you're referring to above as "implementing PCI
devices in Resource Tracker". Could you elaborate? The resource tracker
already embeds a PciManager object that manages PCI devices, as you know.
Perhaps you meant "implement PCI devices as Resource Providers"?

A PciManager? I know that we have a field PCI_DEVICE :) - I guess a
virt driver can return inventory with total of PCI devices. Talking
about manager, not sure.

I'm referring to this:

https://github.com/openstack/nova/blob/master/nova/pci/manager.py#L33

[SNIP]

It is that piece that Eric and myself have been talking about standardizing
into a "generic device management" interface that would have an
update_inventory() method that accepts a ProviderTree object [1]

Jay, all of that looks to me perfectly sane even it's not clear what
you want make so generic. That part of code is for the virt layers and
you can't make it like just considering GPU or NET as a generic piece,
they have characteristic which are requirements for virt layers.

In that method 'updateinventory(providertree)' which you are going
to introduce for /pci/PciManager, a first step would be to convert the
objects to a understable dict for the whole logic, right, or do you
have an other plan?

In all cases from my POV I don't see any blocker, both work can
co-exist without any pain. And adding features in the current /pci
module is not going to add heavy work but is going to give to us a
clear view of what is needed.

[1]
https://github.com/openstack/nova/blob/master/nova/compute/provider_tree.py

and would add resource providers corresponding to devices that are made
available to guests for use.

You still have to define "traits", basically for physical network
devices, the users want to select device according physical network,
to select device according the placement on host (NUMA), to select the
device according the bandwidth capability... For GPU it's same
story. And I do not have mentioned devices which support virtual
functions.

Yes, the generic device manager would be responsible for associating traits
to the resource providers it adds to the ProviderTree provided to it in the
update_inventory() call.

So that is what you plan to do for this release :) - Reasonably I
don't think we are close to have something ready for production.

I don't disagree with you that this is a huge amount of refactoring to
undertake over the next couple releases. :)

Yes and that is the point. We are going to block the work on /pci
module during a period where we can see a large interest around such
support.

Jay, I have question, Why you don't start by exposing NUMA ?

I believe you're asking here why we don't start by modeling NUMA nodes as
child resource providers of the compute node? Instead of starting by
modeling PCI devices as child providers of the compute node? If that's not
what you're asking, please do clarify...

We're starting with modeling PCI devices as child providers of the compute
node because they are easier to deal with as a whole than NUMA nodes and we
have the potential of being able to remove the PciPassthroughFilter from the
scheduler in Queens.

I don't see us being able to remove the NUMATopologyFilter from the
scheduler in Queens because of the complexity involved in how coupled the
NUMA topology resource handling is to CPU pinning, huge page support, and IO
emulation thread pinning.

Hope that answers that question; again, lemme know if that's not the
question you were asking! :)

Yes it was the question and you perfectly responded, thanks. I will
try to be more clear in the future :)

As you have noticed the support of NUMA will be quite difficult and it
is not in the TODO right now, which let me think that we are going to
block development on pci module and more of that at the end provide
less support (no NUMA awareness). Is that reasonable ?

For the record, I have zero confidence in any existing "functional" tests
for NUMA, SR-IOV, CPU pinning, huge pages, and the like. Unfortunately, due
to the fact that these features often require hardware that either the
upstream community CI lacks or that depends on libraries, drivers and kernel
versions that really aren't available to non-bleeding edge users (or users
with very deep pockets).

It's good point, if you are not confidence, don't you think it's
premature to move forward on implementing new thing without to have
well trusted functional tests?

Completely agree with you. I would rather see functional integration tests
that are proven to actually test these complex hardware devices gating
Nova patches before adding any new functionality to Nova.

I plan to rewrote a bit the work initiated by Vladik (Thanks to him)
even if I think they exercising well the complexity.

We're adding lots of functional tests of the placement and resource
providers modeling. I could definitely use some assistance from folks with
access to this specialized hardware to set up and maintain the CI systems
that can provide they are actually exercising these code paths.

+1

  • The Usage

There are no difference between SR-IOV and MDEV, from operators point
of view who knows how to expose SR-IOV devices in Nova, they already
know how to expose MDEV devices (vGPUs).

Operators will be able to expose MDEV devices in the same manner as
they expose SR-IOV:

1/ Configure whitelist devices

['{"vendor_id":"10de"}']

2/ Create aliases

[{"vendor_id":"10de", "name":"vGPU"}]

3/ Configure the flavor

openstack flavor set --property "pci_passthrough:alias"="vGPU:1"

  • Limitations

The mdev does not provide 'productid' but 'mdevtype' which should be
considered to exactly identify which resource users can request e.g:
nvidia-10. To provide that support we have to add a new field
'mdev_type' so aliases could be something like:

{"vendorid":"10de", mdevtype="nvidia-10" "name":"alias-nvidia-10"}
{"vendorid":"10de", mdevtype="nvidia-11" "name":"alias-nvidia-11"}

I do have plan to add but first I need to have support from upstream
to continue that work.

As mentioned in IRC and the previous ML discussion, my focus is on the
nested resource providers work and reviews, along with the other two
top-priority scheduler items (move operations and alternate hosts).

I'll do my best to look at your patch series, but please note it's lower
priority than a number of other items.

No worries, the code is here, tested, fully functionnal and
production-ready, I made effort to make it available at the very
beginning of the release. With some good volitions we could fix any
bugs and have support for vGPUs in Queens.

You cannot say it's tested, fully functional and production-ready until we
see functional integration tests proving that :)

OK I accept that point :)

One thing that would be very useful, Sahid, if you could get with Eric Fried
(efried) on IRC and discuss with him the "generic device management" system
that was discussed at the PTG. It's likely that the /pci module is going to
be overhauled in Rocky and it would be good to have the mdev device
management API requirements included in that discussion.

Perhaps you missed the above part of my response. I'd like to repeat that it
would be great to get your input on the generic device management ideas
we've been throwing around.

Jay, I can help on this sure even if it's not still clear when you are
going to consider the characteristic of devices in your generic
management ideas :)

If I can have some explanations I could start working on reporting the
resources 'getinventory()' and rewrote the PciManager to handle that
new 'update
from_inventory()'.

That is said for vGPUs I'm not sure about the spec you have approved
or perhaps it's a long term view. I think that you have considered the
vGPUs as dynamic resources where it can be for some hypervisor like
probably XenServer but it's not or at least not yet for libvirt/QEMU.

I think the first implementation should care only of the type of vGPU
and NUMA placement. We should have call that resource MDEVGPU as we
have for SRIOV
NET. The operator allocate the resources of vGPUs based
on requirements and configure flavors based on type/name. That would
be a basic support.

All the best,
-jay


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
responded Oct 2, 2017 by Sahid_Orentino_Ferdj (1,020 points)   1
...