Monday, August 25, 2014

IOMMU Groups, inside and out

Sometimes VFIO users are befuddled that they aren't able to separate devices between host and guest or multiple guests due to IOMMU grouping and revert to using legacy KVM device assignment, or as is the case with may VFIO-VGA users, apply the PCIe ACS override patch to avoid the problem.  Let's take a moment to look at what this is really doing.

Hopefully we all have at least some vague notion of what an IOMMU does in a system, it allows mapping of an I/O virtual address (IOVA) to a physical memory address.  Without an IOMMU, all devices share a flat view of physical memory without any memory translation operation.  With an IOMMU we have a new address space, the IOVA space, that we can put to use.

Different IOMMUs have different level of functionality.  Before the proliferation of virtualization, IOMMUs often provided only translation, and often only for a small aperture or window of the address space.  These IOMMUs mostly provided two capabilities, avoiding bounce buffers and creating contiguous DMA operations.  Bounce buffers are necessary when the addressing capabilities of the device are less than that of the platform, for instance if the device can only address 4GB of memory, but your system supports 8GB.  If the driver allocates a buffer above 4GB, the device cannot directly DMA to it.  A bounce buffer is buffer space in lower memory, where the device can temporarily DMA, which is then copied to the driver allocated buffer on completion.  An IOMMU can avoid the extra buffer and copy operation by providing an IOVA within the device's address space, backed by the driver's buffer that is outside of the device's address space.  Creating contiguous DMA operations comes into play when the driver makes use of multiple buffers, scattered throughout the physical address space, and gathered together for a single I/O operation.  The IOMMU can take these scatter-gather lists and map them into the IOVA space to form a contiguous DMA operation for the device.  In the simplest example, a driver may allocate two 4KB buffers that are not contiguous in the physical memory space.  The IOMMU can allocate a contiguous range for these buffers allowing the I/O device to do a single 8KB DMA rather than two separate 4KB DMAs.

Both of these features are still important for high performance I/O on the host, but the IOMMU feature we love from a virtualization perspective is the isolation capabilities of modern IOMMUs.  Isolation wasn't possible on a wide scale prior to PCI-Express because conventional PCI does not tag transactions with an ID of the requesting device (requester ID).  PCI-X included some degree of a requester ID, but rules for interconnecting devices taking ownership of the transaction made the support incomplete for isolation.  With PCIe, each device tags transactions with a requester ID unique to the device (the PCI bus/device/function number, BDF), which is used to reference a unique IOVA table for that device.  Suddenly we go from having a shared IOVA space used to offload unreachable memory and consolidate memory, to a per device IOVA space that we can not only use for those features, but also to restrict DMA access from the device.  For assignment to a virtual machine, we now simply need to populate the IOVA space for the assigned device with the guest physical to host physical memory mappings for the VM and the device can transparently perform DMA in the guest address space.

Back to IOMMU groups; IOMMU groups try to describe the smallest sets of devices which can be considered isolated from the perspective of the IOMMU.  The first step in doing this is that each device must associate to a unique IOVA space.  That is, if multiple devices alias to the same IOVA space, then the IOMMU cannot distinguish between them.  This is the reason that a typical x86 PC will group all conventional PCI devices together, all of them are aliased to the same PCIe-to-PCI bridge.  Legacy KVM device assignment will allow a user to assign these devices separately, but the configuration is guaranteed to fail.  VFIO is governed by IOMMU groups and therefore prevents configurations which violate this most basic requirement of IOMMU granularity.

Beyond this first step of being able to simply differentiate one device from another, we next need to determine whether the transactions from a device actually reach the IOMMU.  The PCIe specification allows for transactions to be re-routed within the interconnect fabric.  A PCIe downstream port can re-route a transaction from one downstream device to another.  The downstream ports of a PCIe switch may be interconnected to allow re-routing from one port to another.  Even within a multifunction endpoint device, a transaction from one function may be delivered directly to another function.  These transactions from one device to another are called peer-to-peer transactions and can be bad news for devices operating in separate IOVA spaces.  Imagine for instance if the network interface card assigned to your guest attempted a DMA write to a guest physical address (IOVA) that matched the MMIO space for a peer disk controller owned by the host.  An interconnect attempting to optimize the data path of that transaction could send the DMA write straight to the disk controller before it gets to the IOMMU for translation.

This is where PCIe Access Control Services (ACS) comes into play.  ACS provides us with the ability to determine whether these redirects are possible as well as the ability to disable them.  This is an essential component in being able to isolate devices from one another and sadly one that is too often missing in interconnects and multifunction endpoints.  Without ACS support at every step from the device to the IOMMU, we must assume that redirection is possible at the highest upstream device lacking ACS, thereby breaking isolation of all devices below that point in the topology.  IOMMU groups in a PCI environment take this isolation into account, grouping together devices which are capable of untranslated peer-to-peer DMA.

Combining these two things, the IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups.  VFIO uses this information to enforce safe ownership of devices for userspace.  With the exception of bridges, root ports, and switches (ie. interconnect fabric), all devices within an IOMMU group must be bound to a VFIO device driver or known safe stub driver.  For PCI, these drivers are vfio-pci and pci-stub.  We allow pci-stub simply because it's known that the host does not interact with devices via this driver (using legacy KVM device assignment on such devices while the group is in use with VFIO for a different VM is strongly discouraged).  If when attempting to use VFIO you see an error message indicating the group is not viable, it relates to this rule that all of the devices in the group need to be bound to an appropriate host driver.

IOMMU groups are visible to the user through sysfs:

$ find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/2/devices/0000:00:14.0
/sys/kernel/iommu_groups/3/devices/0000:00:16.0
/sys/kernel/iommu_groups/4/devices/0000:00:19.0
/sys/kernel/iommu_groups/5/devices/0000:00:1a.0
/sys/kernel/iommu_groups/6/devices/0000:00:1b.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.1
/sys/kernel/iommu_groups/7/devices/0000:00:1c.2
/sys/kernel/iommu_groups/7/devices/0000:02:00.0
/sys/kernel/iommu_groups/7/devices/0000:03:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:1d.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.2

/sys/kernel/iommu_groups/9/devices/0000:00:1f.3

Here we see that devices like the audio controller (0000:00:1b.0) have their own IOMMU group, while a wireless adapter (0000:03:00.0) and flash card reader (0000:02:00.0) share an IOMMU group.  The later is a result of lack of ACS support at the PCIe root ports (0000:00:1c.*).  Each device also has links back to its IOMMU group:

$ readlink -f /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/
/sys/kernel/iommu_groups/7

The set of devices can thus be found using:

$ ls /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/devices/

0000:00:1c.0  0000:00:1c.1  0000:00:1c.2  0000:02:00.0  0000:03:00.0

Using this example, if I wanted to assign the wireless adapter (0000:03:00.0) to a guest, I would also need to bind the flash card reader (0000:02:00.0) to either vfio-pci or pci-stub in order to make the group viable.  An important point here is that the flash card reader does not also need to be assigned to the guest, it simply needs to be held by a device which is known to either participate in VFIO, like vfio-pci, or known not to do DMA, like pci-stub.  Newer kernels than used for this example will split this IOMMU group as support has been added to expose the isolation capabilities of this chipset, even though it does not support PCIe ACS directly.

In closing, let's discuss strategies for dealing with IOMMU groups that contain more devices than desired.  For a plug-in card, the first option would be to determine whether installing the card into a different slot may produce the desired grouping.  On a typical Intel chipset, PCIe root ports are provided via both the processor and the PCH (Platform Controller Hub).  The capabilities of these root ports can be very different.  On the latest Linux kernels we have support for exposing the isolation of the PCH root ports, even though many of them do not have native PCIe ACS support.  These are therefore often a good target for creating smaller IOMMU groups.  On Xeon class processors (except E3-1200 series), the processor-based PCIe root ports typically support ACS.  Client processors, such as the i5/i7 Core processor do not support ACS, but we can hope future products from Intel will update this support.

Another option that many users have found is a kernel patch which overrides PCIe ACS support in the kernel, allowing command line options to falsely expose isolation capabilities of various components.  In many cases this appears to work well, but without vendor confirmation, we cannot be sure that the devices are truly isolated.  The occurrence of a misdirected DMA may be sufficiently rare to mask association with this option.  We may also find differences in chipset programming or address assignment between vendors that allows relatively safe use of this override on one system, while other systems may experience issues.  Adjusting slot usage and using a platform with proper isolation support are clearly the best options.

The final option is to work with the vendor to determine whether isolation is present and quirk the kernel to recognize this isolation.  This is generally a matter of determining whether internal peer-to-peer between functions is possible, or in the case of downstream ports, also determining whether redirection is possible.  Multifunction endpoints that do not support peer-to-peer can expose this using a single static ACS table in configuration space, exposing no capabilities.

Hopefully this entry helps to describe why we have IOMMU groups, why they take the shape that they do, and how they operate with VFIO.  Please comment if I can provide further clarification anywhere.

37 comments:

  1. Dear Alex,

    Do you have any idea how I could further split my IOMMU groups?
    I have a RocketU Highpoint USB controller PCIe card that has 4 AsMedia USB controllers behind PLX.
    I have tried already with ACS override but can't get them to split into different groups.

    IOMMU groups

    /sys/kernel/iommu_groups/17/devices/0000:00:1f.0
    /sys/kernel/iommu_groups/17/devices/0000:00:1f.2
    /sys/kernel/iommu_groups/17/devices/0000:00:1f.3
    /sys/kernel/iommu_groups/18/devices/0000:02:00.0
    /sys/kernel/iommu_groups/18/devices/0000:02:00.1
    /sys/kernel/iommu_groups/18/devices/0000:03:01.0
    /sys/kernel/iommu_groups/18/devices/0000:03:05.0
    /sys/kernel/iommu_groups/18/devices/0000:03:07.0
    /sys/kernel/iommu_groups/18/devices/0000:03:09.0
    /sys/kernel/iommu_groups/18/devices/0000:04:00.0
    /sys/kernel/iommu_groups/18/devices/0000:05:00.0
    /sys/kernel/iommu_groups/18/devices/0000:06:00.0
    /sys/kernel/iommu_groups/18/devices/0000:07:00.0
    /sys/kernel/iommu_groups/19/devices/0000:08:00.0
    /sys/kernel/iommu_groups/19/devices/0000:08:00.1


    lspci output


    00:1d.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 (rev 06)
    00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6)
    00:1f.0 ISA bridge: Intel Corporation C600/X79 series chipset LPC Controller (rev 06)
    00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)
    00:1f.3 SMBus: Intel Corporation C600/X79 series chipset SMBus Host Controller (rev 06)
    02:00.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    02:00.1 System peripheral: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    03:01.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    03:05.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    03:07.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    03:09.0 PCI bridge: PLX Technology, Inc. PEX 8609 8-lane, 8-Port PCI Express Gen 2 (5.0 GT/s) Switch with DMA (rev ba)
    04:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
    05:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
    06:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
    07:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller
    08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM]
    08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI Audio [Radeon HD 6400 Series]


    lspci tree output


    \-[0000:00]-+-00.0
    +-01.0-[01]--
    +-01.1-[02-07]--+-00.0-[03-07]--+-01.0-[04]----00.0
    | | +-05.0-[05]----00.0
    | | +-07.0-[06]----00.0
    | | \-09.0-[07]----00.0
    | \-00.1
    +-02.0-[08]--+-00.0


    Kind regards;
    Robin Sandra

    ReplyDelete
    Replies
    1. Please use the contact form or pastebin to provide "sudo lspci -vvvnn". Thanks.

      Delete
    2. This comment has been removed by the author.

      Delete
  2. Hi Alex,

    We have two similar servers, both Dell R730 with Intel cards. In one of them we installed RH 7.1 and in the other Ubuntu 14.04.02.

    The iommu grouping is different, for the same hardware:

    RHEL 7.1:
    n2@nfv105 ~$ find /sys/kernel/iommu_groups/ -type l|grep "0000:81"
    /sys/kernel/iommu_groups/46/devices/0000:81:00.0
    /sys/kernel/iommu_groups/47/devices/0000:81:00.1
    /sys/kernel/iommu_groups/115/devices/0000:81:10.0
    /sys/kernel/iommu_groups/116/devices/0000:81:10.2
    /sys/kernel/iommu_groups/117/devices/0000:81:10.4
    /sys/kernel/iommu_groups/118/devices/0000:81:10.6
    /sys/kernel/iommu_groups/119/devices/0000:81:11.0
    /sys/kernel/iommu_groups/120/devices/0000:81:11.2
    /sys/kernel/iommu_groups/121/devices/0000:81:11.4
    /sys/kernel/iommu_groups/122/devices/0000:81:11.6
    /sys/kernel/iommu_groups/123/devices/0000:81:10.1
    /sys/kernel/iommu_groups/124/devices/0000:81:10.3
    /sys/kernel/iommu_groups/125/devices/0000:81:10.5
    /sys/kernel/iommu_groups/126/devices/0000:81:10.7
    /sys/kernel/iommu_groups/127/devices/0000:81:11.1
    /sys/kernel/iommu_groups/128/devices/0000:81:11.3
    /sys/kernel/iommu_groups/129/devices/0000:81:11.5
    /sys/kernel/iommu_groups/130/devices/0000:81:11.7
    n2@nfv105 ~$ uname -a
    Linux nfv105 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

    UBUNTU 14.04.02:
    n2@nfv106:/boot$ find /sys/kernel/iommu_groups/ -type l|grep "0000:81"
    /sys/kernel/iommu_groups/40/devices/0000:81:00.0
    /sys/kernel/iommu_groups/40/devices/0000:81:00.1
    /sys/kernel/iommu_groups/91/devices/0000:81:10.0
    /sys/kernel/iommu_groups/92/devices/0000:81:10.2
    /sys/kernel/iommu_groups/93/devices/0000:81:10.4
    /sys/kernel/iommu_groups/94/devices/0000:81:10.6
    /sys/kernel/iommu_groups/95/devices/0000:81:11.0
    /sys/kernel/iommu_groups/96/devices/0000:81:11.2
    /sys/kernel/iommu_groups/97/devices/0000:81:11.4
    /sys/kernel/iommu_groups/98/devices/0000:81:11.6
    /sys/kernel/iommu_groups/99/devices/0000:81:10.1
    /sys/kernel/iommu_groups/100/devices/0000:81:10.3
    /sys/kernel/iommu_groups/101/devices/0000:81:10.5
    /sys/kernel/iommu_groups/102/devices/0000:81:10.7
    /sys/kernel/iommu_groups/103/devices/0000:81:11.1
    /sys/kernel/iommu_groups/104/devices/0000:81:11.3
    /sys/kernel/iommu_groups/105/devices/0000:81:11.5
    /sys/kernel/iommu_groups/106/devices/0000:81:11.7
    n2@nfv106:/boot$ uname -a
    Linux nfv106 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

    The problem is that in Ubuntu we cannot passthrough just a single port to a guest, because both ports are in the same iommu group 40:
    /sys/kernel/iommu_groups/40/devices/0000:81:00.0
    /sys/kernel/iommu_groups/40/devices/0000:81:00.1

    In RH we do not have that problem, as the iommu group is different:
    /sys/kernel/iommu_groups/46/devices/0000:81:00.0
    /sys/kernel/iommu_groups/47/devices/0000:81:00.1

    Could you advise on how to fix this problem in Ubuntu?
    Is this a kernel issue that could be fixed with a kernel patch?
    Do you know if a vanilla kernel would support the correct behaviour, as in RH 7.1?

    Thanks
    Antonio López

    ReplyDelete
    Replies
    1. Many multi-function devices do not support ACS to indicate isolation between ports, causing the ports to be grouped together. In this case, Red Hat has worked with Intel to confirm that isolation exists between ports of this particular NIC, pushed those changes upstream and also backported them to our RHEL kernel to enable our customers. I can't speak for Ubuntu.

      Delete
    2. Thanks Alex!

      I guess the patch is this: https://lkml.org/lkml/2013/5/30/513 Right?

      How can I find out in which kernel version has been released?

      Antonio

      Delete
    3. No, that's the override patch. There are no plans to include that upstream. Depending on the NIC, you need one of these:

      http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/pci/quirks.c?id=d748804f5be8ca4dd97a4167fcf84867dca7c116
      http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/pci/quirks.c?id=100ebb2c48eaddd6a7ce9602d5d4c37f0a3c9232

      The latter was included in v3.18, the former in v4.1.

      Delete
    4. This comment has been removed by the author.

      Delete
  3. Alex I have read your blog many times and thanks for the insight and I finally understand iommu groups. Right now I have run into NIC (mellanox) that with SR-IOV capability that iommu groups put the PF device and VF device in the same group which they shouldn't be

    root@vm-ha:~# uname -r
    4.2.3-1-pve

    iommu boot kernel flag
    GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on mlx4_core.port_type_array=1,1 mlx4_core.num_vfs=8 mlx4_core.probe_vf=0"

    dmesg | grep -e DMAR -e IOMMU
    [ 0.000000] ACPI: DMAR 0x00000000CFF60442 000110 (v01 Intel OEMDMAR 06040000 LOHR 00000001)
    [ 0.000000] DMAR: IOMMU enabled
    .......

    lspci -vnn|grep Mellanox
    07:00.0 Network controller [0280]: Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
    Subsystem: Mellanox Technologies Device [15b3:0024]
    07:00.1 Network controller [0280]: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] [15b3:1004]
    Subsystem: Mellanox Technologies Device [15b3:61b0]
    07:00.2 Network controller [0280]: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] [15b3:1004]
    Subsystem: Mellanox Technologies Device [15b3:61b0]
    07:00.3 Network controller [0280]: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] [15b3:1004]
    Subsystem: Mellanox Technologies Device [15b3:61b0]
    07:00.4 Network controller [0280]: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] [15b3:1004]
    Subsystem: Mellanox Technologies Device [15b3:61b0]
    ......

    root@vm-ha:~# find /sys/kernel/iommu_groups/ -type l|grep :07:
    /sys/kernel/iommu_groups/4/devices/0000:07:00.0
    /sys/kernel/iommu_groups/4/devices/0000:07:00.1
    /sys/kernel/iommu_groups/4/devices/0000:07:00.2
    /sys/kernel/iommu_groups/4/devices/0000:07:00.3
    /sys/kernel/iommu_groups/4/devices/0000:07:00.4
    /sys/kernel/iommu_groups/4/devices/0000:07:00.5
    /sys/kernel/iommu_groups/4/devices/0000:07:00.6
    /sys/kernel/iommu_groups/4/devices/0000:07:00.7
    /sys/kernel/iommu_groups/4/devices/0000:07:01.0

    root@vm-ha:~# lspci | grep -i 'root' | cut -d ' ' -f 1 | xargs -I {} sudo lspci -vvvnn -s {}
    returns nothing

    Intel Xeon CPU E5472

    -[0000:00]-+-00.0
    +-01.0-[01]--
    +-03.0-[02-05]--+-00.0-[03-04]----00.0-[04]--
    | \-00.3-[05]--
    +-05.0-[06]----00.0
    +-07.0-[07]--+-00.0
    | +-00.1
    | +-00.2
    | +-00.3
    | +-00.4
    | +-00.5
    | +-00.6
    | +-00.7
    | \-01.0
    ...............

    ReplyDelete
    Replies
    1. Is the root port at 00:07.0 also in that IOMMU group? If so, it probably means that it lacks ACS and we've grouped everything downstream of it together because redirections can occur at the root port, allowing peer-to-peer between VFs without IOMMU translation.

      The other possibility is that this device isn't really a true SR-IOV device, but instead manages VFs in device firmware. Does 07:00.0 actually have an SR-IOV capability? If this is the case, then Linux doesn't have any reason to handle these "VFs" different from a multifunction device and expects ACS per function to expose the isolation. If this is the case then we'd need Mellanox to vouch for the isolation between ports so that it can be quirked.

      Delete
    2. Dear Alex

      pci-device @ 07.00.0 should be the PF but in the same iommu group 4 with other PF

      root@vm-ha:~/Downloads# lspci -vvnn -s 07:00.0
      07:00.0 Network controller [0280]: Mellanox Technologies MT27500 Family [ConnectX-3] [15b3:1003]
      Subsystem: Mellanox Technologies Device [15b3:0024]
      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-
      Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
      ARICap: MFVC- ACS-, Next Function: 0
      ARICtl: MFVC- ACS-, Function Group: 0
      Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-f7-a9-90
      Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV)
      IOVCap: Migration-, Interrupt Message Number: 000
      IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
      IOVSta: Migration-
      Initial VFs: 8, Total VFs: 8, Number of VFs: 8, Function Dependency Link: 00
      VF offset: 1, stride: 1, Device ID: 1004
      Supported Page Size: 000007ff, System Page Size: 00000001
      Region 2: Memory at 00000000d9800000 (64-bit, prefetchable)
      VF Migration: offset: 00000000, BIR: 0
      Capabilities: [154 v2] Advanced Error Reporting
      UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
      UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
      UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
      CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
      CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
      Capabilities: [18c v1] #19
      Kernel driver in use: mlx4_core

      The card is Mellanox MCX354A-FCBT and according to Mellanox, the card definately supports SR-IOV

      https://community.mellanox.com/docs/DOC-1317
      https://community.mellanox.com/docs/DOC-1484

      Delete
    3. But is the root port at 00:07.0 also in that group (that's 00:07.0 not 07:00.0). If the root port doesn't have ACS, then everything behind it will be in the same group.

      Delete
  4. you are right... so its mobo issue...
    /sys/kernel/iommu_groups/4/devices/0000:00:07.0
    /sys/kernel/iommu_groups/4/devices/0000:07:00.0
    /sys/kernel/iommu_groups/4/devices/0000:07:00.1
    /sys/kernel/iommu_groups/4/devices/0000:07:00.2
    /sys/kernel/iommu_groups/4/devices/0000:07:00.3
    /sys/kernel/iommu_groups/4/devices/0000:07:00.4
    /sys/kernel/iommu_groups/4/devices/0000:07:00.5
    /sys/kernel/iommu_groups/4/devices/0000:07:00.6
    /sys/kernel/iommu_groups/4/devices/0000:07:00.7
    /sys/kernel/iommu_groups/4/devices/0000:07:01.0

    ReplyDelete
  5. Well, more of a processor issue in this case given the root port you're using. X58/ICH systems seem to be too old for Intel to investigate whether the processor or chipset provide any isolation and procedure for us to enable it via quirks if it does. Your options are probably either to patch your kernel with the ACS override patch, risking that there's actually isolation there, use legacy KVM device assignment, which is deprecated upstream and also risks whether there's actually isolation, or upgrade your hardware. If you want to use this card in a processor root port on Intel with VFs assignable to separate VMs, you'll need a High End Desktop processor or Xeon E5 or better as I describe in a recent post. If you can make do with installing the card in a chipset PCH root port, nearly any more modern platform will work (I think we have quirks going back to 5-series, plus X79 & X99 - nothing for Skylake yet).

    ReplyDelete
  6. Can anyone Please tell me how to create the iommu group and how to bind the device to the iommu group.

    ReplyDelete
    Replies
    1. iommu groups are created by the iommu driver in the kernel, their construct cannot be manipulated by the user other than to turn the iommu on or off.

      Delete
  7. BY iommu driver means vfio-pci driver.

    ReplyDelete
    Replies
    1. No, the VT-d or AMD-Vi driver (intel_iommu=on or amd_iommu=on)

      Delete
    2. I had changed the bios setting and the following command give me this
      dmesg | grep -i iommu
      [ 0.052478] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c90780106f0462 ecap f020e3

      Now my objective is to make following ethernet ports to use IOMMU
      0000:01:00.0 'Ethernet Controller 10-Gigabit X540-AT2' unused=vfio-pci
      0000:01:00.1 'Ethernet Controller 10-Gigabit X540-AT2' if=eth3 drv=ixgbe unused=vfio-pci

      I am not able to bind this devices to vfio-pci driver
      I tried
      echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
      but it gave
      bash: echo: write error: No such device

      Could you please help to fulfill my objective..

      Thanks

      Delete
    3. You're entirely missing how Linux matches PCI devices to drivers. You either need to make the driver match the device by echo'ing the PCI device and vendor ID to the new_id file or make the device match the driver by echo'ing the driver name to the driver_override file of the device in sysfs.

      Delete
    4. lspci -n -s 0000:01:00.0
      01:00.0 0200: 8086:1528 (rev 01)
      echo "8086 1528" > /sys/bus/pci/drivers/vfio-pci/new_id
      echo "vfio-pci" > /sys/bus/pci/devices/0000\:01\:00.0/driver_override
      echo 0000:01:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
      bash: echo: write error: No such device

      Delete
    5. The device needs to be unbound from any other driver (ixgbe) before it can be bound to vfio-pci.

      Delete
  8. As shown in my comment @3:13 pm
    I had unbinded 0000:01:00.0 already.
    The problem is during binding ubuntu is not recognising device.

    As dpdk provides loadable igb_uio module, similarly Can I get vfio module as a loadable module so that I could make changes in the module and control iommu in my own way.

    ReplyDelete
    Replies
    1. I don't see any unbind in your comment @3:13pm. If I try to bind a device to vfio-pci that's already bound to another driver, I get the same "No such device" error. The vfio modules are already loadable modules, depending on how your distribution builds them.

      Delete
    2. I found my mistake: I didn't make changes in the /etc/default/grub properly.because of which iommu group was not assigned to device.hence vfio was not able to recognize device.
      Thank you for your time,

      Delete
  9. Do you know how to set the different page size for the pages used by iommu.
    specifically what ioctl command should I use. I have support for 2MB Hugepage only.

    ReplyDelete
    Replies
    1. The vfio type1 iommu driver will pin pages and pass the largest contiguous chunk available to the IOMMU API for mapping by the hardware specific iommu drivers (ex. VT-d). The IOMMU API layer breaks this down into page sizes reported by the hardware driver. Therefore to make use of iommu superpages, use hugepages for allocating the buffers to be mapped and perform VFIO_IOMMU_MAP_DMA with size an alignment sufficient to at least map a full hugepage and superpage use by the hardware iommu will be automatic. This method also facilitates opportunistic use of iommu superpages and enables iommu hardware that supports arbitrary pages sizes, like AMD-Vi. There's a disable_hugepages module option to vfio_iommu_type1 that will skip the check for contiguous pages and map using only PAGE_SIZE chunks for testing and debugging purposes.

      Delete
  10. From above what I understood is
    if my
    huge page: of size 4k 16k 2M
    IOMMU super page: of size 4k 8k 4M
    then
    4k huge page mapped to one 4k super page
    16k huge page mapped to two 8k pages
    2M page mapped to 256 8k page.

    so now for 2M hugpage their will be 256 TLB miss although this could be reduced to 1 TLB miss, if IOMMU would have supported 2M super page..

    ReplyDelete
    Replies
    1. That's correct, but note that if you happen to get contiguous pages of any kind sufficient for the next step, we'll opportunistically use the larger iommu superpage. For instance, 2 contiguous 4k hugepages will use an 8k iommu superpage, 2 contiguous 2M hugepages will use a 4M iommu superpage. BTW, that's really unusual for the processor and iommu to support different size pages.

      Delete
    2. Suppose we have to dma transfer of 4huge page of 2M in one transaction & out of 4 two happen to be contiguous then 3tlb entries would be created for that dma transaction.
      Assume that IOMMU supports 4Mpage and 2M page.

      Delete
  11. Dear Alex, I sent a message to libvirt list (https://www.redhat.com/archives/libvir-list/2015-December/msg00818.html) about multi-function PCI hot-plug. Laine explained me that my assumptions were incomplete, specially on IOMMU grouping.

    The most intriguing sentence was: "(the most commonly used example is if the controller for the host's main disk happens to be in the same iommu group as some device that you're trying to assign)". But thanks to this post I understand that it can happen due to the lack of ACS support.

    However, in a scenario like that there is no way to have a 'safe' PCI passthrough. Even a cold-plug attachment would require all the device under the same IOMMU group to be unbind from the host to the vfio driver before starting the guest, right? If I get it right, maybe it would be feasible to offer multi-function hot-plug support only to devices supporting ACS. What do you think?

    Thank you for sharing all these information, basically all that I know of it I have been learning from your posts and patches! :)

    ReplyDelete
    Replies
    1. A cold-plug attachment has the same constraints with IOMMU group isolation as hotplug. Endpoints within a group may not be split between host and guest usage.

      The issues of IOMMU groups and multifunction hot-plug are mostly orthogonal. IOMMU groups always contain one or more functions in the PCI space. An IOMMU group may contain all the functions of a multi-function device, it may contain all the functions within an entire PCI hierarchy. So basing hotplug on an IOMMU group makes very little sense.

      QEMU now supports hot-add of multifunction devices using the methodology Laine described, exposing the device to the guest only when PCI function 0 is added. vfio-pci device can take advantage of this just like any other QEMU PCI device. It seems to me that libvirt simply needs to be updated to support this. The only vfio or IOMMU group special case may be that libvirt might need to bind all of the devices for the slot before attaching any of them to QEMU, otherwise we might get into the scenario of attempting to split an IOMMU group between host and guest while we're in the process of filling the PCI slot.

      Delete
  12. I am trying to create a patched kernel for Fedora 23. I have source for 4.5.1, I can compile it and run 4.5.1 kernel no problem. But I can't get an ACS patch to compile. I have tried a few different versions. To compile the kernel I am doing make menuconfig, make rpm. Then I can install the RPM and boot from it. What I am doing for ACS is make menuconfig. The cat pathtofile/acs_patch.diff |patch -p1 at this point some of them succeed but fail on make rpm some ACS patch versions fail at the patch -p1 step for example:

    patching file Documentation/kernel-parameters.txt
    Hunk #1 FAILED at 2819.
    1 out of 1 hunk FAILED -- saving rejects to file Documentation/kernel-parameters.txt.rej
    patching file drivers/pci/quirks.c
    Hunk #1 succeeded at 3677 with fuzz 2 (offset 16 lines).
    Hunk #2 FAILED at 3855.
    1 out of 2 hunks FAILED -- saving rejects to file drivers/pci/quirks.c.rej

    This is my first time compiling a kernel, so I could be way off in the process. Anything you can suggest for how to work through this? I have spent alot of time trying to find info on how to do this and information is spares, it all seems to assume you know a ton about patching, compiling kernels and the like.

    Any help would be appreciated.

    ReplyDelete
    Replies
    1. If the kernel has changed too differently, the patch file could not be applied. This is what "FAILED" the messages mean.

      You need to find the similar version of patch file (if ever exist).

      Delete
  13. Hi Alex. I test VFIO in a x86 computer. But I find that there's no iommu_group under /sys/kernel/iommu_groups. I have inserted all vfio_* modules. Could you give me a hint? Did I miss some configuration when I compiled kernel image and modules?

    ReplyDelete
    Replies
    1. 1. Enable VT-d support in BIOS
      2. Add "intel_iommu=on" to boot option
      3. Reboot, check "dmesg | grep IOMMU" to see if get supported
      4. Insert all vfio_* modules

      Delete

Comments are not a support forum. For help with problems, please try the vfio-users mailing list (https://www.redhat.com/mailman/listinfo/vfio-users)