Fri Jun 24 13:07:52 2022 UTC ()
Apply patches for Xen security advisory 397 up to 402, and 404 (XSA403 still
not released).
Bump PKGREVISION


(bouyer)
diff -r1.5 -r1.6 pkgsrc/sysutils/xenkernel415/Makefile
diff -r1.5 -r1.6 pkgsrc/sysutils/xenkernel415/distinfo
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA397
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA398
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA399
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA400
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA401
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA402
diff -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA404

cvs diff -r1.5 -r1.6 pkgsrc/sysutils/xenkernel415/Makefile (expand / switch to unified diff)

--- pkgsrc/sysutils/xenkernel415/Makefile 2022/04/30 00:21:15 1.5
+++ pkgsrc/sysutils/xenkernel415/Makefile 2022/06/24 13:07:52 1.6
@@ -1,19 +1,19 @@ @@ -1,19 +1,19 @@
1# $NetBSD: Makefile,v 1.5 2022/04/30 00:21:15 khorben Exp $ 1# $NetBSD: Makefile,v 1.6 2022/06/24 13:07:52 bouyer Exp $
2 2
3VERSION= 4.15.2 3VERSION= 4.15.2
4DISTNAME= xen-${VERSION} 4DISTNAME= xen-${VERSION}
5PKGNAME= xenkernel415-${VERSION} 5PKGNAME= xenkernel415-${VERSION}
6PKGREVISION= 1 6PKGREVISION= 2
7CATEGORIES= sysutils 7CATEGORIES= sysutils
8MASTER_SITES= https://downloads.xenproject.org/release/xen/${VERSION}/ 8MASTER_SITES= https://downloads.xenproject.org/release/xen/${VERSION}/
9DIST_SUBDIR= xen415 9DIST_SUBDIR= xen415
10 10
11MAINTAINER= bouyer@NetBSD.org 11MAINTAINER= bouyer@NetBSD.org
12HOMEPAGE= https://xenproject.org/ 12HOMEPAGE= https://xenproject.org/
13COMMENT= Xen 4.15.x Kernel 13COMMENT= Xen 4.15.x Kernel
14 14
15LICENSE= gnu-gpl-v2 15LICENSE= gnu-gpl-v2
16 16
17ONLY_FOR_PLATFORM= NetBSD-*.*-x86_64 17ONLY_FOR_PLATFORM= NetBSD-*.*-x86_64
18 18
19SSP_SUPPORTED= no 19SSP_SUPPORTED= no

cvs diff -r1.5 -r1.6 pkgsrc/sysutils/xenkernel415/distinfo (expand / switch to unified diff)

--- pkgsrc/sysutils/xenkernel415/distinfo 2022/03/04 17:54:08 1.5
+++ pkgsrc/sysutils/xenkernel415/distinfo 2022/06/24 13:07:52 1.6
@@ -1,14 +1,21 @@ @@ -1,14 +1,21 @@
1$NetBSD: distinfo,v 1.5 2022/03/04 17:54:08 bouyer Exp $ 1$NetBSD: distinfo,v 1.6 2022/06/24 13:07:52 bouyer Exp $
2 2
3BLAKE2s (xen415/xen-4.15.2.tar.gz) = f6e3d354a144c9ff49a198ebcafbd5e8a4414690d5672b3e2ed394c461ab8ab0 3BLAKE2s (xen415/xen-4.15.2.tar.gz) = f6e3d354a144c9ff49a198ebcafbd5e8a4414690d5672b3e2ed394c461ab8ab0
4SHA512 (xen415/xen-4.15.2.tar.gz) = 1cbf988fa8ed38b7ad724978958092ca0e5506e38c709c7d1af196fb8cb8ec0197a79867782761ef230b268624b3d7a0d5d0cd186f37d25f495085c71bf70d54 4SHA512 (xen415/xen-4.15.2.tar.gz) = 1cbf988fa8ed38b7ad724978958092ca0e5506e38c709c7d1af196fb8cb8ec0197a79867782761ef230b268624b3d7a0d5d0cd186f37d25f495085c71bf70d54
5Size (xen415/xen-4.15.2.tar.gz) = 40773378 bytes 5Size (xen415/xen-4.15.2.tar.gz) = 40773378 bytes
6SHA1 (patch-Config.mk) = 9372a09efd05c9fbdbc06f8121e411fcb7c7ba65 6SHA1 (patch-Config.mk) = 9372a09efd05c9fbdbc06f8121e411fcb7c7ba65
 7SHA1 (patch-XSA397) = caf9698a8817ae0728da9be6f2018392c9ab6634
 8SHA1 (patch-XSA398) = e4fff05675bcf231f9fdf99e9773d1389cd0660c
 9SHA1 (patch-XSA399) = c9ab4473654810ca2701dfc38c26e91a0d7f2eb5
 10SHA1 (patch-XSA400) = 33d3ae929427ef3e8c74f9e1c36fc1d7e742a8f3
 11SHA1 (patch-XSA401) = 8589aa9465c9416b4266beaad37a843de9906add
 12SHA1 (patch-XSA402) = 5fe64577fcc249e202591d3a88ab423dbaf0ae42
 13SHA1 (patch-XSA404) = ffb441cb248988b679707387e878ad0908082131
7SHA1 (patch-xen_Makefile) = 465388d80de414ca3bb84faefa0f52d817e423a6 14SHA1 (patch-xen_Makefile) = 465388d80de414ca3bb84faefa0f52d817e423a6
8SHA1 (patch-xen_Rules.mk) = c743dc63f51fc280d529a7d9e08650292c171dac 15SHA1 (patch-xen_Rules.mk) = c743dc63f51fc280d529a7d9e08650292c171dac
9SHA1 (patch-xen_arch_x86_Kconfig) = df14bfa09b9a0008ca59d53c938d43a644822dd9 16SHA1 (patch-xen_arch_x86_Kconfig) = df14bfa09b9a0008ca59d53c938d43a644822dd9
10SHA1 (patch-xen_arch_x86_Rules.mk) = 54392a7d719a21bc625a96b673056f88b96df97d 17SHA1 (patch-xen_arch_x86_Rules.mk) = 54392a7d719a21bc625a96b673056f88b96df97d
11SHA1 (patch-xen_arch_x86_boot_build32.mk) = c7e57ee41ebf29ced32146945bddca7b482c6a49 18SHA1 (patch-xen_arch_x86_boot_build32.mk) = c7e57ee41ebf29ced32146945bddca7b482c6a49
12SHA1 (patch-xen_arch_x86_extable.c) = e439e6f3fe704d9b2894fc6b9e8f23f321a00f05 19SHA1 (patch-xen_arch_x86_extable.c) = e439e6f3fe704d9b2894fc6b9e8f23f321a00f05
13SHA1 (patch-xen_arch_x86_mm_p2m.c) = 6e9b84dc8448eca9677f184e720bbfcb3c6d314e 20SHA1 (patch-xen_arch_x86_mm_p2m.c) = 6e9b84dc8448eca9677f184e720bbfcb3c6d314e
14SHA1 (patch-xen_drivers_passthrough_x86_iommu.c) = 8b3a36a019490b1d125ea1f74274435382797da1 21SHA1 (patch-xen_drivers_passthrough_x86_iommu.c) = 8b3a36a019490b1d125ea1f74274435382797da1

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA397
$NetBSD: patch-XSA397,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Roger Pau Monne <roger.pau@citrix.com>
Subject: x86/hap: do not switch on log dirty for VRAM tracking

XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.

This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:

Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable  x86_64  debug=y  Not tainted ]----
CPU:    34
RIP:    e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206   CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
   [<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
   [<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
   [<ffff82d04031577f>] F paging_domctl+0x251/0xd41
   [<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
   [<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
   [<ffff82d0403a729d>] F lstar_enter+0x12d/0x140

Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.

Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.

Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.

This is CVE-2022-26356 / XSA-397.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- xen/include/asm-x86/paging.h.orig
+++ xen/include/asm-x86/paging.h
@@ -162,9 +162,6 @@ void paging_log_dirty_range(struct domai
                             unsigned long nr,
                             uint8_t *dirty_bitmap);
 
-/* enable log dirty */
-int paging_log_dirty_enable(struct domain *d, bool log_global);
-
 /* log dirty initialization */
 void paging_log_dirty_init(struct domain *d, const struct log_dirty_ops *ops);
 
--- xen/arch/x86/mm/hap/hap.c.orig
+++ xen/arch/x86/mm/hap/hap.c
@@ -69,13 +69,6 @@ int hap_track_dirty_vram(struct domain *
     {
         unsigned int size = DIV_ROUND_UP(nr_frames, BITS_PER_BYTE);
 
-        if ( !paging_mode_log_dirty(d) )
-        {
-            rc = paging_log_dirty_enable(d, false);
-            if ( rc )
-                goto out;
-        }
-
         rc = -ENOMEM;
         dirty_bitmap = vzalloc(size);
         if ( !dirty_bitmap )
@@ -107,6 +100,10 @@ int hap_track_dirty_vram(struct domain *
 
             paging_unlock(d);
 
+            domain_pause(d);
+            p2m_enable_hardware_log_dirty(d);
+            domain_unpause(d);
+
             if ( oend > ostart )
                 p2m_change_type_range(d, ostart, oend,
                                       p2m_ram_logdirty, p2m_ram_rw);
--- xen/arch/x86/mm/paging.c.orig
+++ xen/arch/x86/mm/paging.c
@@ -211,7 +211,7 @@ static int paging_free_log_dirty_bitmap(
     return rc;
 }
 
-int paging_log_dirty_enable(struct domain *d, bool log_global)
+static int paging_log_dirty_enable(struct domain *d, bool log_global)
 {
     int ret;
 

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA398
$NetBSD: patch-XSA398,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From 1b50f41b3bd800eb72064063da0c64b86d629f3a Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 7 Mar 2022 16:35:52 +0000
Subject: x86/spec-ctrl: Cease using thunk=lfence on AMD

AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe.  AMD are recommending using retpoline everywhere.

Retpoline is incompatible with CET.  All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).

This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway.  Leave behind a message if IBRS is
found to be missing.

Also update the default heuristics to never select THUNK_LFENCE.  This causes
AMD CPUs to change their default to retpoline.

Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.

This is part of XSA-398 / CVE-2021-26401.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 443802b3d2e5..2392537954c8 100644
--- docs/misc/xen-command-line.pandoc.orig
+++ docs/misc/xen-command-line.pandoc
@@ -2205,9 +2205,9 @@ to use.
 
 If Xen was compiled with INDIRECT_THUNK support, `bti-thunk=` can be used to
 select which of the thunks gets patched into the `__x86_indirect_thunk_%reg`
-locations.  The default thunk is `retpoline` (generally preferred for Intel
-hardware), with the alternatives being `jmp` (a `jmp *%reg` gadget, minimal
-overhead), and `lfence` (an `lfence; jmp *%reg` gadget, preferred for AMD).
+locations.  The default thunk is `retpoline` (generally preferred), with the
+alternatives being `jmp` (a `jmp *%reg` gadget, minimal overhead), and
+`lfence` (an `lfence; jmp *%reg` gadget).
 
 On hardware supporting IBRS (Indirect Branch Restricted Speculation), the
 `ibrs=` option can be used to force or prevent Xen using the feature itself.
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 9301d95bd705..7ded6ecba197 100644
--- xen/arch/x86/spec_ctrl.c.orig
+++ xen/arch/x86/spec_ctrl.c
@@ -367,14 +367,19 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
                "\n");
 
     /* Settings for Xen's protection, irrespective of guests. */
-    printk("  Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s%s%s\n",
+    printk("  Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s%s, Other:%s%s%s%s%s\n",
            thunk == THUNK_NONE      ? "N/A" :
            thunk == THUNK_RETPOLINE ? "RETPOLINE" :
            thunk == THUNK_LFENCE    ? "LFENCE" :
            thunk == THUNK_JMP       ? "JMP" : "?",
-           !boot_cpu_has(X86_FEATURE_IBRSB)          ? "No" :
+           (!boot_cpu_has(X86_FEATURE_IBRSB) &&
+            !boot_cpu_has(X86_FEATURE_IBRS))         ? "No" :
            (default_xen_spec_ctrl & SPEC_CTRL_IBRS)  ? "IBRS+" :  "IBRS-",
-           !boot_cpu_has(X86_FEATURE_SSBD)           ? "" :
+           (!boot_cpu_has(X86_FEATURE_STIBP) &&
+            !boot_cpu_has(X86_FEATURE_AMD_STIBP))    ? "" :
+           (default_xen_spec_ctrl & SPEC_CTRL_STIBP) ? " STIBP+" : " STIBP-",
+           (!boot_cpu_has(X86_FEATURE_SSBD) &&
+            !boot_cpu_has(X86_FEATURE_AMD_SSBD))     ? "" :
            (default_xen_spec_ctrl & SPEC_CTRL_SSBD)  ? " SSBD+" : " SSBD-",
            !(caps & ARCH_CAPS_TSX_CTRL)              ? "" :
            (opt_tsx & 1)                             ? " TSX+" : " TSX-",
@@ -916,10 +921,23 @@ void __init init_speculation_mitigations(void)
     /*
      * First, disable the use of retpolines if Xen is using shadow stacks, as
      * they are incompatible.
+     *
+     * In the absence of retpolines, IBRS needs to be used for speculative
+     * safety.  All CET-capable hardware has efficient IBRS.
      */
-    if ( cpu_has_xen_shstk &&
-         (opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE) )
-        thunk = THUNK_JMP;
+    if ( cpu_has_xen_shstk )
+    {
+        if ( !boot_cpu_has(X86_FEATURE_IBRSB) )
+            printk(XENLOG_WARNING "?!? CET active, but no MSR_SPEC_CTRL?\n");
+        else if ( opt_ibrs == -1 )
+        {
+            opt_ibrs = ibrs = true;
+            default_xen_spec_ctrl |= SPEC_CTRL_IBRS | SPEC_CTRL_STIBP;
+        }
+
+        if ( opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE )
+            thunk = THUNK_JMP;
+    }
 
     /*
      * Has the user specified any custom BTI mitigations?  If so, follow their
@@ -951,16 +951,10 @@
         if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) )
         {
             /*
-             * AMD's recommended mitigation is to set lfence as being dispatch
-             * serialising, and to use IND_THUNK_LFENCE.
-             */
-            if ( cpu_has_lfence_dispatch )
-                thunk = THUNK_LFENCE;
-            /*
-             * On Intel hardware, we'd like to use retpoline in preference to
+             * On all hardware, we'd like to use retpoline in preference to
              * IBRS, but only if it is safe on this hardware.
              */
-            else if ( retpoline_safe(caps) )
+            if ( retpoline_safe(caps) )
                 thunk = THUNK_RETPOLINE;
             else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
                 ibrs = true;

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA399
$NetBSD: patch-XSA399,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: correct ordering of operations in cleanup_domid_map()

The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.

For the interaction with context_set_domain_id() and ->domid_map[] reads
see the code comment.

{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.

domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.

This is CVE-2022-26357 / XSA-399.

Fixes: b9c20c78789f ("VT-d: per-iommu domain-id")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -152,8 +152,14 @@ static void cleanup_domid_map(struct dom
 
     if ( iommu_domid >= 0 )
     {
+        /*
+         * Update domid_map[] /before/ domid_bitmap[] to avoid a race with
+         * context_set_domain_id(), setting the slot to DOMID_INVALID for
+         * ->domid_map[] reads to produce a suitable value while the bit is
+         * still set.
+         */
+        iommu->domid_map[iommu_domid] = DOMID_INVALID;
         clear_bit(iommu_domid, iommu->domid_bitmap);
-        iommu->domid_map[iommu_domid] = 0;
     }
 }
 

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA400
$NetBSD: patch-XSA400,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -2392,6 +2392,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2428,10 +2432,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2507,9 +2507,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2522,20 +2521,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %pp from %pd (%d)\n",
-                           &PCI_SBDF3(seg, bus, devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -1981,14 +1981,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2005,12 +1997,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2032,11 +2029,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2052,7 +2053,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter

This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/include/asm-x86/iommu.h.orig
+++ xen/include/asm-x86/iommu.h
@@ -143,7 +143,8 @@ int pi_update_irte(const struct pi_desc
 })
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct domain_iommu;
+struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
--- xen/drivers/passthrough/amd/iommu_map.c.orig
+++ xen/drivers/passthrough/amd/iommu_map.c
@@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom
     unsigned long  next_table_mfn;
     unsigned int level;
     struct page_info *table;
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
 
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
@@ -220,7 +220,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(hd);
             if ( table == NULL )
             {
                 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(hd);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -483,7 +483,7 @@ int __init amd_iommu_quarantine_init(str
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -498,7 +498,7 @@ int __init amd_iommu_quarantine_init(str
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
         if ( !pg )
             break;
 
--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
+++ xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -208,7 +208,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -327,7 +327,7 @@ static u64 addr_to_dma_page_maddr(struct
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -347,7 +347,7 @@ static u64 addr_to_dma_page_maddr(struct
             if ( !alloc )
                 break;
 
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(hd);
             if ( !pg )
                 break;
 
@@ -2761,7 +2761,7 @@ static int __init intel_iommu_quarantine
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(hd);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2780,7 +2780,7 @@ static int __init intel_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
 
         if ( !pg )
             goto out;
--- xen/drivers/passthrough/x86/iommu.c.orig
+++ xen/drivers/passthrough/x86/iommu.c
@@ -415,9 +415,8 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
     void *p;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -121,28 +121,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct vtd_iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid = domain_iommu_domid(domain, iommu);
@@ -1404,44 +1382,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd",
-                       domain, &PCI_SBDF3(seg, bus, devfn),
-                       pdev->domain);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "%pd: %pp mapped, but can't find owner\n",
-                       domain, &PCI_SBDF3(seg, bus, devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %pp already mapped to d%d",
-                       domain, &PCI_SBDF3(seg, bus, devfn), cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/drivers/passthrough/vtd/extern.h.orig
+++ xen/drivers/passthrough/vtd/extern.h
@@ -84,7 +84,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -103,8 +104,8 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -117,6 +117,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1362,15 +1363,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1379,17 +1392,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
     }
     else
     {
@@ -1401,36 +1429,107 @@ int domain_context_mapping_one(
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
+            if ( prev_dom )
+                rcu_unlock_domain(prev_dom);
             return -ENOMEM;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domain, iommu) )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
         return -EFAULT;
     }
 
-    context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels));
-    context_set_fault_enable(*context);
-    context_set_present(*context);
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) ==
+               level_to_agaw(iommu->nr_pt_levels));
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   &PCI_SBDF3(pdev->seg, pdev->bus, devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(seg, bus, devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
+    }
+
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1451,17 +1550,26 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
     {
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        if ( !prev_dom )
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             mode & MAP_WITH_RMRR) < 0;
+        else
+            ret = 1;
 
         if ( !ret && pdev && pdev->devfn == devfn )
             check_cleanup_domid_map(domain, pdev, iommu);
     }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1471,8 +1579,10 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    uint16_t seg = pdev->seg;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
     uint8_t bus = pdev->bus, secbus;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
@@ -1492,8 +1602,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
@@ -1512,7 +1643,9 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1524,9 +1657,10 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1534,6 +1668,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1542,7 +1685,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1557,10 +1700,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2336,9 +2484,8 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
@@ -2348,6 +2495,31 @@ static int reassign_device_ownership(
     if ( (target != hardware_domain) && !iommu_intremap )
         untrusted_msi = true;
 
+    ret = domain_context_mapping(target, devfn, pdev);
+    if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
+        return ret;
+    }
+
+    if ( pdev->devfn == devfn )
+    {
+        const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+        if ( drhd )
+            check_cleanup_domid_map(source, pdev, drhd->iommu);
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2376,34 +2548,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- xen/drivers/passthrough/vtd/iommu.h.orig
+++ xen/drivers/passthrough/vtd/iommu.h
@@ -202,8 +202,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- xen/drivers/passthrough/vtd/quirks.c.orig
+++ xen/drivers/passthrough/vtd/quirks.c
@@ -344,7 +344,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -355,9 +356,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -365,7 +366,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -389,7 +391,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -415,7 +417,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- xen/drivers/passthrough/vtd/vtd.h.orig
+++ xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/drivers/passthrough/amd/iommu.h.orig
+++ xen/drivers/passthrough/amd/iommu.h
@@ -247,9 +247,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 
--- xen/drivers/passthrough/amd/iommu_map.c.orig
+++ xen/drivers/passthrough/amd/iommu_map.c
@@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
+++ xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -81,41 +81,82 @@ int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(d);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
     const struct domain_iommu *hd = dom_iommu(domain);
+    const struct ivrs_mappings *ivrs_dev;
+
+    BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
-    BUG_ON( !hd->arch.amd.root_table ||
-            !hd->arch.amd.paging_mode ||
-            !iommu->dev_table.buffer );
+    rc = allocate_domain_resources(domain);
+    if ( rc )
+        return rc;
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
     dte = &table[req_id];
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
-        const struct ivrs_mappings *ivrs_dev;
-
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.amd.root_table),
-            domain->domain_id, hd->arch.amd.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.amd.root_table),
+                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
-        ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
         if ( dte->it_root )
         {
             dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
@@ -130,17 +171,73 @@ static void amd_iommu_setup_domain_devic
             dte->i = ats_enabled;
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.amd.root_table),
+                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            printk(" %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.amd.root_table),
-                        domain->domain_id, hd->arch.amd.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.amd.root_table),
+                    domain->domain_id, hd->arch.amd.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -151,6 +248,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -216,18 +315,6 @@ int amd_iommu_alloc_root(struct domain *
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain *d)
-{
-    struct domain_iommu *hd = dom_iommu(d);
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(d);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
@@ -340,7 +427,15 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -356,26 +451,9 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = allocate_domain_resources(target);
-    if ( rc )
-        return rc;
-
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
@@ -490,8 +568,7 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/drivers/passthrough/vtd/extern.h.orig
+++ xen/drivers/passthrough/vtd/extern.h
@@ -85,9 +85,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -105,7 +106,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -355,15 +355,17 @@ static u64 addr_to_dma_page_maddr(struct
     return pte_maddr;
 }
 
-static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels)
+static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
+                                unsigned int nr_pt_levels)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    uint64_t pgd_maddr;
     unsigned int agaw;
 
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
 
-    if ( iommu_use_hap_pt(d) )
+    if ( pgd_maddr )
+        /* nothing */;
+    else if ( iommu_use_hap_pt(d) )
     {
         pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
 
@@ -1376,12 +1378,12 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int rc, ret;
@@ -1421,10 +1423,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root;
+
         spin_lock(&hd->arch.mapping_lock);
 
-        pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels);
-        if ( !pgd_maddr )
+        root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
+        if ( !root )
         {
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
@@ -1434,7 +1438,7 @@ int domain_context_mapping_one(
             return -ENOMEM;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1550,15 +1554,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                           domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             domain->domain_id,
+                                             hd->arch.vtd.pgd_maddr,
                                              mode & MAP_WITH_RMRR) < 0;
+        }
         else
             ret = 1;
 
@@ -1580,6 +1590,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1643,7 +1654,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1657,7 +1669,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1685,7 +1698,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1700,7 +1714,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1728,7 +1743,7 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1786,7 +1801,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1837,7 +1852,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1847,7 +1863,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1873,12 +1890,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
--- xen/drivers/passthrough/vtd/quirks.c.orig
+++ xen/drivers/passthrough/vtd/quirks.c
@@ -345,6 +345,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -358,16 +360,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -391,7 +394,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -417,7 +420,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -59,8 +59,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -68,16 +68,16 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -85,8 +85,7 @@ static int domain_iommu_domid(struct dom
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -97,7 +96,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -113,7 +112,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -122,9 +121,9 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -180,7 +179,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, &dom_iommu(d)->arch.vtd.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -687,7 +686,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1447,7 +1446,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(&lctxt, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domid, iommu) )
     {
     unlock:
         spin_unlock(&iommu->lock);
@@ -1768,7 +1767,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1938,7 +1937,7 @@ static void iommu_domain_teardown(struct
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/include/asm-x86/iommu.h.orig
+++ xen/include/asm-x86/iommu.h
@@ -142,6 +142,10 @@ int pi_update_irte(const struct pi_desc
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
--- xen/include/asm-x86/pci.h.orig
+++ xen/include/asm-x86/pci.h
@@ -15,6 +15,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- xen/drivers/passthrough/amd/iommu.h.orig
+++ xen/drivers/passthrough/amd/iommu.h
@@ -96,6 +96,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- xen/drivers/passthrough/amd/iommu_detect.c.orig
+++ xen/drivers/passthrough/amd/iommu_detect.c
@@ -180,6 +180,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n",
@@ -190,7 +195,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
+++ xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -508,6 +508,8 @@ static int amd_iommu_add_device(u8 devfn
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -568,7 +570,22 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -591,6 +608,9 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     ivrs_mappings = get_ivrs_mappings(pdev->seg);
     bdf = PCI_BDF2(pdev->bus, devfn);
     if ( amd_iommu_perdev_intremap &&
--- xen/drivers/passthrough/pci.c.orig
+++ xen/drivers/passthrough/pci.c
@@ -327,6 +327,7 @@ static struct pci_dev *alloc_pdev(struct
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->arch.pseudo_domid = DOMID_INVALID;
     INIT_LIST_HEAD(&pdev->msi_list);
 
     pos = pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1276,8 +1277,12 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%pp - %pd - node %-3d - MSIs < ",
-               &pdev->sbdf, pdev->domain,
+        printk("%pp - ", &pdev->sbdf);
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d - MSIs < ",
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1204,7 +1205,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     if ( nr_iommus >= MAX_IOMMUS )
     {
@@ -1297,7 +1298,16 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1320,6 +1330,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1581,8 +1592,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1590,6 +1601,7 @@ static int domain_context_mapping(struct
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1649,6 +1661,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCIe_ENDPOINT:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1663,6 +1683,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCI:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1736,6 +1764,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1818,8 +1853,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
@@ -1829,7 +1866,7 @@ static int domain_context_unmap(struct d
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
-        return -ENODEV;
+        return ERR_PTR(-ENODEV);
     iommu = drhd->iommu;
 
     switch ( pdev->type )
@@ -1839,7 +1876,7 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         if ( !is_hardware_domain(domain) )
-            return -EPERM;
+            return ERR_PTR(-EPERM);
         goto out;
 
     case DEV_TYPE_PCIe_BRIDGE:
@@ -1912,7 +1949,7 @@ static int domain_context_unmap(struct d
         check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
-    return ret;
+    return ret ? ERR_PTR(ret) : drhd;
 }
 
 static void iommu_clear_root_pgtable(struct domain *d)
@@ -2137,16 +2174,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2163,6 +2201,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
--- xen/drivers/passthrough/vtd/iommu.h.orig
+++ xen/drivers/passthrough/vtd/iommu.h
@@ -535,6 +535,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
     uint32_t version;
--- xen/drivers/passthrough/x86/iommu.c.orig
+++ xen/drivers/passthrough/x86/iommu.c
@@ -386,6 +386,53 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 int iommu_free_pgtables(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- xen/drivers/passthrough/amd/iommu_map.c.orig
+++ xen/drivers/passthrough/amd/iommu_map.c
@@ -584,8 +584,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -2958,9 +2958,6 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
-
     /* Pages may be leaked in failure case */
     return rc;
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- xen/drivers/passthrough/amd/iommu.h.orig
+++ xen/drivers/passthrough/amd/iommu.h
@@ -358,7 +358,7 @@ static inline int amd_iommu_get_paging_m
     while ( max_frames > PTE_PER_TABLE_SIZE )
     {
         max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-        if ( ++level > 6 )
+        if ( ++level > IOMMU_MAX_PT_LEVELS )
             return -ENOMEM;
     }
 
--- xen/drivers/passthrough/amd/iommu-defs.h.orig
+++ xen/drivers/passthrough/amd/iommu-defs.h
@@ -106,6 +106,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- xen/drivers/passthrough/amd/iommu_map.c.orig
+++ xen/drivers/passthrough/amd/iommu_map.c
@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
 
     /*
      * A frame number past what the current page tables can represent can't
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- xen/arch/x86/mm/p2m.c.orig
+++ xen/arch/x86/mm/p2m.c
@@ -1468,7 +1468,7 @@ int set_identity_p2m_entry(struct domain
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret;
 
-    if ( !paging_mode_translate(p2m->domain) )
+    if ( !paging_mode_translate(d) )
     {
         if ( !is_iommu_enabled(d) )
             return 0;
--- xen/include/asm-x86/pci.h.orig
+++ xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -20,7 +22,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- xen/drivers/passthrough/amd/iommu.h.orig
+++ xen/drivers/passthrough/amd/iommu.h
@@ -223,7 +223,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- xen/drivers/passthrough/amd/iommu_map.c.orig
+++ xen/drivers/passthrough/amd/iommu_map.c
@@ -528,64 +528,135 @@ int amd_iommu_reserve_domain_unity_unmap
     return rc;
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
 
-    if ( hd->arch.amd.root_table )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.amd.root_table);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
+
+    if ( pdev->arch.amd.root_table )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
-
-    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
-    if ( !hd->arch.amd.root_table )
-        goto out;
-
-    table = __map_domain_page(hd->arch.amd.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.amd.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.amd.root_table = NULL;
+
+    if ( rc )
+        printk("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
+    else
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
+++ xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -122,6 +122,8 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     const struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
     BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
@@ -141,14 +143,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.amd.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.amd.root_table),
-                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -172,7 +185,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -185,8 +198,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.amd.root_table),
-                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -205,6 +218,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             printk(" %pp: reassignment may cause %pd data corruption\n",
@@ -234,9 +248,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.amd.root_table),
-                    domain->domain_id, hd->arch.amd.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -305,7 +318,7 @@ int amd_iommu_alloc_root(struct domain *
 {
     struct domain_iommu *hd = dom_iommu(d);
 
-    if ( unlikely(!hd->arch.amd.root_table) )
+    if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
     {
         hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
@@ -396,7 +409,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.amd.paging_mode);
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
@@ -608,6 +621,8 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- xen/drivers/passthrough/iommu.c.orig
+++ xen/drivers/passthrough/iommu.c
@@ -424,21 +424,21 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init, dev);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- xen/drivers/passthrough/pci.c.orig
+++ xen/drivers/passthrough/pci.c
@@ -858,9 +858,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1441,6 +1448,13 @@ static int assign_device(struct domain *
         msixtbl_init(d);
     }
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,12 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.vtd.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
+
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -85,13 +91,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -117,8 +128,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -168,8 +184,12 @@ static void check_cleanup_domid_map(stru
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1414,7 +1434,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1570,15 +1590,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                           domain->domain_id);
+                                           DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                             domain->domain_id,
-                                             hd->arch.vtd.pgd_maddr,
+                                             DEVICE_DOMID(prev_dom, pdev),
+                                             DEVICE_PGTABLE(prev_dom, pdev),
                                              mode & MAP_WITH_RMRR) < 0;
-        }
         else
             ret = 1;
 
@@ -1600,7 +1617,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1633,7 +1650,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1672,8 +1689,8 @@ static int domain_context_mapping(struct
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1696,8 +1713,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1725,8 +1742,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1741,8 +1758,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1889,7 +1906,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1900,7 +1917,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1923,18 +1940,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1977,6 +1988,26 @@ static void iommu_domain_teardown(struct
         cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2201,6 +2232,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2896,69 +2929,135 @@ static void vtd_dump_page_tables(struct
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    struct page_info *pgs[6])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(page_to_maddr(pgs[level]));
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     struct page_info *pg;
-    struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc = 0;
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.vtd.pgd_maddr);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.vtd.pgd_maddr )
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
-        goto out;
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
     }
 
-    pg = iommu_alloc_pgtable(hd);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    rc = -ENOMEM;
+    pg = iommu_alloc_pgtable(hd);
     if ( !pg )
-        goto out;
+        return -ENOMEM;
 
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
     hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
-    parent = map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level )
+    for_each_rmrr_device ( rmrr, bdf, i )
     {
-        uint64_t maddr;
-        unsigned int offset;
-
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-
-        if ( !pg )
-            goto out;
+        if ( rc )
+            break;
 
-        maddr = page_to_maddr(pg);
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pp: RMRR quarantine mapping failed\n",
+                       &pdev->sbdf);
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.vtd.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
+
+    if ( !rc )
+    {
+        struct dma_pte *root;
+        struct page_info *pgs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
-    rc = 0;
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages may be leaked in failure case */
     return rc;
 }
 
--- xen/drivers/passthrough/vtd/iommu.h.orig
+++ xen/drivers/passthrough/vtd/iommu.h
@@ -509,7 +509,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- xen/include/xen/iommu.h.orig
+++ xen/include/xen/iommu.h
@@ -234,7 +234,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -352,6 +352,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 #ifdef CONFIG_HAS_PCI
 int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA401
$NetBSD: patch-XSA401,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Clean up _get_page_type()

Various fixes for clarity, ahead of making complicated changes.

 * Split the overflow check out of the if/else chain for type handling, as
   it's somewhat unrelated.
 * Comment the main if/else chain to explain what is going on.  Adjust one
   ASSERT() and state the bit layout for validate-locked and partial states.
 * Correct the comment about TLB flushing, as it's backwards.  The problem
   case is when writeable mappings are retained to a page becoming read-only,
   as it allows the guest to bypass Xen's safety checks for updates.
 * Reduce the scope of 'y'.  It is an artefact of the cmpxchg loop and not
   valid for use by subsequent logic.  Switch to using ACCESS_ONCE() to treat
   all reads as explicitly volatile.  The only thing preventing the validated
   wait-loop being infinite is the compiler barrier hidden in cpu_relax().
 * Replace one page_get_owner(page) with the already-calculated 'd' already in
   scope.

No functional change.

This is part of XSA-401 / CVE-2022-26362.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 796faca64103..ddd32f88c798 100644
--- xen/arch/x86/mm.c.orig
+++ xen/arch/x86/mm.c
@@ -2935,16 +2935,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
 static int _get_page_type(struct page_info *page, unsigned long type,
                           bool preemptible)
 {
-    unsigned long nx, x, y = page->u.inuse.type_info;
+    unsigned long nx, x;
     int rc = 0;
 
     ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
     ASSERT(!in_irq());
 
-    for ( ; ; )
+    for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
     {
         x  = y;
         nx = x + 1;
+
         if ( unlikely((nx & PGT_count_mask) == 0) )
         {
             gdprintk(XENLOG_WARNING,
@@ -2952,8 +2953,15 @@ static int _get_page_type(struct page_info *page, unsigned long type,
                      mfn_x(page_to_mfn(page)));
             return -EINVAL;
         }
-        else if ( unlikely((x & PGT_count_mask) == 0) )
+
+        if ( unlikely((x & PGT_count_mask) == 0) )
         {
+            /*
+             * Typeref 0 -> 1.
+             *
+             * Type changes are permitted when the typeref is 0.  If the type
+             * actually changes, the page needs re-validating.
+             */
             struct domain *d = page_get_owner(page);
 
             if ( d && shadow_mode_enabled(d) )
@@ -2964,8 +2972,8 @@ static int _get_page_type(struct page_info *page, unsigned long type,
             {
                 /*
                  * On type change we check to flush stale TLB entries. It is
-                 * vital that no other CPUs are left with mappings of a frame
-                 * which is about to become writeable to the guest.
+                 * vital that no other CPUs are left with writeable mappings
+                 * to a frame which is intending to become pgtable/segdesc.
                  */
                 cpumask_t *mask = this_cpu(scratch_cpumask);
 
@@ -2977,7 +2985,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
                 if ( unlikely(!cpumask_empty(mask)) &&
                      /* Shadow mode: track only writable pages. */
-                     (!shadow_mode_enabled(page_get_owner(page)) ||
+                     (!shadow_mode_enabled(d) ||
                       ((nx & PGT_type_mask) == PGT_writable_page)) )
                 {
                     perfc_incr(need_flush_tlb_flush);
@@ -3008,7 +3016,14 @@ static int _get_page_type(struct page_info *page, unsigned long type,
         }
         else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
         {
-            /* Don't log failure if it could be a recursive-mapping attempt. */
+            /*
+             * else, we're trying to take a new reference, of the wrong type.
+             *
+             * This (being able to prohibit use of the wrong type) is what the
+             * typeref system exists for, but skip printing the failure if it
+             * looks like a recursive mapping, as subsequent logic might
+             * ultimately permit the attempt.
+             */
             if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
                  (type == PGT_l1_page_table) )
                 return -EINVAL;
@@ -3027,18 +3042,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
         }
         else if ( unlikely(!(x & PGT_validated)) )
         {
+            /*
+             * else, the count is non-zero, and we're grabbing the right type;
+             * but the page hasn't been validated yet.
+             *
+             * The page is in one of two states (depending on PGT_partial),
+             * and should have exactly one reference.
+             */
+            ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1));
+
             if ( !(x & PGT_partial) )
             {
-                /* Someone else is updating validation of this page. Wait... */
+                /*
+                 * The page has been left in the "validate locked" state
+                 * (i.e. PGT_[type] | 1) which means that a concurrent caller
+                 * of _get_page_type() is in the middle of validation.
+                 *
+                 * Spin waiting for the concurrent user to complete (partial
+                 * or fully validated), then restart our attempt to acquire a
+                 * type reference.
+                 */
                 do {
                     if ( preemptible && hypercall_preempt_check() )
                         return -EINTR;
                     cpu_relax();
-                } while ( (y = page->u.inuse.type_info) == x );
+                } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
                 continue;
             }
-            /* Type ref count was left at 1 when PGT_partial got set. */
-            ASSERT((x & PGT_count_mask) == 1);
+
+            /*
+             * The page has been left in the "partial" state
+             * (i.e., PGT_[type] | PGT_partial | 1).
+             *
+             * Rather than bumping the type count, we need to try to grab the
+             * validation lock; if we succeed, we need to validate the page,
+             * then drop the general ref associated with the PGT_partial bit.
+             *
+             * We grab the validation lock by setting nx to (PGT_[type] | 1)
+             * (i.e., non-zero type count, neither PGT_validated nor
+             * PGT_partial set).
+             */
             nx = x & ~PGT_partial;
         }
 
@@ -3087,6 +3130,13 @@ static int _get_page_type(struct page_info *page, unsigned long type,
     }
 
  out:
+    /*
+     * Did we drop the PGT_partial bit when acquiring the typeref?  If so,
+     * drop the general reference that went along with it.
+     *
+     * N.B. validate_page() may have have re-set PGT_partial, not reflected in
+     * nx, but will have taken an extra ref when doing so.
+     */
     if ( (x & PGT_partial) && !(nx & PGT_partial) )
         put_page(page);
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type()

_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between.  Consider:

CPU A:
  1. Creates an L2e referencing pg
     `-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
  2.     Issues flush_tlb_mask()
CPU B:
  3. Creates a writeable mapping of pg
     `-> _get_page_type(pg, PGT_writable_page), count increases to 1
  4. Writes into new mapping, creating a TLB entry for pg
  5. Removes the writeable mapping of pg
     `-> _put_page_type(pg), count goes back down to 0
CPU A:
  7.     Issues cmpxchg(), setting count 1, type PGT_l1_page_table

CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only).  The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.

Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.

Also remove the early validation for writeable and shared pages.  This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
 * The IOMMU pagetables are in sync with the new page type
 * Writeable mappings to shared pages have been torn down

This is part of XSA-401 / CVE-2022-26362.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ddd32f88c798..1693b580b152 100644
--- xen/arch/x86/mm.c.orig
+++ xen/arch/x86/mm.c
@@ -2962,56 +2962,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
              * Type changes are permitted when the typeref is 0.  If the type
              * actually changes, the page needs re-validating.
              */
-            struct domain *d = page_get_owner(page);
-
-            if ( d && shadow_mode_enabled(d) )
-               shadow_prepare_page_type_change(d, page, type);
 
             ASSERT(!(x & PGT_pae_xen_l2));
             if ( (x & PGT_type_mask) != type )
             {
-                /*
-                 * On type change we check to flush stale TLB entries. It is
-                 * vital that no other CPUs are left with writeable mappings
-                 * to a frame which is intending to become pgtable/segdesc.
-                 */
-                cpumask_t *mask = this_cpu(scratch_cpumask);
-
-                BUG_ON(in_irq());
-                cpumask_copy(mask, d->dirty_cpumask);
-
-                /* Don't flush if the timestamp is old enough */
-                tlbflush_filter(mask, page->tlbflush_timestamp);
-
-                if ( unlikely(!cpumask_empty(mask)) &&
-                     /* Shadow mode: track only writable pages. */
-                     (!shadow_mode_enabled(d) ||
-                      ((nx & PGT_type_mask) == PGT_writable_page)) )
-                {
-                    perfc_incr(need_flush_tlb_flush);
-                    /*
-                     * If page was a page table make sure the flush is
-                     * performed using an IPI in order to avoid changing the
-                     * type of a page table page under the feet of
-                     * spurious_page_fault().
-                     */
-                    flush_mask(mask,
-                               (x & PGT_type_mask) &&
-                               (x & PGT_type_mask) <= PGT_root_page_table
-                               ? FLUSH_TLB | FLUSH_FORCE_IPI
-                               : FLUSH_TLB);
-                }
-
-                /* We lose existing type and validity. */
                 nx &= ~(PGT_type_mask | PGT_validated);
                 nx |= type;
-
-                /*
-                 * No special validation needed for writable pages.
-                 * Page tables and GDT/LDT need to be scanned for validity.
-                 */
-                if ( type == PGT_writable_page || type == PGT_shared_page )
-                    nx |= PGT_validated;
             }
         }
         else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
@@ -3092,6 +3048,56 @@ static int _get_page_type(struct page_info *page, unsigned long type,
             return -EINTR;
     }
 
+    /*
+     * One typeref has been taken and is now globally visible.
+     *
+     * The page is either in the "validate locked" state (PGT_[type] | 1) or
+     * fully validated (PGT_[type] | PGT_validated | >0).
+     */
+
+    if ( unlikely((x & PGT_count_mask) == 0) )
+    {
+        struct domain *d = page_get_owner(page);
+
+        if ( d && shadow_mode_enabled(d) )
+            shadow_prepare_page_type_change(d, page, type);
+
+        if ( (x & PGT_type_mask) != type )
+        {
+            /*
+             * On type change we check to flush stale TLB entries. It is
+             * vital that no other CPUs are left with writeable mappings
+             * to a frame which is intending to become pgtable/segdesc.
+             */
+            cpumask_t *mask = this_cpu(scratch_cpumask);
+
+            BUG_ON(in_irq());
+            cpumask_copy(mask, d->dirty_cpumask);
+
+            /* Don't flush if the timestamp is old enough */
+            tlbflush_filter(mask, page->tlbflush_timestamp);
+
+            if ( unlikely(!cpumask_empty(mask)) &&
+                 /* Shadow mode: track only writable pages. */
+                 (!shadow_mode_enabled(d) ||
+                  ((nx & PGT_type_mask) == PGT_writable_page)) )
+            {
+                perfc_incr(need_flush_tlb_flush);
+                /*
+                 * If page was a page table make sure the flush is
+                 * performed using an IPI in order to avoid changing the
+                 * type of a page table page under the feet of
+                 * spurious_page_fault().
+                 */
+                flush_mask(mask,
+                           (x & PGT_type_mask) &&
+                           (x & PGT_type_mask) <= PGT_root_page_table
+                           ? FLUSH_TLB | FLUSH_FORCE_IPI
+                           : FLUSH_TLB);
+            }
+        }
+    }
+
     if ( unlikely(((x & PGT_type_mask) == PGT_writable_page) !=
                   (type == PGT_writable_page)) )
     {
@@ -3120,13 +3126,25 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
     if ( unlikely(!(nx & PGT_validated)) )
     {
-        if ( !(x & PGT_partial) )
+        /*
+         * No special validation needed for writable or shared pages.  Page
+         * tables and GDT/LDT need to have their contents audited.
+         *
+         * per validate_page(), non-atomic updates are fine here.
+         */
+        if ( type == PGT_writable_page || type == PGT_shared_page )
+            page->u.inuse.type_info |= PGT_validated;
+        else
         {
-            page->nr_validated_ptes = 0;
-            page->partial_flags = 0;
-            page->linear_pt_count = 0;
+            if ( !(x & PGT_partial) )
+            {
+                page->nr_validated_ptes = 0;
+                page->partial_flags = 0;
+                page->linear_pt_count = 0;
+            }
+
+            rc = validate_page(page, type, preemptible);
         }
-        rc = validate_page(page, type, preemptible);
     }
 
  out:

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA402
$NetBSD: patch-XSA402,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/page: Introduce _PAGE_* constants for memory types

... rather than opencoding the PAT/PCD/PWT attributes in __PAGE_HYPERVISOR_*
constants.  These are going to be needed by forthcoming logic.

No functional change.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
index 4c7f2cb70c69..534bc1f403b3 100644
--- xen/include/asm-x86/page.h.orig
+++ xen/include/asm-x86/page.h
@@ -336,6 +336,14 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
 
 #define PAGE_CACHE_ATTRS (_PAGE_PAT | _PAGE_PCD | _PAGE_PWT)
 
+/* Memory types, encoded under Xen's choice of MSR_PAT. */
+#define _PAGE_WB         (                                0)
+#define _PAGE_WT         (                        _PAGE_PWT)
+#define _PAGE_UCM        (            _PAGE_PCD            )
+#define _PAGE_UC         (            _PAGE_PCD | _PAGE_PWT)
+#define _PAGE_WC         (_PAGE_PAT                        )
+#define _PAGE_WP         (_PAGE_PAT |             _PAGE_PWT)
+
 /*
  * Debug option: Ensure that granted mappings are not implicitly unmapped.
  * WARNING: This will need to be disabled to run OSes that use the spare PTE
@@ -354,8 +362,8 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
 #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
 #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
                                    _PAGE_DIRTY | _PAGE_RW)
-#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
-#define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_UCM)
+#define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_UC)
 #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86: Don't change the cacheability of the directmap

Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
in response to guest mapping requests") attempted to keep the cacheability
consistent between different mappings of the same page.

The reason wasn't described in the changelog, but it is understood to be in
regards to a concern over machine check exceptions, owing to errata when using
mixed cacheabilities.  It did this primarily by updating Xen's mapping of the
page in the direct map when the guest mapped a page with reduced cacheability.

Unfortunately, the logic didn't actually prevent mixed cacheability from
occurring:
 * A guest could map a page normally, and then map the same page with
   different cacheability; nothing prevented this.
 * The cacheability of the directmap was always latest-takes-precedence in
   terms of guest requests.
 * Grant-mapped frames with lesser cacheability didn't adjust the page's
   cacheattr settings.
 * The map_domain_page() function still unconditionally created WB mappings,
   irrespective of the page's cacheattr settings.

Additionally, update_xen_mappings() had a bug where the alias calculation was
wrong for mfn's which were .init content, which should have been treated as
fully guest pages, not Xen pages.

Worse yet, the logic introduced a vulnerability whereby necessary
pagetable/segdesc adjustments made by Xen in the validation logic could become
non-coherent between the cache and main memory.  The CPU could subsequently
operate on the stale value in the cache, rather than the safe value in main
memory.

The directmap contains primarily mappings of RAM.  PAT/MTRR conflict
resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
cacheability resolves to being coherent.  The special case is WC mappings,
which are non-coherent against MTRR=WB regions (except for fully-coherent
CPUs).

Xen must not have any WC cacheability in the directmap, to prevent Xen's
actions from creating non-coherency.  (Guest actions creating non-coherency is
dealt with in subsequent patches.)  As all memory types for MTRR=WB ranges
inter-operate coherently, so leave Xen's directmap mappings as WB.

Only PV guests with access to devices can use reduced-cacheability mappings to
begin with, and they're trusted not to mount DoSs against the system anyway.

Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
count.  Retain the check in get_page_from_l1e() for special_pages() because a
guest has no business using reduced cacheability on these.

This reverts changeset 55f97f49b7ce6c3520c555d19caac6cf3f9a5df0

This is CVE-2022-26363, part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2644b9f0337c..6ce8c19dcecc 100644
--- xen/arch/x86/mm.c.orig
+++ xen/arch/x86/mm.c
@@ -783,28 +783,6 @@ bool is_iomem_page(mfn_t mfn)
     return (page_get_owner(page) == dom_io);
 }
 
-static int update_xen_mappings(unsigned long mfn, unsigned int cacheattr)
-{
-    int err = 0;
-    bool alias = mfn >= PFN_DOWN(xen_phys_start) &&
-         mfn < PFN_UP(xen_phys_start + xen_virt_end - XEN_VIRT_START);
-    unsigned long xen_va =
-        XEN_VIRT_START + ((mfn - PFN_DOWN(xen_phys_start)) << PAGE_SHIFT);
-
-    if ( boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) )
-        return 0;
-
-    if ( unlikely(alias) && cacheattr )
-        err = map_pages_to_xen(xen_va, _mfn(mfn), 1, 0);
-    if ( !err )
-        err = map_pages_to_xen((unsigned long)mfn_to_virt(mfn), _mfn(mfn), 1,
-                     PAGE_HYPERVISOR | cacheattr_to_pte_flags(cacheattr));
-    if ( unlikely(alias) && !cacheattr && !err )
-        err = map_pages_to_xen(xen_va, _mfn(mfn), 1, PAGE_HYPERVISOR);
-
-    return err;
-}
-
 #ifndef NDEBUG
 struct mmio_emul_range_ctxt {
     const struct domain *d;
@@ -1009,47 +987,14 @@ get_page_from_l1e(
         goto could_not_pin;
     }
 
-    if ( pte_flags_to_cacheattr(l1f) !=
-         ((page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base) )
+    if ( (l1f & PAGE_CACHE_ATTRS) != _PAGE_WB && is_special_page(page) )
     {
-        unsigned long x, nx, y = page->count_info;
-        unsigned long cacheattr = pte_flags_to_cacheattr(l1f);
-        int err;
-
-        if ( is_special_page(page) )
-        {
-            if ( write )
-                put_page_type(page);
-            put_page(page);
-            gdprintk(XENLOG_WARNING,
-                     "Attempt to change cache attributes of Xen heap page\n");
-            return -EACCES;
-        }
-
-        do {
-            x  = y;
-            nx = (x & ~PGC_cacheattr_mask) | (cacheattr << PGC_cacheattr_base);
-        } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
-
-        err = update_xen_mappings(mfn, cacheattr);
-        if ( unlikely(err) )
-        {
-            cacheattr = y & PGC_cacheattr_mask;
-            do {
-                x  = y;
-                nx = (x & ~PGC_cacheattr_mask) | cacheattr;
-            } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
-
-            if ( write )
-                put_page_type(page);
-            put_page(page);
-
-            gdprintk(XENLOG_WARNING, "Error updating mappings for mfn %" PRI_mfn
-                     " (pfn %" PRI_pfn ", from L1 entry %" PRIpte ") for d%d\n",
-                     mfn, get_gpfn_from_mfn(mfn),
-                     l1e_get_intpte(l1e), l1e_owner->domain_id);
-            return err;
-        }
+        if ( write )
+            put_page_type(page);
+        put_page(page);
+        gdprintk(XENLOG_WARNING,
+                 "Attempt to change cache attributes of Xen heap page\n");
+        return -EACCES;
     }
 
     return 0;
@@ -2455,25 +2400,10 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
  */
 static int cleanup_page_mappings(struct page_info *page)
 {
-    unsigned int cacheattr =
-        (page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base;
     int rc = 0;
     unsigned long mfn = mfn_x(page_to_mfn(page));
 
     /*
-     * If we've modified xen mappings as a result of guest cache
-     * attributes, restore them to the "normal" state.
-     */
-    if ( unlikely(cacheattr) )
-    {
-        page->count_info &= ~PGC_cacheattr_mask;
-
-        BUG_ON(is_special_page(page));
-
-        rc = update_xen_mappings(mfn, 0);
-    }
-
-    /*
      * If this may be in a PV domain's IOMMU, remove it.
      *
      * NB that writable xenheap pages have their type set and cleared by
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 041c158f03f6..f5b8862b8374 100644
--- xen/include/asm-x86/mm.h.orig
+++ xen/include/asm-x86/mm.h
@@ -69,25 +69,22 @@
  /* Set when is using a page as a page table */
 #define _PGC_page_table   PG_shift(3)
 #define PGC_page_table    PG_mask(1, 3)
- /* 3-bit PAT/PCD/PWT cache-attribute hint. */
-#define PGC_cacheattr_base PG_shift(6)
-#define PGC_cacheattr_mask PG_mask(7, 6)
  /* Page is broken? */
-#define _PGC_broken       PG_shift(7)
-#define PGC_broken        PG_mask(1, 7)
+#define _PGC_broken       PG_shift(4)
+#define PGC_broken        PG_mask(1, 4)
  /* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */
-#define PGC_state         PG_mask(3, 9)
-#define PGC_state_inuse   PG_mask(0, 9)
-#define PGC_state_offlining PG_mask(1, 9)
-#define PGC_state_offlined PG_mask(2, 9)
-#define PGC_state_free    PG_mask(3, 9)
+#define PGC_state           PG_mask(3, 6)
+#define PGC_state_inuse     PG_mask(0, 6)
+#define PGC_state_offlining PG_mask(1, 6)
+#define PGC_state_offlined  PG_mask(2, 6)
+#define PGC_state_free      PG_mask(3, 6)
 #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st)
 /* Page is not reference counted */
-#define _PGC_extra        PG_shift(10)
-#define PGC_extra         PG_mask(1, 10)
+#define _PGC_extra        PG_shift(7)
+#define PGC_extra         PG_mask(1, 7)
 
 /* Count of references to this frame. */
-#define PGC_count_width   PG_shift(10)
+#define PGC_count_width   PG_shift(7)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
 /*
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86: Split cache_flush() out of cache_writeback()

Subsequent changes will want a fully flushing version.

Use the new helper rather than opencoding it in flush_area_local().  This
resolves an outstanding issue where the conditional sfence is on the wrong
side of the clflushopt loop.  clflushopt is ordered with respect to older
stores, not to younger stores.

Rename gnttab_cache_flush()'s helper to avoid colliding in name.
grant_table.c can see the prototype from cache.h so the build fails
otherwise.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Xen 4.16 and earlier:
 * Also backport half of c/s 3330013e67396 "VT-d / x86: re-arrange cache
   syncing" to split cache_writeback() out of the IOMMU logic, but without the
   associated hooks changes.

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 25798df50f54..0c912b8669f8 100644
--- xen/arch/x86/flushtlb.c.orig
+++ xen/arch/x86/flushtlb.c
@@ -234,7 +234,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     if ( flags & FLUSH_CACHE )
     {
         const struct cpuinfo_x86 *c = &current_cpu_data;
-        unsigned long i, sz = 0;
+        unsigned long sz = 0;
 
         if ( order < (BITS_PER_LONG - PAGE_SHIFT) )
             sz = 1UL << (order + PAGE_SHIFT);
@@ -244,13 +244,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
              c->x86_clflush_size && c->x86_cache_size && sz &&
              ((sz >> 10) < c->x86_cache_size) )
         {
-            alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
-            for ( i = 0; i < sz; i += c->x86_clflush_size )
-                alternative_input(".byte " __stringify(NOP_DS_PREFIX) ";"
-                                  " clflush %0",
-                                  "data16 clflush %0",      /* clflushopt */
-                                  X86_FEATURE_CLFLUSHOPT,
-                                  "m" (((const char *)va)[i]));
+            cache_flush(va, sz);
             flags &= ~FLUSH_CACHE;
         }
         else
@@ -265,6 +259,80 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     return flags;
 }
 
+void cache_flush(const void *addr, unsigned int size)
+{
+    /*
+     * This function may be called before current_cpu_data is established.
+     * Hence a fallback is needed to prevent the loop below becoming infinite.
+     */
+    unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+    const void *end = addr + size;
+
+    addr -= (unsigned long)addr & (clflush_size - 1);
+    for ( ; addr < end; addr += clflush_size )
+    {
+        /*
+         * Note regarding the "ds" prefix use: it's faster to do a clflush
+         * + prefix than a clflush + nop, and hence the prefix is added instead
+         * of letting the alternative framework fill the gap by appending nops.
+         */
+        alternative_io("ds; clflush %[p]",
+                       "data16 clflush %[p]", /* clflushopt */
+                       X86_FEATURE_CLFLUSHOPT,
+                       /* no outputs */,
+                       [p] "m" (*(const char *)(addr)));
+    }
+
+    alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
+}
+
+void cache_writeback(const void *addr, unsigned int size)
+{
+    unsigned int clflush_size;
+    const void *end = addr + size;
+
+    /* Fall back to CLFLUSH{,OPT} when CLWB isn't available. */
+    if ( !boot_cpu_has(X86_FEATURE_CLWB) )
+        return cache_flush(addr, size);
+
+    /*
+     * This function may be called before current_cpu_data is established.
+     * Hence a fallback is needed to prevent the loop below becoming infinite.
+     */
+    clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+    addr -= (unsigned long)addr & (clflush_size - 1);
+    for ( ; addr < end; addr += clflush_size )
+    {
+/*
+ * The arguments to a macro must not include preprocessor directives. Doing so
+ * results in undefined behavior, so we have to create some defines here in
+ * order to avoid it.
+ */
+#if defined(HAVE_AS_CLWB)
+# define CLWB_ENCODING "clwb %[p]"
+#elif defined(HAVE_AS_XSAVEOPT)
+# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
+#else
+# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
+#endif
+
+#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
+#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
+# define INPUT BASE_INPUT
+#else
+# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
+#endif
+
+        asm volatile (CLWB_ENCODING :: INPUT(addr));
+
+#undef INPUT
+#undef BASE_INPUT
+#undef CLWB_ENCODING
+    }
+
+    asm volatile ("sfence" ::: "memory");
+}
+
 unsigned int guest_flush_tlb_flags(const struct domain *d)
 {
     bool shadow = paging_mode_shadow(d);
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 47b019c75017..77bba9806937 100644
--- xen/common/grant_table.c.orig
+++ xen/common/grant_table.c
@@ -3423,7 +3423,7 @@ gnttab_swap_grant_ref(XEN_GUEST_HANDLE_PARAM(gnttab_swap_grant_ref_t) uop,
     return 0;
 }
 
-static int cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
+static int _cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
 {
     struct domain *d, *owner;
     struct page_info *page;
@@ -3517,7 +3517,7 @@ gnttab_cache_flush(XEN_GUEST_HANDLE_PARAM(gnttab_cache_flush_t) uop,
             return -EFAULT;
         for ( ; ; )
         {
-            int ret = cache_flush(&op, cur_ref);
+            int ret = _cache_flush(&op, cur_ref);
 
             if ( ret < 0 )
                 return ret;
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index cf4d2218fa8b..8f70ae727b86 100644
--- xen/drivers/passthrough/vtd/extern.h.orig
+++ xen/drivers/passthrough/vtd/extern.h
@@ -76,7 +76,6 @@ int __must_check qinval_device_iotlb_sync(struct vtd_iommu *iommu,
                                           struct pci_dev *pdev,
                                           u16 did, u16 size, u64 addr);
 
-unsigned int get_cache_line_size(void);
 void flush_all_cache(void);
 
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index a063462cff5a..68a658930a6a 100644
--- xen/drivers/passthrough/vtd/iommu.c.orig
+++ xen/drivers/passthrough/vtd/iommu.c
@@ -31,6 +31,7 @@
 #include <xen/pci.h>
 #include <xen/pci_regs.h>
 #include <xen/keyhandler.h>
+#include <asm/cache.h>
 #include <asm/msi.h>
 #include <asm/nops.h>
 #include <asm/irq.h>
@@ -204,54 +205,6 @@ static void check_cleanup_domid_map(struct domain *d,
     }
 }
 
-static void sync_cache(const void *addr, unsigned int size)
-{
-    static unsigned long clflush_size = 0;
-    const void *end = addr + size;
-
-    if ( clflush_size == 0 )
-        clflush_size = get_cache_line_size();
-
-    addr -= (unsigned long)addr & (clflush_size - 1);
-    for ( ; addr < end; addr += clflush_size )
-/*
- * The arguments to a macro must not include preprocessor directives. Doing so
- * results in undefined behavior, so we have to create some defines here in
- * order to avoid it.
- */
-#if defined(HAVE_AS_CLWB)
-# define CLWB_ENCODING "clwb %[p]"
-#elif defined(HAVE_AS_XSAVEOPT)
-# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
-#else
-# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
-#endif
-
-#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
-#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
-# define INPUT BASE_INPUT
-#else
-# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
-#endif
-        /*
-         * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush
-         * + prefix than a clflush + nop, and hence the prefix is added instead
-         * of letting the alternative framework fill the gap by appending nops.
-         */
-        alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]",
-                         "data16 clflush %[p]", /* clflushopt */
-                         X86_FEATURE_CLFLUSHOPT,
-                         CLWB_ENCODING,
-                         X86_FEATURE_CLWB, /* no outputs */,
-                         INPUT(addr));
-#undef INPUT
-#undef BASE_INPUT
-#undef CLWB_ENCODING
-
-    alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT,
-                      "sfence", X86_FEATURE_CLWB);
-}
-
 /* Allocate page table, return its machine address */
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
 {
@@ -271,7 +224,7 @@ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
         clear_page(vaddr);
 
         if ( (iommu_ops.init ? &iommu_ops : &vtd_ops)->sync_cache )
-            sync_cache(vaddr, PAGE_SIZE);
+            cache_writeback(vaddr, PAGE_SIZE);
         unmap_domain_page(vaddr);
         cur_pg++;
     }
@@ -1252,7 +1252,7 @@
     iommu->nr_pt_levels = agaw_to_level(agaw);
 
     if ( !ecap_coherent(iommu->ecap) )
-        vtd_ops.sync_cache = sync_cache;
+        vtd_ops.sync_cache = cache_writeback;
 
     /* allocate domain id bitmap */
     nr_dom = cap_ndoms(iommu->cap);
diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c
index 6681dccd6970..55f0faa521cb 100644
--- xen/drivers/passthrough/vtd/x86/vtd.c.orig
+++ xen/drivers/passthrough/vtd/x86/vtd.c
@@ -47,11 +47,6 @@ void unmap_vtd_domain_page(const void *va)
     unmap_domain_page(va);
 }
 
-unsigned int get_cache_line_size(void)
-{
-    return ((cpuid_ebx(1) >> 8) & 0xff) * 8;
-}
-
 void flush_all_cache()
 {
     wbinvd();
diff --git a/xen/include/asm-x86/cache.h b/xen/include/asm-x86/cache.h
index 1f7173d8c72c..e4770efb22b9 100644
--- xen/include/asm-x86/cache.h.orig
+++ xen/include/asm-x86/cache.h
@@ -11,4 +11,11 @@
 
 #define __read_mostly __section(".data.read_mostly")
 
+#ifndef __ASSEMBLY__
+
+void cache_flush(const void *addr, unsigned int size);
+void cache_writeback(const void *addr, unsigned int size);
+
+#endif
+
 #endif
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/amd: Work around CLFLUSH ordering on older parts

On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
including reads and writes to the address, and LFENCE/SFENCE instructions.

This creates a multitude of problematic corner cases, laid out in the manual.
Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 1ee687d0d224..986672a072b7 100644
--- xen/arch/x86/cpu/amd.c.orig
+++ xen/arch/x86/cpu/amd.c
@@ -787,6 +787,14 @@ static void init_amd(struct cpuinfo_x86 *c)
 	if (!cpu_has_lfence_dispatch)
 		__set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability);
 
+	/*
+	 * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with
+	 * everything, including reads and writes to address, and
+	 * LFENCE/SFENCE instructions.
+	 */
+	if (!cpu_has_clflushopt)
+		setup_force_cpu_cap(X86_BUG_CLFLUSH_MFENCE);
+
 	switch(c->x86)
 	{
 	case 0xf ... 0x11:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 0c912b8669f8..dcbb4064012e 100644
--- xen/arch/x86/flushtlb.c.orig
+++ xen/arch/x86/flushtlb.c
@@ -259,6 +259,13 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     return flags;
 }
 
+/*
+ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with everything,
+ * including reads and writes to address, and LFENCE/SFENCE instructions.
+ *
+ * This function only works safely after alternatives have run.  Luckily, at
+ * the time of writing, we don't flush the caches that early.
+ */
 void cache_flush(const void *addr, unsigned int size)
 {
     /*
@@ -268,6 +275,8 @@ void cache_flush(const void *addr, unsigned int size)
     unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
     const void *end = addr + size;
 
+    alternative("", "mfence", X86_BUG_CLFLUSH_MFENCE);
+
     addr -= (unsigned long)addr & (clflush_size - 1);
     for ( ; addr < end; addr += clflush_size )
     {
@@ -283,7 +292,9 @@ void cache_flush(const void *addr, unsigned int size)
                        [p] "m" (*(const char *)(addr)));
     }
 
-    alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
+    alternative_2("",
+                  "sfence", X86_FEATURE_CLFLUSHOPT,
+                  "mfence", X86_BUG_CLFLUSH_MFENCE);
 }
 
 void cache_writeback(const void *addr, unsigned int size)
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index fe2f97354fb6..09f619459bc7 100644
--- xen/include/asm-x86/cpufeatures.h.orig
+++ xen/include/asm-x86/cpufeatures.h
@@ -46,6 +46,7 @@ XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch
 #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x))
 
 #define X86_BUG_FPU_PTRS          X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */
+#define X86_BUG_CLFLUSH_MFENCE    X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
 
 /* Total number of capability words, inc synth and bug words. */
 #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Track and flush non-coherent mappings of RAM

There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
devices that make non-coherent writes.  The Linux sound subsystem makes
extensive use of this technique.

For such usecases, the guest's DMA buffer is mapped and consistently used as
WC, and Xen doesn't interact with the buffer.

However, a mischevious guest can use WC mappings to deliberately create
non-coherency between the cache and RAM, and use this to trick Xen into
validating a pagetable which isn't actually safe.

Allocate a new PGT_non_coherent to track the non-coherency of mappings.  Set
it whenever a non-coherent writeable mapping is created.  If the page is used
as anything other than PGT_writable_page, force a cache flush before
validation.  Also force a cache flush before the page is returned to the heap.

This is CVE-2022-26364, part of XSA-402.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6ce8c19dcecc..1759b84ba97c 100644
--- xen/arch/x86/mm.c.orig
+++ xen/arch/x86/mm.c
@@ -997,6 +997,15 @@ get_page_from_l1e(
         return -EACCES;
     }
 
+    /*
+     * Track writeable non-coherent mappings to RAM pages, to trigger a cache
+     * flush later if the target is used as anything but a PGT_writeable page.
+     * We care about all writeable mappings, including foreign mappings.
+     */
+    if ( !boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) &&
+         (l1f & (PAGE_CACHE_ATTRS | _PAGE_RW)) == (_PAGE_WC | _PAGE_RW) )
+        set_bit(_PGT_non_coherent, &page->u.inuse.type_info);
+
     return 0;
 
  could_not_pin:
@@ -2442,6 +2451,19 @@ static int cleanup_page_mappings(struct page_info *page)
         }
     }
 
+    /*
+     * Flush the cache if there were previously non-coherent writeable
+     * mappings of this page.  This forces the page to be coherent before it
+     * is freed back to the heap.
+     */
+    if ( __test_and_clear_bit(_PGT_non_coherent, &page->u.inuse.type_info) )
+    {
+        void *addr = __map_domain_page(page);
+
+        cache_flush(addr, PAGE_SIZE);
+        unmap_domain_page(addr);
+    }
+
     return rc;
 }
 
@@ -3016,6 +3038,22 @@ static int _get_page_type(struct page_info *page, unsigned long type,
     if ( unlikely(!(nx & PGT_validated)) )
     {
         /*
+         * Flush the cache if there were previously non-coherent mappings of
+         * this page, and we're trying to use it as anything other than a
+         * writeable page.  This forces the page to be coherent before we
+         * validate its contents for safety.
+         */
+        if ( (nx & PGT_non_coherent) && type != PGT_writable_page )
+        {
+            void *addr = __map_domain_page(page);
+
+            cache_flush(addr, PAGE_SIZE);
+            unmap_domain_page(addr);
+
+            page->u.inuse.type_info &= ~PGT_non_coherent;
+        }
+
+        /*
          * No special validation needed for writable or shared pages.  Page
          * tables and GDT/LDT need to have their contents audited.
          *
diff --git a/xen/arch/x86/pv/grant_table.c b/xen/arch/x86/pv/grant_table.c
index 0325618c9883..81c72e61ed55 100644
--- xen/arch/x86/pv/grant_table.c.orig
+++ xen/arch/x86/pv/grant_table.c
@@ -109,7 +109,17 @@ int create_grant_pv_mapping(uint64_t addr, mfn_t frame,
 
     ol1e = *pl1e;
     if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
+    {
+        /*
+         * We always create mappings in this path.  However, our caller,
+         * map_grant_ref(), only passes potentially non-zero cache_flags for
+         * MMIO frames, so this path doesn't create non-coherent mappings of
+         * RAM frames and there's no need to calculate PGT_non_coherent.
+         */
+        ASSERT(!cache_flags || is_iomem_page(frame));
+
         rc = GNTST_okay;
+    }
 
  out_unlock:
     page_unlock(page);
@@ -294,7 +304,18 @@ int replace_grant_pv_mapping(uint64_t addr, mfn_t frame,
                  l1e_get_flags(ol1e), addr, grant_pte_flags);
 
     if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
+    {
+        /*
+         * Generally, replace_grant_pv_mapping() is used to destroy mappings
+         * (n1le = l1e_empty()), but it can be a present mapping on the
+         * GNTABOP_unmap_and_replace path.
+         *
+         * In such cases, the PTE is fully transplanted from its old location
+         * via steal_linear_addr(), so we need not perform PGT_non_coherent
+         * checking here.
+         */
         rc = GNTST_okay;
+    }
 
  out_unlock:
     page_unlock(page);
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index f5b8862b8374..5c19b71eca70 100644
--- xen/include/asm-x86/mm.h.orig
+++ xen/include/asm-x86/mm.h
@@ -53,8 +53,12 @@
 #define _PGT_partial      PG_shift(8)
 #define PGT_partial       PG_mask(1, 8)
 
+/* Has this page been mapped writeable with a non-coherent memory type? */
+#define _PGT_non_coherent PG_shift(9)
+#define PGT_non_coherent  PG_mask(1, 9)
+
  /* Count of uses of this frame as its current type. */
-#define PGT_count_width   PG_shift(8)
+#define PGT_count_width   PG_shift(9)
 #define PGT_count_mask    ((1UL<<PGT_count_width)-1)
 
 /* Are the 'type mask' bits identical? */

File Added: pkgsrc/sysutils/xenkernel415/patches/Attic/patch-XSA404
$NetBSD: patch-XSA404,v 1.1 2022/06/24 13:07:52 bouyer Exp $

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 1cab26fef61f..e4c820e17053 100644
--- docs/misc/xen-command-line.pandoc.orig
+++ docs/misc/xen-command-line.pandoc
@@ -2194,9 +2194,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b21272988006..4a61e951facf 100644
--- xen/arch/x86/domain.c.orig
+++ xen/arch/x86/domain.c
@@ -861,6 +861,8 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -1994,14 +1996,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -2065,6 +2068,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 49651f3c435a..5f5de45a1309 100644
--- xen/arch/x86/hvm/vmx/entry.S.orig
+++ xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 1e226102d399..b4efc940aa2b 100644
--- xen/arch/x86/spec_ctrl.c.orig
+++ xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static bool __initdata opt_rsb_pv = true;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -903,6 +903,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1148,21 +1155,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index 09f619459bc7..9eaab7a2a1fa 100644
--- xen/include/asm-x86/cpufeatures.h.orig	2022-06-23 19:50:27.080499703 +0200
+++ xen/include/asm-x86/cpufeatures.h	2022-06-23 19:51:20.975755594 +0200
@@ -35,8 +35,7 @@
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 7213d184b016..d0df7f83aa0c 100644
--- xen/include/asm-x86/domain.h.orig
+++ xen/include/asm-x86/domain.h
@@ -319,6 +319,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index 9caecddfec96..68f6c46c470c 100644
--- xen/include/asm-x86/spec_ctrl.h.orig
+++ xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 02b3b18ce69f..5a590bac44aa 100644
--- xen/include/asm-x86/spec_ctrl_asm.h.orig
+++ xen/include/asm-x86/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index b4efc940aa2b..38e0cc2847e0 100644
--- xen/arch/x86/spec_ctrl.c.orig
+++ xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 947778105fb6..1e743461e91d 100644
--- xen/include/asm-x86/msr-index.h.orig
+++ xen/include/asm-x86/msr-index.h
@@ -59,6 +59,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
@@ -76,4 +81,5 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e4c820e17053..e17a835ed254 100644
--- docs/misc/xen-command-line.pandoc.orig
+++ docs/misc/xen-command-line.pandoc
@@ -2171,7 +2171,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2250,8 +2250,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 38e0cc2847e0..83b856fa9158 100644
--- xen/arch/x86/spec_ctrl.c.orig
+++ xen/arch/x86/spec_ctrl.c
@@ -67,7 +67,9 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
 
 static int8_t __initdata opt_srb_lock = -1;
 uint64_t __read_mostly default_xen_mcu_opt_ctrl;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -912,7 +917,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1148,6 +1155,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1160,18 +1179,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1213,12 +1234,16 @@
          * On some SRBDS-affected hardware, it may be safe to relax srb-lock
          * by default.
          *
-         * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only way
-         * to access the Fill Buffer.  If TSX isn't available (inc. SKU
-         * reasons on some models), or TSX is explicitly disabled, then there
-         * is no need for the extra overhead to protect RDRAND/RDSEED.
+         * data becomes available to other contexts.  To recover the data, an
+         * attacker needs to use:
+         *  - SBDS (MDS or TAA to sample the cores fill buffer)
+         *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+         *  - DRPW (Architecturally latch stale fill buffer data)
+         *
+         * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+         * is no unprivileged MMIO access, the RNG data doesn't need protecting.
          */
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;