summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-01-27x86/hyperv: Fix kexec panic/hang issuesDexuan Cui1-0/+18
[ Upstream commit dfe94d4086e40e92b1926bddcefa629b791e9b28 ] Currently the kexec kernel can panic or hang due to 2 causes: 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the old VP Assist Pages when the kexec kernel runs. The same issue is fixed for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for hibernation"). Now fix it for kexec. 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so between hv_kexec_handler() and native_machine_shutdown(), the other CPUs can still try to access the hypercall page and cause panic. The workaround "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move hyperv_cleanup() to a better place. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20201222065541.24312-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handlingPeter Gonda1-2/+2
[ Upstream commit a8f7e08a81708920a928664a865208fdf451c49f ] The IN and OUT instructions with port address as an immediate operand only use an 8-bit immediate (imm8). The current VC handler uses the entire 32-bit immediate value but these instructions only set the first bytes. Cast the operand to an u8 for that. [ bp: Massage commit message. ] Fixes: 25189d08e5168 ("x86/sev-es: Add support for handling IOIO exceptions") Signed-off-by: Peter Gonda <pgonda@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: David Rientjes <rientjes@google.com> Link: https://lkml.kernel.org/r/20210105163311.221490-1-pgonda@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12x86/mtrr: Correct the range check before performing MTRR type lookupsYing-Tsun Huang1-3/+3
commit cb7f4a8b1fb426a175d1708f05581939c61329d4 upstream. In mtrr_type_lookup(), if the input memory address region is not in the MTRR, over 4GB, and not over the top of memory, a write-back attribute is returned. These condition checks are for ensuring the input memory address region is actually mapped to the physical memory. However, if the end address is just aligned with the top of memory, the condition check treats the address is over the top of memory, and write-back attribute is not returned. And this hits in a real use case with NVDIMM: the nd_pmem module tries to map NVDIMMs as cacheable memories when NVDIMMs are connected. If a NVDIMM is the last of the DIMMs, the performance of this NVDIMM becomes very low since it is aligned with the top of memory and its memory type is uncached-minus. Move the input end address change to inclusive up into mtrr_type_lookup(), before checking for the top of memory in either mtrr_type_lookup_{variable,fixed}() helpers. [ bp: Massage commit message. ] Fixes: 0cc705f56e40 ("x86/mm/mtrr: Clean up mtrr_type_lookup()") Signed-off-by: Ying-Tsun Huang <ying-tsun.huang@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201215070721.4349-1-ying-tsun.huang@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12x86/resctrl: Don't move a task to the same resource groupFenghua Yu1-0/+7
commit a0195f314a25582b38993bf30db11c300f4f4611 upstream. Shakeel Butt reported in [1] that a user can request a task to be moved to a resource group even if the task is already in the group. It just wastes time to do the move operation which could be costly to send IPI to a different CPU. Add a sanity check to ensure that the move operation only happens when the task is not already in the resource group. [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/ Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Reported-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/962ede65d8e95be793cb61102cca37f7bb018e66.1608243147.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSRFenghua Yu1-69/+43
commit ae28d1aae48a1258bd09a6f707ebb4231d79a761 upstream. Currently, when moving a task to a resource group the PQR_ASSOC MSR is updated with the new closid and rmid in an added task callback. If the task is running, the work is run as soon as possible. If the task is not running, the work is executed later in the kernel exit path when the kernel returns to the task again. Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task is running is the right thing to do. Queueing work for a task that is not running is unnecessary (the PQR_ASSOC MSR is already updated when the task is scheduled in) and causing system resource waste with the way in which it is implemented: Work to update the PQR_ASSOC register is queued every time the user writes a task id to the "tasks" file, even if the task already belongs to the resource group. This could result in multiple pending work items associated with a single task even if they are all identical and even though only a single update with most recent values is needed. Specifically, even if a task is moved between different resource groups while it is sleeping then it is only the last move that is relevant but yet a work item is queued during each move. This unnecessary queueing of work items could result in significant system resource waste, especially on tasks sleeping for a long time. For example, as demonstrated by Shakeel Butt in [1] writing the same task id to the "tasks" file can quickly consume significant memory. The same problem (wasted system resources) occurs when moving a task between different resource groups. As pointed out by Valentin Schneider in [2] there is an additional issue with the way in which the queueing of work is done in that the task_struct update is currently done after the work is queued, resulting in a race with the register update possibly done before the data needed by the update is available. To solve these issues, update the PQR_ASSOC MSR in a synchronous way right after the new closid and rmid are ready during the task movement, only if the task is running. If a moved task is not running nothing is done since the PQR_ASSOC MSR will be updated next time the task is scheduled. This is the same way used to update the register when tasks are moved as part of resource group removal. [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/ [2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com [ bp: Massage commit message and drop the two update_task_closid_rmid() variants. ] Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Reported-by: Shakeel Butt <shakeelb@google.com> Reported-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30x86/CPU/AMD: Save AMD NodeId as cpu_die_idYazen Ghannam3-15/+13
[ Upstream commit 028c221ed1904af9ac3c5162ee98f48966de6b3d ] AMD systems provide a "NodeId" value that represents a global ID indicating to which "Node" a logical CPU belongs. The "Node" is a physical structure equivalent to a Die, and it should not be confused with logical structures like NUMA nodes. Logical nodes can be adjusted based on firmware or other settings whereas the physical nodes/dies are fixed based on hardware topology. The NodeId value can be used when a physical ID is needed by software. Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value from CPUID or MSR as appropriate. Default to phys_proc_id otherwise. Do so for both AMD and Hygon systems. Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no longer needed. Update the x86 topology documentation. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30mm/gup: prevent gup_fast from racing with COW during forkJason Gunthorpe1-0/+1
[ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ] Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") pages under a FOLL_PIN will not be write protected during COW for fork. This means that pages returned from pin_user_pages(FOLL_WRITE) should not become write protected while the pin is active. However, there is a small race where get_user_pages_fast(FOLL_PIN) can establish a FOLL_PIN at the same time copy_present_page() is write protecting it: CPU 0 CPU 1 get_user_pages_fast() internal_get_user_pages_fast() copy_page_range() pte_alloc_map_lock() copy_present_page() atomic_read(has_pinned) == 0 page_maybe_dma_pinned() == false atomic_set(has_pinned, 1); gup_pgd_range() gup_pte_range() pte_t pte = gup_get_pte(ptep) pte_access_permitted(pte) try_grab_compound_head() pte = pte_wrprotect(pte) set_pte_at(); pte_unmap_unlock() // GUP now returns with a write protected page The first attempt to resolve this by using the write protect caused problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") Instead wrap copy_p4d_range() with the write side of a seqcount and check the read side around gup_pgd_range(). If there is a collision then get_user_pages_fast() fails and falls back to slow GUP. Slow GUP is safe against this race because copy_page_range() is only called while holding the exclusive side of the mmap_lock on the src mm_struct. [akpm@linux-foundation.org: coding style fixes] Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30x86/kprobes: Restore BTF if the single-stepping is cancelledMasami Hiramatsu1-0/+5
[ Upstream commit 78ff2733ff352175eb7f4418a34654346e1b6cd2 ] Fix to restore BTF if single-stepping causes a page fault and it is cancelled. Usually the BTF flag was restored when the single stepping is done (in resume_execution()). However, if a page fault happens on the single stepping instruction, the fault handler is invoked and the single stepping is cancelled. Thus, the BTF flag is not restored. Fixes: 1ecc798c6764 ("x86: debugctlmsr kprobes") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2 Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30x86/mce: Correct the detection of invalid notifier prioritiesZhen Lei1-1/+2
[ Upstream commit 15af36596ae305aefc8c502c2d3e8c58221709eb ] Commit c9c6d216ed28 ("x86/mce: Rename "first" function as "early"") changed the enumeration of MCE notifier priorities. Correct the check for notifier priorities to cover the new range. [ bp: Rewrite commit message, remove superfluous brackets in conditional. ] Fixes: c9c6d216ed28 ("x86/mce: Rename "first" function as "early"") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30x86/apic: Fix x2apic enablement without interrupt remappingDavid Woodhouse2-6/+17
[ Upstream commit 26573a97746c7a99f394f9d398ce91a8853b3b89 ] Currently, Linux as a hypervisor guest will enable x2apic only if there are no CPUs present at boot time with an APIC ID above 255. Hotplugging a CPU later with a higher APIC ID would result in a CPU which cannot be targeted by external interrupts. Add a filter in x2apic_apic_id_valid() which can be used to prevent such CPUs from coming online, and allow x2apic to be enabled even if they are present at boot time. Fixes: ce69a784504 ("x86/apic: Enable x2APIC without interrupt remapping under KVM") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201024213535.443185-2-dwmw2@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-26x86/split-lock: Avoid returning with interrupts enabledAndi Kleen1-1/+2
commit e14fd4ba8fb47fcf5f244366ec01ae94490cd86a upstream. When a split lock is detected always make sure to disable interrupts before returning from the trap handler. The kernel exit code assumes that all exits run with interrupts disabled, otherwise the SWAPGS sequence can race against interrupts and cause recursing page faults and later panics. The problem will only happen on CPUs with split lock disable functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville. Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code") Fixes: bce9b042ec73 ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+ Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-12x86/kprobes: Fix optprobe to detect INT3 padding correctlyMasami Hiramatsu1-2/+20
Commit 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes") changed the padding bytes between functions from NOP to INT3. However, when optprobe decodes a target function it finds INT3 and gives up the jump optimization. Instead of giving up any INT3 detection, check whether the rest of the bytes to the end of the function are INT3. If all of them are INT3, those come from the linker. In that case, continue the optprobe jump optimization. [ bp: Massage commit message. ] Fixes: 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes") Reported-by: Adam Zabrocki <pi3@pi3.com.pl> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2
2020-12-10x86/apic/vector: Fix ordering in vector assignmentThomas Gleixner1-10/+14
Prarit reported that depending on the affinity setting the ' irq $N: Affinity broken due to vector space exhaustion.' message is showing up in dmesg, but the vector space on the CPUs in the affinity mask is definitely not exhausted. Shung-Hsi provided traces and analysis which pinpoints the problem: The ordering of trying to assign an interrupt vector in assign_irq_vector_any_locked() is simply wrong if the interrupt data has a valid node assigned. It does: 1) Try the intersection of affinity mask and node mask 2) Try the node mask 3) Try the full affinity mask 4) Try the full online mask Obviously #2 and #3 are in the wrong order as the requested affinity mask has to take precedence. In the observed cases #1 failed because the affinity mask did not contain CPUs from node 0. That made it allocate a vector from node 0, thereby breaking affinity and emitting the misleading message. Revert the order of #2 and #3 so the full affinity mask without the node intersection is tried before actually affinity is broken. If no node is assigned then only the full affinity mask and if that fails the full online mask is tried. Fixes: d6ffc6ac83b1 ("x86/vector: Respect affinity mask in irq descriptor") Reported-by: Prarit Bhargava <prarit@redhat.com> Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
2020-12-10x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabledXiaochen Shen1-4/+2
The MBA software controller (mba_sc) is a feedback loop which periodically reads MBM counters and tries to restrict the bandwidth below a user-specified value. It tags along the MBM counter overflow handler to do the updates with 1s interval in mbm_update() and update_mba_bw(). The purpose of mbm_update() is to periodically read the MBM counters to make sure that the hardware counter doesn't wrap around more than once between user samplings. mbm_update() calls __mon_event_count() for local bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count() instead when mba_sc is enabled. __mon_event_count() will not be called for local bandwidth updating in MBM counter overflow handler, but it is still called when reading MBM local bandwidth counter file 'mbm_local_bytes', the call path is as below: rdtgroup_mondata_show() mon_event_read() mon_event_count() __mon_event_count() In __mon_event_count(), m->chunks is updated by delta chunks which is calculated from previous MSR value (m->prev_msr) and current MSR value. When mba_sc is enabled, m->chunks is also updated in mbm_update() by mistake by the delta chunks which is calculated from m->prev_bw_msr instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in the mba_sc feedback loop. When reading MBM local bandwidth counter file, m->chunks was changed unexpectedly by mbm_bw_count(). As a result, the incorrect local bandwidth counter which calculated from incorrect m->chunks is shown to the user. Fix this by removing incorrect m->chunks updating in mbm_bw_count() in MBM counter overflow handler, and always calling __mon_event_count() in mbm_update() to make sure that the hardware local bandwidth counter doesn't wrap around. Test steps: # Run workload with aggressive memory bandwidth (e.g., 10 GB/s) git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat && make ./tools/membw/membw -c 0 -b 10000 --read # Enable MBA software controller mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl # Create control group c1 mkdir /sys/fs/resctrl/c1 # Set MB throttle to 6 GB/s echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata # Write PID of the workload to tasks file echo `pidof membw` > /sys/fs/resctrl/c1/tasks # Read local bytes counters twice with 1s interval, the calculated # local bandwidth is not as expected (approaching to 6 GB/s): local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes` sleep 1 local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes` echo "local b/w (bytes/s):" `expr $local_2 - $local_1` Before fix: local b/w (bytes/s): 11076796416 After fix: local b/w (bytes/s): 5465014272 Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop) Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
2020-12-06Merge tag 'x86-urgent-2020-12-06' of ↵Linus Torvalds5-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Make the AMD L3 QoS code and data priorization enable/disable mechanism work correctly. The control bit was only set/cleared on one of the CPUs in a L3 domain, but it has to be modified on all CPUs in the domain. The initial documentation was not clear about this, but the updated one from Oct 2020 spells it out. - Fix an off by one in the UV platform detection code which causes the UV hubs to be identified wrongly. The chip revisions start at 1 not at 0. - Fix a long standing bug in the evaluation of prefixes in the uprobes code which fails to handle repeated prefixes properly. The aggregate size of the prefixes can be larger than the bytes array but the code blindly iterated over the aggregate size beyond the array boundary. Add a macro to handle this case properly and use it at the affected places" * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes x86/platform/uv: Fix UV4 hub revision adjustment x86/resctrl: Fix AMD L3 QOS CDP enable/disable
2020-12-06x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytesMasami Hiramatsu1-4/+6
Since insn.prefixes.nbytes can be bigger than the size of insn.prefixes.bytes[] when a prefix is repeated, the proper check must be insn.prefixes.bytes[i] != 0 and i < 4 instead of using insn.prefixes.nbytes. Introduce a for_each_insn_prefix() macro for this purpose. Debugged by Kees Cook <keescook@chromium.org>. [ bp: Massage commit message, sync with the respective header in tools/ and drop "we". ] Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2
2020-12-03x86/platform/uv: Fix UV4 hub revision adjustmentMike Travis1-1/+1
Currently, UV4 is incorrectly identified as UV4A and UV4A as UV5. Hub chip starts with revision 1, fix it. [ bp: Massage commit message. ] Fixes: 647128f1536e ("x86/platform/uv: Update UV MMRs for UV5") Signed-off-by: Mike Travis <mike.travis@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Link: https://lkml.kernel.org/r/20201203152252.371199-1-mike.travis@hpe.com
2020-12-01x86/resctrl: Fix AMD L3 QOS CDP enable/disableBabu Moger3-2/+14
When the AMD QoS feature CDP (code and data prioritization) is enabled or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs in an L3 domain (core complex). That is not correct - the CDP bit needs to be updated on all the logical CPUs in the domain. This was not spelled out clearly in the spec earlier. The specification has been updated and the updated document, "AMD64 Technology Platform Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue Date: October 2020" is available now. Refer the section: Code and Data Prioritization. Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache data structure. The documentation can be obtained at: https://developer.amd.com/wp-content/resources/56375.pdf Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 [ bp: Massage commit message. ] Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
2020-11-29Merge tag 'locking-urgent-2020-11-29' of ↵Linus Torvalds1-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two more places which invoke tracing from RCU disabled regions in the idle path. Similar to the entry path the low level idle functions have to be non-instrumentable" * tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intel_idle: Fix intel_idle() vs tracing sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-29Merge tag 'x86_urgent_for_v5.10-rc6' of ↵Linus Torvalds3-43/+32
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "A couple of urgent fixes which accumulated this last week: - Two resctrl fixes to prevent refcount leaks when manipulating the resctrl fs (Xiaochen Shen) - Correct prctl(PR_GET_SPECULATION_CTRL) reporting (Anand K Mistry) - A fix to not lose already seen MCE severity which determines whether the machine can recover (Gabriele Paoloni)" * tag 'x86_urgent_for_v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Do not overwrite no_way_out if mce_end() fails x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
2020-11-27Merge tag 'iommu-fixes' of ↵Linus Torvalds1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull iommu fixes from Will Deacon: "Here's another round of IOMMU fixes for -rc6 consisting mainly of a bunch of independent driver fixes. Thomas agreed for me to take the x86 'tboot' fix here, as it fixes a regression introduced by a vt-d change. - Fix intel iommu driver when running on devices without VCCAP_REG - Fix swiotlb and "iommu=pt" interaction under TXT (tboot) - Fix missing return value check during device probe() - Fix probe ordering for Qualcomm SMMU implementation - Ensure page-sized mappings are used for AMD IOMMU buffers with SNP RMP" * tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: iommu/vt-d: Don't read VCCAP register unless it exists x86/tboot: Don't disable swiotlb when iommu is forced on iommu: Check return of __iommu_attach_device() arm-smmu-qcom: Ensure the qcom_scm driver has finished probing iommu/amd: Enforce 4k mapping for certain IOMMU data structures
2020-11-27x86/mce: Do not overwrite no_way_out if mce_end() failsGabriele Paoloni1-2/+4
Currently, if mce_end() fails, no_way_out - the variable denoting whether the machine can recover from this MCE - is determined by whether the worst severity that was found across the MCA banks associated with the current CPU, is of panic severity. However, at this point no_way_out could have been already set by mca_start() after looking at all severities of all CPUs that entered the MCE handler. If mce_end() fails, check first if no_way_out is already set and, if so, stick to it, otherwise use the local worst value. [ bp: Massage. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
2020-11-25x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpbAnand K Mistry1-2/+2
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command line, IBPB is force-enabled and STIPB is conditionally-enabled (or not available). However, since 21998a351512 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.") the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP} instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour. Because the issuing of IBPB relies on the switch_mm_*_ibpb static branches, the mitigations behave as expected. Since 1978b3a53a74 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP") this discrepency caused the misreporting of IB speculation via prctl(). On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb, prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL | PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB speculation mode, even though the flag is ignored. Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not available. [ bp: Massage commit message. ] Fixes: 21998a351512 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.") Fixes: 1978b3a53a74 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP") Signed-off-by: Anand K Mistry <amistry@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
2020-11-25x86/tboot: Don't disable swiotlb when iommu is forced onLu Baolu1-4/+1
After commit 327d5b2fee91c ("iommu/vt-d: Allow 32bit devices to uses DMA domain"), swiotlb could also be used for direct memory access if IOMMU is enabled but a device is configured to pass through the DMA translation. Keep swiotlb when IOMMU is forced on, otherwise, some devices won't work if "iommu=pt" kernel parameter is used. Fixes: 327d5b2fee91 ("iommu/vt-d: Allow 32bit devices to uses DMA domain") Reported-and-tested-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20201125014124.4070776-1-baolu.lu@linux.intel.com Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210237 Signed-off-by: Will Deacon <will@kernel.org>
2020-11-24sched/idle: Fix arch_cpu_idle() vs tracingPeter Zijlstra1-5/+7
We call arch_cpu_idle() with RCU disabled, but then use local_irq_{en,dis}able(), which invokes tracing, which relies on RCU. Switch all arch_cpu_idle() implementations to use raw_local_irq_{en,dis}able() and carefully manage the lockdep,rcu,tracing state like we do in entry. (XXX: we really should change arch_cpu_idle() to not return with interrupts enabled) Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
2020-11-24x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leakXiaochen Shen1-7/+25
On resource group creation via a mkdir an extra kernfs_node reference is obtained by kernfs_get() to ensure that the rdtgroup structure remains accessible for the rdtgroup_kn_unlock() calls where it is removed on deletion. Currently the extra kernfs_node reference count is only dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup structure is removed in a few other locations that lack the matching reference drop. In call paths of rmdir and umount, when a control group is removed, kernfs_remove() is called to remove the whole kernfs nodes tree of the control group (including the kernfs nodes trees of all child monitoring groups), and then rdtgroup structure is freed by kfree(). The rdtgroup structures of all child monitoring groups under the control group are freed by kfree() in free_all_child_rdtgrp(). Before calling kfree() to free the rdtgroup structures, the kernfs node of the control group itself as well as the kernfs nodes of all child monitoring groups still take the extra references which will never be dropped to 0 and the kernfs nodes will never be freed. It leads to reference count leak and kernfs_node_cache memory leak. For example, reference count leak is observed in these two cases: (1) mount -t resctrl resctrl /sys/fs/resctrl mkdir /sys/fs/resctrl/c1 mkdir /sys/fs/resctrl/c1/mon_groups/m1 umount /sys/fs/resctrl (2) mkdir /sys/fs/resctrl/c1 mkdir /sys/fs/resctrl/c1/mon_groups/m1 rmdir /sys/fs/resctrl/c1 The same reference count leak issue also exists in the error exit paths of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon(). Fix this issue by following changes to make sure the extra kernfs_node reference on rdtgroup is dropped before freeing the rdtgroup structure. (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up kernfs_put() and kfree(). (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup structure is about to be freed by kfree(). (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error exit paths of mkdir where an extra reference is taken by kernfs_get(). Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Fixes: 60cf5e101fd4 ("x86/intel_rdt: Add mkdir to resctrl file system") Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
2020-11-24x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leakXiaochen Shen1-33/+2
Willem reported growing of kernfs_node_cache entries in slabtop when repeatedly creating and removing resctrl subdirectories as well as when repeatedly mounting and unmounting the resctrl filesystem. On resource group (control as well as monitoring) creation via a mkdir an extra kernfs_node reference is obtained to ensure that the rdtgroup structure remains accessible for the rdtgroup_kn_unlock() calls where it is removed on deletion. The kernfs_node reference count is dropped by kernfs_put() in rdtgroup_kn_unlock(). With the above explaining the need for one kernfs_get()/kernfs_put() pair in resctrl there are more places where a kernfs_node reference is obtained without a corresponding release. The excessive amount of reference count on kernfs nodes will never be dropped to 0 and the kernfs nodes will never be freed in the call paths of rmdir and umount. It leads to reference count leak and kernfs_node_cache memory leak. Remove the superfluous kernfs_get() calls and expand the existing comments surrounding the remaining kernfs_get()/kernfs_put() pair that remains in use. Superfluous kernfs_get() calls are removed from two areas: (1) In call paths of mount and mkdir, when kernfs nodes for "info", "mon_groups" and "mon_data" directories and sub-directories are created, the reference count of newly created kernfs node is set to 1. But after kernfs_create_dir() returns, superfluous kernfs_get() are called to take an additional reference. (2) kernfs_get() calls in rmdir call paths. Fixes: 17eafd076291 ("x86/intel_rdt: Split resource group removal in two") Fixes: 4af4a88e0c92 ("x86/intel_rdt/cqm: Add mount,umount support") Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data") Fixes: c7d9aac61311 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring") Fixes: 5dc1d5c6bac2 ("x86/intel_rdt: Simplify info and base file lists") Fixes: 60cf5e101fd4 ("x86/intel_rdt: Add mkdir to resctrl file system") Fixes: 4e978d06dedb ("x86/intel_rdt: Add "info" files to resctrl file system") Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
2020-11-22Merge tag 'x86_urgent_for_v5.10-rc5' of ↵Linus Torvalds2-57/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of same because the proper one is going through the IOMMU tree (Thomas Gleixner) - An Intel microcode loader fix to save the correct microcode patch to apply during resume (Chen Yu) - A fix to not access user memory of other processes when dumping opcode bytes (Thomas Gleixner) * tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account" x86/dumpstack: Do not try to access user space code of other tasks x86/microcode/intel: Check patch signature before saving microcode for early loading iommu/vt-d: Take CONFIG_PCI_ATS into account
2020-11-20Merge tag 'iommu-fixes' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull iommu fixes from Will Deacon: "Two straightforward vt-d fixes: - Fix boot when intel iommu initialisation fails under TXT (tboot) - Fix intel iommu compilation error when DMAR is enabled without ATS and temporarily update IOMMU MAINTAINERs entry" * tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: MAINTAINERS: Temporarily add myself to the IOMMU entry iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set iommu/vt-d: Avoid panic if iommu init fails in tboot system
2020-11-19Merge tag 'x86-urgent-2020-11-15' of ↵Will Deacon1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/iommu/fixes Pull in x86 fixes from Thomas, as they include a change to the Intel DMAR code on which we depend: * tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: iommu/vt-d: Cure VF irqdomain hickup x86/platform/uv: Fix copied UV5 output archtype x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-18iommu/vt-d: Avoid panic if iommu init fails in tboot systemZhenzhong Duan1-3/+0
"intel_iommu=off" command line is used to disable iommu but iommu is force enabled in a tboot system for security reason. However for better performance on high speed network device, a new option "intel_iommu=tboot_noforce" is introduced to disable the force on. By default kernel should panic if iommu init fail in tboot for security reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off". Fix the code setting force_on and move intel_iommu_tboot_noforce from tboot code to intel iommu code. Fixes: 7304e8f28bb2 ("iommu/vt-d: Correctly disable Intel IOMMU force on") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com> Tested-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com> Acked-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2020-11-18x86/dumpstack: Do not try to access user space code of other tasksThomas Gleixner1-4/+19
sysrq-t ends up invoking show_opcodes() for each task which tries to access the user space code of other processes, which is obviously bogus. It either manages to dump where the foreign task's regs->ip points to in a valid mapping of the current task or triggers a pagefault and prints "Code: Bad RIP value.". Both is just wrong. Add a safeguard in copy_code() and check whether the @regs pointer matches currents pt_regs. If not, do not even try to access it. While at it, add commentary why using copy_from_user_nmi() is safe in copy_code() even if the function name suggests otherwise. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201117202753.667274723@linutronix.de
2020-11-17x86/microcode/intel: Check patch signature before saving microcode for early ↵Chen Yu1-53/+10
loading Currently, scan_microcode() leverages microcode_matches() to check if the microcode matches the CPU by comparing the family and model. However, the processor stepping and flags of the microcode signature should also be considered when saving a microcode patch for early update. Use find_matching_signature() in scan_microcode() and get rid of the now-unused microcode_matches() which is a good cleanup in itself. Complete the verification of the patch being saved for early loading in save_microcode_patch() directly. This needs to be done there too because save_mc_for_early() will call save_microcode_patch() too. The second reason why this needs to be done is because the loader still tries to support, at least hypothetically, mixed-steppings systems and thus adds all patches to the cache that belong to the same CPU model albeit with different steppings. For example: microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6 microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23 microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27 microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23 microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27 microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23 The patch which is being saved for early loading, however, can only be the one which fits the CPU this runs on so do the signature verification before saving. [ bp: Do signature verification in save_microcode_patch() and rewrite commit message. ] Fixes: ec400ddeff20 ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535 Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.chen@intel.com
2020-11-15Merge tag 'x86-urgent-2020-11-15' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of fixes for x86: - Cure the fallout from the MSI irqdomain overhaul which missed that the Intel IOMMU does not register virtual function devices and therefore never reaches the point where the MSI interrupt domain is assigned. This made the VF devices use the non-remapped MSI domain which is trapped by the IOMMU/remap unit - Remove an extra space in the SGI_UV architecture type procfs output for UV5 - Remove a unused function which was missed when removing the UV BAU TLB shootdown handler" * tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: iommu/vt-d: Cure VF irqdomain hickup x86/platform/uv: Fix copied UV5 output archtype x86/platform/uv: Drop last traces of uv_flush_tlb_others
2020-11-15Merge tag 'perf-urgent-2020-11-15' of ↵Linus Torvalds1-4/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of fixes for perf: - A set of commits which reduce the stack usage of various perf event handling functions which allocated large data structs on stack causing stack overflows in the worst case - Use the proper mechanism for detecting soft interrupts in the recursion protection - Make the resursion protection simpler and more robust - Simplify the scheduling of event groups to make the code more robust and prepare for fixing the issues vs. scheduling of exclusive event groups - Prevent event multiplexing and rotation for exclusive event groups - Correct the perf event attribute exclusive semantics to take pinned events, e.g. the PMU watchdog, into account - Make the anythread filtering conditional for Intel's generic PMU counters as it is not longer guaranteed to be supported on newer CPUs. Check the corresponding CPUID leaf to make sure - Fixup a duplicate initialization in an array which was probably caused by the usual 'copy & paste - forgot to edit' mishap" * tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Fix Add BW copypasta perf/x86/intel: Make anythread filter support conditional perf: Tweak perf_event_attr::exclusive semantics perf: Fix event multiplexing for exclusive groups perf: Simplify group_sched_in() perf: Simplify group_sched_out() perf/x86: Make dummy_iregs static perf/arch: Remove perf_sample_data::regs_user_copy perf: Optimize get_recursion_context() perf: Fix get_recursion_context() perf/x86: Reduce stack usage for x86_pmu::drain_pebs() perf: Reduce stack usage of perf_output_begin()
2020-11-13x86/platform/uv: Fix copied UV5 output archtypeMike Travis1-3/+3
A test shows that