linux.git/arch/x86/kernel/process_32.c, branch v5.10.258

x86/resctl: fix scheduler confusion with 'current'

2023-03-11T15:40:19+00:00

commit 7fef099702527c3b2c5234a2ea6a24411485a13a upstream.

The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian 
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers 
Tested-by: Tony Luck 
Tested-by: Stephane Eranian 
Tested-by: Babu Moger 
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

x86/fpu: Correct pkru/xstate inconsistency

2022-03-02T10:42:47+00:00

When eagerly switching PKRU in switch_fpu_finish() it checks that
current is not a kernel thread as kernel threads will never use PKRU.
It's possible that this_cpu_read_stable() on current_task
(ie. get_current()) is returning an old cached value. To resolve this
reference next_p directly rather than relying on current.

As written it's possible when switching from a kernel thread to a
userspace thread to observe a cached PF_KTHREAD flag and never restore
the PKRU. And as a result this issue only occurs when switching
from a kernel thread to a userspace thread, switching from a non kernel
thread works perfectly fine because all that is considered in that
situation are the flags from some other non kernel task and the next fpu
is passed in to switch_fpu_finish().

This behavior only exists between 5.2 and 5.13 when it was fixed by a
rewrite decoupling PKRU from xstate, in:
  commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()")

Unfortunately backporting the fix from 5.13 is probably not realistic as
it's part of a 60+ patch series which rewrites most of the PKRU handling.

Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Brian Geffon 
Signed-off-by: Willis Kung 
Tested-by: Willis Kung 
Cc:  # v5.4.x
Cc:  # v5.10.x
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman

x86/dumpstack: Add log_lvl to __show_regs()

2020-07-22T21:56:53+00:00

show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.

Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).

Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).

Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.

[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/

Signed-off-by: Dmitry Safonov 
Signed-off-by: Thomas Gleixner 
Acked-by: Petr Mladek 
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com

mm: don't include asm/pgtable.h if linux/mm.h is already included

2020-06-09T16:39:13+00:00

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes  to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include 
in the files that include .

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include ") ; do
		sed -i -e '/include / d' $f
	done

Signed-off-by: Mike Rapoport 
Signed-off-by: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Brian Cain 
Cc: Catalin Marinas 
Cc: Chris Zankel 
Cc: "David S. Miller" 
Cc: Geert Uytterhoeven 
Cc: Greentime Hu 
Cc: Greg Ungerer 
Cc: Guan Xuetao 
Cc: Guo Ren 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Ley Foon Tan 
Cc: Mark Salter 
Cc: Matthew Wilcox 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Mike Rapoport 
Cc: Nick Hu 
Cc: Paul Walmsley 
Cc: Richard Weinberger 
Cc: Rich Felker 
Cc: Russell King 
Cc: Stafford Horne 
Cc: Thomas Bogendoerfer 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vincent Chen 
Cc: Vineet Gupta 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

x86/resctrl: Rename asm/resctrl_sched.h to asm/resctrl.h

2020-05-06T15:45:22+00:00

asm/resctrl_sched.h is dedicated to the code used for configuration
of the CPU resource control state when a task is scheduled.

Rename resctrl_sched.h to resctrl.h in preparation of additions that
will no longer make this file dedicated to work done during scheduling.

No functional change.

Suggested-by: Borislav Petkov 
Signed-off-by: Reinette Chatre 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/6914e0ef880b539a82a6d889f9423496d471ad1d.1588715690.git.reinette.chatre@intel.com

x86: Remove unneeded includes

2020-03-21T15:03:25+00:00

Clean up includes of and in 

Signed-off-by: Brian Gerst 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20200313195144.164260-19-brgerst@gmail.com

x86: Remove force_iret()

2020-01-08T18:40:51+00:00

force_iret() was originally intended to prevent the return to user mode with
the SYSRET or SYSEXIT instructions, in cases where the register state could
have been changed to be incompatible with those instructions.  The entry code
has been significantly reworked since then, and register state is validated
before SYSRET or SYSEXIT are used.  force_iret() no longer serves its original
purpose and can be eliminated.

Signed-off-by: Brian Gerst 
Signed-off-by: Borislav Petkov 
Acked-by: Oleg Nesterov 
Link: https://lkml.kernel.org/r/20191219115812.102620-1-brgerst@gmail.com

x86/iopl: Remove legacy IOPL option

2019-11-16T10:24:05+00:00

The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
cruft dealing with the (e)flags based IOPL mechanism.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Juergen Gross  (Paravirt and Xen parts)
Acked-by: Andy Lutomirski

x86/process: Unify copy_thread_tls()

2019-11-16T10:23:59+00:00

While looking at the TSS io bitmap it turned out that any change in that
area would require identical changes to copy_thread_tls(). The 32 and 64
bit variants share sufficient code to consolidate them into a common
function to avoid duplication of upcoming modifications.

Signed-off-by: Thomas Gleixner 
Acked-by: Andy Lutomirski

x86/stackframe/32: Provide consistent pt_regs

2019-06-25T08:23:47+00:00

Currently pt_regs on x86_32 has an oddity in that kernel regs
(!user_mode(regs)) are short two entries (esp/ss). This means that any
code trying to use them (typically: regs->sp) needs to jump through
some unfortunate hoops.

Change the entry code to fix this up and create a full pt_regs frame.

This then simplifies various trampolines in ftrace and kprobes, the
stack unwinder, ptrace, kdump and kgdb.

Much thanks to Josh for help with the cleanups!

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Josh Poimboeuf 
Acked-by: Masami Hiramatsu 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Ingo Molnar