summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2019-04-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds3-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Misc clocksource driver fixes, and a sched-clock wrapping fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze() clocksource/drivers/timer-ti-dm: Remove omap_dm_timer_set_load_start clocksource/drivers/oxnas: Fix OX820 compatible clocksource/drivers/arm_arch_timer: Remove unneeded pr_fmt macro clocksource/drivers/npcm: select TIMER_OF
2019-04-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds4-39/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes: - various tooling fixes - kretprobe fixes - kprobes annotation fixes - kprobes error checking fix - fix the default events for AMD Family 17h CPUs - PEBS fix - AUX record fix - address filtering fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Avoid kretprobe recursion bug kprobes: Mark ftrace mcount handler functions nokprobe x86/kprobes: Verify stack frame on kretprobe perf/x86/amd: Add event map for AMD Family 17h perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf() perf tools: Fix map reference counting perf evlist: Fix side band thread draining perf tools: Check maps for bpf programs perf bpf: Return NULL when RB tree lookup fails in perf_env__find_bpf_prog_info() tools include uapi: Sync sound/asound.h copy perf top: Always sample time to satisfy needs of use of ordered queuing perf evsel: Use hweight64() instead of hweight_long(attr.sample_regs_user) tools lib traceevent: Fix missing equality check for strcmp perf stat: Disable DIR_FORMAT feature for 'perf stat record' perf scripts python: export-to-sqlite.py: Fix use of parent_id in calls_view perf header: Fix lock/unlock imbalances when processing BPF/BTF info perf/x86: Fix incorrect PEBS_REGS perf/ring_buffer: Fix AUX record suppression perf/core: Fix the address filtering fix kprobes: Fix error check when reusing optimized probes
2019-04-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2-2/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A deadline scheduler warning/race fix, and a cfs_period_us quota calculation workaround where the real fix looks too involved to merge immediately" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Correctly handle active 0-lag timers sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
2019-04-20Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "A lockdep warning fix and a script execution fix when atomics are generated" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomics: Don't assume that scripts are executable locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again
2019-04-19kernel/watchdog_hld.c: hard lockup message should end with a newlineSergey Senozhatsky1-1/+2
Separate print_modules() and hard lockup error message. Before the patch: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437 Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19kprobes: Mark ftrace mcount handler functions nokprobeMasami Hiramatsu1-1/+5
Mark ftrace mcount handler functions nokprobe since probing on these functions with kretprobe pushes return address incorrectly on kretprobe shadow stack. Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com> Tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18signal: use fdget() since we don't allow O_PATHChristian Brauner1-1/+1
As stated in the original commit for pidfd_send_signal() we don't allow to signal processes through O_PATH file descriptors since it is semantically equivalent to a write on the pidfd. We already correctly error out right now and return EBADF if an O_PATH fd is passed. This is because we use file->f_op to detect whether a pidfd is passed and O_PATH fds have their file->f_op set to empty_fops in do_dentry_open() and thus fail the test. Thus, there is no regression. It's just semantically correct to use fdget() and return an error right from there instead of taking a reference and returning an error later. Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jann@thejh.net> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-18timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()Chang-An Chen3-2/+11
tick_freeze() introduced by suspend-to-idle in commit 124cf9117c5f ("PM / sleep: Make it possible to quiesce timers during suspend-to-idle") uses timekeeping_suspend() instead of syscore_suspend() during suspend-to-idle. As a consequence generic sched_clock will keep going because sched_clock_suspend() and sched_clock_resume() are not invoked during suspend-to-idle which can result in a generic sched_clock wrap. On a ARM system with suspend-to-idle enabled, sched_clock is registered as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which means the real wrapping duration is 8796093022202ns. [ 134.551779] suspend-to-idle suspend (timekeeping_suspend()) [ 1204.912239] suspend-to-idle resume (timekeeping_resume()) ...... [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend()) [ 5880.502807] suspend-to-idle resume (timekeeping_resume()) ...... [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend()) [ 8035.753167] suspend-to-idle resume (timekeeping_resume()) ...... [ 8795.786684] (2)[321:charger_thread]...... [ 8795.788387] (2)[321:charger_thread]...... [ 0.057226] (0)[0:swapper/0]...... [ 0.061447] (2)[0:swapper/2]...... sched_clock was not stopped during suspend-to-idle, and sched_clock_poll hrtimer was not expired because timekeeping_suspend() was invoked during suspend-to-idle. It makes sched_clock wrap at kernel time 8796s. To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in tick_freeze() together with timekeeping_suspend() and timekeeping_resume(). Fixes: 124cf9117c5f (PM / sleep: Make it possible to quiesce timers during suspend-to-idle) Signed-off-by: Chang-An Chen <chang-an.chen@mediatek.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Corey Minyard <cminyard@mvista.com> Cc: <linux-mediatek@lists.infradead.org> Cc: <linux-arm-kernel@lists.infradead.org> Cc: Stanley Chu <stanley.chu@mediatek.com> Cc: <kuohong.wang@mediatek.com> Cc: <freddy.hsin@mediatek.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.com
2019-04-16sched/deadline: Correctly handle active 0-lag timersluca abeni1-2/+1
syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: chengjian (D) <cj.chengjian@huawei.com> Link: https://lkml.kernel.org/r/20190325131530.34706-1-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockupPhil Auld1-0/+25
With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16perf/ring_buffer: Fix AUX record suppressionAlexander Shishkin1-18/+15
The following commit: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") has an unintended side-effect of also suppressing all AUX records with no flags and non-zero size, so all the regular records in the full trace mode. This breaks some use cases for people. Fix this by restoring "regular" AUX records. Reported-by: Ben Gainey <Ben.Gainey@arm.com> Tested-by: Ben Gainey <Ben.Gainey@arm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") Link: https://lkml.kernel.org/r/20190329091338.29999-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16perf/core: Fix the address filtering fixAlexander Shishkin1-16/+21
The following recent commit: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") changes the address filtering logic to communicate filter ranges to the PMU driver via a single address range object, instead of having the driver do the final bit of math. That change forgets to take into account kernel filters, which are not calculated the same way as DSO based filters. Fix that by passing the kernel filters the same way as file-based filters. This doesn't require any additional changes in the drivers. Reported-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") Link: https://lkml.kernel.org/r/20190329091212.29870-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16kprobes: Fix error check when reusing optimized probesMasami Hiramatsu1-4/+2
The following commit introduced a bug in one of our error paths: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") it missed to handle the return value of kprobe_optready() as error-value. In reality, the kprobe_optready() returns a bool result, so "true" case must be passed instead of 0. This causes some errors on kprobe boot-time selftests on ARM: [ ] Beginning kprobe tests... [ ] Probe ARM code [ ] kprobe [ ] kretprobe [ ] ARM instruction simulation [ ] Check decoding tables [ ] Run test cases [ ] FAIL: test_case_handler not run [ ] FAIL: Test andge r10, r11, r14, asr r7 [ ] FAIL: Scenario 11 ... [ ] FAIL: Scenario 7 [ ] Total instruction simulation tests=1631, pass=1433 fail=198 [ ] kprobe tests failed This can happen if an optimized probe is unregistered and next kprobe is registered on same address until the previous probe is not reclaimed. If this happens, a hidden aggregated probe may be kept in memory, and no new kprobe can probe same address. Also, in that case register_kprobe() will return "1" instead of minus error value, which can mislead caller logic. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # v5.0+ Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' againBart Van Assche1-4/+5
If lockdep_register_key() and lockdep_unregister_key() are called with debug_locks == false then the following warning is reported: WARNING: CPU: 2 PID: 15145 at kernel/locking/lockdep.c:4920 lockdep_unregister_key+0x1ad/0x240 That warning is reported because lockdep_unregister_key() ignores the value of 'debug_locks' and because the behavior of lockdep_register_key() depends on whether or not 'debug_locks' is set. Fix this inconsistency by making lockdep_unregister_key() take 'debug_locks' again into account. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: shenghui <shhuiw@foxmail.com> Fixes: 90c1cba2b3b3 ("locking/lockdep: Zap lock classes even with lock debugging disabled") Link: http://lkml.kernel.org/r/20190415170538.23491-1-bvanassche@acm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-14Merge branch 'page-refs' (page ref overflow)Linus Torvalds1-1/+5
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit
2019-04-14fs: prevent page refcount overflow in pipe_buf_getMatthew Wilcox1-1/+5
Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-12Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix the alarm_timer_remaining() return value" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Return correct remaining time
2019-04-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a NULL pointer dereference crash in certain environments" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
2019-04-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-11/+45
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Six kernel side fixes: three related to NMI handling on AMD systems, a race fix, a kexec initialization fix and a PEBS sampling fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix perf_event_disable_inatomic() race x86/perf/amd: Remove need to check "running" bit in NMI handler x86/perf/amd: Resolve NMI latency issues for active PMCs x86/perf/amd: Resolve race condition when disabling PMC perf/x86/intel: Initialize TFA MSR perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
2019-04-12Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds1-17/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fixes a crash when accessing /proc/lockdep" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Zap lock classes even with lock debugging disabled
2019-04-12Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "Two genirq fixes, plus an irqchip driver error handling fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n irqchip/irq-ls1x: Missing error code in ls1x_intc_of_init()
2019-04-12perf/core: Fix perf_event_disable_inatomic() racePeter Zijlstra2-11/+45
Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local() on his s390. The problem boils down to: CPU-A CPU-B perf_event_overflow() perf_event_disable_inatomic() @pending_disable = 1 irq_work_queue(); sched-out event_sched_out() @pending_disable = 0 sched-in perf_event_overflow() perf_event_disable_inatomic() @pending_disable = 1; irq_work_queue(); // FAILS irq_work_run() perf_pending_event() if (@pending_disable) perf_event_disable_local(); // WHOOPS The problem exists in generic, but s390 is particularly sensitive because it doesn't implement arch_irq_work_raise(), nor does it call irq_work_run() from it's PMU interrupt handler (nor would that be sufficient in this case, because s390 also generates perf_event_overflow() from pmu::stop). Add to that the fact that s390 is a virtual architecture and (virtual) CPU-A can stall long enough for the above race to happen, even if it would self-IPI. Adding a irq_work_sync() to event_sched_in() would work for all hardare PMUs that properly use irq_work_run() but fails for software PMUs. Instead encode the CPU number in @pending_disable, such that we can tell which CPU requested the disable. This then allows us to detect the above scenario and even redirect the IPI to make up for the failed queue. Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hendrik Brueckner <brueckner@linux.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-11dma-debug: only skip one stackframe entryScott Wood1-1/+1
With skip set to 1, I get a traceback like this: [ 106.867637] DMA-API: Mapped at: [ 106.870784] afu_dma_map_region+0x2cd/0x4f0 [dfl_afu] [ 106.875839] afu_ioctl+0x258/0x380 [dfl_afu] [ 106.880108] do_vfs_ioctl+0xa9/0x720 [ 106.883688] ksys_ioctl+0x60/0x90 [ 106.887007] __x64_sys_ioctl+0x16/0x20 With the previous value of 2, afu_dma_map_region was being omitted. I suspect that the code paths have simply changed since the value of 2 was chosen a decade ago, but it's also possible that it varies based on which mapping function was used, compiler inlining choices, etc. In any case, it's best to err on the side of skipping less. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-10alarmtimer: Return correct remaining timeAndrei Vagin1-1/+1
To calculate a remaining time, it's required to subtract the current time from the expiration time. In alarm_timer_remaining() the arguments of ktime_sub are swapped. Fixes: d653d8457c76 ("alarmtimer: Implement remaining callback") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Cc: Stephen Boyd <sboyd@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.com
2019-04-10locking/lockdep: Zap lock classes even with lock debugging disabledBart Van Assche1-17/+12
The following commit: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") changed the behavior of lockdep_free_key_range() from unconditionally zapping lock classes into only zapping lock classes if debug_lock == true. Not zapping lock classes if debug_lock == false leaves dangling pointers in several lockdep datastructures, e.g. lock_class::name in the all_lock_classes list. The shell command "cat /proc/lockdep" causes the kernel to iterate the all_lock_classes list. Hence the "unable to handle kernel paging request" cash that Shenghui encountered by running cat /proc/lockdep. Since the new behavior can cause cat /proc/lockdep to crash, restore the pre-v5.1 behavior. This patch avoids that cat /proc/lockdep triggers the following crash with debug_lock == false: BUG: unable to handle kernel paging request at fffffbfff40ca448 RIP: 0010:__asan_load1+0x28/0x50 Call Trace: string+0xac/0x180 vsnprintf+0x23e/0x820 seq_vprintf+0x82/0xc0 seq_printf+0x92/0xb0 print_name+0x34/0xb0 l_show+0x184/0x200 seq_read+0x59e/0x6c0 proc_reg_read+0x11f/0x170 __vfs_read+0x4d/0x90 vfs_read+0xc5/0x1f0 ksys_read+0xab/0x130 __x64_sys_read+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: shenghui <shhuiw@foxmail.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") # v5.1-rc1. Link: https://lkml.kernel.org/r/20190403233552.124673-1-bvanassche@acm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kernel/sysctl.c: fix out-of-bounds access when setting file-max mm/util.c: fix strndup_user() comment sh: fix multiple function definition build errors MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM mm: writeback: use exact memcg dirty counts psi: clarify the units used in pressure files mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() hugetlbfs: fix memory leak for resv_map mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() lib/lzo: fix bugs for very short or empty input include/linux/bitrev.h: fix constant bitrev kmemleak: powerpc: skip scanning holes in the .bss section lib/string.c: implement a basic bcmp
2019-04-05kernel/sysctl.c: fix out-of-bounds access when setting file-maxWill Deacon1-1/+2
Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up min/max values for the file-max sysctl parameter via the .extra1 and .extra2 fields in the corresponding struct ctl_table entry. Unfortunately, the minimum value points at the global 'zero' variable, which is an int. This results in a KASAN splat when accessed as a long by proc_doulongvec_minmax on 64-bit architectures: | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0 | Read of size 8 at addr ffff2000133d1c20 by task systemd/1 | | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x228 | show_stack+0x14/0x20 | dump_stack+0xe8/0x124 | print_address_description+0x60/0x258 | kasan_report+0x140/0x1a0 | __asan_report_load8_noabort+0x18/0x20 | __do_proc_doulongvec_minmax+0x5d8/0x6a0 | proc_doulongvec_minmax+0x4c/0x78 | proc_sys_call_handler.isra.19+0x144/0x1d8 | proc_sys_write+0x34/0x58 | __vfs_write+0x54/0xe8 | vfs_write+0x124/0x3c0 | ksys_write+0xbc/0x168 | __arm64_sys_write+0x68/0x98 | el0_svc_common+0x100/0x258 | el0_svc_handler+0x48/0xc0 | el0_svc+0x8/0xc | | The buggy address belongs to the variable: | zero+0x0/0x40 | | Memory state around the buggy address: | ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa | ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00 | ^ | ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00 | ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fix the splat by introducing a unsigned long 'zero_ul' and using that instead. Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max") Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Christian Brauner <christian@brauner.io> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matteo Croce <mcroce@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05Merge tag 'trace-5.1-rc3' of ↵Linus Torvalds2-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull syscall-get-arguments cleanup and fixes from Steven Rostedt: "Andy Lutomirski approached me to tell me that the syscall_get_arguments() implementation in x86 was horrible and gcc certainly gets it wrong. He said that since the tracepoints only pass in 0 and 6 for i and n repectively, it should be optimized for that case. Inspecting the kernel, I discovered that all users pass in 0 for i and only one file passing in something other than 6 for the number of arguments. That code happens to be my own code used for the special syscall tracing. That can easily be converted to just using 0 and 6 as well, and only copying what is needed. Which is probably the faster path anyway for that case. Along the way, a couple of real fixes came from this as the syscall_get_arguments() function was incorrect for csky and riscv. x86 has been optimized to for the new interface that removes the variable number of arguments, but the other architectures could still use some loving and take more advantage of the simpler interface" * tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: syscalls: Remove start and number from syscall_set_arguments() args syscalls: Remove start and number from syscall_get_arguments() args csky: Fix syscall_get_arguments() and syscall_set_arguments() riscv: Fix syscall_get_arguments() and syscall_set_arguments() tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments() ptrace: Remove maxargs from task_current_syscall()
2019-04-05genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent()Stephen Boyd1-0/+4
If a child irqchip calls irq_chip_set_wake_parent() but its parent irqchip has the IRQCHIP_SKIP_SET_WAKE flag set an error is returned. This is inconsistent behaviour vs. set_irq_wake_real() which returns 0 when the irqchip has the IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to walk the chain of parents and set irq wake on any chips that don't have the flag set either. If the intent is to call the .irq_set_wake() callback of the parent irqchip, then we expect irqchip implementations to omit the IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function that calls irq_chip_set_wake_parent(). The problem has been observed on a Qualcomm sdm845 device where set wake fails on any GPIO interrupts after applying work in progress wakeup irq patches to the GPIO driver. The chain of chips looks like this: QCOM GPIO -> QCOM PDC (SKIP) -> ARM GIC (SKIP) The GPIO controllers parent is the QCOM PDC irqchip which in turn has ARM GIC as parent. The QCOM PDC irqchip has the IRQCHIP_SKIP_SET_WAKE flag set, and so does the grandparent ARM GIC. The GPIO driver doesn't know if the parent needs to set wake or not, so it unconditionally calls irq_chip_set_wake_parent() causing this function to return a failure because the parent irqchip (PDC) doesn't have the .irq_set_wake() callback set. Returning 0 instead makes everything work and irqs from the GPIO controller can be configured for wakeup. Make it consistent by returning 0 (success) from irq_chip_set_wake_parent() when a parent chip has IRQCHIP_SKIP_SET_WAKE set. [ tglx: Massaged changelog ] Fixes: 08b55e2a9208e ("genirq: Add irqchip_set_wake_parent") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-gpio@vger.kernel.org Cc: Lina Iyer <ilina@codeaurora.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190325181026.247796-1-swboyd@chromium.org
2019-04-05syscalls: Remove start and number from syscall_get_arguments() argsSteven Rostedt (Red Hat)2-3/+3
At Linux Plumbers, Andy Lutomirski approached me and pointed out that the function call syscall_get_arguments() implemented in x86 was horribly written and not optimized for the standard case of passing in 0 and 6 for the starting index and the number of system calls to get. When looking at all the users of this function, I discovered that all instances pass in only 0 and 6 for these arguments. Instead of having this function handle different cases that are never used, simply rewrite it to return the first 6 arguments of a system call. This should help out the performance of tracing system calls by ptrace, ftrace and perf. Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Martin <dave.martin@arm.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: uclinux-h8-devel@lists.sourceforge.jp Cc: linux-hexagon@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: nios2-dev@lists.rocketboards.org Cc: openrisc@lists.librecores.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-riscv@lists.infradead.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86 Reviewed-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-05genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=nKefeng Wang1-0/+1
When CONFIG_SPARSE_IRQ is disable, the request_mutex in struct irq_desc is not initialized which causes malfunction. Fixes: 9114014cf4e6 ("genirq: Add mutex to irq desc to serialize request/free_irq()") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: <linux-arm-kernel@lists.infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190404074512.145533-1-wangkefeng.wang@huawei.com
2019-04-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-19/+31
Pull networking fixes from David Miller: 1) Several hash table refcount fixes in batman-adv, from Sven Eckelmann. 2) Use after free in bpf_evict_inode(), from Daniel Borkmann. 3) Fix mdio bus registration in ixgbe, from Ivan Vecera. 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni. 5) ila rhashtable corruption fix from Herbert Xu. 6) Don't allow upper-devices to be added to vrf devices, from Sabrina Dubroca. 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork. 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype, from Alexander Lobakin. 9) Missing IDR checks in mlx5 driver, from Aditya Pakki. 10) Fix false connection termination in ktls, from Jakub Kicinski. 11) Work around some ASPM issues with r8169 by disabling rx interrupt coalescing on certain chips. From Heiner Kallweit. 12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo Abeni. 13) Fully initialize sockaddr_in structures in SCTP, from Xin Long. 14) Various BPF flow dissector fixes from Stanislav Fomichev. 15) Divide by zero in act_sample, from Davide Caratti. 16) Fix bridging multicast regression introduced by rhashtable conversion, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) ibmvnic: Fix completion structure initialization ipv6: sit: reset ip header pointer in ipip6_rcv net: bridge: always clear mcast matching struct on reports and leaves libcxgb: fix incorrect ppmax calculation vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x sch_cake: Make sure we can write the IP header before changing DSCP bits sch_cake: Use tc_skb_protocol() helper for getting packet protocol tcp: Ensure DCTCP reacts to losses net/sched: act_sample: fix divide by zero in the traffic path net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop net: hns: Fix sparse: some warnings in HNS drivers net: hns: Fix WARNING when remove HNS driver with SMMU enabled net: hns: fix ICMP6 neighbor solicitation messages discard problem net: hns: Fix probabilistic memory overwrite when HNS driver initialized net: hns: Use NAPI_POLL_WEIGHT for hns driver net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw() flow_dissector: rst'ify documentation ipv6: Fix dangling pointer when ipv6 fragment net-gro: Fix GRO flush when receiving a GSO packet. flow_dissector: document BPF flow dissector environment ...
2019-04-04tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()Steven Rostedt (Red Hat)1-3/+6
The only users that calls syscall_get_arguments() with a variable and not a hard coded '6' is ftrace_syscall_enter(). syscall_get_arguments() can be optimized by removing a variable input, and always grabbing 6 arguments regardless of what the system call actually uses. Change ftrace_syscall_enter() to pass the 6 args into a local stack array and copy the necessary arguments into the trace event as needed. This is needed to remove two parameters from syscall_get_arguments(). Link: http://lkml.kernel.org/r/20161107213233.627583542@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-03sched/fair: Do not re-read ->h_load_next during hierarchical load calculationMel Gorman1-3/+3
A NULL pointer dereference bug was reported on a distribution kernel but the same issue should be present on mainline kernel. It occured on s390 but should not be arch-specific. A partial oops looks like: Unable to handle kernel pointer dereference in virtual kernel address space ... Call Trace: ... try_to_wake_up+0xfc/0x450 vhost_poll_wakeup+0x3a/0x50 [vhost] __wake_up_common+0xbc/0x178 __wake_up_common_lock+0x9e/0x160 __wake_up_sync_key+0x4e/0x60 sock_def_readable+0x5e/0x98 The bug hits any time between 1 hour to 3 days. The dereference occurs in update_cfs_rq_h_load when accumulating h_load. The problem is that cfq_rq->h_load_next is not protected by any locking and can be updated by parallel calls to task_h_load. Depending on the compiler, code may be generated that re-reads cfq_rq->h_load_next after the check for NULL and then oops when reading se->avg.load_avg. The dissassembly showed that it was possible to reread h_load_next after the check for NULL. While this does not appear to be an issue for later compilers, it's still an accident if the correct code is generated. Full locking in this path would have high overhead so this patch uses READ_ONCE to read h_load_next only once and check for NULL before dereferencing. It was confirmed that there were no further oops after 10 days of testing. As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any potential problems with store tearing. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()") Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-01signal: don't silently convert SI_USER signals to non-current pidfdJann Horn1-9/+4
The current sys_pidfd_send_signal() silently turns signals with explicit SI_USER context that are sent to non-current tasks into signals with kernel-generated siginfo. This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case. If a user actually wants to send a signal with kernel-provided siginfo, they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing this case is unnecessary. Instead of silently replacing the siginfo, just bail out with an error; this is consistent with other interfaces and avoids special-casing behavior based on security checks. Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <christian@brauner.io>
2019-03-31Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds1-2/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fixes from Thomas Gleixner: "Two SMT/hotplug related fixes: - Prevent crash when HOTPLUG_CPU is disabled and the CPU bringup aborts. This is triggered with the 'nosmt' command line option, but can happen by any abort condition. As the real unplug code is not compiled in, prevent the fail by keeping the CPU in zombie state. - Enforce HOTPLUG_CPU for SMP on x86 to avoid the above situation completely. With 'nosmt' being a popular option it's required to unplug the half brought up sibling CPUs (due to the MCE wreckage) completely" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
2019-03-31Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Thomas Gleixner: "A small set of core updates: - Make the watchdog respect the selected CPU mask again. That was broken by the rework of the watchdog thread management and caused inconsistent state and NMI watchdog being unstoppable. - Ensure that the objtool build can find the libelf location. - Remove dead kcore stub code" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: watchdog: Respect watchdog cpumask on CPU hotplug objtool: Query pkg-config for libelf location proc/kcore: Remove unused kclist_add_remap()
2019-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-19/+31
Daniel Borkmann says: ==================== pull-request: bpf 2019-03-29 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Bug fix in BTF deduplication that was mishandling an equivalence comparison, from Andrii. 2) libbpf Makefile fixes to properly link against libelf for the shared object and to actually export AF_XDP's xsk.h header, from Björn. 3) Fix use after free in bpf inode eviction, from Daniel. 4) Fix a bug in skb creation out of cpumap redirect, from Jesper. 5) Remove an unnecessary and triggerable WARN_ONCE() in max number of call stack frames checking in verifier, from Paul. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29xdp: fix cpumap redirect SKB creation bugJesper Dangaard Brouer1-3/+10
We want to avoid leaking pointer info from xdp_frame (that is placed in top of frame) like commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse"), and followup commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom") that reserve this headroom. These changes also affected how cpumap constructed SKBs, as xdpf->headroom size changed, the skb data starting point were in-effect shifted with 32 bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size calculation also included xdpf->headroom which were reduced by same amount. A bug was introduced in commit 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't"), where the xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This round-up to find the frame_size i