summaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)AuthorFilesLines
2 daysalarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng1-1/+1
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystime/jiffies: Mark jiffies_64_to_clock_t() notraceSteven Rostedt1-1/+1
[ Upstream commit 755a648e78f12574482d4698d877375793867fa1 ] The trace_clock_jiffies() function that handles the "uptime" clock for tracing calls jiffies_64_to_clock_t(). This causes the function tracer to constantly recurse when the tracing clock is set to "uptime". Mark it notrace to prevent unnecessary recursion when using the "uptime" clock. Fixes: 58d4e21e50ff3 ("tracing: Fix wraparound problems in "uptime" trace clock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260306212403.72270bb2@robin Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04hrtimer: Fix trace oddityThomas Gleixner1-1/+1
[ Upstream commit 5d6446f409da00e5a389125ddb5ce09f5bc404c9 ] It turns out that __run_hrtimer() will trace like: <idle>-0 [032] d.h2. 20705.474563: hrtimer_cancel: hrtimer=0xff2db8f77f8226e8 <idle>-0 [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0 Which is a bit nonsensical, the timer doesn't get canceled on expiration. The cause is the use of the incorrect debug helper. Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers") Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143208.219595606@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30ptp: Add PHC file mode checks. Allow RO adjtime() without FMODE_WRITE.Wojtek Wasko1-1/+1
[ Upstream commit b4e53b15c04e3852949003752f48f7a14ae39e86 ] Many devices implement highly accurate clocks, which the kernel manages as PTP Hardware Clocks (PHCs). Userspace applications rely on these clocks to timestamp events, trace workload execution, correlate timescales across devices, and keep various clocks in sync. The kernel’s current implementation of PTP clocks does not enforce file permissions checks for most device operations except for POSIX clock operations, where file mode is verified in the POSIX layer before forwarding the call to the PTP subsystem. Consequently, it is common practice to not give unprivileged userspace applications any access to PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An example of users running into this limitation is documented in [1]. Additionally, POSIX layer requires WRITE permission even for readonly adjtime() calls which are used in PTP layer to return current frequency offset applied to the PHC. Add permission checks for functions that modify the state of a PTP device. Continue enforcing permission checks for POSIX clock operations (settime, adjtime) in the POSIX layer. Only require WRITE access for dynamic clocks adjtime() if any flags are set in the modes field. [1] https://lists.nwtime.org/sympa/arc/linuxptp-users/2024-01/msg00036.html Changes in v4: - Require FMODE_WRITE in ajtime() only for calls modifying the clock in any way. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30posix-clock: Store file pointer in struct posix_clock_contextWojtek Wasko1-0/+1
[ Upstream commit e859d375d1694488015e6804bfeea527a0b25b9f ] File descriptor based pc_clock_*() operations of dynamic posix clocks have access to the file pointer and implement permission checks in the generic code before invoking the relevant dynamic clock callback. Character device operations (open, read, poll, ioctl) do not implement a generic permission control and the dynamic clock callbacks have no access to the file pointer to implement them. Extend struct posix_clock_context with a struct file pointer and initialize it in posix_clock_open(), so that all dynamic clock callbacks can access it. Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wojtek Wasko <wwasko@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30Fix memory leak in posix_clock_open()Linus Torvalds1-7/+9
[ Upstream commit 5b4cdd9c5676559b8a7c944ac5269b914b8c0bb8 ] If the clk ops.open() function returns an error, we don't release the pccontext we allocated for this clock. Re-organize the code slightly to make it all more obvious. Reported-by: Rohit Keshri <rkeshri@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Fixes: 60c6946675fc ("posix-clock: introduce posix_clock_context concept") Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linuxfoundation.org> Stable-dep-of: e859d375d169 ("posix-clock: Store file pointer in struct posix_clock_context") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30posix-clock: introduce posix_clock_context conceptXabier Marquiegui1-9/+27
[ Upstream commit 60c6946675fc06dd2fd2b7a4b6fd1c1f046f1056 ] Add the necessary structure to support custom private-data per posix-clock user. The previous implementation of posix-clock assumed all file open instances need access to the same clock structure on private_data. The need for individual data structures per file open instance has been identified when developing support for multiple timestamp event queue users for ptp_clock. Signed-off-by: Xabier Marquiegui <reibax@gmail.com> Suggested-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: e859d375d169 ("posix-clock: Store file pointer in struct posix_clock_context") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30hrtimer: Fix softirq base check in update_needs_ipi()Thomas Weißschuh1-1/+1
commit 05dc4a9fc8b36d4c99d76bbc02aa9ec0132de4c2 upstream. The 'clockid' field is not the correct way to check for a softirq base. Fix the check to correctly compare the base type instead of the clockid. Fixes: 1e7f7fbcd40c ("hrtimer: Avoid more SMP function calls in clock_was_set()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260107-hrtimer-clock-base-check-v1-1-afb5dbce94a1@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01timers: Fix NULL function pointer race in timer_shutdown_sync()Yipeng Zou1-3/+4
commit 20739af07383e6eb1ec59dcd70b72ebfa9ac362c upstream. There is a race condition between timer_shutdown_sync() and timer expiration that can lead to hitting a WARN_ON in expire_timers(). The issue occurs when timer_shutdown_sync() clears the timer function to NULL while the timer is still running on another CPU. The race scenario looks like this: CPU0 CPU1 <SOFTIRQ> lock_timer_base() expire_timers() base->running_timer = timer; unlock_timer_base() [call_timer_fn enter] mod_timer() ... timer_shutdown_sync() lock_timer_base() // For now, will not detach the timer but only clear its function to NULL if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() [call_timer_fn exit] lock_timer_base() base->running_timer = NULL; unlock_timer_base() ... // Now timer is pending while its function set to NULL. // next timer trigger <SOFTIRQ> expire_timers() WARN_ON_ONCE(!fn) // hit ... lock_timer_base() // Now timer will detach if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() The problem is that timer_shutdown_sync() clears the timer function regardless of whether the timer is currently running. This can leave a pending timer with a NULL function pointer, which triggers the WARN_ON_ONCE(!fn) check in expire_timers(). Fix this by only clearing the timer function when actually detaching the timer. If the timer is running, leave the function pointer intact, which is safe because the timer will be properly detached when it finishes running. Fixes: 0cc04e80458a ("timers: Add shutdown mechanism to the internal functions") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19hrtimers: Unconditionally update target CPU base after offline timer migrationXiongfeng Wang1-8/+3
[ Upstream commit e895f8e29119c8c966ea794af9e9100b10becb88 ] When testing softirq based hrtimers on an ARM32 board, with high resolution mode and NOHZ inactive, softirq based hrtimers fail to expire after being moved away from an offline CPU: CPU0 CPU1 hrtimer_start(..., HRTIMER_MODE_SOFT); cpu_down(CPU1) ... hrtimers_cpu_dying() // Migrate timers to CPU0 smp_call_function_single(CPU0, returgger_next_event); retrigger_next_event() if (!highres && !nohz) return; As retrigger_next_event() is a NOOP when both high resolution timers and NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not updated and the migrated softirq timers never expire unless there is a softirq based hrtimer queued on CPU0 later. Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check in retrigger_next_event(), which enforces a full update of the CPU base. As this is not a fast path the extra cost does not matter. [ tglx: Massaged change log ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()Jiapeng Chong1-13/+8
[ Upstream commit b7c8e1f8a7b4352c1d0b4310686385e3cf6c104a ] The function hrtimer_hres_active() are defined in the hrtimer.c file, but not called elsewhere, so rename __hrtimer_hres_active() to hrtimer_hres_active() and remove the old hrtimer_hres_active() function. kernel/time/hrtimer.c:653:19: warning: unused function 'hrtimer_hres_active'. Fixes: 82ccdf062a64 ("hrtimer: Remove unused function") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20240418023000.130324-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8778 Stable-dep-of: e895f8e29119 ("hrtimers: Unconditionally update target CPU base after offline timer migration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19hrtimer: Remove unused functionJiapeng Chong1-19/+1
[ Upstream commit 82ccdf062a64f3c4ac575c16179ce68edbbbe8e4 ] The function is defined, but not called anywhere: kernel/time/hrtimer.c:1880:20: warning: unused function '__hrtimer_peek_ahead_timers'. Remove it. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322070441.29646-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8611 Stable-dep-of: e895f8e29119 ("hrtimers: Unconditionally update target CPU base after offline timer migration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27clocksource: Fix the CPUs' choice in the watchdog per CPU verificationGuilherme G. Piccoli1-1/+1
[ Upstream commit 08d7becc1a6b8c936e25d827becabfe3bff72a36 ] Right now, if the clocksource watchdog detects a clocksource skew, it might perform a per CPU check, for example in the TSC case on x86. In other words: supposing TSC is detected as unstable by the clocksource watchdog running at CPU1, as part of marking TSC unstable the kernel will also run a check of TSC readings on some CPUs to be sure it is synced between them all. But that check happens only on some CPUs, not all of them; this choice is based on the parameter "verify_n_cpus" and in some random cpumask calculation. So, the watchdog runs such per CPU checks on up to "verify_n_cpus" random CPUs among all online CPUs, with the risk of repeating CPUs (that aren't double checked) in the cpumask random calculation. But if "verify_n_cpus" > num_online_cpus(), it should skip the random calculation and just go ahead and check the clocksource sync between all online CPUs, without the risk of skipping some CPUs due to duplicity in the random cpumask calculation. Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per CPU verification skipping some CPU even with verify_n_cpus=8, due to the duplicity on random cpumask generation. Skipping the randomization when the number of online CPUs is smaller than verify_n_cpus, solves that. Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19posix-cpu-timers: fix race between handle_posix_cpu_timers() and ↵Oleg Nesterov1-0/+9
posix_cpu_timer_del() commit f90fff1e152dedf52b932240ebbd670d83330eca upstream. If an exiting non-autoreaping task has already passed exit_notify() and calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent or debugger right after unlock_task_sighand(). If a concurrent posix_cpu_timer_del() runs at that moment, it won't be able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or lock_task_sighand() will fail. Add the tsk->exit_state check into run_posix_cpu_timers() to fix this. This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because exit_task_work() is called before exit_notify(). But the check still makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail anyway in this case. Cc: stable@vger.kernel.org Reported-by: Benoît Sevens <bsevens@google.com> Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYINGFrederic Weisbecker1-21/+82
commit 53dac345395c0d2493cbc2f4c85fe38aef5b63f5 upstream. hrtimers are migrated away from the dying CPU to any online target at the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers handling tasks involved in the CPU hotplug forward progress. However wakeups can still be performed by the outgoing CPU after CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being armed. Depending on several considerations (crystal ball power management based election, earliest timer already enqueued, timer migration enabled or not), the target may eventually be the current CPU even if offline. If that happens, the timer is eventually ignored. The most notable example is RCU which had to deal with each and every of those wake-ups by deferring them to an online CPU, along with related workarounds: _ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying) _ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU) _ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq) The problem isn't confined to RCU though as the stop machine kthread (which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end of its work through cpu_stop_signal_done() and performs a wake up that eventually arms the deadline server timer: WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0 CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0 RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0 Call Trace: <TASK> start_dl_timer enqueue_dl_entity dl_server_start enqueue_task_fair enqueue_task ttwu_do_activate try_to_wake_up complete cpu_stopper_thread Instead of providing yet another bandaid to work around the situation, fix it in the hrtimers infrastructure instead: always migrate away a timer to an online target whenever it is enqueued from an offline CPU. This will also allow to revert all the above RCU disgraceful hacks. Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Vlad Poenaru <vlad.wing@gmail.com> Reported-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org Closes: 20241213203739.1519801-1-usamaarif642@gmail.com Signed-off-by: Zhaoyang Li <lizy04@hust.edu.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04timer_list: Don't use %pK through printk()Thomas Weißschuh1-2/+2
[ Upstream commit a52067c24ccf6ee4c85acffa0f155e9714f9adce ] This reverts commit f590308536db ("timer debug: Hide kernel addresses via %pK in /proc/timer_list") The timer list helper SEQ_printf() uses either the real seq_printf() for procfs output or vprintk() to print to the kernel log, when invoked from SysRq-q. It uses %pK for printing pointers. In the past %pK was prefered over %p as it would not leak raw pointer values into the kernel log. Since commit ad67b74d2469 ("printk: hash addresses printed with %p") the regular %p has been improved to avoid this issue. Furthermore, restricted pointers ("%pK") were never meant to be used through printk(). They can still unintentionally leak raw pointers or acquire sleeping looks in atomic contexts. Switch to the regular pointer formatting which is safer, easier to reason about and sufficient here. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04posix-timers: Add cond_resched() to posix_timer_add() search loopEric Dumazet1-0/+1
[ Upstream commit 5f2909c6cd13564a07ae692a95457f52295c4f22 ] With a large number of POSIX timers the search for a valid ID might cause a soft lockup on PREEMPT_NONE/VOLUNTARY kernels. Add cond_resched() to the loop to prevent that. [ tglx: Split out from Eric's series ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02timekeeping: Add a lockdep override in tick_freeze()Sebastian Andrzej Siewior1-0/+22
[ Upstream commit 92e250c624ea37fde64bfd624fd2556f0d846f18 ] tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver acquires a spinlock which becomes a sleeping lock on PREEMPT_RT. Lockdep complains about this lock nesting. Add a lockdep override for this special case and a comment explaining why it is okay. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/ Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22hrtimers: Mark is_migration_base() with __always_inlineAndy Shevchenko1-10/+12
[ Upstream commit 27af31e44949fa85550176520ef7086a0d00fd7b ] When is_migration_base() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/time/hrtimer.c:156:20: error: unused function 'is_migration_base' [-Werror,-Wunused-function] 156 | static inline bool is_migration_base(struct hrtimer_clock_base *base) | ^~~~~~~~~~~~~~~~~ Fix this by marking it with __always_inline. [ tglx: Use __always_inline instead of __maybe_unused and move it into the usage sites conditional ] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250116160745.243358-1-andriy.shevchenko@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22hrtimer: Use and report correct timerslack values for realtime tasksFelix Moessbauer1-15/+3
commit ed4fb6d7ef68111bb539283561953e5c6e9a6e38 upstream. The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple timers in a single interrupt. This is a performance optimization. Timers of realtime tasks (having a realtime scheduling policy) should not be delayed. This logic was inconsitently applied to the hrtimers, leading to delays of realtime tasks which used timed waits for events (e.g. condition variables). Due to the downstream override of the slack for rt tasks, the procfs reported incorrect (non-zero) timerslack_ns values. This is changed by setting the timer_slack_ns task attribute to 0 for all tasks with a rt policy. By that, downstream users do not need to specially handle rt tasks (w.r.t. the slack), and the procfs entry shows the correct value of "0". Setting non-zero slack values (either via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored, as stated in "man 2 PR_SET_TIMERSLACK": Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)). The special handling of timerslack on rt tasks in downstream users is removed as well. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21clocksource: Use migrate_disable() to avoid calling get_random_u32() in ↵Waiman Long1-2/+4
atomic context [ Upstream commit 6bb05a33337b2c842373857b63de5c9bf1ae2a09 ] The following bug report happened with a PREEMPT_RT kernel: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 get_random_u32+0x4f/0x110 clocksource_verify_choose_cpus+0xab/0x1a0 clocksource_verify_percpu.part.0+0x6b/0x330 clocksource_watchdog_kthread+0x193/0x1a0 It is due to the fact that clocksource_verify_choose_cpus() is invoked with preemption disabled. This function invokes get_random_u32() to obtain random numbers for choosing CPUs. The batched_entropy_32 local lock and/or the base_crng.lock spinlock in driver/char/random.c will be acquired during the call. In PREEMPT_RT kernel, they are both sleeping locks and so cannot be acquired in atomic context. Fix this problem by using migrate_disable() to allow smp_processor_id() to be reliably used without introducing atomic context. preempt_disable() is then called after clocksource_verify_choose_cpus() but before the clocksource measurement is being run to avoid introducing unexpected latency. Fixes: 7560c02bdffb ("clocksource: Check per-CPU clock synchronization when marked unstable") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/all/20250131173323.891943-2-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21clocksource: Use pr_info() for "Checking clocksource synchronization" messageWaiman Long1-1/+2
[ Upstream commit 1f566840a82982141f94086061927a90e79440e5 ] The "Checking clocksource synchronization" message is normally printed when clocksource_verify_percpu() is called for a given clocksource if both the CLOCK_SOURCE_UNSTABLE and CLOCK_SOURCE_VERIFY_PERCPU flags are set. It is an informational message and so pr_info() is the correct choice. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250125015442.3740588-1-longman@redhat.com Stable-dep-of: 6bb05a33337b ("clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-23hrtimers: Handle CPU state correctly on hotplugKoichiro Den1-1/+10
commit 2f8dea1692eef2b7ba6a256246ed82c365fdc686 upstream. Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14ntp: Remove invalid cast in time offset mathMarcelo Dalmas1-1/+1
commit f5807b0606da7ac7c1b74a386b22134ec7702d05 upstream. Due to an unsigned cast, adjtimex() returns the wrong offest when using ADJ_MICRO and the offset is negative. In this case a small negative offset returns approximately 4.29 seconds (~ 2^32/1000 milliseconds) due to the unsigned cast of the negative offset. This cast was added when the kernel internal struct timex was changed to use type long long for the time offset value to address the problem of a 64bit/32bit division on 32bit systems. The correct cast would have been (s32), which is correct as time_offset can only be in the range of [INT_MIN..INT_MAX] because the shift constant used for calculating it is 32. But that's non-obvious. Remove the cast and use div_s64() to cure the issue. [ tglx: Fix white space damage, use div_s64() and amend the change log ] Fixes: ead25417f82e ("timex: use __kernel_timex internally") Signed-off-by: Marcelo Dalmas <marcelo.dalmas@ge.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/SJ0P101MB03687BF7D5A10FD3C49C51E5F42E2@SJ0P101MB0368.NAMP101.PROD.OUTLOOK.COM Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09time: Fix references to _msecs_to_jiffies() handling of valuesMiguel Ojeda1-1/+1
[ Upstream commit 92b043fd995a63a57aae29ff85a39b6f30cd440c ] The details about the handling of the "normal" values were moved to the _msecs_to_jiffies() helpers in commit ca42aaf0c861 ("time: Refactor msecs_to_jiffies"). However, the same commit still mentioned __msecs_to_jiffies() in the added documentation. Thus point to _msecs_to_jiffies() instead. Fixes: ca42aaf0c861 ("time: Refactor msecs_to_jiffies") Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241025110141.157205-2-ojeda@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09time: Partially revert cleanup on msecs_to_jiffies() documentationMiguel Ojeda1-1/+1
[ Upstream commit b05aefc1f5886c8aece650c9c1639c87b976191a ] The documentation's intention is to compare msecs_to_jiffies() (first sentence) with __msecs_to_jiffies() (second sentence), which is what the original documentation did. One of the cleanups in commit f3cb80804b82 ("time: Fix various kernel-doc problems") may have thought the paragraph was talking about the latter since that is what it is being documented. Thus revert that part of the change. Fixes: f3cb80804b82 ("time: Fix various kernel-doc problems") Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241025110141.157205-1-ojeda@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()Jinjie Ruan1-3/+3
[ Upstream commit 6e62807c7fbb3c758d233018caf94dfea9c65dbd ] If get_clock_desc() succeeds, it calls fget() for the clockid's fd, and get the clk->rwsem read lock, so the error path should release the lock to make the lock balance and fput the clockid's fd to make the refcount balance and release the fd related resource. However the below commit left the error path locked behind resulting in unbalanced locking. Check timespec64_valid_strict() before get_clock_desc() to fix it, because the "ts" is not changed after that. Fixes: d8794ac20a29 ("posix-clock: Fix missing timespec64 check in pc_clock_settime()") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de> [pabeni@redhat.com: fixed commit message typo] Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-22posix-clock: Fix missing timespec64 check in pc_clock_settime()Jinjie Ruan1-0/+3
commit d8794ac20a299b647ba9958f6d657051fc51a540 upstream. As Andrew pointed out, it will make sense that the PTP core checked timespec64 struct's tv_sec and tv_nsec range before calling ptp->info->settime64(). As the man manual of clock_settime() said, if tp.tv_sec is negative or tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL, which include dynamic clocks which handles PTP clock, and the condition is consistent with timespec64_valid(). As Thomas suggested, timespec64_valid() only check the timespec is valid, but not ensure that the time is in a valid range, so check it ahead using timespec64_valid_strict() in pc_clock_settime() and return -EINVAL if not valid. There are some drivers that use tp->tv_sec and tp->tv_nsec directly to write registers without validity checks and assume that the higher layer has checked it, which is dangerous and will benefit from this, such as hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(), and some drivers can remove the checks of itself. Cc: stable@vger.kernel.org Fixes: 0606f422b453 ("posix clocks: Introduce dynamic clocks") Acked-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29hrtimer: Prevent queuing of hrtimer without a function callbackPhil Chang1-0/+2
[ Upstream commit 5a830bbce3af16833fe0092dec47b6dd30279825 ] The hrtimer function callback must not be NULL. It has to be specified by the call side but it is not validated by the hrtimer code. When a hrtimer is queued without a function callback, the kernel crashes with a null pointer dereference when trying to execute the callback in __run_hrtimer(). Introduce a validation before queuing the hrtimer in hrtimer_start_range_ns(). [anna-maria: Rephrase commit message] Signed-off-by: Phil Chang <phil.chang@mediatek.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29clocksource: Make watchdog and suspend-timing multiplication overflow safeAdrian Hunter1-22/+20
[ Upstream commit d0304569fb019d1bcfbbbce1ce6df6b96f04079b ] Kernel timekeeping is designed to keep the change in cycles (since the last timer interrupt) below max_cycles, which prevents multiplication overflow when converting cycles to nanoseconds. However, if timer interrupts stop, the clocksource_cyc2ns() calculation will eventually overflow. Add protection against that. Simplify by folding together clocksource_delta() and clocksource_cyc2ns() into cycles_to_nsec_safe(). Check against max_cycles, falling back to a slower higher precision calculation. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-20-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29tick: Move got_idle_tick away from common flagsFrederic Weisbecker1-1/+1
[ Upstream commit 3ce74f1a8566dbbc9774f85fb0ce781fe290fd32 ] tick_nohz_idle_got_tick() is called by cpuidle_reflect() within the idle loop with interrupts enabled. This function modifies the struct tick_sched's bitfield "got_idle_tick". However this bitfield is stored within the same mask as other bitfields that can be modified from interrupts. Fortunately so far it looks like the only race that can happen is while writing ->got_idle_tick to 0, an interrupt fires and writes the ->idle_active field to 0. It's then possible that the interrupted write to ->got_idle_tick writes back the old value of ->idle_active back to 1. However if that happens, the worst possible outcome is that the time spent between that interrupt and the upcoming call to tick_nohz_idle_exit() is accounted as idle, which is negligible quantity. Still all the bitfield writes within this struct tick_sched's shadow mask should be IRQ-safe. Therefore move this bitfield out to its own storage to avoid further suprises. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-12-frederic@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29hrtimer: Select housekeeping CPU during migrationCosta Shulyupin1-1/+2
[ Upstream commit 56c2cb10120894be40c40a9bf0ce798da14c50f6 ] During CPU-down hotplug, hrtimers may migrate to isolated CPUs, compromising CPU isolation. Address this issue by masking valid CPUs for hrtimers using housekeeping_cpumask(HK_TYPE_TIMER). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240222200856.569036-1-costa.shul@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()Thomas Gleixner1-1/+1
commit 5916be8a53de6401871bdd953f6c60237b47d6d3 upstream. The addition of the bases argument to clock_was_set() fixed up all call sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0. As a result the effect of that clock_was_set() notification is incomplete and might result in timers expiring late because the hrtimer code does not re-evaluate the affected clock bases. Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code which clock bases need to be re-evaluated. Fixes: 17a1b8826b45 ("hrtimer: Add bases argument to clock_was_set()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14ntp: Safeguard against time_constant overflowJustin Stitt1-3/+2
commit 06c03c8edce333b9ad9c6b207d93d3a5ae7c10c0 upstream. Using syzkaller with the recently reintroduced signed integer overflow sanitizer produces this UBSAN report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18 9223372036854775806 + 4 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 __do_adjtimex+0x1236/0x1440 do_adjtimex+0x2be/0x740 The user supplied time_constant value is incremented by four and then clamped to the operating range. Before commit eea83d896e31 ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping after incrementing which does not work correctly when the user supplied value is in the overflow zone of the '+ 4' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Similar to the fixups for time_maxerror and time_esterror, clamp the user space supplied value to the operating range. [ tglx: Switch to clamping ] Fixes: eea83d896e31 ("ntp: NTP4 user space bits update") Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com Closes: https://github.com/KSPP/linux/issues/352 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()Paul E. McKenney1-1/+1
[ Upstream commit f2655ac2c06a15558e51ed6529de280e1553c86e ] The current "nretries > 1 || nretries >= max_retries" check in cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if nretries is greater than 1. The intent is instead to never warn on the first try, but otherwise warn if the successful retry was the last retry. Therefore, change that "||" to "&&". Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14clocksource: Scale the watchdog read retries automaticallyFeng Tang2-12/+11
[ Upstream commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9 ] On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs. The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs. There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime. Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely. [ tglx: Massaged change log and comments ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jin Wang <jin1.wang@intel.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com Stable-dep-of: f2655ac2c06a ("clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14ntp: Clamp maxerror and esterror to operating rangeJustin Stitt1-2/+2
[ Upstream commit 87d571d6fb77ec342a985afa8744bb9bb75b3622 ] Using syzkaller alongside the newly reintroduced signed integer overflow sanitizer spits out this report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16 9223372036854775807 + 500 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 second_overflow+0x2d6/0x500 accumulate_nsecs_to_secs+0x60/0x160 timekeeping_advance+0x1fe/0x890 update_wall_time+0x10/0x30 time_maxerror is unconditionally incremented and the result is checked against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting in wrap-around to negative space. Before commit eea83d896e31 ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping in handle_overflow() which does not work correctly when the user supplied value is in the overflow zone of the '+ 500' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Miroslav confirmed that the input value should be clamped to the operating range and the same applies to time_esterror. The latter is not used by the kernel, but the value still should be in the operating range as it was before the s