From 2b144498350860b6ee9dc57ff27a93ad488de5dc Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Thu, 9 Feb 2012 14:56:42 +0530 Subject: uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints Add uprobes support to the core kernel, with x86 support. This commit adds the kernel facilities, the actual uprobes user-space ABI and perf probe support comes in later commits. General design: Uprobes are maintained in an rb-tree indexed by inode and offset (the offset here is from the start of the mapping). For a unique (inode, offset) tuple, there can be at most one uprobe in the rb-tree. Since the (inode, offset) tuple identifies a unique uprobe, more than one user may be interested in the same uprobe. This provides the ability to connect multiple 'consumers' to the same uprobe. Each consumer defines a handler and a filter (optional). The 'handler' is run every time the uprobe is hit, if it matches the 'filter' criteria. The first consumer of a uprobe causes the breakpoint to be inserted at the specified address and subsequent consumers are appended to this list. On subsequent probes, the consumer gets appended to the existing list of consumers. The breakpoint is removed when the last consumer unregisters. For all other unregisterations, the consumer is removed from the list of consumers. Given a inode, we get a list of the mms that have mapped the inode. Do the actual registration if mm maps the page where a probe needs to be inserted/removed. We use a temporary list to walk through the vmas that map the inode. - The number of maps that map the inode, is not known before we walk the rmap and keeps changing. - extending vm_area_struct wasn't recommended, it's a size-critical data structure. - There can be more than one maps of the inode in the same mm. We add callbacks to the mmap methods to keep an eye on text vmas that are of interest to uprobes. When a vma of interest is mapped, we insert the breakpoint at the right address. Uprobe works by replacing the instruction at the address defined by (inode, offset) with the arch specific breakpoint instruction. We save a copy of the original instruction at the uprobed address. This is needed for: a. executing the instruction out-of-line (xol). b. instruction analysis for any subsequent fixups. c. restoring the instruction back when the uprobe is unregistered. We insert or delete a breakpoint instruction, and this breakpoint instruction is assumed to be the smallest instruction available on the platform. For fixed size instruction platforms this is trivially true, for variable size instruction platforms the breakpoint instruction is typically the smallest (often a single byte). Writing the instruction is done by COWing the page and changing the instruction during the copy, this even though most platforms allow atomic writes of the breakpoint instruction. This also mirrors the behaviour of a ptrace() memory write to a PRIVATE file map. The core worker is derived from KSM's replace_page() logic. In essence, similar to KSM: a. allocate a new page and copy over contents of the page that has the uprobed vaddr b. modify the copy and insert the breakpoint at the required address c. switch the original page with the copy containing the breakpoint d. flush page tables. replace_page() is being replicated here because of some minor changes in the type of pages and also because Hugh Dickins had plans to improve replace_page() for KSM specific work. Instruction analysis on x86 is based on instruction decoder and determines if an instruction can be probed and determines the necessary fixups after singlestep. Instruction analysis is done at probe insertion time so that we avoid having to repeat the same analysis every time a probe is hit. A lot of code here is due to the improvement/suggestions/inputs from Peter Zijlstra. Changelog: (v10): - Add code to clear REX.B prefix as suggested by Denys Vlasenko and Masami Hiramatsu. (v9): - Use insn_offset_modrm as suggested by Masami Hiramatsu. (v7): Handle comments from Peter Zijlstra: - Dont take reference to inode. (expect inode to uprobe_register to be sane). - Use PTR_ERR to set the return value. - No need to take reference to inode. - use PTR_ERR to return error value. - register and uprobe_unregister share code. (v5): - Modified del_consumer as per comments from Peter. - Drop reference to inode before dropping reference to uprobe. - Use i_size_read(inode) instead of inode->i_size. - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called. - Includes errno.h as recommended by Stephen Rothwell to fix a build issue on sparc defconfig - Remove restrictions while unregistering. - Earlier code leaked inode references under some conditions while registering/unregistering. - Continue the vma-rmap walk even if the intermediate vma doesnt meet the requirements. - Validate the vma found by find_vma before inserting/removing the breakpoint - Call del_consumer under mutex_lock. - Use hash locks. - Handle mremap. - Introduce find_least_offset_node() instead of close match logic in find_uprobe - Uprobes no more depends on MM_OWNER; No reference to task_structs while inserting/removing a probe. - Uses read_mapping_page instead of grab_cache_page so that the pages have valid content. - pass NULL to get_user_pages for the task parameter. - call SetPageUptodate on the new page allocated in write_opcode. - fix leaking a reference to the new page under certain conditions. - Include Instruction Decoder if Uprobes gets defined. - Remove const attributes for instruction prefix arrays. - Uses mm_context to know if the application is 32 bit. Signed-off-by: Srikar Dronamraju Also-written-by: Jim Keniston Reviewed-by: Peter Zijlstra Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Roland McGrath Cc: Masami Hiramatsu Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Stephen Rothwell Cc: Denys Vlasenko Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Linux-mm Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com [ Made various small edits to the commit log ] Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 98 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 include/linux/uprobes.h (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h new file mode 100644 index 000000000000..f1d13fd140f2 --- /dev/null +++ b/include/linux/uprobes.h @@ -0,0 +1,98 @@ +#ifndef _LINUX_UPROBES_H +#define _LINUX_UPROBES_H +/* + * Userspace Probes (UProbes) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2008-2011 + * Authors: + * Srikar Dronamraju + * Jim Keniston + */ + +#include +#include + +struct vm_area_struct; +#ifdef CONFIG_ARCH_SUPPORTS_UPROBES +#include +#else + +typedef u8 uprobe_opcode_t; +struct uprobe_arch_info {}; + +#define MAX_UINSN_BYTES 4 +#endif + +#define uprobe_opcode_sz sizeof(uprobe_opcode_t) + +/* flags that denote/change uprobes behaviour */ +/* Have a copy of original instruction */ +#define UPROBES_COPY_INSN 0x1 +/* Dont run handlers when first register/ last unregister in progress*/ +#define UPROBES_RUN_HANDLER 0x2 + +struct uprobe_consumer { + int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); + /* + * filter is optional; If a filter exists, handler is run + * if and only if filter returns true. + */ + bool (*filter)(struct uprobe_consumer *self, struct task_struct *task); + + struct uprobe_consumer *next; +}; + +struct uprobe { + struct rb_node rb_node; /* node in the rb tree */ + atomic_t ref; + struct rw_semaphore consumer_rwsem; + struct list_head pending_list; + struct uprobe_arch_info arch_info; + struct uprobe_consumer *consumers; + struct inode *inode; /* Also hold a ref to inode */ + loff_t offset; + int flags; + u8 insn[MAX_UINSN_BYTES]; +}; + +#ifdef CONFIG_UPROBES +extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr); +extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr, bool verify); +extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn); +extern int register_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer); +extern void unregister_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer); +extern int mmap_uprobe(struct vm_area_struct *vma); +#else /* CONFIG_UPROBES is not defined */ +static inline int register_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer) +{ + return -ENOSYS; +} +static inline void unregister_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer) +{ +} +static inline int mmap_uprobe(struct vm_area_struct *vma) +{ + return 0; +} +#endif /* CONFIG_UPROBES */ +#endif /* _LINUX_UPROBES_H */ -- cgit v1.2.3 From 7b2d81d48a2d8e37efb6ce7b4d5ef58822b30d89 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 17 Feb 2012 09:27:41 +0100 Subject: uprobes/core: Clean up, refactor and improve the code Make the uprobes code readable to me: - improve the Kconfig text so that a mere mortal gets some idea what CONFIG_UPROBES=y is really about - do trivial renames to standardize around the uprobes_*() namespace - clean up and simplify various code flow details - separate basic blocks of functionality - line break artifact and white space related removal - use standard local varible definition blocks - use vertical spacing to make things more readable - remove unnecessary volatile - restructure comment blocks to make them more uniform and more readable in general Cc: Srikar Dronamraju Cc: Jim Keniston Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Masami Hiramatsu Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f1d13fd140f2..64e45f116b2a 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -1,7 +1,7 @@ #ifndef _LINUX_UPROBES_H #define _LINUX_UPROBES_H /* - * Userspace Probes (UProbes) + * User-space Probes (UProbes) * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -40,8 +40,10 @@ struct uprobe_arch_info {}; #define uprobe_opcode_sz sizeof(uprobe_opcode_t) /* flags that denote/change uprobes behaviour */ + /* Have a copy of original instruction */ #define UPROBES_COPY_INSN 0x1 + /* Dont run handlers when first register/ last unregister in progress*/ #define UPROBES_RUN_HANDLER 0x2 @@ -70,27 +72,23 @@ struct uprobe { }; #ifdef CONFIG_UPROBES -extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, - unsigned long vaddr); -extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, - unsigned long vaddr, bool verify); +extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr); +extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify); extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn); -extern int register_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer); -extern void unregister_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer); -extern int mmap_uprobe(struct vm_area_struct *vma); +extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); +extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); +extern int uprobe_mmap(struct vm_area_struct *vma); #else /* CONFIG_UPROBES is not defined */ -static inline int register_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer) +static inline int +uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) { return -ENOSYS; } -static inline void unregister_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer) +static inline void +uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) { } -static inline int mmap_uprobe(struct vm_area_struct *vma) +static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; } -- cgit v1.2.3 From 96379f60075c75b261328aa7830ef8aa158247ac Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 22 Feb 2012 14:45:49 +0530 Subject: uprobes/core: Remove uprobe_opcode_sz uprobe_opcode_sz refers to the smallest instruction size for that architecture. UPROBES_BKPT_INSN_SIZE refers to the size of the breakpoint instruction for that architecture. For now we are assuming that both uprobe_opcode_sz and UPROBES_BKPT_INSN_SIZE are the same for all archs and hence removing uprobe_opcode_sz in favour of UPROBES_BKPT_INSN_SIZE. However if we have to support architectures where the smallest instruction size is different from the size of breakpoint instruction, we may have to re-introduce uprobe_opcode_sz. Signed-off-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Jiri Olsa Cc: Josh Stone Link: http://lkml.kernel.org/r/20120222091549.15880.67020.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 64e45f116b2a..fd45b70750d4 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -37,8 +37,6 @@ struct uprobe_arch_info {}; #define MAX_UINSN_BYTES 4 #endif -#define uprobe_opcode_sz sizeof(uprobe_opcode_t) - /* flags that denote/change uprobes behaviour */ /* Have a copy of original instruction */ -- cgit v1.2.3 From 3ff54efdfaace9e9b2b7c1959a865be6b91de96c Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 22 Feb 2012 14:46:02 +0530 Subject: uprobes/core: Move insn to arch specific structure Few cleanups suggested by Ingo Molnar. - Rename struct uprobe_arch_info to struct arch_uprobe. - Move insn from struct uprobe to struct arch_uprobe. - Make arch specific uprobe functions to accept struct arch_uprobe instead of struct uprobe. - Move struct uprobe to kernel/uprobes.c from include/linux/uprobes.h Signed-off-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Jiri Olsa Cc: Josh Stone Link: http://lkml.kernel.org/r/20120222091602.15880.40249.sendpatchset@srdronam.in.ibm.com [ Made various small improvements ] Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 23 ++--------------------- 1 file changed, 2 insertions(+), 21 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index fd45b70750d4..9c6be62787ed 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -29,12 +29,6 @@ struct vm_area_struct; #ifdef CONFIG_ARCH_SUPPORTS_UPROBES #include -#else - -typedef u8 uprobe_opcode_t; -struct uprobe_arch_info {}; - -#define MAX_UINSN_BYTES 4 #endif /* flags that denote/change uprobes behaviour */ @@ -56,22 +50,9 @@ struct uprobe_consumer { struct uprobe_consumer *next; }; -struct uprobe { - struct rb_node rb_node; /* node in the rb tree */ - atomic_t ref; - struct rw_semaphore consumer_rwsem; - struct list_head pending_list; - struct uprobe_arch_info arch_info; - struct uprobe_consumer *consumers; - struct inode *inode; /* Also hold a ref to inode */ - loff_t offset; - int flags; - u8 insn[MAX_UINSN_BYTES]; -}; - #ifdef CONFIG_UPROBES -extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr); -extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify); +extern int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr); +extern int __weak set_orig_insn(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr, bool verify); extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); -- cgit v1.2.3 From 35aa621b5ab9d08767f7bc8d209b696df281d715 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 22 Feb 2012 11:37:29 +0100 Subject: uprobes: Update copyright notices Add Peter Zijlstra's copyright to the uprobes code, whose contributions to the uprobes code are not visible in the Git history, because they were backmerged. Also update existing copyright notices to the year 2012. Acked-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Link: http://lkml.kernel.org/n/tip-vjqxst502pc1efz7ah8cyht4@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 9c6be62787ed..f85797e1ccd4 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -17,10 +17,11 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * - * Copyright (C) IBM Corporation, 2008-2011 + * Copyright (C) IBM Corporation, 2008-2012 * Authors: * Srikar Dronamraju * Jim Keniston + * Copyright (C) 2011-2012 Red Hat, Inc., Peter Zijlstra */ #include -- cgit v1.2.3 From b2fab5acd28ead6f0dd6c3996ba23f0ef1772f15 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:14:57 -0800 Subject: elevator: make elevator_init_fn() return 0/-errno elevator_ops->elevator_init_fn() has a weird return value. It returns a void * which the caller should assign to q->elevator->elevator_data and %NULL return denotes init failure. Update such that it returns integer 0/-errno and sets elevator_data directly as necessary. This makes the interface more conventional and eases further cleanup. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/elevator.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/elevator.h b/include/linux/elevator.h index 7d4e0356f329..97fb2557a18c 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -33,7 +33,7 @@ typedef void (elevator_put_req_fn) (struct request *); typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *); typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *); -typedef void *(elevator_init_fn) (struct request_queue *); +typedef int (elevator_init_fn) (struct request_queue *); typedef void (elevator_exit_fn) (struct elevator_queue *); struct elevator_ops -- cgit v1.2.3 From d732580b4eb31553c63744a47d590f770cafb8f0 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:14:58 -0800 Subject: block: implement blk_queue_bypass_start/end() Rename and extend elv_queisce_start/end() to blk_queue_bypass_start/end() which are exported and supports nesting via @q->bypass_depth. Also add blk_queue_bypass() to test bypass state. This will be further extended and used for blkio_group management. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 606cf339bb56..315db1d91bc4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -389,6 +389,8 @@ struct request_queue { struct mutex sysfs_lock; + int bypass_depth; + #if defined(CONFIG_BLK_DEV_BSG) bsg_job_fn *bsg_job_fn; int bsg_job_size; @@ -406,7 +408,7 @@ struct request_queue { #define QUEUE_FLAG_SYNCFULL 3 /* read queue has been filled */ #define QUEUE_FLAG_ASYNCFULL 4 /* write queue has been filled */ #define QUEUE_FLAG_DEAD 5 /* queue being torn down */ -#define QUEUE_FLAG_ELVSWITCH 6 /* don't use elevator, just do FIFO */ +#define QUEUE_FLAG_BYPASS 6 /* act as dumb FIFO queue */ #define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */ #define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */ #define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */ @@ -494,6 +496,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) #define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags) #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) #define blk_queue_dead(q) test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags) +#define blk_queue_bypass(q) test_bit(QUEUE_FLAG_BYPASS, &(q)->queue_flags) #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) #define blk_queue_noxmerges(q) \ test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags) -- cgit v1.2.3 From 923adde1be1df57cebd80c563058e503376645e8 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:13 -0800 Subject: blkcg: clear all request_queues on blkcg policy [un]registrations Keep track of all request_queues which have blkcg initialized and turn on bypass and invoke blkcg_clear_queue() on all before making changes to blkcg policies. This is to prepare for moving blkg management into blkcg core. Note that this uses more brute force than necessary. Finer grained shoot down will be implemented later and given that policy [un]registration almost never happens on running systems (blk-throtl can't be built as a module and cfq usually is the builtin default iosched), this shouldn't be a problem for the time being. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 315db1d91bc4..e8c0bbd06b9a 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -397,6 +397,9 @@ struct request_queue { struct bsg_class_device bsg_dev; #endif +#ifdef CONFIG_BLK_CGROUP + struct list_head all_q_node; +#endif #ifdef CONFIG_BLK_DEV_THROTTLING /* Throttle data */ struct throtl_data *td; -- cgit v1.2.3 From 4eef3049986e8397d5003916aed8cad6567a5e02 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:18 -0800 Subject: blkcg: move per-queue blkg list heads and counters to queue and blkg Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e8c0bbd06b9a..f4e35edea70f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -362,6 +362,11 @@ struct request_queue { struct list_head timeout_list; struct list_head icq_list; +#ifdef CONFIG_BLK_CGROUP + /* XXX: array size hardcoded to avoid include dependency (temporary) */ + struct list_head blkg_list[2]; + int nr_blkgs[2]; +#endif struct queue_limits limits; -- cgit v1.2.3 From 03aa264ac15637b6f98374270bcdf31400965505 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:19 -0800 Subject: blkcg: let blkcg core manage per-queue blkg list and counter With the previous patch to move blkg list heads and counters to request_queue and blkg, logic to manage them in both policies are almost identical and can be moved to blkcg core. This patch moves blkg link logic into blkg_lookup_create(), implements common blkg unlink code in blkg_destroy(), and updates blkg_destory_all() so that it's policy specific and can skip root group. The updated blkg_destroy_all() is now used to both clear queue for bypassing and elv switching, and release all blkgs on q exit. This patch introduces a race window where policy [de]registration may race against queue blkg clearing. This can only be a problem on cfq unload and shouldn't be a real problem in practice (and we have many other places where this race already exists). Future patches will remove these unlikely races. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f4e35edea70f..b4d1d4bfc168 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -364,8 +364,8 @@ struct request_queue { struct list_head icq_list; #ifdef CONFIG_BLK_CGROUP /* XXX: array size hardcoded to avoid include dependency (temporary) */ - struct list_head blkg_list[2]; - int nr_blkgs[2]; + struct list_head blkg_list; + int nr_blkgs; #endif struct queue_limits limits; -- cgit v1.2.3 From c875f4d0250a1f070fa26087a73bdd8f54c48100 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:22 -0800 Subject: blkcg: drop unnecessary RCU locking Now that blkg additions / removals are always done under both q and blkcg locks, the only places RCU locking is necessary are blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops unncessary RCU locking replacing it with plain blkcg locking as necessary. * blkiocg_pre_destroy() already perform proper locking and don't need RCU. Dropped. * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read lock. This isn't a hot path. * Now unnecessary synchronize_rcu() from queue exit paths removed. This makes q->nr_blkgs unnecessary. Dropped. * RCU annotation on blkg->q removed. -v2: Vivek pointed out that blkg_lookup_create() still needs to be called under rcu_read_lock(). Updated. -v3: After the update, stats_lock locking in blkio_read_blkg_stats() shouldn't be using _irq variant as it otherwise ends up enabling irq while blkcg->lock is locked. Fixed. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index b4d1d4bfc168..33f1b29e53f4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -365,7 +365,6 @@ struct request_queue { #ifdef CONFIG_BLK_CGROUP /* XXX: array size hardcoded to avoid include dependency (temporary) */ struct list_head blkg_list; - int nr_blkgs; #endif struct queue_limits limits; -- cgit v1.2.3 From 3d48749d93a3dce732dd30a14002ab90ec4355f3 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:25 -0800 Subject: block: ioc_task_link() can't fail ioc_task_link() is used to share %current's ioc on clone. If %current->io_context is set, %current is guaranteed to have refcount on the ioc and, thus, ioc_task_link() can't fail. Replace error checking in ioc_task_link() with WARN_ON_ONCE() and make it just increment refcount and nr_tasks. -v2: Description typo fix (Vivek). Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/iocontext.h | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) (limited to 'include/linux') diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index 1a3018063034..81a8870ac224 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -120,18 +120,12 @@ struct io_context { struct work_struct release_work; }; -static inline struct io_context *ioc_task_link(struct io_context *ioc) +static inline void ioc_task_link(struct io_context *ioc) { - /* - * if ref count is zero, don't allow sharing (ioc is going away, it's - * a race). - */ - if (ioc && atomic_long_inc_not_zero(&ioc->refcount)) { - atomic_inc(&ioc->nr_tasks); - return ioc; - } - - return NULL; + WARN_ON_ONCE(atomic_long_read(&ioc->refcount) <= 0); + WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0); + atomic_long_inc(&ioc->refcount); + atomic_inc(&ioc->nr_tasks); } struct task_struct; -- cgit v1.2.3 From f6e8d01bee036460e03bd4f6a79d014f98ba712e Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:26 -0800 Subject: block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/iocontext.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index 81a8870ac224..6f1a2608e91f 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -100,6 +100,7 @@ struct io_cq { */ struct io_context { atomic_long_t refcount; + atomic_t active_ref; atomic_t nr_tasks; /* all the fields below are protected by this lock */ @@ -120,17 +121,34 @@ struct io_context { struct work_struct release_work; }; -static inline void ioc_task_link(struct io_context *ioc) +/** + * get_io_context_active - get active reference on ioc + * @ioc: ioc of interest + * + * Only iocs with active reference can issue new IOs. This function + * acquires an active reference on @ioc. The caller must already have an + * active reference on @ioc. + */ +static inline void get_io_context_active(struct io_context *ioc) { WARN_ON_ONCE(atomic_long_read(&ioc->refcount) <= 0); - WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0); + WARN_ON_ONCE(atomic_read(&ioc->active_ref) <= 0); atomic_long_inc(&ioc->refcount); + atomic_inc(&ioc->active_ref); +} + +static inline void ioc_task_link(struct io_context *ioc) +{ + get_io_context_active(ioc); + + WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0); atomic_inc(&ioc->nr_tasks); } struct task_struct; #ifdef CONFIG_BLOCK void put_io_context(struct io_context *ioc); +void put_io_context_active(struct io_context *ioc); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); -- cgit v1.2.3 From 852c788f8365062c8a383c5a93f7f7289977cb50 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:27 -0800 Subject: block: implement bio_associate_current() IO scheduling and cgroup are tied to the issuing task via io_context and cgroup of %current. Unfortunately, there are cases where IOs need to be routed via a different task which makes scheduling and cgroup limit enforcement applied completely incorrectly. For example, all bios delayed by blk-throttle end up being issued by a delayed work item and get assigned the io_context of the worker task which happens to serve the work item and dumped to the default block cgroup. This is double confusing as bios which aren't delayed end up in the correct cgroup and makes using blk-throttle and cfq propio together impossible. Any code which punts IO issuing to another task is affected which is getting more and more common (e.g. btrfs). As both io_context and cgroup are firmly tied to task including userland visible APIs to manipulate them, it makes a lot of sense to match up tasks to bios. This patch implements bio_associate_current() which associates the specified bio with %current. The bio will record the associated ioc and blkcg at that point and block layer will use the recorded ones regardless of which task actually ends up issuing the bio. bio release puts the associated ioc and blkcg. It grabs and remembers ioc and blkcg instead of the task itself because task may already be dead by the time the bio is issued making ioc and blkcg inaccessible and those are all block layer cares about. elevator_set_req_fn() is updated such that the bio elvdata is being allocated for is available to the elevator. This doesn't update block cgroup policies yet. Further patches will implement the support. -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in rq_ioc() to fix build breakage. Signed-off-by: Tejun Heo Cc: Vivek Goyal Cc: Kent Overstreet Signed-off-by: Jens Axboe --- include/linux/bio.h | 8 ++++++++ include/linux/blk_types.h | 10 ++++++++++ include/linux/elevator.h | 6 ++++-- 3 files changed, 22 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/bio.h b/include/linux/bio.h index 129a9c097958..692d3d5b49f5 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -268,6 +268,14 @@ extern struct bio_vec *bvec_alloc_bs(gfp_t, int, unsigned long *, struct bio_set extern void bvec_free_bs(struct bio_set *, struct bio_vec *, unsigned int); extern unsigned int bvec_nr_vecs(unsigned short idx); +#ifdef CONFIG_BLK_CGROUP +int bio_associate_current(struct bio *bio); +void bio_disassociate_task(struct bio *bio); +#else /* CONFIG_BLK_CGROUP */ +static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } +static inline void bio_disassociate_task(struct bio *bio) { } +#endif /* CONFIG_BLK_CGROUP */ + /* * bio_set is used to allow other portions of the IO system to * allocate their own private memory pools for bio and iovec structures. diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 4053cbd4490e..0edb65dd8edd 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -14,6 +14,8 @@ struct bio; struct bio_integrity_payload; struct page; struct block_device; +struct io_context; +struct cgroup_subsys_state; typedef void (bio_end_io_t) (struct bio *, int); typedef void (bio_destructor_t) (struct bio *); @@ -66,6 +68,14 @@ struct bio { bio_end_io_t *bi_end_io; void *bi_private; +#ifdef CONFIG_BLK_CGROUP + /* + * Optional ioc and css associated with this bio. Put on bio + * release. Read comment on top of bio_associate_current(). + */ + struct io_context *bi_ioc; + struct cgroup_subsys_state *bi_css; +#endif #if defined(CONFIG_BLK_DEV_INTEGRITY) struct bio_integrity_payload *bi_integrity; /* data integrity */ #endif diff --git a/include/linux/elevator.h b/include/linux/elevator.h index 97fb2557a18c..c03af7687bb4 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -28,7 +28,8 @@ typedef int (elevator_may_queue_fn) (struct request_queue *, int); typedef void (elevator_init_icq_fn) (struct io_cq *); typedef void (elevator_exit_icq_fn) (struct io_cq *); -typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t); +typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, + struct bio *, gfp_t); typedef void (elevator_put_req_fn) (struct request *); typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *); typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *); @@ -129,7 +130,8 @@ extern void elv_unregister_queue(struct request_queue *q); extern int elv_may_queue(struct request_queue *, int); extern void elv_abort_queue(struct request_queue *); extern void elv_completed_request(struct request_queue *, struct request *); -extern int elv_set_request(struct request_queue *, struct request *, gfp_t); +extern int elv_set_request(struct request_queue *q, struct request *rq, + struct bio *bio, gfp_t gfp_mask); extern void elv_put_request(struct request_queue *, struct request *); extern void elv_drain_elevator(struct request_queue *); -- cgit v1.2.3 From 900771a483ef28915a48066d7895d8252315607a Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:14 +0530 Subject: uprobes/core: Make macro names consistent Rename macros that refer to individual uprobe to start with UPROBE_ instead of UPROBES_. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092514.5379.36595.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f85797e1ccd4..838fb312926a 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -35,10 +35,10 @@ struct vm_area_struct; /* flags that denote/change uprobes behaviour */ /* Have a copy of original instruction */ -#define UPROBES_COPY_INSN 0x1 +#define UPROBE_COPY_INSN 0x1 /* Dont run handlers when first register/ last unregister in progress*/ -#define UPROBES_RUN_HANDLER 0x2 +#define UPROBE_RUN_HANDLER 0x2 struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); -- cgit v1.2.3 From e3343e6a2819ff5d0dfc4bb5c9fb7f9a4d04da73 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:30 +0530 Subject: uprobes/core: Make order of function parameters consistent across functions If a function takes struct uprobe or struct arch_uprobe, then it is passed as the first parameter. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092530.5379.18394.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 838fb312926a..58699182e9a7 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -52,20 +52,20 @@ struct uprobe_consumer { }; #ifdef CONFIG_UPROBES -extern int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr); -extern int __weak set_orig_insn(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr, bool verify); +extern int __weak set_bkpt(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); +extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr, bool verify); extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn); -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer); +extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); #else /* CONFIG_UPROBES is not defined */ static inline int -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { } static inline int uprobe_mmap(struct vm_area_struct *vma) -- cgit v1.2.3 From 5cb4ac3a583d4ee18c8682ab857e093c4a0d0895 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:45 +0530 Subject: uprobes/core: Rename bkpt to swbp bkpt doesnt seem to be a correct abbrevation for breakpoint. Choice was between bp and breakpoint. Since bp can refer to things other than breakpoint, use swbp to refer to breakpoints. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092545.5379.91251.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/uprobes.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 58699182e9a7..eac525f41b94 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -52,9 +52,9 @@ struct uprobe_consumer { }; #ifdef CONFIG_UPROBES -extern int __weak set_bkpt(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); +extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr, bool verify); -extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn); +extern bool __weak is_swbp_insn(uprobe_opcode_t *insn); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); -- cgit v1.2.3 From 0326f5a94ddea33fa331b2519f4172f4fb387baa Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Tue, 13 Mar 2012 23:30:11 +0530 Subject: uprobes/core: Handle breakpoint and singlestep exceptions Uprobes uses exception notifiers to get to know if a thread hit a breakpoint or a singlestep exception. When a thread hits a uprobe or is singlestepping post a uprobe hit, the uprobe exception notifier sets its TIF_UPROBE bit, which will then be checked on its return to userspace path (do_notify_resume() ->uprobe_notify_resume()), where the consumers handlers are run (in task context) based on the defined filters. Uprobe hits are thread specific and hence we need to maintain information about if a task hit a uprobe, what uprobe was hit, the slot where the original instruction was copied for xol so that it can be singlestepped with appropriate fixups. In some cases, special care is needed for instructions that are executed out of line (xol). These are architecture specific artefacts, such as handling RIP relative instructions on x86_64. Since the instruction at which the uprobe was inserted is executed out of line, architecture specific fixups are added so that the thread continues normal execution in the presence of a uprobe. Postpone the signals until we execute the probed insn. post_xol() path does a recalc_sigpending() before return to user-mode, this ensures the signal can't be lost. Uprobes relies on DIE_DEBUG notification to notify if a singlestep is complete. Adds x86 specific uprobe exception notifiers and appropriate hooks needed to determine a uprobe hit and subsequent post processing. Add requisite x86 fixups for xol for uprobes. Specific cases needing fixups include relative jumps (x86_64), calls, etc. Where possible, we check and skip singlestepping the breakpointed instructions. For now we skip single byte as well as few multibyte nop instructions. However this can be extended to other instructions too. Credits to Oleg Nesterov for suggestions/patches related to signal, breakpoint, singlestep handling code. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com [ Performed various cleanliness edits ] Signed-off-by: Ingo Molnar --- include/linux/sched.h | 4 ++++ include/linux/uprobes.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 56 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 7d379a6bfd88..8379e3771690 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1590,6 +1590,10 @@ struct task_struct { #ifdef CONFIG_HAVE_HW_BREAKPOINT atomic_t ptrace_bp_refcnt; #endif +#ifdef CONFIG_UPROBES + struct uprobe_task *utask; + int uprobe_srcu_id; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index eac525f41b94..5ec778fdce6f 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -28,8 +28,9 @@ #include struct vm_area_struct; + #ifdef CONFIG_ARCH_SUPPORTS_UPROBES -#include +# include #endif /* flags that denote/change uprobes behaviour */ @@ -39,6 +40,8 @@ struct vm_area_struct; /* Dont run handlers when first register/ last unregister in progress*/ #define UPROBE_RUN_HANDLER 0x2 +/* Can skip singlestep */ +#define UPROBE_SKIP_SSTEP 0x4 struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); @@ -52,13 +55,42 @@ struct uprobe_consumer { }; #ifdef CONFIG_UPROBES +enum uprobe_task_state { + UTASK_RUNNING, + UTASK_BP_HIT, + UTASK_SSTEP, + UTASK_SSTEP_ACK, + UTASK_SSTEP_TRAPPED, +}; + +/* + * uprobe_task: Metadata of a task while it singlesteps. + */ +struct uprobe_task { + enum uprobe_task_state state; + struct arch_uprobe_task autask; + + struct uprobe *active_uprobe; + + unsigned long xol_vaddr; + unsigned long vaddr; +}; + extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr, bool verify); extern bool __weak is_swbp_insn(uprobe_opcode_t *insn); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); -#else /* CONFIG_UPROBES is not defined */ +extern void uprobe_free_utask(struct task_struct *t); +extern void uprobe_copy_process(struct task_struct *t); +extern unsigned long __weak uprobe_get_swbp_addr(struct pt_regs *regs); +extern int uprobe_post_sstep_notifier(struct pt_regs *regs); +extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); +extern void uprobe_notify_resume(struct pt_regs *regs); +extern bool uprobe_deny_signal(void); +extern bool __weak arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); +#else /* !CONFIG_UPROBES */ static inline int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { @@ -72,5 +104,22 @@ static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; } -#endif /* CONFIG_UPROBES */ +static inline void uprobe_notify_resume(struct pt_regs *regs) +{ +} +static inline bool uprobe_deny_signal(void) +{ + return false; +} +static inline unsigned long uprobe_get_swbp_addr(struct pt_regs *regs) +{ + return 0; +} +static inline void uprobe_free_utask(struct task_struct *t) +{ +} +static inline void uprobe_copy_process(struct task_struct *t) +{ +} +#endif /* !CONFIG_UPROBES */ #endif /* _LINUX_UPROBES_H */ -- cgit v1.2.3 From 598971bfbdfdc8701337dc1636c7919c44699914 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 19 Mar 2012 15:10:58 -0700 Subject: cfq: don't use icq_get_changed() cfq caches the associated cfqq's for a given cic. The cache needs to be flushed if the cic's ioprio or blkcg has changed. It is currently done by requiring the changing action to set the respective ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(), which involves iterating through all the affected icqs. All cfq wants to know is whether ioprio and/or blkcg have changed since the last flush and can be easily achieved by just remembering the current ioprio and blkcg ID in cic. This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to use the remembered value instead, and updates cfq_set_request() path such that, instead of using icq_get_changed(), the current values are compared against the remembered ones and trigger appropriate flush action if not. Condition tests are moved inside both _changed functions which are now named check_ioprio_changed() and check_blkcg_changed(). ioprio.h::task_ioprio*() can't be used anymore and replaced with open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio(). Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/ioprio.h | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) (limited to 'include/linux') diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 76dad4808847..beb9ce1c2c23 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -42,26 +42,14 @@ enum { }; /* - * if process has set io priority explicitly, use that. if not, convert - * the cpu scheduler nice value to an io priority + * Fallback BE priority */ #define IOPRIO_NORM (4) -static inline int task_ioprio(struct io_context *ioc) -{ - if (ioprio_valid(ioc->ioprio)) - return IOPRIO_PRIO_DATA(ioc->ioprio); - - return IOPRIO_NORM; -} - -static inline int task_ioprio_class(struct io_context *ioc) -{ - if (ioprio_valid(ioc->ioprio)) - return IOPRIO_PRIO_CLASS(ioc->ioprio); - - return IOPRIO_CLASS_BE; -} +/* + * if process has set io priority explicitly, use that. if not, convert + * the cpu scheduler nice value to an io priority + */ static inline int task_nice_ioprio(struct task_struct *task) { return (task_nice(task) + 20) / 5; -- cgit v1.2.3 From 2b566fa55b9a94b53217c2818e6c5e5756eeb1a1 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 19 Mar 2012 15:10:59 -0700 Subject: block: remove ioc_*_changed() After the previous patch to cfq, there's no ioc_get_changed() user left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all related stuff. Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- include/linux/iocontext.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index 6f1a2608e91f..df38db2ef45b 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -6,11 +6,7 @@ #include enum { - ICQ_IOPRIO_CHANGED = 1 << 0, - ICQ_CGROUP_CHANGED = 1 << 1, ICQ_EXITED = 1 << 2, - - ICQ_CHANGED_MASK = ICQ_IOPRIO_CHANGED | ICQ_CGROUP_CHANGED, }; /* @@ -152,9 +148,6 @@ void put_io_context_active(struct io_context *ioc); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); -void ioc_ioprio_changed(struct io_context *ioc, int ioprio); -void ioc_cgroup_changed(struct io_context *ioc); -unsigned int icq_get_changed(struct io_cq *icq); #else struct io_context; static inline void put_io_context(struct io_context *ioc) { } -- cgit v1.2.3 From f56f821feb7b36223f309e0ec05986bb137ce418 Mon Sep 17 00:00:00 2001 From: Daniel Vetter Date: Sun, 25 Mar 2012 19:47:41 +0200 Subject: mm: extend prefault helpers to fault in more than PAGE_SIZE drm/i915 wants to read/write more than one page in its fastpath and hence needs to prefault more than PAGE_SIZE bytes. Add new functions in filemap.h to make that possible. Also kill a copy&pasted spurious space in both functions while at it. v2: As suggested by Andrew Morton, add a multipage parameter to both functions to avoid the additional branch for the pagemap.c hotpath. My gcc 4.6 here seems to dtrt and indeed reap these branches where not needed. v3: Becaus I couldn't find a way around adding a uaddr += PAGE_SIZE to the filemap.c hotpaths (that the compiler couldn't remove again), let's go with separate new functions for the multipage use-case. v4: Adjust comment to CodingStlye and fix spelling. Acked-by: Andrew Morton Signed-off-by: Daniel Vetter --- include/linux/pagemap.h | 64 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 62 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cfaaa6949b8b..c93a9a9bcd35 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -426,7 +426,7 @@ static inline int fault_in_pages_writeable(char __user *uaddr, int size) */ if (((unsigned long)uaddr & PAGE_MASK) != ((unsigned long)end & PAGE_MASK)) - ret = __put_user(0, end); + ret = __put_user(0, end); } return ret; } @@ -445,13 +445,73 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size) if (((unsigned long)uaddr & PAGE_MASK) != ((unsigned long)end & PAGE_MASK)) { - ret = __get_user(c, end); + ret = __get_user(c, end); (void)c; } } return ret; } +/* + * Multipage variants of the above prefault helpers, useful if more than + * PAGE_SIZE of data needs to be prefaulted. These are separate from the above + * functions (which only handle up to PAGE_SIZE) to avoid clobbering the + * filemap.c hotpaths. + */ +static inline int fault_in_multipages_writeable(char __user *uaddr, int size) +{ + int ret; + const char __user *end = uaddr + size - 1; + + if (unlikely(size == 0)) + return 0; + + /* + * Writing zeroes into userspace here is OK, because we know that if + * the zero gets there, we'll be overwriting it. + */ + while (uaddr <= end) { + ret = __put_user(0, uaddr); + if (ret != 0) + return ret; + uaddr += PAGE_SIZE; + } + + /* Check whether the range spilled into the next page. */ + if (((unsigned long)uaddr & PAGE_MASK) == + ((unsigned long)end & PAGE_MASK)) + ret = __put_user(0, end); + + return ret; +} + +static inline int fault_in_multipages_readable(const char __user *uaddr, + int size) +{ + volatile char c; + int ret; + const char __user *end = uaddr + size - 1; + + if (unlikely(size == 0)) + return 0; + + while (uaddr <= end) { + ret = __get_user(c, uaddr); + if (ret != 0) + return ret; + uaddr += PAGE_SIZE; + } + + /* Check whether the range spilled into the next page. */ + if (((unsigned long)uaddr & PAGE_MASK) == + ((unsigned long)end & PAGE_MASK)) { + ret = __get_user(c, end); + (void)c; + } + + return ret; +} + int add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); int add_to_page_cache_lru(struct page *page, struct address_space *mapping, -- cgit v1.2.3 From d4b3b6384f98f8692ad0209891ccdbc7e78bbefe Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Fri, 30 Mar 2012 23:56:31 +0530 Subject: uprobes/core: Allocate XOL slots for uprobes use Uprobes executes the original instruction at a probed location out of line. For this, we allocate a page (per mm) upon the first uprobe hit, in the process user address space, divide it into slots that are used to store the actual instructions to be singlestepped. These slots are known as xol (execution out of line) slots. Care is taken to ensure that the allocation is in an unmapped area as close to the top of the user address space as possible, with appropriate permission settings to keep selinux like frameworks happy. Upon a uprobe hit, a free slot is acquired, and is released after the singlestep completes. Lots of improvements courtesy suggestions/inputs from Peter and Oleg. [ Folded a fix for build issue on powerpc fixed and reported by Stephen Rothwell. ] Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/mm_types.h | 2 ++ include/linux/uprobes.h | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3cc3062b3767..26574c726121 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -388,6 +389,7 @@ struct mm_struct { #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; #endif + struct uprobes_state uprobes_state; }; static inline void mm_init_cpumask(struct mm_struct *mm) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 5ec778fdce6f..a111460c07d5 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -28,6 +28,8 @@ #include struct vm_area_struct; +struct mm_struct; +struct inode; #ifdef CONFIG_ARCH_SUPPORTS_UPROBES # include @@ -76,6 +78,28 @@ struct uprobe_task { unsigned long vaddr; }; +/* + * On a breakpoint hit, thread contests for a slot. It frees the + * slot after singlestep. Currently a fixed number of slots are + * allocated. + */ +struct xol_area { + wait_queue_head_t wq; /* if all slots are busy */ + atomic_t slot_count; /* number of in-use slots */ + unsigned long *bitmap; /* 0 = free slot */ + struct page *page; + + /* + * We keep the vma's vm_start rather than a pointer to the vma + * itself. The probed process or a naughty kernel module could make + * the vma go away, and we must handle that reasonably gracefully. + */ + unsigned long vaddr; /* Page(s) of instruction slots */ +}; + +struct uprobes_state { + struct xol_area *xol_area; +}; extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr, bool verify); extern bool __weak is_swbp_insn(uprobe_opcode_t *insn); @@ -90,7 +114,11 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); extern void uprobe_notify_resume(struct pt_regs *regs); extern bool uprobe_deny_signal(void); extern bool __weak arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); +extern void uprobe_clear_state(struct mm_struct *mm); +extern void uprobe_reset_state(struct mm_struct *mm); #else /* !CONFIG_UPROBES */ +struct uprobes_state { +}; static inline int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { @@ -121,5 +149,11 @@ static inline void uprobe_free_utask(struct task_struct *t) static inline void uprobe_copy_proce