Age | Commit message (Collapse) | Author | Files | Lines |
|
This helps ensure key cache reclaim isn't contending with threads
waiting for the key cache to be helped, and fixes a severe performance
bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
for_each_btree_node() now works similarly to for_each_btree_key(), where
the loop body is passed as an argument to be passed to lockrestart_do().
This now calls trans_begin() on every loop iteration - which fixes an
SRCU warning in backpointers fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
forward compat fix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_trigger_alloc was assuming that the new key would always be newly
created and thus always an alloc_v4 key, but - not when called from
btree_gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_btree_path_traverse_cached() was previously checking if it could
just relock the path, which is a common idiom in path traversal.
However, it was using btree_node_relock(), not btree_path_relock();
btree_path_relock() only succeeds if the path was in state
BTREE_ITER_NEED_RELOCK.
If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is
needed; this led to a null ptr deref in
bch2_btree_path_traverse_cached().
And the short circuit check here isn't needed, since it was already done
in the main bch2_btree_path_traverse_one().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bcachefs_metadata_version_disk_accounting_v2 erroneously had padding
bytes in disk_accounting_key, which is a problem because we have to
guarantee that all unused bytes in disk_accounting_key are zeroed.
Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a
new version.
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a line for capacity
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Implement bch2_accounting_invalid(); check for junk at the end, and
replicas accounting entries in particular need to be checked or we'll
pop asserts later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
.set_acl() requires a dentry, and if one isn't passed it marks the VFS
inode as not having an ACL.
This has been causing inodes with ACLs to have them "disappear" on
bcachefs filesystem, depending on which path those inodes get pulled
into the cache from.
Switching to .get_inode_acl(), like other local filesystems, fixes this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If the allocator gets stuck, we need to know why.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.
Also, up the default timeout to 30 seconds from 10 seconds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a bug exposed by the next path - we pop an assert in
path_set_should_be_locked().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a device removal deadlock when using erasure coding.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
chasing down a device removal deadlock with erasure coding
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've had bugs in the past with incorrect integer conversions in disk
accounting code, which is why bucket helpers now always return s64s; add
a comment explaining this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
implicit integer conversion is a fertile source of bugs, and we really
would rather not have the min()/max() macros doing it implicitly.
bcachefs appears to be the only place in the kernel where this happens,
so let's fix it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: ffcbec6076 ("bcachefs: Kill opts.buckets_nouse")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull bcachefs fixes from Kent Overstreet:
- another fix for fsck getting stuck, from marcin
- small syzbot fix
- another undefined shift fix
* tag 'bcachefs-2024-07-22' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix printbuf usage while atomic
bcachefs: More informative error message in reattach_inode()
bcachefs: kill btree_trans_too_many_iters() in bch2_bucket_alloc_freelist()
bcachefs: mean_and_variance: Avoid too-large shift amounts
|
|
Reported-by: syzbot+f765e51170cf13493f0b@syzkaller.appspotmail.com
Fixes: f12410bb7ddd ("bcachefs: Add an error message for insufficient rw journal devs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code
and has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally
more rational.
- Kuan-Wei Chiu has sent along some maintenance work against our
sorting library code in the series "lib/sort: Optimizations and
cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
GDB command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
ia64: scrub ia64 from poison.h
watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
tsacct: replace strncpy() with strscpy()
lib/bch.c: use swap() to improve code
test_bpf: convert comma to semicolon
init/modpost: conditionally check section mismatch to __meminit*
init: remove unused __MEMINIT* macros
nilfs2: Constify struct kobj_type
nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
math: rational: add missing MODULE_DESCRIPTION() macro
lib/zlib: add missing MODULE_DESCRIPTION() macro
fs: ufs: add MODULE_DESCRIPTION()
lib/rbtree.c: fix the example typo
ocfs2: add bounds checking to ocfs2_check_dir_entry()
fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
coredump: simplify zap_process()
selftests/fpu: add missing MODULE_DESCRIPTION() macro
compiler.h: simplify data_race() macro
build-id: require program headers to be right after ELF header
resource: add missing MODULE_DESCRIPTION()
...
|
|
When we're called via
trans commit -> btree split -> allocator
We may have already arbitrarily many btree_paths, for the transaction
commit we're trying to do; when this happens, the
btree_trans_too_many_iters() call causes us to livelock.
Since the allocator calls btree_iter_dontneed to release paths as it
iterates, this shouldn't cause any problems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull bcachefs updates from Kent Overstreet:
- Metadata version 1.8: Stripe sectors accounting, BCH_DATA_unstriped
This splits out the accounting of dirty sectors and stripe sectors in
alloc keys; this lets us see stripe buckets that still have unstriped
data in them.
This is needed for ensuring that erasure coding is working correctly,
as well as completing stripe creation after a crash.
- Metadata version 1.9: Disk accounting rewrite
The previous disk accounting scheme relied heavily on percpu counters
that were also sharded by outstanding journal buffer; it was fast but
not extensible or scalable, and meant that all accounting counters
were recorded in every journal entry.
The new disk accounting scheme stores accounting as normal btree
keys; updates are deltas until they are flushed by the btree write
buffer.
This means we have no practical limit on the number of counters, and
a new tagged union format that's easy to extend.
We now have counters for compression type/ratio, per-snapshot-id
usage, per-btree-id usage, and pending rebalance work.
- Self healing on read IO/checksum error
Data is now automatically rewritten if we get a read error and then a
successful retry
- Mount API conversion (thanks to Thomas Bertschinger)
- Better lockdep coverage
Previously, btree node locks were tracked individually by lockdep,
like any other lock. But we may take _many_ btree node locks
simultaneously, we easily blow through the limit of 48 locks that
lockdep can track, leading to lockdep turning itself off.
Tracking each btree node lock individually isn't really necessary
since we have our own cycle detector for deadlock avoidance and
centralized tracking of btree node locks, so we now have a single
lockdep_map in btree_trans for "any btree nodes are locked".
- Some more small incremental work towards online check_allocations
- Lots more debugging improvements
- Fixes, including:
- undefined behaviour fixes, originally noted as breaking userspace
LTO builds
- fix a spurious warning in fsck_err, reported by Marcin
- fix an integer overflow on trans->nr_updates, also reported by
Marcin; this broke during deletion of highly fragmented indirect
extents
* tag 'bcachefs-2024-07-18.2' of https://evilpiepirate.org/git/bcachefs: (120 commits)
lockdep: Add comments for lockdep_set_no{validate,track}_class()
bcachefs: Fix integer overflow on trans->nr_updates
bcachefs: silence silly kdoc warning
bcachefs: Fix fsck warning about btree_trans not passed to fsck error
bcachefs: Add an error message for insufficient rw journal devs
bcachefs: varint: Avoid left-shift of a negative value
bcachefs: darray: Don't pass NULL to memcpy()
bcachefs: Kill bch2_assert_btree_nodes_not_locked()
bcachefs: Rename BCH_WRITE_DONE -> BCH_WRITE_SUBMITTED
bcachefs: __bch2_read(): call trans_begin() on every loop iter
bcachefs: show none if label is not set
bcachefs: drop packed, aligned from bkey_inode_buf
bcachefs: btree node scan: fall back to comparing by journal seq
bcachefs: Add lockdep support for btree node locks
lockdep: lockdep_set_notrack_class()
bcachefs: Improve copygc_wait_to_text()
bcachefs: Convert clock code to u64s
bcachefs: Improve startup message
bcachefs: Self healing on read IO error
bcachefs: Make read_only a mount option again, but hidden
...
|
|
Shifting a value by the width of its type or more is undefined.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can't have more updates than paths, so btree_path_idx_t is the
correct type to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If a btree_trans is in use it's supposed to be passed to fsck_err so
that it can be unlocked if we're waiting on userspace input; but the
btree IO paths do call fsck errors where a btree_trans exists on the
stack but it's not passed through.
But it's ok, because it's unlocked while doing IO.
Fixes: a850bde6498b ("bcachefs: fsck_err() may now take a btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This causes us to go read-only - need an error message saying why.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Shifting a negative value left is undefined.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode / dentry updates from Christian Brauner:
"This contains smaller performance improvements to inodes and dentries:
inode:
- Add rcu based inode lookup variants.
They avoid one inode hash lock acquire in the common case thereby
significantly reducing contention. We already support RCU-based
operations but didn't take advantage of them during inode
insertion.
Callers of iget_locked() get the improvement without any code
changes. Callers that need a custom callback can switch to
iget5_locked_rcu() as e.g., did btrfs.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Long-term we should pick up the effort to introduce more
fine-grained locking and possibly improve on the currently used
hash implementation.
- Start zeroing i_state in inode_init_always() instead of doing it in
individual filesystems.
This allows us to remove an unneeded lock acquire in new_inode()
and not burden individual filesystems with this.
dcache:
- Move d_lockref out of the area used by RCU lookup to avoid
cacheline ping poing because the embedded name is sharing a
cacheline with d_lockref.
- Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
up with 128 bytes in total"
* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix dentry size
vfs: move d_lockref out of the area used by RCU lookup
bcachefs: remove now spurious i_state initialization
xfs: remove now spurious i_state initialization in xfs_inode_alloc
vfs: partially sanitize i_state zeroing on inode creation
xfs: preserve i_state around inode_init_always in xfs_reinit_inode
btrfs: use iget5_locked_rcu
vfs: add rcu-based find_inode variants for iget ops
|
|
memcpy's second parameter must not be NULL, even if size is zero.
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We no longer track individual btree node locks with lockdep, so this
will never be enabled.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
perusal of /sys/kernel/debug/bcachefs/*/btree_transaction_stats shows
that the read path has been acculumalating unneeded paths on the reflink
btree, which we don't want.
The solution is to call bch2_trans_begin(), which drops paths not used
on previous loop iteration.
bch2_readahead:
Max mem used: 0
Transaction duration:
count: 194235
since mount recent
duration of events
min: 150 ns
max: 9 ms
total: 838 ms
mean: 4 us 6 us
stddev: 34 us 7 us
time between events
min: 10 ns
max: 15 h
mean: 2 s 12 s
stddev: 2 s 3 ms
Maximum allocated btree paths (193):
path: idx 2 ref 0:0 P btree=extents l=0 pos 270943112:392:U32_MAX locks 0
path: idx 3 ref 1:0 S btree=extents l=0 pos 270943112:24578:U32_MAX locks 1
path: idx 4 ref 0:0 P btree=reflink l=0 pos 0:24773509:0 locks 0
path: idx 5 ref 0:0 P S btree=reflink l=0 pos 0:24773631:0 locks 1
path: idx 6 ref 0:0 P S btree=reflink l=0 pos 0:24773759:0 locks 1
path: idx 7 ref 0:0 P S btree=reflink l=0 pos 0:24773887:0 locks 1
path: idx 8 ref 0:0 P S btree=reflink l=0 pos 0:24774015:0 locks 1
path: idx 9 ref 0:0 P S btree=reflink l=0 pos 0:24774143:0 locks 1
path: idx 10 ref 0:0 P S btree=reflink l=0 pos 0:24774271:0 locks 1
<many more reflink paths>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If label is not set, the Label tag in superblock info show '(none)'.
```
[Before]
Device index: 0
Label:
Version: 1.4: member_seq
[After]
Device index: 0
Label: (none)
Version: 1.4: member_seq
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Unnecessary here, and this broke the rust bindings:
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29025:1
|
29025 | pub struct bkey_i_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32826:1
|
32826 | pub struct bkey_inode_buf {
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: `bkey_inode_buf` contains a field of type `bkey_i_inode_v3`
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32827:9
|
32827 | pub inode: bkey_i_inode_v3,
| ^^^^^
note: ...which contains a field of type `bch_inode_v3`
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29027:9
|
29027 | pub v: bch_inode_v3,
| ^
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
highly damaged filesystems, or filesystems that have been damaged and
repair and damaged again, may have sequence numbers we can't fully trust
- which in itself is something we need to debug.
Add a journal_seq fallback so that repair doesn't get stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds lockdep tracking for held btree locks with a single dep_map in
btree_trans, i.e. tracking all held btree locks as one object.
This is more practical and more useful than having lockdep track held
btree locks individually, because
- we can take more locks than lockdep can track (unbounded, now that we
have dynamically resizable btree paths)
- there's no lock ordering between btree locks for lockdep to track (we
do cycle detection)
- and this makes it easy to teach lockdep that btree locks are not safe
to hold while invoking memory reclaim.
The last rule is one that lockdep would never learn, because we only do
trylock() from within shrinkers - but we very much do not want to be
invoking memory reclaim while holding btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new helper to disable lockdep tracking entirely for a given class.
This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
printing the raw values can occasionally be very useful
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Eliminate possible integer truncation bugs on 32 bit
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We're not always mounting when we start the filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This repurposes the promote path, which already knows how to call
data_update() after a read: we now automatically rewrite bad data when
we get a read error and then successfully retry from a different
replica.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fsck passes read_only as a mount option, and it's required for
nochanges, which it also uses.
Usually read_only is handled by the VFS, but we need to be able to
handle it too; we just don't want to print it out twice, so mark it as a
hidden option.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|