linux.git/fs/jbd2/commit.c, branch v6.12.80

jbd2: flush filesystem device before updating tail sequence

2025-01-17T12:40:33+00:00

[ Upstream commit a0851ea9cd555c333795b85ddd908898b937c4e1 ]

When committing transaction in jbd2_journal_commit_transaction(), the
disk caches for the filesystem device should be flushed before updating
the journal tail sequence. However, this step is missed if the journal
is not located on the filesystem device. As a result, the filesystem may
become inconsistent following a power failure or system crash. Fix it by
ensuring that the filesystem device is flushed appropriately.

Fixes: 3339578f0578 ("jbd2: cleanup journal tail after transaction commit")
Signed-off-by: Zhang Yi 
Link: https://lore.kernel.org/r/20241203014407.805916-3-yi.zhang@huaweicloud.com
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner 
Signed-off-by: Sasha Levin

jbd2: make jbd2_journal_get_max_txn_bufs() internal

2024-07-09T03:59:37+00:00

There's no reason to have jbd2_journal_get_max_txn_bufs() public
function. Currently all users are internal and can use
journal->j_max_transaction_buffers instead. This saves some unnecessary
recomputations of the limit as a bonus which becomes important as this
function gets more complex in the following patch.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara 
Reviewed-by: Zhang Yi 
Link: https://patch.msgid.link/20240624170127.3253-1-jack@suse.cz
Signed-off-by: Theodore Ts'o

jbd2: speed up jbd2_transaction_committed()

2024-07-05T16:45:36+00:00

jbd2_transaction_committed() is used to check whether a transaction with
the given tid has already committed, it holds j_state_lock in read mode
and check the tid of current running transaction and committing
transaction, but holding the j_state_lock is expensive.

We have already stored the sequence number of the most recently
committed transaction in journal t->j_commit_sequence, we could do this
check by comparing it with the given tid instead. If the given tid isn't
smaller than j_commit_sequence, we can ensure that the given transaction
has been committed. That way we could drop the expensive lock and
achieve about 10% ~ 20% performance gains in concurrent DIOs on may
virtual machine with 100G ramdisk.

fio -filename=/mnt/foo -direct=1 -iodepth=10 -rw=$rw -ioengine=libaio \
    -bs=4k -size=10G -numjobs=10 -runtime=60 -overwrite=1 -name=test \
    -group_reporting

Before:
  overwrite       IOPS=88.2k, BW=344MiB/s
  read            IOPS=95.7k, BW=374MiB/s
  rand overwrite  IOPS=98.7k, BW=386MiB/s
  randread        IOPS=102k, BW=397MiB/s

After:
  overwrite       IOPS=105k, BW=410MiB/s
  read            IOPS=112k, BW=436MiB/s
  rand overwrite  IOPS=104k, BW=404MiB/s
  randread        IOPS=111k, BW=432MiB/s

CC: Dave Chinner 
Suggested-by: Dave Chinner 
Link: https://lore.kernel.org/linux-ext4/ZjILCPNZRHeazSqV@dread.disaster.area/
Signed-off-by: Zhang Yi 
Reviewed-by: Jan Kara 
Reviewed-by: Ritesh Harjani (IBM) 
Link: https://patch.msgid.link/20240520131831.2910790-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o

jbd2: remove unused return info from jbd2_journal_write_metadata_buffer

2024-06-27T14:20:26+00:00

The done_copy_out info from jbd2_journal_write_metadata_buffer is not
used. Simply remove it.

Signed-off-by: Kemeng Shi 
Reviewed-by: Zhang Yi 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20240514112438.1269037-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o

jbd2: add prefix 'jbd2' for 'shrink_type'

2024-05-09T14:52:31+00:00

As 'shrink_type' is exported. The module prefix 'jbd2' is added to
distinguish from memory reclamation.

Signed-off-by: Ye Bin 
Reviewed-by: Jan Kara 
Reviewed-by: Zhang Yi 
Link: https://lore.kernel.org/r/20240407065355.1528580-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o

jbd2: use shrink_type type instead of bool type for __jbd2_journal_clean_checkpoint_list()

2024-05-09T14:52:31+00:00

"enum shrink_type" can clearly express the meaning of the parameter of
__jbd2_journal_clean_checkpoint_list(), and there is no need to use the
bool type.

Signed-off-by: Ye Bin 
Reviewed-by: Jan Kara 
Reviewed-by: Zhang Yi 
Link: https://lore.kernel.org/r/20240407065355.1528580-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o

jbd2: remove redundant assignement to variable err

2024-05-07T19:52:20+00:00

The variable err is being assigned a value that is never read, it
is being re-assigned inside the following while loop and also
after the while loop. The assignment is redundant and can be
removed.

Cleans up clang scan build warning:
fs/jbd2/commit.c:574:2: warning: Value stored to 'err' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King 
Reviewed-by: Jan Kara 
Link: https://lore.kernel.org/r/20240410112803.232993-1-colin.i.king@gmail.com
Signed-off-by: Theodore Ts'o

jbd2: fix soft lockup in journal_finish_inode_data_buffers()

2023-12-12T15:25:46+00:00

There's issue when do io test:
WARN: soft lockup - CPU#45 stuck for 11s! [jbd2/dm-2-8:4170]
CPU: 45 PID: 4170 Comm: jbd2/dm-2-8 Kdump: loaded Tainted: G  OE
Call trace:
 dump_backtrace+0x0/0x1a0
 show_stack+0x24/0x30
 dump_stack+0xb0/0x100
 watchdog_timer_fn+0x254/0x3f8
 __hrtimer_run_queues+0x11c/0x380
 hrtimer_interrupt+0xfc/0x2f8
 arch_timer_handler_phys+0x38/0x58
 handle_percpu_devid_irq+0x90/0x248
 generic_handle_irq+0x3c/0x58
 __handle_domain_irq+0x68/0xc0
 gic_handle_irq+0x90/0x320
 el1_irq+0xcc/0x180
 queued_spin_lock_slowpath+0x1d8/0x320
 jbd2_journal_commit_transaction+0x10f4/0x1c78 [jbd2]
 kjournald2+0xec/0x2f0 [jbd2]
 kthread+0x134/0x138
 ret_from_fork+0x10/0x18

Analyzed informations from vmcore as follows:
(1) There are about 5k+ jbd2_inode in 'commit_transaction->t_inode_list';
(2) Now is processing the 855th jbd2_inode;
(3) JBD2 task has TIF_NEED_RESCHED flag;
(4) There's no pags in address_space around the 855th jbd2_inode;
(5) There are some process is doing drop caches;
(6) Mounted with 'nodioread_nolock' option;
(7) 128 CPUs;

According to informations from vmcore we know 'journal->j_list_lock' spin lock
competition is fierce. So journal_finish_inode_data_buffers() maybe process
slowly. Theoretically, there is scheduling point in the filemap_fdatawait_range_keep_errors().
However, if inode's address_space has no pages which taged with PAGECACHE_TAG_WRITEBACK,
will not call cond_resched(). So may lead to soft lockup.
journal_finish_inode_data_buffers
  filemap_fdatawait_range_keep_errors
    __filemap_fdatawait_range
      while (index <= end)
        nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, PAGECACHE_TAG_WRITEBACK);
        if (!nr_pages)
           break;    --> If 'nr_pages' is equal zero will break, then will not call cond_resched()
        for (i = 0; i < nr_pages; i++)
          wait_on_page_writeback(page);
        cond_resched();

To solve above issue, add scheduling point in the journal_finish_inode_data_buffers();

Signed-off-by: Ye Bin 
Reviewed-by: Jan Kara 
Link: https://lore.kernel.org/r/20231211112544.3879780-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o

jbd2: increase the journal IO's priority

2023-12-01T04:27:39+00:00

Current jbd2 only add REQ_SYNC for descriptor block, metadata log
buffer, commit buffer and superblock buffer, the submitted IO could be
throttled by writeback throttle in block layer, that could lead to
priority inversion in some cases. The log IO looks like a kind of high
priority metadata IO, so it should not be throttled by WBT like QOS
policies in block layer, let's add REQ_SYNC | REQ_IDLE to exempt from
writeback throttle, and also add REQ_META together indicates it's a
metadata IO.

Signed-off-by: Zhang Yi 
Reviewed-by: Jan Kara 
Link: https://lore.kernel.org/r/20231129114740.2686201-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o

jbd2: Remove page size assumptions

2023-09-07T19:17:02+00:00

jbd2_alloc() allocates a buffer from slab when the block size is smaller
than PAGE_SIZE, and slab may be using a compound page.  Before commit
8147c4c4546f, we set b_page to the precise page containing the buffer
and this code worked well.  Now we set b_page to the head page of the
allocation, so we can no longer use offset_in_page().  While we could
do a 1:1 replacement with offset_in_folio(), use the more idiomatic
bh_offset() and the folio APIs to map the buffer.

This isn't enough to support a b_size larger than PAGE_SIZE on HIGHMEM
machines, but this is good enough to fix the actual bug we're seeing.

Fixes: 8147c4c4546f ("jbd2: use a folio in jbd2_journal_write_metadata_buffer()")
Reported-by: Zorro Lang 
Signed-off-by: Ritesh Harjani (IBM) 
[converted to be more folio]
Signed-off-by: Matthew Wilcox (Oracle)