[PATCH 03/23] Don't overwrite op.val_size on btree delete
Valerie Aurora
val at versity.com
Thu Apr 10 04:59:31 PDT 2025
On Wed, Apr 9, 2025 at 9:14 PM Zach Brown <zab at zabbo.net> wrote:
>
> On Fri, Apr 04, 2025 at 08:45:19PM +0200, Valerie Aurora wrote:
> > When the btree operation delete flag is set, the operation's val_size
> > is overwritten with the current size of the btree item. This causes a
> > bug on a simultaneous delete/insert btree operation in which the old
> > value size controls how many bytes of the new value are copied into
> > the value, instead of the new value size. Since delete-only btree
> > operations ignore op.val_size, just don't overwrite the operation
> > val_size at all.
>
> Deletion really doesn't ignore op.val_size. It's a shame that the diff
> context cut off when it did, 'cause the very next line shows it being
> passed into the split/merge logic. When deleting items, the block will
> be split if item population in the block falls under the low water mark
> after the deletion of an item who's size we describe by passing in
> op.val_size.
>
> Anyway, that's just the start of the problem with the combined
> insert/delete operation. Look just a bit further down and we see that
> it does either insert_item or delete_item, never both.
>
> Those item insertion and deletion functions expand or contract the item
> headers. That's expensive enough that we don't want to just call both
> when we're updating the value payload of an existing item.
>
> Additionally, we'd like to support increasing the size of the value.
> Put all this together and we get the outline of changes that I think
> it'd take to support changing the values of existing items:
>
> - "replacing" by specifying both insert and delete is confusing and
> potentially accident prone. It's probably time to turn the op into a
> naturally mutually exclusive enum with an additional _CHANGE_VALUE
> operation.
>
> - When setting the value, set op.val_size to the potential change in
> block space consumption by the new value. When setting a smaller
> value, it's rounding up the absolute difference to the value alignment.
> When setting a larger value, it's the total new value size. When
> The sizes are the same it's 0. (This will make sense after the
> next bullet item, promimse :)). This gets passed to the split/merge
> logic.
>
> - like insert_item and delete_item, make a specific change_item_value
> helper. This will only modify the existing item header to change the
> value. If the new value is smaller than the old, it can copy the value
> into the current value payload. And it might need to update the record
> of internal free space (see delete_item) for any unused space at the end
> of the old value. If the new value is greater than the old, it'd
> allocate a new value payload (see insert_item), point the existing item
> header at that, and update the free space accounting. The op block
> calling this would then be done with the item and increment the
> index, like the insertion op does.
>
> If you're excited about getting your hands dirty in the btree code
> you're welcome to take a swing at this. Or I can throw it together if
> you'd rather not.
Thanks for the detailed notes! Let me give it a swing, I'd like to
have a better grasp of the btree code, and I really like the idea of
having the replace operation, especially given your experience with
CPU usage by btree traversal in scoutfs.
Valerie
More information about the ngnfs-devel
mailing list