[PATCH 03/23] Don't overwrite op.val_size on btree delete

Valerie Aurora val at versity.com
Thu Apr 10 04:59:31 PDT 2025


On Wed, Apr 9, 2025 at 9:14 PM Zach Brown <zab at zabbo.net> wrote:
>
> On Fri, Apr 04, 2025 at 08:45:19PM +0200, Valerie Aurora wrote:
> > When the btree operation delete flag is set, the operation's val_size
> > is overwritten with the current size of the btree item. This causes a
> > bug on a simultaneous delete/insert btree operation in which the old
> > value size controls how many bytes of the new value are copied into
> > the value, instead of the new value size. Since delete-only btree
> > operations ignore op.val_size, just don't overwrite the operation
> > val_size at all.
>
> Deletion really doesn't ignore op.val_size.  It's a shame that the diff
> context cut off when it did, 'cause the very next line shows it being
> passed into the split/merge logic.  When deleting items, the block will
> be split if item population in the block falls under the low water mark
> after the deletion of an item who's size we describe by passing in
> op.val_size.
>
> Anyway, that's just the start of the problem with the combined
> insert/delete operation.  Look just a bit further down and we see that
> it does either insert_item or delete_item, never both.
>
> Those item insertion and deletion functions expand or contract the item
> headers.  That's expensive enough that we don't want to just call both
> when we're updating the value payload of an existing item.
>
> Additionally, we'd like to support increasing the size of the value.
> Put all this together and we get the outline of changes that I think
> it'd take to support changing the values of existing items:
>
>  - "replacing" by specifying both insert and delete is confusing and
>    potentially accident prone.  It's probably time to turn the op into a
>    naturally mutually exclusive enum with an additional _CHANGE_VALUE
>    operation.
>
>  - When setting the value, set op.val_size to the potential change in
>    block space consumption by the new value.  When setting a smaller
>    value, it's rounding up the absolute difference to the value alignment.
>    When setting a larger value, it's the total new value size.  When
>    The sizes are the same it's 0.  (This will make sense after the
>    next bullet item, promimse :)).  This gets passed to the split/merge
>    logic.
>
>  - like insert_item and delete_item, make a specific change_item_value
>    helper.  This will only modify the existing item header to change the
>    value.  If the new value is smaller than the old, it can copy the value
>    into the current value payload.  And it might need to update the record
>    of internal free space (see delete_item) for any unused space at the end
>    of the old value.  If the new value is greater than the old, it'd
>    allocate a new value payload (see insert_item), point the existing item
>    header at that, and update the free space accounting.  The op block
>    calling this would then be done with the item and increment the
>    index, like the insertion op does.
>
> If you're excited about getting your hands dirty in the btree code
> you're welcome to take a swing at this.  Or I can throw it together if
> you'd rather not.

Thanks for the detailed notes! Let me give it a swing, I'd like to
have a better grasp of the btree code, and I really like the idea of
having the replace operation, especially given your experience with
CPU usage by btree traversal in scoutfs.

Valerie



More information about the ngnfs-devel mailing list