[PATCH 03/23] Don't overwrite op.val_size on btree delete

Zach Brown zab at zabbo.net
Wed Apr 9 12:14:50 PDT 2025


On Fri, Apr 04, 2025 at 08:45:19PM +0200, Valerie Aurora wrote:
> When the btree operation delete flag is set, the operation's val_size
> is overwritten with the current size of the btree item. This causes a
> bug on a simultaneous delete/insert btree operation in which the old
> value size controls how many bytes of the new value are copied into
> the value, instead of the new value size. Since delete-only btree
> operations ignore op.val_size, just don't overwrite the operation
> val_size at all.

Deletion really doesn't ignore op.val_size.  It's a shame that the diff
context cut off when it did, 'cause the very next line shows it being
passed into the split/merge logic.  When deleting items, the block will
be split if item population in the block falls under the low water mark
after the deletion of an item who's size we describe by passing in
op.val_size.

Anyway, that's just the start of the problem with the combined
insert/delete operation.  Look just a bit further down and we see that
it does either insert_item or delete_item, never both.

Those item insertion and deletion functions expand or contract the item
headers.  That's expensive enough that we don't want to just call both
when we're updating the value payload of an existing item.

Additionally, we'd like to support increasing the size of the value.
Put all this together and we get the outline of changes that I think
it'd take to support changing the values of existing items:

 - "replacing" by specifying both insert and delete is confusing and
   potentially accident prone.  It's probably time to turn the op into a
   naturally mutually exclusive enum with an additional _CHANGE_VALUE
   operation.

 - When setting the value, set op.val_size to the potential change in
   block space consumption by the new value.  When setting a smaller
   value, it's rounding up the absolute difference to the value alignment.
   When setting a larger value, it's the total new value size.  When
   The sizes are the same it's 0.  (This will make sense after the
   next bullet item, promimse :)).  This gets passed to the split/merge
   logic.

 - like insert_item and delete_item, make a specific change_item_value
   helper.  This will only modify the existing item header to change the
   value.  If the new value is smaller than the old, it can copy the value
   into the current value payload.  And it might need to update the record
   of internal free space (see delete_item) for any unused space at the end
   of the old value.  If the new value is greater than the old, it'd
   allocate a new value payload (see insert_item), point the existing item
   header at that, and update the free space accounting.  The op block
   calling this would then be done with the item and increment the
   index, like the insertion op does.

If you're excited about getting your hands dirty in the btree code
you're welcome to take a swing at this.  Or I can throw it together if
you'd rather not.

- z



More information about the ngnfs-devel mailing list