[PATCH rdma-next 4/4] nvme-rdma: add more error details when a QP moves to an error state
Mark Zhang
markzhang at nvidia.com
Tue Nov 1 02:12:17 PDT 2022
On 9/8/2022 1:39 AM, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
>> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>>
>>>>> This entire code needs to move to the rdma core instead
>>>>> of being leaked to ulps.
>>>>
>>>> We can move, but you will lose connection between queue number,
>>>> caller and error itself.
>>>
>>> That still doesn't explain why nvme-rdma is special.
>>>
>>> In any event, the ulp can log the qpn so the context can be interrogated
>>> if that is important.
>>
>> I also don't see why the QP event handler can't be called
>> from user context to start with. I see absolutely no reason to
>> add boilerplate code to drivers for reporting slighly more verbose
>> errors on one specific piece of hrdware. I'd say clean up the mess
>> that is the QP event handler first, and then once error reporting
>> becomes trivial we can just do it.
>
> I don't know, Chuck documented it in 2018:
> eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")
>
> 1164 struct ib_qp_init_attr {
> 1165 /* Consumer's event_handler callback must not block */
> 1166 void (*event_handler)(struct ib_event *, void *);
Looks like driver calls it in an atomic way, e.g.:
mlx5_ib_qp_event() -> ibqp->event_handler(&event, ibqp->qp_context);
Could driver also report it as an IB async event, as WQ_CATAS_ERROR is
defined as an async event (IB spec C11-39), and QP_FATAL is also an
event of enum ib_event_type? Is it a problem that one event is reported
twice?
If it is acceptable then ulp could register this event handler with
ib_register_event_handler(), which is non-atomic.
More information about the Linux-nvme
mailing list