[PATCH rdma-next 4/4] nvme-rdma: add more error details when a QP moves to an error state

Mark Zhang markzhang at nvidia.com
Tue Nov 1 02:12:17 PDT 2022


On 9/8/2022 1:39 AM, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
>> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>>
>>>>> This entire code needs to move to the rdma core instead
>>>>> of being leaked to ulps.
>>>>
>>>> We can move, but you will lose connection between queue number,
>>>> caller and error itself.
>>>
>>> That still doesn't explain why nvme-rdma is special.
>>>
>>> In any event, the ulp can log the qpn so the context can be interrogated
>>> if that is important.
>>
>> I also don't see why the QP event handler can't be called
>> from user context to start with.  I see absolutely no reason to
>> add boilerplate code to drivers for reporting slighly more verbose
>> errors on one specific piece of hrdware.  I'd say clean up the mess
>> that is the QP event handler first, and then once error reporting
>> becomes trivial we can just do it.
> 
> I don't know, Chuck documented it in 2018:
> eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")
> 
>    1164 struct ib_qp_init_attr {
>    1165         /* Consumer's event_handler callback must not block */
>    1166         void                  (*event_handler)(struct ib_event *, void *);

Looks like driver calls it in an atomic way, e.g.:
   mlx5_ib_qp_event() -> ibqp->event_handler(&event, ibqp->qp_context);

Could driver also report it as an IB async event, as WQ_CATAS_ERROR is 
defined as an async event (IB spec C11-39), and QP_FATAL is also an 
event of enum ib_event_type? Is it a problem that one event is reported 
twice?

If it is acceptable then ulp could register this event handler with 
ib_register_event_handler(), which is non-atomic.



More information about the Linux-nvme mailing list