[PATCH rdma-next 4/4] nvme-rdma: add more error details when a QP moves to an error state

Mark Zhang markzhang at nvidia.com
Tue Nov 1 18:56:42 PDT 2022


On 11/1/2022 5:12 PM, Mark Zhang wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/8/2022 1:39 AM, Leon Romanovsky wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
>>> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> This entire code needs to move to the rdma core instead
>>>>>> of being leaked to ulps.
>>>>>
>>>>> We can move, but you will lose connection between queue number,
>>>>> caller and error itself.
>>>>
>>>> That still doesn't explain why nvme-rdma is special.
>>>>
>>>> In any event, the ulp can log the qpn so the context can be 
>>>> interrogated
>>>> if that is important.
>>>
>>> I also don't see why the QP event handler can't be called
>>> from user context to start with.  I see absolutely no reason to
>>> add boilerplate code to drivers for reporting slighly more verbose
>>> errors on one specific piece of hrdware.  I'd say clean up the mess
>>> that is the QP event handler first, and then once error reporting
>>> becomes trivial we can just do it.
>>
>> I don't know, Chuck documented it in 2018:
>> eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")
>>
>>    1164 struct ib_qp_init_attr {
>>    1165         /* Consumer's event_handler callback must not block */
>>    1166         void                  (*event_handler)(struct ib_event 
>> *, void *);
> 
> Looks like driver calls it in an atomic way, e.g.:
>    mlx5_ib_qp_event() -> ibqp->event_handler(&event, ibqp->qp_context);
> 
> Could driver also report it as an IB async event, as WQ_CATAS_ERROR is
> defined as an async event (IB spec C11-39), and QP_FATAL is also an
> event of enum ib_event_type? Is it a problem that one event is reported
> twice?
> 
> If it is acceptable then ulp could register this event handler with
> ib_register_event_handler(), which is non-atomic.

Or move qp event handler to non-atomic as Christoph suggested? This 
means to fix the mlx4/mlx5 driver, to call it in a work queue.



More information about the Linux-nvme mailing list