[PATCH v10 1/2] scsi: ufs: Enable power management for wlun
Asutosh Das (asd)
asutoshd at codeaurora.org
Fri Mar 19 00:40:39 GMT 2021
On 3/18/2021 12:16 PM, Adrian Hunter wrote:
> On 18/03/21 7:58 pm, Asutosh Das (asd) wrote:
>> On 3/18/2021 10:54 AM, Rafael J. Wysocki wrote:
>>> On Thu, Mar 18, 2021 at 6:33 PM Asutosh Das (asd)
>>> <asutoshd at codeaurora.org> wrote:
>>>>
>>>> On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
>>>>> On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter at intel.com> wrote:
>>>>>>
>>>>>> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
>>>>>>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>>>>>>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>>>>>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>>>>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>>>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael at kernel.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern at rowland.harvard.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>>>>>>>> Scenario - bootups
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Issue:
>>>>>>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Log:
>>>>>>>>>>>>>>>> [ 10.056379][ T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.062497][ T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.356600][ T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.362944][ T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.696627][ T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.704562][ T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [ 10.980602][ T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>>>>>>>> [ 10.987327][ T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>>>>>>>> [ 10.994440][ T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>>>>>>>> [ 11.000402][ T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>>>>>>>> [ 11.005453][ T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>>>>>>>> [ 11.009958][ T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>>>>>>>> [ 11.014469][ T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>>>>>>>> [ 11.019072][ T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>>>>>>>> [ 11.023595][ T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>>>>>>>> [ 11.353298][ T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>>>>>>>> [ 11.357726][ T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>>>>>>>> [ 11.362155][ T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>>>>>>>> [ 11.366584][ T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>>>>>>>> [ 11.374366][ T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>>>>>>>
>>>>>>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>>>>>>>> active references to the supplier")?
>>>>>>>>>>>> Hi Rafael:
>>>>>>>>>>>> No - it is not greater than 1.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [ 11.383376][ T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>>>>>>>> [ 12.977318][ T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>>>>>>> if (hba->sdev_ufs_device) {
>>>>>>>>>>>>>>>> link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>>>>>>> &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>>>>>>> DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Alan!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello
>>>>>>>>>>> I & Can (thanks CanG) debugged this further:
>>>>>>>>>>>
>>>>>>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>>>>>>>
>>>>>>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>>>>>>>
>>>>>>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>>>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>>>>>>>
>>>>>>>>>>> So if:
>>>>>>>>>>> Context T1:
>>>>>>>>>>> [1] scsi_probe_and_add_lun()
>>>>>>>>>>> [2] |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>>>>>>>
>>>>>>>>>>> Context T2:
>>>>>>>>>>> __device_attach_async_helper()
>>>>>>>>>>> |- driver_probe_device()
>>>>>>>>>>> |- sd_probe()
>>>>>>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>>>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>>>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>>>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>>>>>>>
>>>>>>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>>>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>>>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>>>>>>>
>>>>>>>>>> Is this with DL_FLAG_RPM_ACTIVE? In that case, wouldn't active
>>>>>>>>>> consumers have link->rpm_active = 2 and also have incremented
>>>>>>>>>> the supplier's usage_count?
>>>>>>>
>>>>>>> Yes this is with DL_FLAG_RPM_ACTIVE.
>>>>>>>
>>>>>>> Please let me share a log here:
>>>>>>> BEF means - Before, AFT means After.
>>>>>>>
>>>>>>> [ 6.843445][ T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779: supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
>>>>>>> [ 6.892545][ T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>>>
>>>>>>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
>>>>>>>
>>>>>>> [ 6.931846][ T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
>>>>>>> [ 6.941246][ T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
>>>>>>>
>>>>>>> [ 6.941246][ T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
>>>>>>> [ 6.941247][ T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
>>>>>>>
>>>>>>> [ 6.941267][ T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
>>>>>>>
>>>>>>> ------ T196 Context comes in while T7 is running ----------
>>>>>>> [ 6.941466][ T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>>> --------------------------------------------------------------
>>>>>>>
>>>>>>> [ 7.788397][ T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> T196 is the context in which sd_probe() is invoked for this scsi device.
>>>>>>>
>>>>>>> [ 7.974410][ T196] sd 0:0:0:4: [sde] Attached SCSI disk
>>>>>>> [ 7.984188][ T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
>>>>>>> [ 7.998424][ T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
>>>>>>> [ 8.017320][ T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
>>>>>>>
>>>>>>> The reference to the link is released after sd_probe() is completed.
>>>>>>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
>>>>>>>
>>>>>>> In this log, the usage_count of supplier becomes 0 here:
>>>>>>> [ 11.963885][ T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
>>>>>>> [ 11.973821][ T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
>>>>>>>
>>>>>>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>>>>>>
>>>>>> If that is the case, then it is an error in PM not UFS.
>>>>>>
>>>>>> A second look at the code around rpm_put_suppliers() does look
>>>>>> potentially racy, since there does not appear to be anything
>>>>>> stopping the runtime_status changing between
>>>>>> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>>>>>>
>>>>>> Rafael, can you comment?
>>>>>
>>>>> Indeed, if the device is suspending, changing its PM-runtime status to
>>>>> RPM_SUSPENDED and dropping its power.lock allows a concurrent
>>>>> rpm_resume() to run and resume the device before the suppliers can be
>>>>> suspended.
>>>>>
>>>>> That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
>>>>> runtime: Update device status before letting suppliers suspend").
>>>>>
>>>>> It is probably better to revert that commit and address the original
>>>>> issue in a different way.
>>>>>
>>>> Hello,
>>>> One approach to address the original issue could be to prevent the scsi
>>>> devices from suspending until the probe is completed, perhaps?
>>>
>>> I was talking about the original issue that commit 44cc89f76464
>>> attempted to address.
>>>
>>> I'm not sure if and how it is related to the issue you have been debugging.
>>>
>> Hi Rafael
>> Thanks for clarifying that.
>> Understood.
>> I was referring to the issue that I've been discussing with Adrian.
>
> For test purposes, you could try reverting 44cc89f76464, making the
> other changes to the UFS driver, and see if the device_links issue
> goes away.
>
Hi Adrian
Thanks for the review and comments.
I addressed the comments and tested it. I still see the device_links
issue. I've pushed the change that I tested as v12.
# grep -H . /sys/bus/scsi/devices/*/power/runtime_status
<
/sys/bus/scsi/devices/0:0:0:0/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:1/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:2/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:3/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:4/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:49456/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:49476/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:49488/power/runtime_status:suspended <
I think it couldn't resolve the issue because we're not stopping the
scsi devices from suspending after the async probe is scheduled.
-asd
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project
More information about the Linux-mediatek
mailing list