Deadlock under load with Linux 5.9 and other recent kernels
Patrik Nilsson
nipatriknilsson at gmail.com
Mon Sep 28 07:06:04 EDT 2020
Hi!
To me this bug description is very similar to what I'm struggling with
on an amd64-platform.
When I get too much data sent via usb, it seems as the usb controlmsg is
delayed so it times out and unmounts the block device.
I have been working on my related bug for long to get it easily
reproducible, but failed. It is there all the time. New hardware is on
its way so I can continue my testing.
Maybe you can test the patch I'm using to see if it works better for you?
In the meanwhile here is my description of my bug:
> I have stress tested the usb system. To the USB is now seven
> mechanical hard disks and two ssd disks connected. Six processes are
> at the same time writing random data to the disks. One of them is to
> the ssd disk I couldn't write data to before without it failed. Also
> the other usb-ssd disk is my root partition.
>
> Before I applied the patch, my root partition sometimes failed to be
> kept mounted. Now I have not had any crashes.
>
> This is a quick fix for hard disks, but working. It continued to work
> when I started three virtualbox guests and let them also do work. The
> guests' hard disks is on my usb-root partition.
>
> It doesn't work if I also use my usb2ethernet adapter (ID 2001:4a00
> D-Link Corp.), although my root partition and two randomize tests
> survived. Maybe a much larger timeout in this case will help? But this
> I don't find as a good solution.
>
> The behavior is the same on the other (much slower) computer with a
> different usb hub. I have also tested it with exactly the same setup
> as earlier, with no mechanical hard disks, and it works with the patch
> and not without it.
Best regards,
Patrik
---start of diff---
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5b768b80d1ee..3c550934815c 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -105,7 +105,7 @@ MODULE_PARM_DESC(use_both_schemes,
DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
-#define HUB_DEBOUNCE_TIMEOUT 2000
+#define HUB_DEBOUNCE_TIMEOUT 10000
#define HUB_DEBOUNCE_STEP 25
#define HUB_DEBOUNCE_STABLE 100
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 20c555db4621..e64d441bb78f 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -1841,8 +1841,8 @@ extern int usb_set_configuration(struct usb_device
*dev, int configuration);
* USB identifies 5 second timeouts, maybe more in a few cases, and a few
* slow devices (like some MGE Ellipse UPSes) actually push that limit.
*/
-#define USB_CTRL_GET_TIMEOUT 5000
-#define USB_CTRL_SET_TIMEOUT 5000
+#define USB_CTRL_GET_TIMEOUT 10000
+#define USB_CTRL_SET_TIMEOUT 10000
/**
---end of diff---
On 28/09/2020 03:37, Christian Hewitt wrote:
>> On 26 Sep 2020, at 4:28 pm, Christian Hewitt <christianshewitt at gmail.com> wrote:
>>
>>> On 26 Sep 2020, at 4:13 pm, Jens Axboe <axboe at kernel.dk> wrote:
>>>
>>> On 9/26/20 5:55 AM, Christian Hewitt wrote:
>>>>> On 26 Sep 2020, at 2:51 pm, Jens Axboe <axboe at kernel.dk> wrote:
>>>>>
>>>>> On 9/26/20 1:55 AM, Christian Hewitt wrote:
>>>>>> I am using an ARM SBC device with Amlogic S922X chip (Beelink
>>>>>> GS-King-X, an Android STB) to boot the Kodi mediacentre distro
>>>>>> LibreELEC (which I work on) although the issue is also reproducible
>>>>>> with Manjaro and Armbian on the same hardware, and with the GT-King
>>>>>> and GT-King Pro devices from the same vendor - all three devices are
>>>>>> using a common dtsi:
>>>>>>
>>>>>> https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gsking-x.dts
>>>>>> https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gtking-pro.dts
>>>>>> https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-gtking.dts
>>>>>> https://github.com/chewitt/linux/blob/amlogic-5.9-integ/arch/arm64/boot/dts/amlogic/meson-g12b-w400.dtsi
>>>>>>
>>>>>> I have schematics for the devices, but can only share those privately
>>>>>> on request.
>>>>>>
>>>>>> For testing I am booting LibreELEC from SD card. The box has a 4TB
>>>>>> SATA drive internally connected with a USB > SATA bridge, see dmesg:
>>>>>> http://ix.io/2yLh and I connect a USB stick with a 4GB ISO file that I
>>>>>> copy to the internal SATA drive. Within 10-20 seconds of starting the
>>>>>> copy the box deadlocks needing a hard power cycle to recover. The
>>>>>> timing of the deadlock is variable but the device _always_ deadlocks.
>>>>>> Although I am using a simple copy use-case, there are similar reports
>>>>>> in Armbian forums performing tasks like installs/updates that involve
>>>>>> I/O loads.
>>>>>>
>>>>>> Following advice in the #linux-amlogic IRC channel I added
>>>>>> CONFIG_SOFTLOCKUP_DETECTOR and CONFIG_DETECT_HUNG_TASK and was able to
>>>>>> get output on the HDMI screen (it is not possible to connect to UART
>>>>>> pins without destroying the box case). If you advance the following
>>>>>> video frame by frame in VLC you can see the output:
>>>>>>
>>>>>> https://www.dropbox.com/s/klvcizim8cs5lze/lockup_clip.mov?dl=0
>>>>> Try with this patch:
>>>>>
>>>>> https://lore.kernel.org/linux-block/20200925191902.543953-1-shakeelb@google.com/
>>>> It still locks up approx. 25 seconds into the copy operation. Here’s the output in video again (a little blurry):
>>>>
>>>> https://www.dropbox.com/s/3j2czaq509arg6g/lockup_clip2.mov?dl=0
>>> Can you try and set CONFIG_SLUB in your .config instead of CONFIG_SLAB?
>> CONFIG_SLUB is already set, here’s the full defconfig http://paste.ubuntu.com/p/5BNdZv6J3c/
>>
>> # dmesg | grep -i slub
>> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
>>
>>> Also, just take a picture, should be easier to get readable than a video.
>>> And the static trace is all that is needed.
>> This is from a GT-King Pro which someone reminded me has a large RS232 port on the rear:
>>
>> https://pastebin.com/raw/sGtzgreN
> from 5.9—rc7 https://pastebin.com/raw/nbHJmrqe
>
> Christian
>
>
>
>
--
PGP-key fingerprint: 1B30 7F61 AF9E 538A FCD6 2BE7 CED7 B0E4 3BF9 8D6C
More information about the linux-amlogic
mailing list