ahci_imx: can read corrupted data successfully
Russell King - ARM Linux
linux at arm.linux.org.uk
Mon Aug 24 03:55:01 PDT 2015
I've no idea how this is happening, but it appears that it's possible
to read corrupted data over an eSATA link.
The setup is a Solid-run Cubox-i4pro connected to an external eSATA 2.5"
enclosure with a Corsair Neutron 128GB SSD.
While the issue seems to be the generally poor quality of eSATA cables
(I have two eSATA enclosures, the other enclosure's eSATA cable
interferes with a Logitech wireless receiver - I've had to wrap the
cable in aluminium foil.) However, it shouldn't be possible to
successfully read faulty data in that condition - and this is what I
find most worrying.
What I've noticed is that at boot, the the SSD is sometimes properly
detected, other times it isn't:
[reboot at 10am]
ata1: softreset failed (device not ready)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: Corsair Neutron SSD, M311, max UDMA/133
ata1.00: 250069680 sectors, multi 1: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access ATA Corsair Neutron M311 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
[half an hour passes and a reboot later]
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: , , max UDMA/133
ata1.00: 250069680 sectors, multi 1: LBA48
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access ATA n/a PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
sd 0:0:0:0: [sda] Attached SCSI disk
[replugging the eSATA connector]
ata1: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata1: irq_stat 0x00000040, connection status changed
ata1: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: Corsair Neutron SSD, M311, max UDMA/133
ata1.00: 250069680 sectors, multi 1: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata1: EH complete
scsi 0:0:0:0: Direct-Access ATA Corsair Neutron M311 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
[another reboot]
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: Corsair Neutron SSD, M311, max UDMA/133
ata1.00: 250069680 sectors, multi 1: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access ATA Corsair Neutron M311 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
When the SSD is mis-detected, as can be seen above, the partition table
isn't recognised. Reading sector 0 shows:
00000000 18 f0 9f e5 18 f0 9f e5 18 f0 9f e5 18 f0 9f e5 |................|
00000010 18 f0 9f e5 00 00 a0 e1 14 f0 9f e5 14 f0 9f e5 |................|
00000020 f8 09 00 00 3c 00 00 00 40 00 00 00 1c f2 00 00 |....<... at .......|
00000030 04 f2 00 00 5c 12 00 00 90 12 00 00 fe ff ff ea |....\...........|
00000040 fe ff ff ea 00 00 00 ea da 0d 00 ea 28 00 8f e2 |............(...|
00000050 00 0c 90 e8 00 a0 8a e0 01 70 4a e2 00 b0 8b e0 |.........pJ.....|
which is some ARM code - it looks like early boot code, code which would
normally be found in the vector page.
What should be there is the partition table:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000001b0 00 00 00 00 00 00 00 00 bf 64 9b 0b 00 00 00 20 |.........d..... |
000001c0 21 00 83 53 09 0a 00 08 00 00 00 80 02 00 00 53 |!..S...........S|
000001d0 0a 0a 83 a8 0a 1e 00 88 02 00 00 00 00 01 00 a8 |................|
000001e0 0b 1e 83 1d 3f ce 00 88 02 01 b0 3a e5 0d 00 00 |....?......:....|
000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.|
The SSD has had no ARM executables copied on to it since it was
purchased. It contains a DOS partition and swap space, so it's possible
that the swap space could have had ARM code swapped to it. Grepping the
swap partition for the initial sequence of 8 bytes (18 f0 9f e5 18 f0 9f
e5) gives nothing.
Another possibility is we didn't actually read any data from the SSD at
all but ended up copying it from somewhere else. As I say above, it
looks like what one would expect to find in the ARM vector page. It
doesn't tie up with the iMX6 ROM nor the uboot image in the SD card.
It's not the kernel's vectors either. It could be the SSD firmware (I
haven't checked, but I wouldn't be surprised if the SSD has an ARM CPU
on it) but that would point towards a very weird failure of the SSD -
though not if it's receiving corrupted commands and there was no
verification of those commands. Without a firmware image of the SSD
(which isn't going to be possible to get hold of) it's impossible to
know.
What concerns me is that this incorrect data was successfully read
allegedly from the device. What happens if a write were to occur -
what would we be writing to?
Also, clearly the identify information is definitely screwed, although
some of it does seem to be correct, such as the capacity, but the
device identifiers are broken.
How does SATA ensure command and data integrity over the link? I'd
assume that there's a CRC present on the data, like UDMA on PATA. How
are CRC errors supposed to be reported? Is it possible that ahci_imx
and other layers are not properly checking for CRC errors?
Any ideas what to look at?
Anyone got any suggestions on where to get a good quality, but not
stupidly expensive eSATA cable from?
I'm waiting for it to happen again, and I'll dump out more of the drive's
"contents" when the cable is bad. If it is the drive's firmware, it
should contain the manufacturer name/model somewhere in the image.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
More information about the linux-arm-kernel
mailing list