NVM and swap device

Fri Jan 15 10:18:19 PST 2016

On Fri, 15 Jan 2016 17:42:36 +0000
Keith Busch <keith.busch at intel.com> wrote:

> On Tue, Jan 12, 2016 at 07:40:30PM -0800, Stephen Hemminger wrote:
> > I have a nice shiny new Intel NVM PCI card; decided to use it for a filesystem and swap.
> > The filesystem (btrfs) is doing fine, but the swap device was throwing occasional
> > random errors. Suspect a driver problem rather than hardware.
> > 
> > I am using 4.4 kernel without patches.
> > 
> > kern.log:Jan 12 08:11:57 xeon-e3 kernel: [159474.037390] Read-error on swap-device (259:0:17597808)
> > kern.log.1:Jan  7 08:32:10 xeon-e3 kernel: [87938.855526] Read-error on swap-device (259:0:11355648)
> > kern.log.1:Jan  7 08:32:10 xeon-e3 kernel: [87938.855530] Read-error on swap-device (259:0:11355656)
> > kern.log.1:Jan  7 08:32:10 xeon-e3 kernel: [87939.855467] Read-error on swap-device (259:0:16180824)
> > kern.log.1:Jan  8 08:24:07 xeon-e3 kernel: [63670.777981] Read-error on swap-device (259:0:32690768)
> > kern.log.1:Jan  9 09:25:02 xeon-e3 kernel: [153720.919325] Read-error on swap-device (259:0:220488)
> > kern.log.1:Jan  9 16:40:05 xeon-e3 kernel: [179820.957675] Read-error on swap-device (259:0:24476232)
> > kern.log.1:Jan  9 16:40:05 xeon-e3 kernel: [179820.962673] Read-error on swap-device (259:0:33292816)
> > 
> > The swap device was being added via /etc/fstab by UUID. 
> 
> If you haven't any further insights into the issue, could you check the
> device's health? I'd be surprised if there is a problem with that since
> you mentioned it was new card, but would like to rule that out if this
> has hit a dead end on the other testing.
> 
> For that, we need smart logs. There are various tools available that
> can read those logs. Here's an open source version:
> 
>   https://github.com/linux-nvme/nvme-cli
> 
> Here's example output from one of my drives with the above tool:
> 
>   # nvme smart-log /dev/nvme0
>   Smart Log for NVME device:/dev/nvme0 namespace-id:ffffffff
>   critical_warning                    : 0
>   temperature                         : 29 C
>   available_spare                     : 100%
>   available_spare_threshold           : 10%
>   percentage_used                     : 0%
>   data_units_read                     : 577,600
>   data_units_written                  : 3,182,404
>   host_read_commands                  : 4,537,801
>   host_write_commands                 : 18,713,235
>   controller_busy_time                : 17
>   power_cycles                        : 1
>   power_on_hours                      : 163
>   unsafe_shutdowns                    : 1
>   media_errors                        : 0
>   num_err_log_entries                 : 0

I wanted to run more stress tests before reporting back.