[PATCH 1/1] PMFS: Add experimental Persistent Memory Block Driver
Nicholas Moulin
nicholas.w.moulin at linux.intel.com
Thu May 9 15:08:09 EDT 2013
From: Nicholas Moulin <nicholas.w.moulin at linux.intel.com>
Initial version of PMBD, the persistent memory block driver
This commit rebases to Linux 3.9
Signed-off-by: Nicholas Moulin <nicholas.w.moulin at intel.com>
---
Documentation/blockdev/00-INDEX | 2 +
Documentation/blockdev/pmbd.txt | 185 ++
drivers/block/Kconfig | 10 +
drivers/block/Makefile | 2 +
drivers/block/pmbd.c | 4541 +++++++++++++++++++++++++++++
include/linux/pmbd.h | 509 ++++
diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..2e8f5b2 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -16,3 +16,5 @@ paride.txt
- information about the parallel port IDE subsystem.
ramdisk.txt
- short guide on how to set up and use the RAM disk.
+pmbd.txt
+ - information about Persistent Memory Block Driver.
diff --git a/Documentation/blockdev/pmbd.txt b/Documentation/blockdev/pmbd.txt
new file mode 100644
index 0000000..244820f
--- /dev/null
+++ b/Documentation/blockdev/pmbd.txt
@@ -0,0 +1,185 @@
+===============================================================================
+ INTEL PERSISTENT MEMORY BLOCK DRIVER (PMBD) v0.9
+===============================================================================
+
+This software implements a block device driver for persistent memory (PM).
+This module provides a block-based logical interface to manage PM that is
+physically attached to the system memory bus.
+
+The architecture is assumed as follows. Both DRAM and PM DIMMs are directly
+attached to the host memory bus. The PM space is presented to the operating
+system as a contiguous range of physical memory address space at the high end.
+
+There are three major design considerations: (1) Data protection - Private
+mapping is used to prevent stray pointers (in kernel/driver bugs) to
+accidentally wipe off persistent PM data. (2) Data persistence - Non-temporal
+store and fence instructions are used to leverage the processor store buffer
+and avoid polluting the CPU cache. (3) Write ordering - Write barrier is
+supported to ensure correct order of writes.
+
+This module also includes other (experimental) features, such as PM speed
+emulation, checksum for page integrity, partial page updates, write
+verification, etc. Please refer to the help page of the module.
+
+
+===============================================================================
+ COMPILING AND INSTALLING THE PMBD DRIVER
+===============================================================================
+
+1. Compile the PMBD driver:
+
+ $ make
+
+2. Install the PMBD driver:
+
+ $ sudo make install
+
+3. Check available driver information:
+
+ $ modinfo pmbd
+
+===============================================================================
+ QUICK USER'S GUIDE OF THE PMBD DRIVER
+===============================================================================
+
+1. modify /etc/grub.conf to set the physical memory address range that
+ is to be simulated as PM.
+
+ Add the following to the boot option line:
+
+ memmap=<PM_SIZE_GB>G$<DRAM_SIZE_GB>G numa=off
+
+ NOTE:
+
+ PM_SIZE_GB - the PM space size (in GBs)
+ DRAM_SIZE_GB - the DRAM space size (in GBs)
+
+ Example:
+
+ Assuming a total memory capacity of 24GB, and if we want to use 16GB PM and
+ 8GB DRAM, it should be "memmap=16G$8G".
+
+2. Reboot and check if the memory size is set as expected.
+
+ $ sudo reboot; exit
+ $ free
+
+3. Load the device driver module
+
+ Load the driver module into the kernel with private mapping, non-temp store,
+ and write barrier enabled (*** RECOMMENDED CONFIG ***):
+
+ $ modprobe pmbd mode="pmbd<PM_SIZE_GB>;hmo=<DRAM_SIZE_GB>;hms<PM_SIZE_GB>; \
+ pmapY;ntsY;wbY;"
+
+ Check the kernel message output:
+
+ $ dmesg
+
+ After loading the module, a block device (/dev/pma) should appear. Since
+ now, it can be used as any block device, such as fdisk, mkfs, etc.
+
+4. Unload the device driver
+
+ $ rmmod pmbd
+
+===============================================================================
+ OTHER CONFIGURATION OPTIONS OF THE PERSISTENT MEMORY DEVICE DRIVER MODULE
+===============================================================================
+
+usage: $ modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];;.."
+
+GENERAL OPTIONS:
+ pmbd<#,#..> set pmbd size (GBs)
+ HM|VM use high memory (HM default) or vmalloc (VM)
+ hmo<#> high memory starting offset (GB)
+ hms<#> high memory size (GBs)
+ pmap<Y|N> use private mapping (Y) or not (N default) - (note: must
+ enable HM and wrprotN)
+ nts<Y|N> use non-temporal store (MOVNTQ) and sfence to do memcpy (Y),
+ or regular memcpy (N default)
+ wb<Y|N> use write barrier (Y) or not (N default)
+ fua<Y|N> use WRITE_FUA (Y default) or not (N)
+ ntl<Y|N> use non-temporal load (MOVNTDQA) to do memcpy (Y), or
+ regular memcpy (N default) - this option enforces memory type
+ of write combining
+
+
+SIMULATION:
+ simmode<#,#..> use the specified numbers to the whole device (0 default) or
+ PM only (1)
+ rdlat<#,#..> set read access latency (ns)
+ wrlat<#,#..> set write access latency (ns)
+ rdbw<#,#..> set read bandwidth (MB/sec) (if set 0, no emulation)
+ wrbw<#,#..> set write bandwidth (MB/sec) (if set 0, no emulation)
+ rdsx<#,#..> set the relative slowdown (x) for read
+ wrsx<#,#..> set the relative slowdown (x) for write
+ rdpause<#,.> set a pause (cycles per 4KB) for each read
+ wrpause<#,.> set a pause (cycles per 4KB) for each write
+ adj<#> set an adjustment to the system overhead (nanoseconds)
+
+WRITE PROTECTION:
+ wrprot<Y|N> use write protection for PM pages? (Y or N)
+ wpmode<#,#,..> write protection mode: use the PTE change (0 default) or flip
+ CR0/WP bit (1)
+ clflush<Y|N> use clflush to flush CPU cache for each write to PM space?
+ (Y or N)
+ wrverify<Y|N> use write verification for PM pages? (Y or N)
+ checksum<Y|N> use checksum to protect PM pages? (Y or N)
+ bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)
+ bufnum<#> the number of buffers for a PMBD device (16 buffers, at least 1
+ if using buffer, 0 -no buffer)
+ bufstride<#> the number of contiguous blocks(4KB) mapped into one buffer
+ (bucket size for round-robin mapping) (1024 in default)
+ batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1 means
+ no batching)
+
+MISC:
+ mgb<Y|N> mergeable? (Y or N)
+ lock<Y|N> lock the on-access page to serialize accesses? (Y or N)
+ cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined
+ (WB), or Uncachable (UC)
+ subupdate<Y|N> only update the changed cachelines of a page? (Y or N) (check
+ PMBD_CACHELINE_SIZE)
+ timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat)?
+ This will cause significant performance slowdown (Y or N)
+
+NOTE:
+ (1) Option rdlat/wrlat only specifies the minimum access times. Real access
+ times can be higher.
+ (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored.
+ (3) Option simmode1 applies the simulated specification to the PM space,
+ rather than the whole device, which may have buffer.
+
+WARNING:
+ (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning
+ may appear. Use "nosoftlockup" boot option to disable it.
+ (2) Enabling timestat may cause performance degradation.
+ (3) FUA is supported , but if buffer is used (for PT based
+ protection), enabling FUA lowers performance due to double writes.
+ (4) No support for changing CPU cache related PTE attributes for VM-based PMBD
+ (RCU stalls).
+
+PROC ENTRIES:
+ /proc/pmbd/pmbdcfg: config info about the PMBD devices
+ /proc/pmbd/pmbdstat: statistics of the PMBD devices (if timestat is enabled)
+
+EXAMPLE:
+ Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:
+ (1) Basic (Ramdisk):
+ $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;"
+
+ (2) Protected (with private mapping):
+ $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;"
+
+ (3) Protected and synced (with private mapping, non-temp store):
+ $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;"
+
+ (4) *** RECOMMENDED CONFIGURATION ***
+ Protected, synced, and ordered (with private mapping, nt-store, write
+ barrier):
+ $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;wbY;"
+
+
+
+
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b81ddfe..47dbb6d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -540,5 +540,15 @@ config BLK_DEV_RSXX
To compile this driver as a module, choose M here: the
module will be called rsxx.
+
+config BLK_DEV_PMBD
+ tristate "Persistent Memory Block Driver"
+ depends on m
+
+ default n
+ help
+ Say M here if you want include the Persistent Memory Block Driver.
+
+ If unsure, say N.
endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index a3b4023..6ac1cbe 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -42,4 +42,6 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
+obj-$(CONFIG_BLK_DEV_PMBD) += pmbd.o
+
swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/pmbd.c b/drivers/block/pmbd.c
new file mode 100644
index 0000000..62d61f7
--- /dev/null
+++ b/drivers/block/pmbd.c
@@ -0,0 +1,4541 @@
+/*
+ * Intel Persistent Memory Block Driver
+ * Copyright (c) <2011-2013>, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+/*
+ * Intel Persistent Memory Block Driver (v0.9)
+ *
+ * Parts derived with changes from drivers/block/brd.c, lib/crc32.c, and
+ * arch/x86/lib/mmx_32.c
+ *
+ * Intel Corporation <linux-pmbd at intel.com>
+ * 03/24/2011
+ *
+ * Authors
+ * 2013 - Released the open-source version 0.9 (fchen)
+ * 2012 - Ported to Linux 3.2.1 (fchen)
+ * 2011 - Feng Chen (Intel) implemented version 1 of PMBD for Linux 2.6.34.
+ */
+
+
+/*
+ *******************************************************************************
+ * Persistent Memory Block Device Driver
+ *
+ * USAGE:
+ * % sudo modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[OPTION1];[OPTION2];..>"
+ *
+ * GENERAL OPTIONS:
+ * - pmbd<#,..>: a sequence of integer numbers setting PMBD device sizes (in
+ * units of GBs). For example, mode="pmbd4,1" means creating a
+ * 4GB and a 1GB PMBD device (/dev/pma and /dev/pmb).
+ *
+ * - HM|VM: choose two types of PMBD devices
+ * - VM: vmalloc() based
+ * - HM: HIGH_MEM based (default)
+ * - In /boot/grub/grub.conf, add "mem=<n>G memmap=<m>G$<n>G"
+ * to reserve the high m GBs for PM, starting from offset n
+ * GBs in physical memory
+ *
+ * - hmo<#>: if HM is set, setting the starting physical mem address
+ * (in units of GBs).
+ *
+ * - hms<#>: if HM is set, setting the remapping memory size (in GBs)
+ *
+ * - pmap<Y|N> set private mapping (Y) or not (N default). using
+ * pmap_atomic_pfn() to dynamically map/unmap the
+ * to-be-accessed PM page for protection purpose.
+ * This option must work with HM enabled. In the Linux boot
+ * option, "mem" option must be removed.
+ *
+ * - nts<Y|N> set non-temporal store/sfence (Y) or not (N default).
+ *
+ * - wb<Y|N>: use write barrier (Y) or not (N default)
+ *
+ * - fua<Y|N> use WRITE_FUA (Y default) or not (N)
+ * FUA with PT-based protection (with buffer) incurs
+ * double-write overhead
+ *
+ * SIMULATION OPTIONS:
+ *
+ * - simmode<#,#..> set the simulation mode for each PMBD device
+ * - 0 for simulating the whole device
+ * - 1 for simulating the PM space only
+ * Note that simulating the PM space may cause some system
+ * warning of soft lockup. To disable it, add nonsoftlockup
+ * in the boot options.
+ *
+ * - rdlat<#,#..>: a sequence of integer numbers setting emulated read
+ * latencies (in units of nanoseconds) for reading each
+ * sector. Each number is corresponding to a device. Default
+ * value is 0.
+ *
+ * - wrlat<#,#..>: set emulated write access latencies (see rdlat)
+ *
+ * - rdbw<#,#..>: a sequence of integer numbers setting emulated read
+ * bandwidth (in units of MB/sec) for reading each sector.
+ * Each number corresponds to a device. Default value is 0;
+ *
+ * - wrbw<#,#..>: set emulated write bandwidth (see rdbw)
+ *
+ * - rdsx<#,#..>: set the slowdown ratio (x) for reads as compared to DRAM
+ *
+ * - wrsx<#,#..>: set the slowdown ratio (x) for writes as compared to DRAM
+ *
+ * - rdpause<#,#..>: set the injected delay (cycles per page) for read (not
+ * for emulation, just inject latencies
+ * for each read per page)
+ *
+ * - wrpause<#,#..>: set the injected delay (cycles per page) for write
+ * (not for emulation, just inject latencies for
+ * each read per page).
+ *
+ * - adj<#>: offset the overhead with estimated system overhead. Default
+ * is 4us, however, this could vary system by system.
+ *
+ * WRITE PROTECTION:
+ *
+ * - wrprot<Y|N>: provide write protection on PM space by setting page
+ * read-only (default: N).
+ * This option is incompatible with pmap.
+ *
+ * - wpmode<#,#,..> write protection mode: use the PTE change (0 default) or
+ * switch CR0/WP bit (1)
+ *
+ * - wrverify<Y|N>: read out the data for verification after writing into PM
+ * space
+ *
+ * - clflush<Y|N>: flush CPU cache or not (default: N)
+ *
+ * - checksum<Y|N>: use checksum to provide further protection from data
+ * corruption (default: N)
+ *
+ * - lock<Y|N>: lock the on-access PM page to serialize accesses
+ * (default: Y)
+ *
+ * - bufsize<#,#,#.#...> -- the buffer size in MBs (for speeding up write
+ * protection) 0 means no buffer, minimum size is 16 MBs
+ *
+ * - bufnum<#> the number of buffers for a pmbd device (16 buffers, at
+ * least 1 if using buffering, 0 will disable buffer mode)
+ *
+ * - bufstride<#> the number of contiguous blocks(4KB) mapped into one
+ * buffer (the bucket size for round-robin mapping)
+ * (1024 in default)
+ *
+ * - batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1
+ * means no batching)
+ *
+ * MISC OPTIONS:
+ *
+ * - subupdate<Y|N> only update changed cachelines of a page (check
+ * PMBD_CACHELINE_SIZE, default: N)
+ *
+ * - mgb<Y|N>: setting mergeable or not (default: Y)
+ *
+ * - cache<WB|WC|UM|UC>:
+ * WB -- write back (both read/write cache used)
+ * WC -- write combined (write through but cachable)
+ * UM -- uncachable but write back
+ * UC -- write through and uncachable
+ * No support for changing CPU cache flags
+ * with vmalloc() based PMBD
+ *
+ * - timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat) or
+ * not (default: N). This will cause significant performance loss.
+ *
+ * EXAMPLE:
+ * mode="pmbd2,1;rdlat100,2000;wrlat500,4000;rdbw100,100;wrbw100,100;HM;hmo4;hms3;
+ * mgbY;flushY;cacheWB;wrprotY;wrverifyY;checksumY;lockY;rammode0,1;bufsize16,0;
+ * subupdateY;"
+ *
+ * Explanation: Create two PMBD devices, /dev/pma (2GB) and /dev/pmb (1GB).
+ * Insert 100ns and 500ns for reading and writing a sector to /dev/pma,
+ * respectively. Insert 2000ns and 4000ns for reading and writing a sector
+ * to /dev/pmb. Make the read/write bandwidth for both devices 100MB/sec.
+ * No system overhead adjustment is applied. We use 3GB high memory for the
+ * PMBD devices, starting from 4GB physical memory address. Make it
+ * mergeable, use writeback and flush CPU cache for the PM space, use write
+ * protection for PM space by setting PM space read-only, verify each
+ * write by reading out written data, use checksum to protect PM space, use
+ * spinlock to protect from corruption caused by concurrent accesses, the
+ * first device is applied without write protection, the second device is
+ * applied with write protection, and use sub-page updates.
+ *
+ * NOTE:
+ * - We can create no more than 26 devices, 4 partitions each.
+ *
+ * FIXME:
+ * (1) We use an unoccupied major device num (261) temporarily
+ *******************************************************************************
+ */
+
+#include <linux/init.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/major.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <asm/uaccess.h>
+#include <linux/time.h>
+#include <asm/timer.h>
+#include <linux/cpufreq.h>
+#include <linux/crc32.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/kthread.h>
+#include <linux/sort.h>
+#include <linux/timex.h>
+#include <linux/proc_fs.h>
+#include <asm/tlbflush.h>
+#include <asm/i387.h>
+#include <asm/asm.h>
+#include <linux/pmbd.h>
+#include <linux/delay.h>
+
+/* device configs */
+static int max_part = 4; /* maximum num of partitions */
+static int part_shift = 0; /* partition shift */
+static LIST_HEAD(pmbd_devices); /* device list */
+static DEFINE_MUTEX(pmbd_devices_mutex); /* device mutex */
+
+/* /proc file system entry */
+static struct proc_dir_entry* proc_pmbd = NULL;
+static struct proc_dir_entry* proc_pmbdstat = NULL;
+static struct proc_dir_entry* proc_pmbdcfg = NULL;
+
+/* pmbd device default configuration */
+static unsigned g_pmbd_type = PMBD_CONFIG_HIGHMEM; /* vmalloc(PMBD_CONFIG_VMALLOC) or reserve highmem (PMBD_CONFIG_HIGHMEM default) */
+static unsigned g_pmbd_pmap = FALSE; /* use pmap_atomic() to map/unmap space on demand */
+static unsigned g_pmbd_nts = FALSE; /* use non-temporal store (movntq) */
+static unsigned g_pmbd_wb = FALSE; /* use write barrier */
+static unsigned g_pmbd_fua = TRUE; /* use fua support (Linux 3.2.1) */
+static unsigned g_pmbd_mergeable = TRUE; /* mergeable or not */
+static unsigned g_pmbd_cpu_cache_clflush= FALSE; /* flush CPU cache or not*/
+static unsigned g_pmbd_wr_protect = FALSE; /* flip PTE R/W bits for write protection */
+static unsigned g_pmbd_wr_verify = FALSE; /* read out written data for verification */
+static unsigned g_pmbd_checksum = FALSE; /* do checksum on PM data */
+static unsigned g_pmbd_lock = TRUE; /* do spinlock on accessing a PM page */
+static unsigned g_pmbd_subpage_update = FALSE; /* do subpage update (only write changed content) */
+static unsigned g_pmbd_timestat = FALSE; /* do a detailed timestamp breakdown statistics */
+static unsigned g_pmbd_ntl = FALSE; /* use non-temporal load (movntdqa)*/
+static unsigned long g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB; /* CPU cache flag (default - write back) */
+
+/* high memory configs */
+static unsigned long g_highmem_size = 0; /* size of the reserved physical mem space (bytes) */
+static phys_addr_t g_highmem_phys_addr = 0; /* beginning of the reserved phy mem space (bytes)*/
+static void* g_highmem_virt_addr = NULL; /* beginning of the reserve HIGH_MEM space */
+static void* g_highmem_curr_addr = NULL; /* beginning of the available HIGH_MEM space for alloc*/
+
+/* module parameters */
+static unsigned g_pmbd_nr = 0; /* num of PMBD devices */
+static unsigned long long g_pmbd_size[PMBD_MAX_NUM_DEVICES]; /* PMBD device sizes in units of GBs */
+static unsigned long long g_pmbd_rdlat[PMBD_MAX_NUM_DEVICES]; /* access latency for read (nanosecs) */
+static unsigned long long g_pmbd_wrlat[PMBD_MAX_NUM_DEVICES]; /* access latency for write nanosecs) */
+static unsigned long long g_pmbd_rdbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for read (MB/sec) */
+static unsigned long long g_pmbd_wrbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for write (MB/sec)*/
+static unsigned long long g_pmbd_rdsx[PMBD_MAX_NUM_DEVICES]; /* read slowdown (x) */
+static unsigned long long g_pmbd_wrsx[PMBD_MAX_NUM_DEVICES]; /* write slowdown (x)*/
+static unsigned long long g_pmbd_rdpause[PMBD_MAX_NUM_DEVICES]; /* read pause (cycles per page) */
+static unsigned long long g_pmbd_wrpause[PMBD_MAX_NUM_DEVICES]; /* write pause (cycles per page)*/
+static unsigned long long g_pmbd_simmode[PMBD_MAX_NUM_DEVICES]; /* simulating PM space (1) or the whole device (0 default) */
+static unsigned long long g_pmbd_adjust_ns = 0; /* nanosec of adjustment to offset system overhead */
+static unsigned long long g_pmbd_rammode[PMBD_MAX_NUM_DEVICES]; /* do write optimization or not */
+static unsigned long long g_pmbd_bufsize[PMBD_MAX_NUM_DEVICES]; /* the buffer size (in MBs) */
+static unsigned long long g_pmbd_buffer_batch_size[PMBD_MAX_NUM_DEVICES]; /* the batch size (num of pages) for flushing PMBD buffer */
+static unsigned long long g_pmbd_wpmode[PMBD_MAX_NUM_DEVICES]; /* write protection mode: PTE change (0 default) and CR0 Switch (1)*/
+
+static unsigned long long g_pmbd_num_buffers = 0; /* number of individual buffers */
+static unsigned long long g_pmbd_buffer_stride = 1024; /* number of contiguous PBNs belonging to the same buffer */
+
+/* definition of functions */
+static inline uint64_t cycle_to_ns(uint64_t cycle);
+static inline void sync_slowdown_cycles(uint64_t cycles);
+static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw);
+static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start);
+
+/*
+ * *************************************************************************
+ * parse module parameters functions
+ * *************************************************************************
+ */
+static char *mode = "";
+module_param(mode, charp, 444);
+MODULE_PARM_DESC(mode, USAGE_INFO);
+
+/* print pmbd configuration info */
+static void pmbd_print_conf(void)
+{
+ int i;
+#ifndef CONFIG_X86
+ printk(KERN_INFO "pmbd: running on a non-x86 platform, check ioremap()...\n");
+#endif
+ printk(KERN_INFO "pmbd: cacheline_size=%d\n", PMBD_CACHELINE_SIZE);
+ printk(KERN_INFO "pmbd: PMBD_SECTOR_SIZE=%lu, PMBD_PAGE_SIZE=%lu\n", PMBD_SECTOR_SIZE, PMBD_PAGE_SIZE);
+ printk(KERN_INFO "pmbd: g_pmbd_type = %s\n", PMBD_USE_VMALLOC()? "VMALLOC" : "HIGH_MEM");
+ printk(KERN_INFO "pmbd: g_pmbd_mergeable = %s\n", PMBD_IS_MERGEABLE()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_clflush = %s\n", PMBD_USE_CLFLUSH()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_flag = %s\n", PMBD_CPU_CACHE_FLAG());
+ printk(KERN_INFO "pmbd: g_pmbd_wr_protect = %s\n", PMBD_USE_WRITE_PROTECTION()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_wr_verify = %s\n", PMBD_USE_WRITE_VERIFICATION()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_checksum = %s\n", PMBD_USE_CHECKSUM()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_lock = %s\n", PMBD_USE_LOCK()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_subpage_update = %s\n", PMBD_USE_SUBPAGE_UPDATE()? "YES" : "NO");
+ printk(KERN_INFO "pmbd: g_pmbd_adjust_ns = %llu ns\n", g_pmbd_adjust_ns);
+ printk(KERN_INFO "pmbd: g_pmbd_num_buffers = %llu\n", g_pmbd_num_buffers);
+ printk(KERN_INFO "pmbd: g_pmbd_buffer_stride = %llu blocks\n", g_pmbd_buffer_stride);
+ printk(KERN_INFO "pmbd: g_pmbd_timestat = %u \n", g_pmbd_timestat);
+ printk(KERN_INFO "pmbd: HIGHMEM offset [%llu] size [%lu] Private Mapping (%s) (%s) (%s) Write Barrier(%s) FUA(%s)\n",
+ g_highmem_phys_addr, g_highmem_size, (PMBD_USE_PMAP()? "Enabled" : "Disabled"),
+ (PMBD_USE_NTS()? "Non-Temporal Store":"Temporal Store"),
+ (PMBD_USE_NTL()? "Non-Temporal Load":"Temporal Load"),
+ (PMBD_USE_WB()? "Enabled": "Disabled"),
+ (PMBD_USE_FUA()? "Enabled":"Disabled"));
+
+ /* for each pmbd device */
+ for (i = 0; i < g_pmbd_nr; i ++) {
+ printk(KERN_INFO "pmbd: /dev/pm%c (%d)[%llu GB] read[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] write[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] [%s] [Buf: %llu MBs, batch %llu pages] [%s] [%s]\n",
+ 'a'+i, i, g_pmbd_size[i], g_pmbd_rdlat[i], g_pmbd_rdbw[i], g_pmbd_rdsx[i], g_pmbd_rdpause[i], g_pmbd_wrlat[i], g_pmbd_wrbw[i], g_pmbd_wrsx[i], g_pmbd_wrpause[i],\
+ (g_pmbd_rammode[i] ? "RAM" : "PMBD"), g_pmbd_bufsize[i], g_pmbd_buffer_batch_size[i], \
+ (g_pmbd_simmode[i] ? "Simulating PM only" : "Simulating the whole device"), \
+ (PMBD_USE_PMAP() ? "PMAP" : (g_pmbd_wpmode[i] ? "WP-CR0/WP" : "WP-PTE")));
+
+ if (g_pmbd_simmode[i] > 0){
+ printk(KERN_INFO "pmbd: ********************************* WARNING **************************************\n");
+ printk(KERN_INFO "pmbd: Using simmode%llu to simulate a slowed-down PM space may cause system soft lockup.\n", g_pmbd_simmode[i]);
+ printk(KERN_INFO "pmbd: To disable the warning message, please add \"nosoftlockup\" in the boot option. \n");
+ printk(KERN_INFO "pmbd: ********************************************************************************\n");
+ }
+ }
+
+ printk(KERN_INFO "pmbd: ****************************** WARNING ***********************************\n");
+ printk(KERN_INFO "pmbd: 1. Checksum mismatch can be detected but not handled \n");
+ printk(KERN_INFO "pmbd: 2. PMAP is incompatible with \"wrprotY\"\n");
+ printk(KERN_INFO "pmbd: **************************************************************************\n");
+
+ return;
+}
+
+/*
+ * Parse a string with config for multiple devices (e.g. mode="pmbd4,1,3;")
+ * @mode: input option string
+ * @tag: the tag being looked for (e.g. pmbd)
+ * @data: output in an array
+ */
+static int _pmbd_parse_multi(char* mode, char* tag, unsigned long long data[])
+{
+ int nr = 0;
+ if (strlen(mode)) {
+ char* head = mode;
+ char* tail = mode;
+ char* end = mode + strlen(mode);
+ char tmp[128];
+
+ if ((head = strstr(mode, tag))) {
+ head = head + strlen(tag);
+ tail = head;
+ while(head < end){
+ int len = 0;
+
+ /* locate the position of the first non-number char */
+ for(tail = head; IS_DIGIT(*tail) && tail < end; tail++) {};
+
+ /* pick up the numbers */
+ len = tail - head;
+ if(len > 0) {
+ nr ++;
+ if (nr > PMBD_MAX_NUM_DEVICES) {
+ printk(KERN_ERR "pmbd: %s(%d) - too many (%d) device config for %s\n",
+ __FUNCTION__, __LINE__, nr, tag);
+ return -1;
+ }
+ strncpy(tmp, head, len); tmp[len] = '\0';
+ data[nr - 1] = simple_strtoull(tmp, NULL, 0);
+ }
+
+ /* check the next sequence of numbers */
+ for(; !IS_DIGIT(*tail) && tail < end; tail++) {
+ /* if we meet the first alpha char or space, clause ends */
+ if(IS_ALPHA(*tail) || IS_SPACE(*tail))
+ goto done;
+ };
+
+ /* move head to the next sequence of numbers */
+ head = tail;
+ }
+ }
+ }
+done:
+ return nr;
+}
+
+/*
+ * Parse a string with config for all devices (e.g. mode="adj1000")
+ * @mode: input option string
+ * @tag: the tag being looked for (e.g. pmbd)
+ * @data: output
+ */
+static int _pmbd_parse_single(char* mode, char* tag, unsigned long long* data)
+{
+ if (strlen(mode)) {
+ char* head = mode;
+ char* tail = mode;
+ char tmp[128];
+
+ if (strstr(mode, tag)) {
+ head = strstr(mode, tag) + strlen(tag);
+ for(tail=head; IS_DIGIT(*tail); tail++) {};
+ if(tail == head) {
+ return -1;
+ } else {
+ int len = tail - head;
+ strncpy(tmp, head, len); tmp[len] = '\0';
+ *data = simple_strtoull(tmp, NULL, 0);
+ }
+ }
+ }
+ return 0;
+}
+
+static void load_default_conf(void)
+{
+ int i = 0;
+ for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++)
+ g_pmbd_buffer_batch_size[i] = PMBD_BUFFER_BATCH_SIZE_DEFAULT;
+}
+
+/* parse the module parameters (mode) */
+static void pmbd_parse_conf(void)
+{
+ int i = 0;
+ static unsigned enforce_cache_wc = FALSE;
+
+ load_default_conf();
+
+ if (strlen(mode)) {
+ unsigned long long data = 0;
+
+ /* check pmbd size/usable */
+ if (strstr(mode, "pmbd")) {
+ if( (g_pmbd_nr = _pmbd_parse_multi(mode, "pmbd", g_pmbd_size)) <= 0)
+ goto fail;
+ } else {
+ printk(KERN_ERR "pmbd: no pmbd size set\n");
+ goto fail;
+ }
+
+ /* rdlat/wrlat (emulated read/write latency) in nanosec */
+ if (strstr(mode, "rdlat"))
+ if (_pmbd_parse_multi(mode, "rdlat", g_pmbd_rdlat) < 0)
+ goto fail;
+ if (strstr(mode, "wrlat"))
+ if (_pmbd_parse_multi(mode, "wrlat", g_pmbd_wrlat) < 0)
+ goto fail;
+
+ /* rdbw/wrbw (emulated read/write bandwidth) in MB/sec*/
+ if (strstr(mode, "rdbw"))
+ if (_pmbd_parse_multi(mode, "rdbw", g_pmbd_rdbw) < 0)
+ goto fail;
+ if (strstr(mode, "wrbw"))
+ if (_pmbd_parse_multi(mode, "wrbw", g_pmbd_wrbw) < 0)
+ goto fail;
+
+ /* rdsx/wrsx (emulated read/write slowdown X) */
+ if (strstr(mode, "rdsx"))
+ if (_pmbd_parse_multi(mode, "rdsx", g_pmbd_rdsx) < 0)
+ goto fail;
+ if (strstr(mode, "wrsx"))
+ if (_pmbd_parse_multi(mode, "wrsx", g_pmbd_wrsx) < 0)
+ goto fail;
+
+ /* rdsx/wrsx (emulated read/write slowdown X) */
+ if (strstr(mode, "rdpause"))
+ if (_pmbd_parse_multi(mode, "rdpause", g_pmbd_rdpause) < 0)
+ goto fail;
+ if (strstr(mode, "wrpause"))
+ if (_pmbd_parse_multi(mode, "wrpause", g_pmbd_wrpause) < 0)
+ goto fail;
+
+ /* do write optimization */
+ if (strstr(mode, "rammode")){
+ printk(KERN_ERR "pmbd: rammode removed\n");
+ goto fail;
+ if (_pmbd_parse_multi(mode, "rammode", g_pmbd_rammode) < 0)
+ goto fail;
+ }
+
+ if (strstr(mode, "bufsize")){
+ if (_pmbd_parse_multi(mode, "bufsize", g_pmbd_bufsize) < 0)
+ goto fail;
+ for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
+ if (g_pmbd_bufsize[i] > 0 && g_pmbd_bufsize[i] < PMBD_BUFFER_MIN_BUFSIZE){
+ printk(KERN_ERR "pmbd: bufsize cannot be smaller than %d MBs. Setting 0 to disable PMBD buffer.\n", PMBD_BUFFER_MIN_BUFSIZE);
+ goto fail;
+ }
+ }
+ }
+
+ /* numbuf and bufstride*/
+ if (strstr(mode, "bufnum")) {
+ if(_pmbd_parse_single(mode, "bufnum", &data) < 0) {
+ printk(KERN_ERR "pmbd: incorrect bufnum (must be at least 1)\n");
+ goto fail;
+ } else {
+ g_pmbd_num_buffers = data;
+ }
+ }
+ if (strstr(mode, "bufstride")) {
+ if(_pmbd_parse_single(mode, "bufstride", &data) < 0) {
+ printk(KERN_ERR "pmbd: incorrect bufstride (must be at least 1)\n");
+ goto fail;
+ } else {
+ g_pmbd_buffer_stride = data;
+ }
+ }
+
+ /* check the nanoseconds of overhead to compensate */
+ if (strstr(mode, "adj")) {
+ if(_pmbd_parse_single(mode, "adj", &data) < 0) {
+ printk(KERN_ERR "pmbd: incorrect adj\n");
+ goto fail;
+ } else {
+ g_pmbd_adjust_ns = data;
+ }
+ }
+
+ /* check PMBD device type */
+ if ((strstr(mode, "VM"))) {
+ g_pmbd_type = PMBD_CONFIG_VMALLOC;
+ } else if ((strstr(mode, "HM"))) {
+ g_pmbd_type = PMBD_CONFIG_HIGHMEM;
+ }
+
+ /* use pmap*/
+ if ((strstr(mode, "pmapY"))) {
+ g_pmbd_pmap = TRUE;
+ } else if ((strstr(mode, "pmapN"))) {
+ g_pmbd_pmap = FALSE;
+ }
+ if ((strstr(mode, "PMAP"))){
+ printk("WARNING: !!! pmbd: PMAP is not supported any more (use pmapY) !!!\n");
+ goto fail;
+ }
+
+ /* use nts*/
+ if ((strstr(mode, "ntsY"))) {
+ g_pmbd_nts = TRUE;
+ } else if ((strstr(mode, "ntsN"))) {
+ g_pmbd_nts = FALSE;
+ }
+ if ((strstr(mode, "NTS"))){
+ printk("WARNING: !!! pmbd: NTS is not supported any more (use ntsY) !!!\n");
+ goto fail;
+ }
+
+ /* use ntl*/
+ if ((strstr(mode, "ntlY"))) {
+ g_pmbd_ntl = TRUE;
+ enforce_cache_wc = TRUE;
+ } else if ((strstr(mode, "ntlN"))) {
+ g_pmbd_ntl = FALSE;
+ }
+
+ /* timestat */
+ if ((strstr(mode, "timestatY"))) {
+ g_pmbd_timestat = TRUE;
+ } else if ((strstr(mode, "timestatN"))) {
+ g_pmbd_timestat = FALSE;
+ }
+
+
+ /* write barrier */
+ if ((strstr(mode, "wbY"))) {
+ g_pmbd_wb = TRUE;
+ } else if ((strstr(mode, "wbN"))) {
+ g_pmbd_wb = FALSE;
+ }
+
+ /* write barrier */
+ if ((strstr(mode, "fuaY"))) {
+ g_pmbd_fua = TRUE;
+ } else if ((strstr(mode, "fuaN"))) {
+ g_pmbd_fua = FALSE;
+ }
+
+
+ /* check if HIGH_MEM PMBD is configured */
+ if (PMBD_USE_HIGHMEM()) {
+ if (strstr(mode, "hmo") && strstr(mode, "hms")) {
+ /* parse reserved HIGH_MEM offset */
+ if(_pmbd_parse_single(mode, "hmo", &data) < 0){
+ printk(KERN_ERR "pmbd: incorrect hmo\n");
+ g_highmem_phys_addr = 0;
+ goto fail;
+ } else {
+ g_highmem_phys_addr = data * 1024 * 1024 * 1024;
+ }
+
+ /* parse reserved HIGH_MEM size */
+ if(_pmbd_parse_single(mode, "hms", &data) < 0 || data == 0){
+ printk(KERN_ERR "pmbd: incorrect hms\n");
+ g_highmem_size = 0;
+ goto fail;
+ } else {
+ g_highmem_size = data * 1024 * 1024 * 1024;
+ }
+ } else {
+ printk(KERN_ERR "pmbd: hmo or hms not set ***\n");
+ goto fail;
+ }
+
+
+ }
+
+
+ /* check if mergeable */
+ if((strstr(mode,"mgbY")))
+ g_pmbd_mergeable = TRUE;
+ else if((strstr(mode,"mgbN")))
+ g_pmbd_mergeable = FALSE;
+
+ /* CPU cache flushing */
+ if((strstr(mode,"clflushY")))
+ g_pmbd_cpu_cache_clflush = TRUE;
+ else if((strstr(mode,"clflushN")))
+ g_pmbd_cpu_cache_clflush = FALSE;
+
+ /* CPU cache setting */
+ if((strstr(mode,"cacheWB"))) /* cache write back */
+ g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB;
+ else if((strstr(mode,"cacheWC"))) /* cache write combined (through) */
+ g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
+ else if((strstr(mode,"cacheUM"))) /* cache cachable but write back */
+ g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC_MINUS;
+ else if((strstr(mode,"cacheUC"))) /* cache uncablable */
+ g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC;
+
+
+ /* write protectable */
+ if((strstr(mode,"wrprotY")))
+ g_pmbd_wr_protect = TRUE;
+ else if((strstr(mode,"wrprotN")))
+ g_pmbd_wr_protect = FALSE;
+
+ /* write protectable */
+ if((strstr(mode,"wrverifyY")))
+ g_pmbd_wr_verify = TRUE;
+ else if((strstr(mode,"wrverifyN")))
+ g_pmbd_wr_verify = FALSE;
+
+ /* checksum */
+ if((strstr(mode,"checksumY")))
+ g_pmbd_checksum = TRUE;
+ else if((strstr(mode,"checksumN")))
+ g_pmbd_checksum = FALSE;
+
+ /* checksum */
+ if((strstr(mode,"lockY")))
+ g_pmbd_lock = TRUE;
+ else if((strstr(mode,"lockN")))
+ g_pmbd_lock = FALSE;
+
+ /* write protectable */
+ if((strstr(mode,"subupdateY")))
+ g_pmbd_subpage_update = TRUE;
+ else if((strstr(mode,"subupdateN")))
+ g_pmbd_subpage_update = FALSE;
+
+
+ /* batch */
+ if (strstr(mode, "batch")){
+ if (_pmbd_parse_multi(mode, "batch", g_pmbd_buffer_batch_size) < 0)
+ goto fail;
+ /* check if any batch size is set too small */
+ for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
+ if (g_pmbd_buffer_batch_size[i] < 1){
+ printk(KERN_ERR "pmbd: buffer batch size cannot be smaller than 1 page (default: 1024 pages)\n");
+ goto fail;
+ }
+ }
+ }
+
+ /* simmode */
+ if (strstr(mode, "simmode")){
+ if (_pmbd_parse_multi(mode, "simmode", g_pmbd_simmode) < 0)
+ goto fail;
+ }
+
+ /* wpmode */
+ if (strstr(mode, "wpmode")){
+ if (_pmbd_parse_multi(mode, "wpmode", g_pmbd_wpmode) < 0)
+ goto fail;
+ }
+
+ } else {
+ goto fail;
+ }
+
+ /* apply some enforced configuration */
+ if (enforce_cache_wc) /* if ntl is used, we must use WC */
+ g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
+
+ /* Done, print input options */
+ pmbd_print_conf();
+ return;
+
+fail:
+ printk(KERN_ERR "pmbd: wrong mode config! Check modinfo\n\n");
+ g_pmbd_nr = 0;
+ return;
+}
+
+/*
+ * *****************************************************************
+ * simple emulation API functions
+ * pmbd_rdwr_pause - pause read/write for a specified cycles/page
+ * pmbd_rdwr_slowdown - slowdown read/write proportionally to DRAM
+ * *****************************************************************/
+
+/* handle rdpause and wrpause options*/
+static void pmbd_rdwr_pause(PMBD_DEVICE_T* pmbd, size_t bytes, unsigned rw)
+{
+ uint64_t cycles = 0;
+ uint64_t time_p1, time_p2;
+
+ /* sanity check */
+ if (pmbd->rdpause == 0 && pmbd->wrpause == 0)
+ return;
+
+ /* start */
+ TIMESTAT_POINT(time_p1);
+
+ /* calculate the cycles to pause */
+ if (rw == READ && pmbd->rdpause){
+ cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->rdpause), pmbd->rdpause);
+ } else if (rw == WRITE && pmbd->wrpause){
+ cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->wrpause), pmbd->wrpause);
+ }
+
+ /* slow down now */
+ if (cycles)
+ sync_slowdown_cycles(cycles);
+
+ TIMESTAT_POINT(time_p2);
+
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_pause[rw][cid] += time_p2 - time_p1;
+ }
+
+ return;
+}
+
+
+/* handle rdsx and wrsx options */
+static void pmbd_rdwr_slowdown(PMBD_DEVICE_T* pmbd, int rw, uint64_t start, uint64_t end)
+{
+ uint64_t cycles = 0;
+ uint64_t time_p1, time_p2;
+
+ /* sanity check */
+ if ( !((rw == READ && pmbd->rdsx > 1) || (rw == WRITE && pmbd->wrsx > 1)))
+ return;
+
+ if (end < start){
+ printk(KERN_WARNING "pmbd: %s(%d) end (%llu) is earlier than start (%llu)\n", \
+ __FUNCTION__, __LINE__, (unsigned long long) start, (unsigned long long)end);
+ return;
+ }
+
+ /* start */
+ TIMESTAT_POINT(time_p1);
+
+ /*FIXME: should we allow to do async slowdown? */
+ cycles = (end-start)*((rw == READ) ? (pmbd->rdsx - 1) : (pmbd->wrsx -1));
+
+ /*FIXME: should we minus a slack here (80-100cycles)? */
+ if (cycles)
+ sync_slowdown_cycles(cycles);
+
+ TIMESTAT_POINT(time_p2);
+
+ /* updating statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_slowdown[rw][cid] += time_p2 - time_p1;
+ }
+
+ return;
+}
+
+
+/*
+ * set page's cache flags
+ * @vaddr: start virtual address
+ * @num_pages: the range size
+ */
+static void set_pages_cache_flags(unsigned long vaddr, int num_pages)
+{
+ switch (g_pmbd_cpu_cache_flag) {
+ case _PAGE_CACHE_WB:
+ printk(KERN_INFO "pmbd: set PM pages cache flags (WB)\n");
+ set_memory_wb(vaddr, num_pages);
+ break;
+ case _PAGE_CACHE_WC:
+ printk(KERN_INFO "pmbd: set PM pages cache flags (WC)\n");
+ set_memory_wc(vaddr, num_pages);
+ break;
+ case _PAGE_CACHE_UC:
+ printk(KERN_INFO "pmbd: set PM pages cache flags (UC)\n");
+ set_memory_uc(vaddr, num_pages);
+ break;
+ case _PAGE_CACHE_UC_MINUS:
+ printk(KERN_INFO "pmbd: set PM pages cache flags (UM)\n");
+ set_memory_uc(vaddr, num_pages);
+ break;
+ default:
+ set_memory_wb(vaddr, num_pages);
+ printk(KERN_WARNING "pmbd: PM page attribute is not set - use WB\n");
+ break;
+ }
+ return;
+}
+
+
+/*
+ * *************************************************************************
+ * PMAP - Private mapping interface APIs
+ * *************************************************************************
+ *
+ * The private mapping is for providing write protection -- only when we need
+ * to access the PM page, we map it into the kernel virtual memory space, once
+ * we finish using it, we unmap it, so the spatial and temporal window left for
+ * bug attack is really small.
+ *
+ * Notes: pmap works similar to kmap_atomic*. It does the following:
+ * (1) pmap_create(): allocate 128 pages with vmalloc, these 128 pte mapping is
+ * saved to a backup place, and then be cleared to prevent accidental accesses.
+ * Each page is assigned correspondingly to the CPU ID where the calling thread
+ * is running on. So we support at most 128 CPU IDs.
+ * (2) pmap_atomic_pfn(): map the specified pfn into the entry, whose index is
+ * the ID of the CPU on which the current thread is running. The pfn is loaded
+ * into the corresponding pte entry and the corresponding TLB entry is flushed
+ * (3) punmap_atomic(): the specified pte entry is cleared, and the TLB entry
+ * is flushed
+ * (4) pmap_destroy(): the saved pte mapping of the 128 pages are restored, and
+ * vfree() is called to release the 128 pages allocated through vmalloc().
+ *
+ */
+
+#define PMAP_NR_PAGES (128)
+static unsigned int pmap_nr_pages = 0; /* the total number of available pages for private mapping */
+static void* pmap_va_start = NULL; /* the first PMAP virtual address */
+static pte_t* pmap_ptep[PMAP_NR_PAGES]; /* the array of PTE entries */
+static unsigned long pmap_pfn[PMAP_NR_PAGES]; /* the array of page frame numbers for restoring */
+static pgprot_t pmap_prot[PMAP_NR_PAGES]; /* the array of page protection fields */
+#define PMAP_VA(IDX) (pmap_va_start + (IDX) * PAGE_SIZE)
+#define PMAP_IDX(VA) (((unsigned long)(VA) - (unsigned long)pmap_va_start) >> PAGE_SHIFT)
+
+static inline void pmap_flush_tlb_single(unsigned long addr)
+{
+ asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+}
+
+static inline void* update_pmap_pfn(unsigned long pfn, unsigned int idx)
+{
+ void* va = PMAP_VA(idx);
+ pte_t* ptep = pmap_ptep[idx];
+ pte_t old_pte = *ptep;
+ pte_t new_pte = pfn_pte(pfn, pmap_prot[idx]);
+
+ if (pte_val(old_pte) == pte_val(new_pte))
+ return va;
+
+ /* update the pte entry */
+ set_pte_atomic(ptep, new_pte);
+// set_pte(ptep, new_pte);
+
+ /* flush one single tlb */
+ __flush_tlb_one((unsigned long) va);
+// pmap_flush_tlb_single((unsigned long) va);
+
+ /* return the old one for bkup */
+ return va;
+}
+
+static inline void clear_pmap_pfn(unsigned idx)
+{
+ if (idx < pmap_nr_pages){
+
+ void* va = PMAP_VA(idx);
+ pte_t* ptep = pmap_ptep[idx];
+
+ /* clear the mapping */
+ pte_clear(NULL, (unsigned long) va, ptep);
+ __flush_tlb_one((unsigned long) va);
+
+ } else {
+ panic("%s(%d) illegal pmap idx\n", __FUNCTION__, __LINE__);
+ }
+}
+
+static int pmap_atomic_init(void)
+{
+ unsigned int i;
+
+ /* checking */
+ if (pmap_va_start)
+ panic("%s(%d) something is wrong\n", __FUNCTION__, __LINE__);
+
+ /* allocate an array of dummy pages as pmap virtual addresses */
+ pmap_va_start = vmalloc(PAGE_SIZE * PMAP_NR_PAGES);
+ if (!pmap_va_start){
+ printk(KERN_ERR "pmbd:%s(%d) pmap_va_start cannot be initialized\n", __FUNCTION__, __LINE__);
+ return -ENOMEM;
+ }
+ pmap_nr_pages = PMAP_NR_PAGES;
+
+ /* set pages' cache flags, this flag would be saved into pmap_prot
+ * and will be applied together with the dynamically mapped page too (01/12/2012)*/
+ set_pages_cache_flags((unsigned long)pmap_va_start, pmap_nr_pages);
+
+ /* save the dummy pages' ptep, pfn, and prot info */
+ printk(KERN_INFO "pmbd: saving dummy pmap entries\n");
+ for (i = 0; i < pmap_nr_pages; i ++){
+ pte_t old_pte;
+ unsigned int level;
+ void* va = PMAP_VA(i);
+
+ /* get the ptep */
+ pte_t* ptep = lookup_address((unsigned long)(va), &level);
+
+ /* sanity check */
+ if (!ptep)
+ panic("%s(%d) mapping not found\n", __FUNCTION__, __LINE__);
+
+ old_pte = *ptep;
+ if (!pte_val(old_pte))
+ panic("%s(%d) invalid pte value\n", __FUNCTION__, __LINE__);
+
+ if (level != PG_LEVEL_4K)
+ panic("%s(%d) not PG_LEVEL_4K \n", __FUNCTION__, __LINE__);
+
+ /* save dummy entries */
+ pmap_ptep[i] = ptep;
+ pmap_pfn[i] = pte_pfn(old_pte);
+ pmap_prot[i] = pte_pgprot(old_pte);
+
+/* printk(KERN_INFO "%s(%d): saving dummy pmap entries: %u va=%p pfn=%lx\n", \
+ __FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
+*/
+ }
+
+ /* clear the pte to make it illegal to access */
+ for (i = 0; i < pmap_nr_pages; i ++)
+ clear_pmap_pfn(i);
+
+ return 0;
+}
+
+static void pmap_atomic_done(void)
+{
+ int i;
+
+ /* restore the dummy pages' pte */
+ printk(KERN_INFO "pmbd: restoring dummy pmap entries\n");
+ for (i = 0; i < pmap_nr_pages; i ++){
+/* void* va = PMAP_VA(i);
+ printk(KERN_INFO "%s(%d): restoring dummy pmap entries: %d va=%p pfn=%lx\n", \
+ __FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
+*/
+ /* restore the old pfn */
+ update_pmap_pfn(pmap_pfn[i], i);
+ pmap_ptep[i]= NULL;
+ pmap_pfn[i] = 0;
+ }
+
+ /* free the dummy pages*/
+ if (pmap_va_start)
+ vfree(pmap_va_start);
+ else
+ panic("%s(%d): freeing dummy pages failed\n", __FUNCTION__, __LINE__);
+
+ pmap_va_start = NULL;
+ pmap_nr_pages = 0;
+ return;
+}
+
+static void* pmap_atomic_pfn(unsigned long pfn, PMBD_DEVICE_T* pmbd, unsigned rw)
+{
+ void* va = NULL;
+ unsigned int idx = CUR_CPU_ID();
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ TIMESTAMP(time_p1);
+
+ /* disable page fault temporarily */
+ pagefault_disable();
+
+ /* change the mapping to the specified pfn*/
+ va = update_pmap_pfn(pfn, idx);
+
+ TIMESTAMP(time_p2);
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_pmap[rw][cid] += time_p2 - time_p1;
+ }
+
+ return va;
+}
+
+static void punmap_atomic(void* va, PMBD_DEVICE_T* pmbd, unsigned rw)
+{
+ unsigned int idx = PMAP_IDX(va);
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ TIMESTAMP(time_p1);
+
+ /* clear the mapping */
+ clear_pmap_pfn(idx);
+
+ /* re-enable the page fault */
+ pagefault_enable();
+
+ TIMESTAMP(time_p2);
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_punmap[rw][cid] += time_p2 - time_p1;
+ }
+
+ return;
+}
+
+/* create the dummy pmap space */
+static int pmap_create(void)
+{
+ pmap_atomic_init();
+ return 0;
+}
+
+/* destroy the dummy pmap space */
+static void pmap_destroy(void)
+{
+ pmap_atomic_done();
+ return;
+}
+
+/*
+ * *************************************************************************
+ * Non-temporal memcpy
+ * *************************************************************************
+ * Non-temporal memcpy does the following:
+ * (1) use movntq to copy into PM space
+ * (2) use sfence to flush the data to memory controller
+ *
+ * Compared to regular temporal memcpy, it provides several benefits here:
+ * (1) writes to PM bypass the CPU cache, which avoids polluting CPU cache
+ * (2) reads from PM still benefit from the CPU cache
+ * (3) sfence used for each write guarantees data will be flushed out of buffer
+ */
+
+static void nts_memcpy_64bytes_v2(void* to, void* from, size_t size)
+{
+ int i;
+ unsigned bs = 64; /* write unit size 8 bytes */
+
+ if (size < bs)
+ panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
+
+ if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
+ panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
+
+ /* start */
+ kernel_fpu_begin();
+
+ /* do the non-temporal mov */
+ for (i = 0; i < size; i += bs){
+ __asm__ __volatile__ (
+ "movdqa (%0), %%xmm0\n"
+ "movdqa 16(%0), %%xmm1\n"
+ "movdqa 32(%0), %%xmm2\n"
+ "movdqa 48(%0), %%xmm3\n"
+ "movntdq %%xmm0, (%1)\n"
+ "movntdq %%xmm1, 16(%1)\n"
+ "movntdq %%xmm2, 32(%1)\n"
+ "movntdq %%xmm3, 48(%1)\n"
+ :
+ : "r" (from), "r" (to)
+ : "memory");
+
+ to += bs;
+ from += bs;
+ }
+
+ /* do sfence to push data out */
+ __asm__ __volatile__ (
+ " sfence\n" : :
+ );
+
+ /* end */
+ kernel_fpu_end();
+
+ /*NOTE: we assume it would be multiple units of 64 bytes*/
+ if (i != size)
+ panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
+
+ return;
+}
+
+/* non-temporal store */
+static void nts_memcpy(void* to, void* from, size_t size)
+{
+ if (size < 64){
+ panic("no support for nt load smaller than 64 bytes yet\n");
+ } else {
+ nts_memcpy_64bytes_v2(to, from, size);
+ }
+}
+
+
+static void ntl_memcpy_64bytes(void* to, void* from, size_t size)
+{
+ int i;
+ unsigned bs = 64; /* write unit size 16 bytes */
+
+ if (size < bs)
+ panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
+
+ if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
+ panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
+
+ /* start */
+ kernel_fpu_begin();
+
+ /* do the non-temporal mov */
+ for (i = 0; i < size; i += bs){
+ __asm__ __volatile__ (
+ "movntdqa (%0), %%xmm0\n"
+ "movntdqa 16(%0), %%xmm1\n"
+ "movntdqa 32(%0), %%xmm2\n"
+ "movntdqa 48(%0), %%xmm3\n"
+ "movdqa %%xmm0, (%1)\n"
+ "movdqa %%xmm1, 16(%1)\n"
+ "movdqa %%xmm2, 32(%1)\n"
+ "movdqa %%xmm3, 48(%1)\n"
+ :
+ : "r" (from), "r" (to)
+ : "memory");
+
+ to += bs;
+ from += bs;
+ }
+
+ /* end */
+ kernel_fpu_end();
+
+ /*NOTE: we assume it would be multiple units of 64 bytes (at least 512 bytes)*/
+ if (i != size)
+ panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
+
+ return;
+}
+
+/* non-temporal load */
+static void ntl_memcpy(void* to, void* from, size_t size)
+{
+ if (size < 64){
+ panic("no support for nt load smaller than 128 bytes yet\n");
+ } else {
+ ntl_memcpy_64bytes(to, from, size);
+ }
+}
+
+
+/*
+ * *************************************************************************
+ * COPY TO/FROM PM
+ * *************************************************************************
+ *
+ * NOTE: copying into PM needs particular care, we use different solution here:
+ * (1) pmap: we only map/unmap PM pages when we need to access, which provides
+ * us the most protection, for both reads and writes
+ * (2) non-pmap: we always map every page into the kernel space, however, we
+ * put different protection for writes only. In both cases, PM pages are
+ * initialized as read-only
+ * - PTE manipulation: before each write, the page writable bit is enabled, and
+ * disabled right after the write operation is done.
+ * - CR0/WP switch: before each write, the WP bit in the CR0 register turned
+ * off, and turned back on right after the write operation is done. Once
+ * CR0/WP bit is turned off, the CPU would not check the writable bit in the
+ * TLB in local CPU. So it is a tricky way to hack and walk around this
+ * problem.
+ *
+ */
+
+#define PMBD_PMAP_DUMMY_BASE_VA (4096)
+#define PMBD_PMAP_VA_TO_PA(VA) (g_highmem_phys_addr + (VA) - PMBD_PMAP_DUMMY_BASE_VA)
+/*
+ * copying from/to a contiguous PM space using pmap
+ * @ram_va: the RAM virtual address
+ * @pmbd_dummy_va: the dummy PM virtual address (for converting to phys addr)
+ * @rw: 0 - read, 1 - write
+ */
+
+#define MEMCPY_TO_PMBD(dst, src, bytes) { if (PMBD_USE_NTS()) \
+ nts_memcpy((dst), (src), (bytes)); \
+ else \
+ memcpy((dst), (src), (bytes));}
+
+#define MEMCPY_FROM_PMBD(dst, src, bytes) { if (PMBD_USE_NTL()) \
+ ntl_memcpy((dst), (src), (bytes)); \
+ else \
+ memcpy((dst), (src), (bytes));}
+
+static inline int _memcpy_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* ram_va, void* pmbd_dummy_va, size_t bytes, unsigned rw, unsigned do_fua)
+{
+ unsigned long flags = 0;
+ uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
+
+ /* disable interrupt (PMAP entry is shared) */
+ DISABLE_SAVE_IRQ(flags);
+
+ /* do the real work */
+ while(bytes){
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ unsigned long pfn = (pa >> PAGE_SHIFT); /* page frame number */
+ unsigned off = pa & (~PAGE_MASK); /* offset in one page */
+ unsigned size = MIN_OF((PAGE_SIZE - off), bytes);/* the size to copy */
+
+ /* map it */
+ void * map = pmap_atomic_pfn(pfn, pmbd, rw);
+ void * pmbd_va = map + off;
+
+ /* do memcopy */
+ TIMESTAMP(time_p1);
+ if (rw == READ) {
+ MEMCPY_FROM_PMBD(ram_va, pmbd_va, size);
+ } else {
+ if (PMBD_USE_SUBPAGE_UPDATE()) {
+ /* if we do subpage write, write a cacheline each time */
+ /* FIXME: we probably need to check the alignment here */
+ size = MIN_OF(size, PMBD_CACHELINE_SIZE);
+ if (memcmp(pmbd_va, ram_va, size)){
+ MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
+ }
+ } else {
+ MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
+ }
+ }
+ TIMESTAMP(time_p2);
+
+ /* emulating slowdown*/
+ if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+ pmbd_rdwr_slowdown((pmbd), rw, time_p1, time_p2);
+
+ /* for write check if we need to do clflush or do FUA*/
+ if (rw == WRITE){
+ if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
+ pmbd_clflush_range(pmbd, pmbd_va, (size));
+ }
+
+ /* if write combine is used, we need to do sfence (like in ntstore) */
+ if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM())
+ sfence();
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_memcpy[rw][cid] += time_p2 - time_p1;
+ }
+
+ /* unmap it */
+ punmap_atomic(map, pmbd, rw);
+
+ /* prepare the next iteration */
+ ram_va += size;
+ bytes -= size;
+ pa += size;
+ }
+
+ /* re-enable interrupt */
+ ENABLE_RESTORE_IRQ(flags);
+
+ return 0;
+}
+
+static inline int memcpy_from_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+ return _memcpy_pmbd_pmap(pmbd, dst, src, bytes, READ, FALSE);
+}
+
+static inline int memcpy_to_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+ return _memcpy_pmbd_pmap(pmbd, src, dst, bytes, WRITE, do_fua);
+}
+
+
+/*
+ * memcpy from/to PM without using pmap
+ */
+
+#define DISABLE_CR0_WP(CR0,FLAGS) {\
+ if (PMBD_USE_WRITE_PROTECTION()){\
+ DISABLE_SAVE_IRQ((FLAGS));\
+ (CR0) = read_cr0();\
+ write_cr0((CR0) & ~X86_CR0_WP);\
+ }\
+ }
+#define ENABLE_CR0_WP(CR0,FLAGS) {\
+ if (PMBD_USE_WRITE_PROTECTION()){\
+ write_cr0((CR0));\
+ ENABLE_RESTORE_IRQ((FLAGS));\
+ }\
+ }
+
+static inline int memcpy_from_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ /* start memcpy */
+ TIMESTAMP(time_p1);
+#if 0
+ if (PMBD_DEV_USE_VMALLOC((pmbd)))
+ memcpy((dst), (src), (bytes));
+ else if (PMBD_DEV_USE_HIGHMEM((pmbd)))
+ memcpy_fromio((dst), (src), (bytes));
+#endif
+ MEMCPY_FROM_PMBD(dst, src, bytes);
+
+ TIMESTAMP(time_p2);
+
+ /* emulating slowdown*/
+ if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+ pmbd_rdwr_slowdown((pmbd), READ, time_p1, time_p2);
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_memcpy[READ][cid] += time_p2 - time_p1;
+ }
+
+ return 0;
+}
+
+static int memcpy_to_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+
+ unsigned long cr0 = 0;
+ unsigned long flags = 0;
+ size_t left = bytes;
+
+
+ /* get a bkup copy of the CR0 (to allow writable)*/
+ if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
+ DISABLE_CR0_WP(cr0, flags);
+
+ /* do the real work */
+ while(left){
+ size_t size = left; // the size to copy
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ TIMESTAMP(time_p1);
+ /* do memcopy */
+ if (PMBD_USE_SUBPAGE_UPDATE()) {
+ /* if we do subpage write, write a cacheline each time */
+ size = MIN_OF(size, PMBD_CACHELINE_SIZE);
+
+ if (memcmp(dst, src, size)){
+ MEMCPY_TO_PMBD(dst, src, size);
+ }
+ } else {
+ MEMCPY_TO_PMBD(dst, src, size);
+ }
+ TIMESTAMP(time_p2);
+
+ /* emulating slowdown*/
+ if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+ pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+ /* if write, check if we need to do clflush or we do FUA */
+ if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
+ pmbd_clflush_range(pmbd, dst, (size));
+
+ /* if write combine is used, we need to do sfence (like in ntstore) */
+ if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM())
+ sfence();
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_memcpy[WRITE][cid] += time_p2 - time_p1;
+ }
+
+ /* prepare the next iteration */
+ dst += size;
+ src += size;
+ left -= size;
+ }
+
+ /* restore the CR0 */
+ if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
+ ENABLE_CR0_WP(cr0, flags);
+
+ return 0;
+}
+
+static int memcpy_to_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+ uint64_t start = 0;
+ uint64_t end = 0;
+
+ /* start simulation timing */
+ if (PMBD_DEV_SIM_PMBD((pmbd)))
+ start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), WRITE);
+
+ /* do memcpy now */
+ if (PMBD_USE_PMAP()){
+ memcpy_to_pmbd_pmap(pmbd, dst, src, bytes, do_fua);
+ } else {
+ memcpy_to_pmbd_nopmap(pmbd, dst, src, bytes, do_fua);
+ }
+
+ /* stop simulation timing */
+ if (PMBD_DEV_SIM_PMBD((pmbd)))
+ end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), WRITE, start);
+
+ /* pause write for a while*/
+ pmbd_rdwr_pause(pmbd, bytes, WRITE);
+
+ return 0;
+}
+
+
+
+static int memcpy_from_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+ uint64_t start = 0;
+ uint64_t end = 0;
+
+ /* start simulation timing */
+ if (PMBD_DEV_SIM_PMBD((pmbd)))
+ start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), READ);
+
+ /* do memcpy here */
+ if (PMBD_USE_PMAP()){
+ memcpy_from_pmbd_pmap(pmbd, dst, src, bytes);
+ }else{
+ memcpy_from_pmbd_nopmap(pmbd, dst, src, bytes);
+ }
+
+ /* stop simulation timing */
+ if (PMBD_DEV_SIM_PMBD((pmbd)))
+ end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), READ, start);
+
+ /* pause read for a while */
+ pmbd_rdwr_pause(pmbd, bytes, READ);
+
+ return 0;
+}
+
+
+
+/*
+ * *************************************************************************
+ * PMBD device buffer management
+ * *************************************************************************
+ *
+ * Since write protection involves high performance overhead (due to TLB
+ * shootdown and other system locking, linked list scan overhead related with
+ * set_memory_* functions), we cannot change page table attributes for each
+ * incoming write to PM space. In order to battle this issue, we added a
+ * buffer to temporarily hold the incoming writes into a DRAM buffer, and
+ * launch a syncer daemon to periodically flush dirty pages from the buffer to
+ * the PM storage. This brings two benefits: first, more contiguous pages can
+ * be clustered together, and we only need to do one page attribute change for
+ * a cluster; second, high overhead is hidden in the background, since the
+ * writes become asynchronous now.
+ *
+ */
+
+
+/* support functions to sort the bbi entries */
+static int compare_bbi_sort_entries(const void* m, const void* n)
+{
+ PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
+ PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
+ if (a->pbn < b->pbn)
+ return -1;
+ else if (a->pbn == b->pbn)
+ return 0;
+ else
+ return 1;
+
+}
+
+static void swap_bbi_sort_entries(void* m, void* n, int size)
+{
+ PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
+ PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
+ PMBD_BSORT_ENTRY_T tmp;
+ tmp = *a;
+ *a = *b;
+ *b = tmp;
+ return;
+}
+
+
+/*
+ * get the aligned in-block offsets for a given request
+ * @pmbd: the pmbd device
+ * @sector: the starting offset (in sectors) of the incoming request
+ * @bytes: the size of the incoming request
+ *
+ * return: the in-block offset of the starting sector in the request
+ *
+ * Since the block size (4096 bytes) is larger than the sector size (512 bytes),
+ * if the incoming request is not completely aligned in units of blocks, then
+ * we need to pull the whole block from PM space into the buffer, and apply
+ * changes to partial blocks. This function is needed to calculate the offset
+ * for the beginning and ending sectors.
+ *
+ * For example: assuming sector size is 1024, buffer block size is 4096, sector
+ * is 5, size is 1024, then the returned start offset is 1 (the second sector
+ * in the buffer block), and the returned end offset is 2 (the third sector in
+ * the buffer block)
+ *
+ * offset_s -----v v--- offset_e
+ * ----------------------------------
+ * | |*****| | |
+ * ----------------------------------
+ *
+ */
+
+static sector_t pmbd_buffer_aligned_request_start(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+ sector_t sector_s = sector;
+ PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector_s);
+ sector_t block_s = PBN_TO_SECTOR(pmbd, pbn_s); /* the block's starting offset (in sector) */
+ sector_t offset_s = 0;
+ if (sector_s >= block_s) /* if not aligned */
+ offset_s = sector_s - block_s;
+ return offset_s;
+}
+
+static sector_t pmbd_buffer_aligned_request_end(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+ sector_t sector_e = sector + BYTE_TO_SECTOR(bytes) - 1;
+ PBN_T pbn_e = SECTOR_TO_PBN(pmbd, sector_e);
+ sector_t block_e = PBN_TO_SECTOR(pmbd, pbn_e); /* the block's starting offset (in sector) */
+ sector_t offset_e = PBN_TO_SECTOR(pmbd, 1) - 1;
+
+ if (sector_e >= block_e) /* if not aligned */
+ offset_e = (sector_e - block_e);
+ return offset_e;
+}
+
+
+/*
+ * check and see if a physical block (pbn) is buffered
+ * @pmbd: pmbd device
+ * @pbn: buffer block number
+ *
+ * NOTE: The caller must hold the pbi->lock
+ */
+static PMBD_BBI_T* _pmbd_buffer_lookup(PMBD_BUFFER_T* buffer, PBN_T pbn)
+{
+ PMBD_BBI_T* bbi = NULL;
+ PMBD_DEVICE_T* pmbd = buffer->pmbd;
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+
+ if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
+ bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn);
+ }
+ return bbi;
+}
+
+/*
+ * Alloc/flush buffer functions
+ */
+
+/*
+ * flushing a range of contiguous physical blocks from buffer to PM space
+ * @pmbd: pmbd device
+ * @pbn_s: the first physical block number to flush (start)
+ * @pbn_e: the last physical block number to flush (end)
+ *
+ * This function only flushes blocks from buffer to PM and unlink(free) the
+ * corresponding buffer blocks and physical PM blocks, and it does not update
+ * the buffer control info (num_dirty, pos_dirty). This is because after
+ * sorting, the processing order of buffer blocks (BBNs) may be different from
+ * the spatial order of the buffer blocks, which makes it impossible to move
+ * pos_dirty forward exactly one after one. In other words, pos_dirty only
+ * points to the end of the dirty range, and we may flush a dirty block in the
+ * middle of the range, rather than from the end first.
+ *
+ * NOTE: The caller must hold the flush_lock; only one thread is allowed to do
+ * this sync; we also assume all the physical blocks in the specified range are
+ * buffered.
+ *
+ */
+
+static unsigned long _pmbd_buffer_flush_range(PMBD_BUFFER_T* buffer, PBN_T pbn_s, PBN_T pbn_e)
+{
+ PBN_T pbn = 0;
+ unsigned long num_cleaned = 0;
+ PMBD_DEVICE_T* pmbd = buffer->pmbd;
+ void* dst = PMBD_BLOCK_VADDR(pmbd, pbn_s);
+ size_t bytes = PBN_TO_BYTE(pmbd, (pbn_e - pbn_s + 1));
+
+ /* NOTE: we are protected by the flush_lock here, no-one else here */
+
+ /* set the pages readwriteable */
+ /* if we use CR0/WP to temporarily switch the writable permission,
+ * we don't have to change the PTE attributes directly */
+ if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+ pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
+
+
+ /* for each physical block, flush it from buffer to the PM space */
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+ BBN_T bbn = 0;
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ void* to = PMBD_BLOCK_VADDR(pmbd, pbn);
+ size_t size = pmbd->pb_size;
+ void* from = NULL; /* wait to get it in locked region */
+ PMBD_BBI_T* bbi = NULL; /* wait to get it in locked region */
+
+ /*
+ * NOTE: This would not cause a deadlock, because the block
+ * here are already buffered, and these blocks would not call
+ * pmbd_buffer_alloc_block()
+ */
+ spin_lock(&pbi->lock); /* lock the block */
+
+ /* get related buffer block info */
+ if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
+ bbn = pbi->bbn;
+ bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn);
+ from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
+ } else {
+ panic("pmbd: %s(%d) something wrong here \n", __FUNCTION__, __LINE__);
+ }
+
+ /* sync data from buffer into PM first */
+ if (PMBD_BUFFER_BBI_IS_DIRTY(buffer, bbn)) {
+ /* flush to PM */
+ memcpy_to_pmbd(pmbd, to, from, size, FALSE);
+
+ /* mark it as clean */
+ PMBD_BUFFER_SET_BBI_CLEAN(buffer, bbn);
+ }
+ }
+
+ /* set the pages back to read-only */
+ if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+ pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
+
+
+ /* finish the remaining work */
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ void* to = PMBD_BLOCK_VADDR(pmbd, pbn);
+ size_t size = pmbd->pb_size;
+ BBN_T bbn = pbi->bbn;
+ void* from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
+
+ /* verify that the write operation succeeded */
+ if(PMBD_USE_WRITE_VERIFICATION())
+ pmbd_verify_wr_pages(pmbd, to, from, size);
+
+ /* reset the bbi and pbi link info */
+ PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, bbn);
+ PMBD_SET_BLOCK_UNBUFFERED(pmbd, pbn);
+
+ /* unlock the block */
+ spin_unlock(&pbi->lock);
+
+ num_cleaned ++;
+ }
+
+ /* generate checksum */
+ if (PMBD_USE_CHECKSUM())
+ pmbd_checksum_on_write(pmbd, dst, bytes);
+
+ return num_cleaned;
+}
+
+
+/*
+ * core function of flushing the pmbd buffer
+ * @pmbd: pmbd device
+ *
+ * NOTE: this function performs the flushing in the following steps
+ * (1) get the flush lock (to allow only one to do flushing)
+ * (2) get the buffer_lock to protect the buffer control info (num_dirty,
+ * pos_dirty, pos_clean)
+ * (3) check if someone else has already done the flushing work while waiting
+ * for the lock
+ * (4) copy the buffer block info from pos_dirty to pos_clean to a temporary
+ * array
+ * (5) release the buffer_lock (to allow alloc to proceed, as long as free
+ * blocks exist)
+ *
+ * (6) sort the temporary array of buffer blocks in the order of their PBNs.
+ * This is because we need to organize sequences of contiguous physical blocks,
+ * so that we can use only one set_memory_* function for a sequence of memory
+ * pages, rather than once for each page. So the larger the sequence is, the
+ * more efficient it would be.
+ * (7) scan the sorted list, and form sequences of contiguous physical blocks,
+ * and call __pmbd_buffer_flush_range() to synchronize the sequences one by one
+ *
+ * (8) get the flush_lock again
+ * (9) update the pos_dirty and num_dirty to reflect the recent changes
+ * (10) release the flush_lock
+ *
+ * NOTE: The caller must not hold flush_lock and buffer_lock, but can hold
+ * pbi->lock.
+ *
+ */
+static unsigned long pmbd_buffer_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean)
+{
+ BBN_T i = 0;
+ BBN_T bbn_s = 0;
+ BBN_T bbn_e = 0;
+ PBN_T first_pbn = 0;
+ PBN_T last_pbn = 0;
+ unsigned long num_cleaned = 0;
+ unsigned long num_scanned = 0;
+ PMBD_DEVICE_T* pmbd = buffer->pmbd;
+ PMBD_BSORT_ENTRY_T* bbi_sort_buffer = buffer->bbi_sort_buffer;
+
+ /* lock the flush_lock to ensure no-one else can do flush in parallel */
+ spin_lock(&buffer->flush_lock);
+
+ /* now we lock the buffer to protect buffer control info */
+ spin_lock(&buffer->buffer_lock);
+
+ /* check if num_to_clean is too large */
+ if (num_to_clean > buffer->num_dirty)
+ num_to_clean = buffer->num_dirty;
+
+ /* check if the buffer is empty (someone else may have done the flushing job) */
+ if (PMBD_BUFFER_IS_EMPTY(buffer) || num_to_clean == 0) {
+ spin_unlock(&buffer->buffer_lock);
+ goto done;
+ }
+
+ /* set up the range of BBNs we need to check */
+ bbn_s = buffer->pos_dirty; /* the first bbn */
+ bbn_e = PMBD_BUFFER_PRIO_POS(buffer, buffer->pos_clean);/* the last bbn */
+
+ /* scan the buffer range and put it into the sort buffer */
+ /*
+ * NOTE: bbn_s could be equal to PMBD_BUFFER_NEXT_POS(buffer, bbn_e), if
+ * the buffer is filled with dirty blocks, so we need to check num_scanned
+ * here.
+ * */
+ for (i = bbn_s;
+ (i != PMBD_BUFFER_NEXT_POS(buffer, bbn_e)) || (num_scanned == 0);
+ i = PMBD_BUFFER_NEXT_POS(buffer, i)) {
+ /*
+ * FIXME: it may be possible that some blocks in the dirty
+ * block range are "clean", because after the block is
+ * allocated, and before it is being written, the block is
+ * marked as CLEAN, but it is allocated already. However, it is
+ * safe to attempt to flush it, because the pbi->lock would
+ * protect us.
+ *
+ * UPDATES: we changed the allocator code to mark it dirty as
+ * soon as the block is allocated. So the aforesaid situation
+ * would not happen anymore.
+ */
+ if(PMBD_BUFFER_BBI_IS_CLEAN(buffer, i)){
+ /* found clean blocks */
+ panic("ERR: %s(%d)%u: found clean block in the range of dirty blocks (bbn_s=%lu bbn_e=%lu, i=%lu, num_scanned=%lu num_to_clean=%lu num_dirty=%lu pos_dirty=%lu pos_clean=%lu)\n",
+ __FUNCTION__, __LINE__, __CURRENT_PID__,bbn_s, bbn_e, i, num_scanned, num_to_clean, buffer->num_dirty, buffer->pos_dirty, buffer->pos_clean);
+ continue;
+ } else {
+ PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, i);
+ PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + num_scanned;
+
+ /* add it to the buffer for sorting */
+ se->pbn = bbi->pbn;
+ se->bbn = i;
+ num_scanned ++;
+
+ /* only clean num_to_clean blocks */
+ if (num_scanned >= num_to_clean)
+ break;
+ }
+ }
+ /* unlock the buffer to let allocator continue */
+ spin_unlock(&buffer->buffer_lock);
+
+ /* if no valid dirty block to be cleaned*/
+ if (num_scanned == 0)
+ goto done;
+
+ /*
+ * sort the buffer to get sequences of contiguous blocks
+ */
+ if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+ sort(bbi_sort_buffer, num_scanned, sizeof(PMBD_BSORT_ENTRY_T), compare_bbi_sort_entries, swap_bbi_sort_entries);
+
+ /* scan the sorted list to organize and flush the sequences of contiguous PBNs */
+ for (i = 0; i < num_scanned; i ++) {
+ PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + i;
+ PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, se->bbn);
+ if (i == 0) {
+ /* the first one */
+ first_pbn = bbi->pbn;
+ last_pbn = bbi->pbn;
+ continue;
+ } else {
+ if (bbi->pbn == (last_pbn + 1) ) {
+ /* if blocks are contiguous */
+ last_pbn = bbi->pbn;
+ continue;
+ } else {
+ /* if blocks are not contiguous */
+ num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
+
+ /* start a new sequence */
+ first_pbn = bbi->pbn;
+ last_pbn = bbi->pbn;
+ continue;
+ }
+ }
+ }
+
+ /* finish the last sequence of contiguous PBNs */
+ num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
+
+ /* update the buffer control info */
+ spin_lock(&buffer->buffer_lock);
+ buffer->pos_dirty = PMBD_BUFFER_NEXT_N_POS(buffer, bbn_s, num_cleaned); /* move pos_dirty forward */
+ buffer->num_dirty -= num_cleaned; /* decrement the counter*/
+ spin_unlock(&buffer->buffer_lock);
+
+done:
+ spin_unlock(&buffer->flush_lock);
+ return num_cleaned;
+}
+
+/*
+ * entry function of flushing buffer
+ * This function is called by both allocator and syncer
+ * @pmbd: pmbd device
+ * @num_to_clean: how many blocks to clean
+ * @i_am_syncer: indicate which caller is (TRUE for syncer and FALSE for allocator)
+ */
+static unsigned long pmbd_buffer_check_and_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean, unsigned caller)
+{
+ unsigned long num_cleaned = 0;
+
+ /*
+ * Since there may exist more than one thread (e.g. alloc/flush or
+ * alloc/alloc) trying to flush the buffer, we need to first check if
+ * someone else has already done the job while waiting for the lock. If
+ * true, we don't have to proceed and flush it again. This improves the
+ * responsiveness of applications
+ */
+ if (caller == CALLER_DESTROYER){
+ /* if destroyer calls this function, just flush everything */
+ goto do_it;
+
+ } else if (caller == CALLER_SYNCER) {
+ /* if syncer calls this function and the buffer is empty, do nothing */
+ spin_lock(&buffer->buffer_lock);
+ if (PMBD_BUFFER_IS_EMPTY(buffer)){
+ spin_unlock(&buffer->buffer_lock);
+ goto done;
+ }
+ spin_unlock(&buffer->buffer_lock);
+
+ } else if (caller == CALLER_ALLOCATOR){
+
+ /* if reader/writer calls this function, some blocks are freed, then
+ * we just do nothing */
+ spin_lock(&buffer->buffer_lock);
+ if (!PMBD_BUFFER_IS_FULL(buffer)){
+ spin_unlock(&buffer->buffer_lock);
+ goto done;
+ }
+ spin_unlock(&buffer->buffer_lock);
+
+ } else {
+ panic("ERR: %s(%d) unknown caller id\n", __FUNCTION__, __LINE__);
+ }
+
+ /* otherwise, we do flushing */
+do_it:
+ num_cleaned = pmbd_buffer_flush(buffer, num_to_clean);
+
+done:
+ return num_cleaned;
+}
+
+/*
+ * Core function of allocating a buffer block
+ *
+ * We first grab the buffer_lock, and check to see if the buffer is full. If
+ * not, we allocate a buffer block, move the pos_clean, and update num_dirty,
+ * then release the buffer_lock. Since we already hold the pbi->lock, it is
+ * safe to release the lock and let other threads proceed (before we really
+ * write data into the buffer block), because no one else can read/write or
+ * access the same buffer block concurrently. If the buffer is full, we release
+ * the buffer_lock to allow others to proceed (because we may be blocked at
+ * flush_lock later), and then we call the function to synchronously flush the
+ * buffer. Note that someone else may be there already, so we may be blocked
+ * there, and if we find someone has already flushed the buffer, we need to
+ * grab the buffer_lock and check if there is available buffer block again.
+ *
+ * NOTE: The caller must hold the pbi->lock.
+ *
+ */
+static PMBD_BBI_T* pmbd_buffer_alloc_block(PMBD_BUFFER_T* buffer, PBN_T pbn)
+{
+ BBN_T pos = 0;
+ PMBD_BBI_T* bbi = NULL;
+ PMBD_DEVICE_T* pmbd = buffer->pmbd;
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+
+ /* lock the buffer control info (we will check and update it) */
+ spin_lock(&buffer->buffer_lock);
+
+check_again:
+ /* check if the buffer is completely full, if yes, flush it to PM */
+ if (PMBD_BUFFER_IS_FULL(buffer)) {
+ /* release the buffer_lock (someone may be doing flushing)*/
+ spin_unlock(&buffer->buffer_lock);
+
+ /* If the buffer is full, we must flush it synchronously.
+ *
+ * NOTE: this on-demand flushing can improve performance a lot, since
+ * the allocator has not to wait for waking up syncer to do this, which
+ * is much faster. Another merit is that it makes the application run
+ * more smoothly (it is abrupt if completely relying on syncer). Also
+ * note that we only flush a batch (e.g. 1024) of blocks, rather than
+ * all the buffer blocks, this is because we only need a few blocks to
+ * satisfy the application's own need, and this reduces the time that
+ * the application spends on allocation. */
+ pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_ALLOCATOR);
+
+ /* grab the lock and check the availability of free buffer blocks
+ * again, because someone may use up all the free buffer blocks, right
+ * after the buffer is flushed but before we can get one */
+ spin_lock(&buffer->buffer_lock);
+ goto check_again;
+ }
+
+ /* if buffer is not full, only reserve one spot first.
+ *
+ * NOTE that we do not have to do link and memcpy in the locked region,
+ * because pbi->lock guarantees that no-one else can use it now. This
+ * moves the high-cost operations out of the critical section */
+ pos = buffer->pos_clean;
+ buffer->pos_clean = PMBD_BUFFER_NEXT_POS(buffer, buffer->pos_clean);
+ buffer->num_dirty ++;
+
+ /* NOTE: we mark it "dirty" here, but actually the data has not been
+ * really written into the PMBD buffer block yet. This is safe, because
+ * we are protected by the pbi->lock */
+ PMBD_BUFFER_SET_BBI_DIRTY(buffer, pos);
+
+ /* now link them up (no-one else can see it) */
+ bbi = PMBD_BUFFER_BBI(buffer, pos);
+
+ bbi->pbn = pbn;
+ pbi->bbn = pos;
+
+ /* unlock the buffer_lock and let others proceed */
+ spin_unlock(&buffer->buffer_lock);
+
+ return bbi;
+}
+
+
+/*
+ * syncer daemon worker function
+ */
+
+static inline uint64_t pmbd_device_is_idle(PMBD_DEVICE_T* pmbd)
+{
+ unsigned last_jiffies, now_jiffies;
+ uint64_t interval = 0;
+
+ now_jiffies = jiffies;
+ PMBD_DEV_GET_ACCESS_TIME(pmbd, last_jiffies);
+ interval = jiffies_to_usecs(now_jiffies - last_jiffies);
+
+ if (PMBD_DEV_IS_IDLE(pmbd, interval)) {
+ return interval;
+ } else {
+ return 0;
+ }
+}
+
+static int pmbd_syncer_worker(void* data)
+{
+ PMBD_BUFFER_T* buffer = (PMBD_BUFFER_T*) data;
+
+ set_user_nice(current, 0);
+
+ do {
+ unsigned do_flush = 0;
+// unsigned long loop = 0;
+ uint64_t idle_usec = 0;
+ spin_lock(&buffer->buffer_lock);
+
+ /* we start flushing, if
+ * (1) the num of dirty blocks hits the high watermark, or
+ * (2) the device has been idle for a while */
+ if (PMBD_BUFFER_ABOVE_HW(buffer)) {
+ //printk("High watermark is hit\n";
+ do_flush = 1;
+ }
+// if (pmbd_device_is_idle(buffer->pmbd) && !PMBD_BUFFER_IS_EMPTY(buffer)) {
+ if ((idle_usec = pmbd_device_is_idle(buffer->pmbd)) && PMBD_BUFFER_ABOVE_LW(buffer)) {
+ //printk("Device is idle for %llu uSeconds\n", idle_usec);
+ do_flush = 1;
+ }
+ if (do_flush){
+ unsigned long num_dirty = 0;
+ unsigned long num_cleaned = 0;
+repeat:
+ num_dirty = buffer->num_dirty;
+ spin_unlock(&buffer->buffer_lock);
+
+ /* start flushing
+ *
+ * NOTE: we only allocate a batch (e.g. 1024) of blocks each time. The
+ * purpose is to let the applications wait for free blocks, so that they can
+ * get a few free blocks and proceed, rather than waiting for the whole
+ * buffer gets flushed. Otherwise, the bandwidth would be lower and the
+ * applications cannot run smoothly.
+ */
+ num_cleaned = pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_SYNCER);
+ //printk("Syncer(%u) activated (%lu) - Before (%lu) Cleaned (%lu) After (%lu)\n",
+ // buffer->buffer_id, loop++, num_dirty, num_cleaned, buffer->num_dirty);
+
+ /* continue to flush until we hit the low watermark */
+ spin_lock(&buffer->buffer_lock);
+ if (PMBD_BUFFER_ABOVE_LW(buffer)) {
+// if (buffer->num_dirty > 0) {
+ goto repeat;
+ }
+ }
+ spin_unlock(&buffer->buffer_lock);
+
+ /* go to sleep */
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(1);
+ set_current_state(TASK_RUNNING);
+
+ } while(!kthread_should_stop());
+ return 0;
+}
+
+static struct task_struct* pmbd_buffer_syncer_init(PMBD_BUFFER_T* buffer)
+{
+ struct task_struct* tsk = NULL;
+ tsk = kthread_run(pmbd_syncer_worker, (void*) buffer, "nsyncer");
+ if (!tsk) {
+ printk(KERN_ERR "pmbd: initializing buffer syncer failed\n");
+ return NULL;
+ }
+
+ buffer->syncer = tsk;
+ printk("pmbd: buffer syncer launched\n");
+ return tsk;
+}
+
+static int pmbd_buffer_syncer_stop(PMBD_BUFFER_T* buffer)
+{
+ if (buffer->syncer){
+ kthread_stop(buffer->syncer);
+ buffer->syncer = NULL;
+ printk(KERN_INFO "pmbd: buffer syncer stopped\n");
+ }
+ return 0;
+}
+
+/*
+ * read and write to PMBD with buffer
+ */
+static void copy_to_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes)
+{
+ PBN_T pbn = 0;
+ void* from = src;
+
+ /*
+ * get the start and end in-block offset
+ *
+ * NOTE: Since the buffer block (4096 bytes) can be larger than the
+ * sector(512 bytes), if incoming request is not completely aligned to
+ * buffer blocks, we need to read the full block from PM into the
+ * buffer block and apply writes to partial of the buffer block. Here,
+ * offset_s and offset_e are the start and end in-block offsets (in
+ * units of sectors) for the first and the last sector in the request,
+ * they may or may not appear in the same buffer block, depending on the
+ * request size.
+ */
+ PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+ PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+ sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
+ sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
+
+ /* for each physical block */
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+ void* to = NULL;
+ PMBD_BBI_T* bbi = NULL;
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ sector_t sect_s = (pbn == pbn_s) ? offset_s : 0; /* sub-block access */
+ sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
+ size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */
+ PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
+
+ /* lock the physical block first */
+ spin_lock(&pbi->lock);
+
+ /* check if the physical block is buffered */
+ bbi = _pmbd_buffer_lookup(buffer, pbn);
+
+ if (bbi){
+ /* if the block is already buffered */
+ to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+ } else {
+ /* if not buffered, allocate one free buffer block */
+ bbi = pmbd_buffer_alloc_block(buffer, pbn);
+
+ /* if not aligned to a full block, we have to copy the whole
+ * block from the PM space to the buffer block first */
+ if (size < pmbd->pb_size){
+ memcpy_from_pmbd(pmbd, PMBD_BUFFER_BLOCK(buffer, pbi->bbn), PMBD_BLOCK_VADDR(pmbd, pbn), pmbd->pb_size);
+ }
+ to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+ }
+
+ /* writing it into buffer */
+ memcpy(to, from, size);
+ PMBD_BUFFER_SET_BBI_DIRTY(buffer, pbi->bbn);
+
+ /* unlock the block */
+ spin_unlock(&pbi->lock);
+
+ from += size;
+ }
+
+ return;
+}
+
+static void copy_from_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+ PBN_T pbn = 0;
+ void* to = dst;
+
+ /* get the start and end in-block offset */
+ PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+ PBN_T pbn_e = BYTE_TO_PBN(pmbd, SECTOR_TO_BYTE(sector) + bytes - 1);
+ sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
+ sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
+
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+ /* Scan the incoming request and check each block, for each block, we
+ * check if it is in the buffer. If true, we read it from the buffer,
+ * otherwise, we read from the PM space. */
+
+ void* from = NULL;
+ PMBD_BBI_T* bbi = NULL;
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ sector_t sect_s = (pbn == pbn_s) ? offset_s : 0;
+ sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
+ size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */
+ PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
+
+ /* lock the physical block first */
+ spin_lock(&pbi->lock);
+
+ /* check if the block is in the buffer */
+ bbi = _pmbd_buffer_lookup(buffer, pbn);
+
+ /* start reading data */
+ if (bbi) {
+ /* if buffered, read it from the buffer */
+ from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+
+ /* read it out */
+ memcpy(to, from, size);
+
+ } else {
+ /* if not buffered, read it from PM space */
+ from = PMBD_BLOCK_VADDR(pmbd, pbn) + SECTOR_TO_BYTE(sect_s);
+
+ /* verify the checksum first */
+ if (PMBD_USE_CHECKSUM())
+ pmbd_checksum_on_read(pmbd, from, size);
+
+ /* read it out*/
+ memcpy_from_pmbd(pmbd, to, from, size);
+ }
+
+ /* unlock the block */
+ spin_unlock(&pbi->lock);
+
+ to += size;
+ }
+
+ return;
+}
+
+/*
+ * buffer related space alloc/free functions
+ */
+static int pmbd_pbi_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+ int err = 0;
+
+ /* allocate checksum space */
+ pmbd->pbi_space = vmalloc(PMBD_TOTAL_PB_NUM(pmbd) * sizeof(PMBD_PBI_T));
+ if (pmbd->pbi_space) {
+ PBN_T i;
+ for (i = 0; i < PMBD_TOTAL_PB_NUM(pmbd); i ++) {
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, i);
+ PMBD_SET_BLOCK_UNBUFFERED(pmbd, i);
+ spin_lock_init(&pbi->lock);
+ }
+ printk(KERN_INFO "pmbd(%d): pbi space is initialized\n", pmbd->pmbd_id);
+ } else {
+ err = -ENOMEM;
+ }
+
+ return err;
+}
+
+static int pmbd_pbi_space_free(PMBD_DEVICE_T* pmbd)
+{
+ if (pmbd->pbi_space){
+ vfree(pmbd->pbi_space);
+ pmbd->pbi_space = NULL;
+ printk(KERN_INFO "pmbd(%d): pbi space is freed\n", pmbd->pmbd_id);
+ }
+ return 0;
+}
+
+static PMBD_BUFFER_T* pmbd_buffer_create(PMBD_DEVICE_T* pmbd)
+{
+ int i;
+ PMBD_BUFFER_T* buffer = kzalloc (sizeof(PMBD_BUFFER_T), GFP_KERNEL);
+ if (!buffer){
+ goto fail;
+ }
+
+ /* link to the pmbd device */
+ buffer->pmbd = pmbd;
+
+ /* set size */
+ if (g_pmbd_bufsize[pmbd->pmbd_id] > PMBD_BUFFER_MIN_BUFSIZE) {
+ buffer->num_blocks = MB_TO_BYTES(g_pmbd_bufsize[pmbd->pmbd_id]) / pmbd->pb_size;
+ } else {
+ if (PMBD_DEV_USE_BUFFER(pmbd)) {
+ printk(KERN_INFO "pmbd(%d): WARNING - too small buffer size (%llu MBs). Buffer set to %d MBs\n",
+ pmbd->pmbd_id, g_pmbd_bufsize[pmbd->pmbd_id], PMBD_BUFFER_MIN_BUFSIZE);
+ }
+ buffer->num_blocks = MB_TO_BYTES(PMBD_BUFFER_MIN_BUFSIZE) / pmbd->pb_size;
+ }
+
+ /* buffer space */
+ buffer->buffer_space = vmalloc(buffer->num_blocks * pmbd->pb_size);
+ if (!buffer->buffer_space)
+ goto fail;
+
+ /* BBI array */
+ buffer->bbi_space = vmalloc(buffer->num_blocks * sizeof(PMBD_BBI_T));
+ if (!buffer->bbi_space)
+ goto fail;
+ memset(buffer->bbi_space, 0, buffer->num_blocks * sizeof(PMBD_BBI_T));
+
+ /* temporary array of bbi for sorting */
+ buffer->bbi_sort_buffer = vmalloc(buffer->num_blocks * sizeof(PMBD_BSORT_ENTRY_T));
+ if (!buffer->bbi_sort_buffer)
+ goto fail;
+
+ /* initialize the locks*/
+ spin_lock_init(&buffer->buffer_lock);
+ spin_lock_init(&buffer->flush_lock);
+
+ /* initialize the BBI array */
+ for (i = 0; i < buffer->num_blocks; i ++){
+ PMBD_BUFFER_SET_BBI_CLEAN(buffer, i);
+ PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, i);
+ }
+
+ /* initialize the buffer control info */
+ buffer->num_dirty = 0;
+ buffer->pos_dirty = 0;
+ buffer->pos_clean = 0;
+ buffer->batch_size = g_pmbd_buffer_batch_size[pmbd->pmbd_id];
+
+ /* launch the syncer daemon */
+ pmbd_buffer_syncer_init(buffer);
+ if (!buffer->syncer)
+ goto fail;
+
+ printk(KERN_INFO "pmbd: pmbd device buffer (%u) allocated (%lu blocks - block size %u bytes)\n",
+ buffer->buffer_id, buffer->num_blocks, pmbd->pb_size);
+ return buffer;
+
+fail:
+ if (buffer && buffer->bbi_sort_buffer)
+ vfree(buffer->bbi_sort_buffer);
+ if (buffer && buffer->bbi_space)
+ vfree(buffer->bbi_space);
+ if (buffer && buffer->buffer_space)
+ vfree(buffer->buffer_space);
+ if (buffer)
+ kfree(buffer);
+ printk(KERN_ERR "%s(%d) vzalloc failed\n", __FUNCTION__, __LINE__);
+ return NULL;
+}
+
+static int pmbd_buffer_destroy(PMBD_BUFFER_T* buffer)
+{
+ unsigned id = buffer->buffer_id;
+
+ /* stop syncer first */
+ pmbd_buffer_syncer_stop(buffer);
+
+ /* flush the buffer to the PM space */
+ pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
+
+ /* FIXME: wait for the on-going operations to finish first? */
+ if (buffer && buffer->bbi_sort_buffer)
+ vfree(buffer->bbi_sort_buffer);
+ if (buffer && buffer->bbi_space)
+ vfree(buffer->bbi_space);
+ if (buffer && buffer->buffer_space)
+ vfree(buffer->buffer_space);
+ if (buffer)
+ kfree(buffer);
+ printk(KERN_INFO "pmbd: pmbd device buffer (%u) space freed\n", id);
+ return 0;
+}
+
+static int pmbd_buffers_create(PMBD_DEVICE_T* pmbd)
+{
+ int i;
+ for (i = 0; i < pmbd->num_buffers; i ++){
+ pmbd->buffers[i] = pmbd_buffer_create(pmbd);
+ if (pmbd->buffers[i] == NULL)
+ return -ENOMEM;
+ (pmbd->buffers[i])->buffer_id = i;
+ }
+ return 0;
+}
+
+static int pmbd_buffers_destroy(PMBD_DEVICE_T* pmbd)
+{
+ int i;
+ for (i = 0; i < pmbd->num_buffers; i ++){
+ if(pmbd->buffers[i]){
+ pmbd_buffer_destroy(pmbd->buffers[i]);
+ pmbd->buffers[i] = NULL;
+ }
+ }
+ return 0;
+}
+
+static int pmbd_buffer_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+ int err = 0;
+
+ if (pmbd->num_buffers <= 0)
+ return 0;
+
+ /* allocate buffers array */
+ pmbd->buffers = kzalloc (sizeof(PMBD_BUFFER_T*) * pmbd->num_buffers, GFP_KERNEL);
+ if (pmbd->buffers == NULL){
+ err = -ENOMEM;
+ goto fail;
+ }
+
+ /* allocate each buffer */
+ err = pmbd_buffers_create(pmbd);
+ printk(KERN_INFO "pmbd: pmbd buffer space allocated.\n");
+fail:
+ return err;
+}
+
+static int pmbd_buffer_space_free(PMBD_DEVICE_T* pmbd)
+{
+ if (pmbd->num_buffers <=0)
+ return 0;
+
+ pmbd_buffers_destroy(pmbd);
+ kfree(pmbd->buffers);
+ pmbd->buffers = NULL;
+ printk(KERN_INFO "pmbd: pmbd buffer space freed.\n");
+
+ return 0;
+}
+
+
+/*
+ * *************************************************************************
+ * High memory based PMBD functions
+ * *************************************************************************
+ *
+ * NOTE:
+ * (1) memcpy_fromio() and memcpy_intoio() are used for reading/writing PM,
+ * but it is unnecessary on x86 architectures.
+ * (2) Currently we only allocate the reserved space to multiple PMBDs once.
+ * No dynamic allocate/deallocate of the space is needed so far.
+ */
+
+
+static void* pmbd_highmem_map(void)
+{
+ /*
+ * NOTE: we can also use ioremap_* functions to directly set memory
+ * page attributes when do remapping, but to make it consistent with
+ * using vmalloc(), we do ioremap_cache() and call set_memory_* later.
+ */
+
+ if (PMBD_USE_PMAP()){
+ /* NOTE: If we use pmap(), we don't need to map the reserved
+ * physical memory into the kernel space. Instead we use
+ * pmap_atomic() to make and unmap the to-be-accessed pages on
+ * demand. Since such mapping is private to the processor,
+ * there is no need to change PTE, and TLB shootdown either.
+ *
+ * Also note that We use PMBD_PMAP_DUMMY_BASE_VA to make the rest
+ * of code happy with a valid virtual address. The real
+ * physical address is calculated as follows:
+ * g_highmem_phys_addr + (vaddr) - PMBD_PMAP_DUMMY_BASE_VA
+ *
+ * (updated 10/25/2011)
+ */
+
+ g_highmem_virt_addr = (void*) PMBD_PMAP_DUMMY_BASE_VA;
+ g_highmem_curr_addr = g_highmem_virt_addr;
+ printk(KERN_INFO "pmbd: PMAP enabled - setting g_highmem_virt_addr to a dummy address (%d)\n", PMBD_PMAP_DUMMY_BASE_VA);
+ return g_highmem_virt_addr;
+
+ } else if ((g_highmem_virt_addr = ioremap_prot(g_highmem_phys_addr, g_highmem_size, g_pmbd_cpu_cache_flag))) {
+
+ g_highmem_curr_addr = g_highmem_virt_addr;
+ printk(KERN_INFO "pmbd: high memory space remapped (offset: %llu MB, size=%lu MB, cache flag=%s)\n",
+ BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size), PMBD_CPU_CACHE_FLAG());
+ return g_highmem_virt_addr;
+
+ } else {
+
+ printk(KERN_ERR "pmbd: %s(%d) - failed remapping high memory space (offset: %llu MB size=%lu MB)\n",
+ __FUNCTION__, __LINE__, BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
+ return NULL;
+ }
+}
+
+static void pmbd_highmem_unmap(void)
+{
+ /* de-remap the high memory from kernel address space */
+ /* NOTE: if we use pmap(), the g_highmem_virt_addr is fake */
+ if (!PMBD_USE_PMAP()){
+ if(g_highmem_virt_addr){
+ iounmap(g_highmem_virt_addr);
+ g_highmem_virt_addr = NULL;
+ printk(KERN_INFO "pmbd: unmapping high mem space (offset: %llu MB, size=%lu MB)is unmapped\n",
+ BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
+ }
+ }
+ return;
+}
+
+static void* hmalloc(uint64_t bytes)
+{
+ void* rtn = NULL;
+
+ /* check if there is still available reserve high memory space */
+ if (bytes <= PMBD_HIGHMEM_AVAILABLE_SPACE) {
+ rtn = g_highmem_curr_addr;
+ g_highmem_curr_addr += bytes;
+ } else {
+ printk(KERN_ERR "pmbd: %s(%d) - no available space (< %llu bytes) in reserved high memory\n",
+ __FUNCTION__, __LINE__, bytes);
+ }
+ return rtn;
+}
+
+static int hfree(void* addr)
+{
+ /* FIXME: no support for dynamic alloc/dealloc in HIGH_MEM space */
+ return 0;
+}
+
+
+/*
+ * *************************************************************************
+ * Device Emulation
+ * *************************************************************************
+ *
+ * Our emulation is based on a simple model - access time and transfer time.
+ *
+ * emulated time = access time + (request size / bandwidth)
+ * inserted delay = emulated time - observed time
+ *
+ * (1) Access time is applied to each request. We check each request's real
+ * access time and pad it with an extra delay to meet the designated latency.
+ * This is a best-effort solution, which means we just guarantee that no
+ * request can be completed with a response time less than the specified
+ * latency, but the real access latencies could be higher. In addition, if the
+ * total number of threads is larger than the number of available processors,
+ * the simulated latencies could be higher, due to CPU saturation.
+ *
+ * (2) Transfer time is calculated based on batches
+ * - A batch is a sequence of consecutive requests with a short interval in
+ * between; requests in a batch can be overlapped with each other (parallel
+ * jobs); there is a limit for the total amount of data and the duration of
+ * a batch
+ * - For each batch, we calculate its target emulated transfer time as
+ * "emul_trans_time = num_sectors/emul_bandwidth" and calculate a delay as
+ * "delay = emul_trans_time - real_trans_time"
+ * - The calculated delay is applied to each batch at the end
+ * - A lock is used to slow down all threads, because bandwidth is a
+ * system-wide specification. In this way, we serialize the threads
+ * accessing the device, which simulates that the device is busy on a task.
+ *
+ * (3) Two types of delays implemented
+ * - Sync delay: if delay is less than 10ms, we keep polling the TSC
+ * counter, which is basically "busy waiting", like spin-lock. This allows
+ * to reach precision of one hundred of cycles
+ * - Async delay: if delay is more than 10ms, we call msleep() to sleep for
+ * a while, which relinquish CPU control, which results in a low precision.
+ * The left-over delay is done with sync delay in nanosecs. Async delay
+ * cannot be used while holding a lock.
+ *
+ */
+
+
+static inline uint64_t DIV64_ROUND(uint64_t dividend, uint64_t divisor)
+{
+ if (divisor > 0) {
+ uint32_t quot1 = dividend / divisor;
+ uint32_t mod = dividend % divisor;
+ uint32_t mult = mod * 2;
+ uint32_t quot2 = mult / divisor;
+ uint64_t result = quot1 + quot2;
+ return result;
+ } else { // FIXME: how to handle this?
+ printk(KERN_WARNING "pmbd: WARNING - %s(%d) divisor is zero\n", __FUNCTION__, __LINE__);
+ return 0;
+ }
+}
+
+static inline unsigned int get_cpu_freq(void)
+{
+#if 0
+ unsigned int khz = cpufreq_quick_get(0); /* FIXME: use cpufreq_get() ??? */
+ if (!khz)
+ khz = cpu_khz;
+ printk("khz=%u, cpu_khz=%u\n", khz, cpu_khz);
+#endif
+ return cpu_khz;
+}
+
+static inline uint64_t _cycle_to_ns(uint64_t cycle, unsigned int khz)
+{
+ return cycle * 1000000 / khz;
+}
+
+static inline uint64_t cycle_to_ns(uint64_t cycle)
+{
+ unsigned int khz = get_cpu_freq();
+ return _cycle_to_ns(cycle, khz);
+}
+
+/*
+ * emulate the latency for a given request size/type on a device
+ * @num_sectors: num of sectors to read/write
+ * @rw: read or write
+ * @pmbd: the pmbd device
+ */
+static uint64_t cal_trans_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
+{
+ uint64_t ns = 0;
+ uint64_t bw = (rw == READ) ? pmbd->rdbw : pmbd->wrbw; /* bandwidth */
+ if (bw) {
+ uint64_t tmp = num_sectors * PMBD_SECTOR_SIZE;
+ uint64_t tt = 1000000000UL >> MB_SHIFT;
+ ns += DIV64_ROUND((tmp * tt), bw);
+ }
+ return ns;
+}
+
+static uint64_t cal_access_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
+{
+ uint64_t ns = (rw == READ) ? pmbd->rdlat : pmbd->wrlat; /* access time */
+ return ns;
+}
+
+static inline void sync_slowdown(uint64_t ns)
+{
+ uint64_t start, now;
+ unsigned int khz = get_cpu_freq();
+ if (ns) {
+ /*
+ * We keep reading TSC counter to check if the delay has
+ * been passed and this prevents CPU from being scaled down,
+ * which provides a stable estimation of the elapsed time.
+ */
+ TIMESTAMP(start);
+ while(1) {
+ TIMESTAMP(now);
+ if (_cycle_to_ns((now-start), khz) > ns)
+ break;
+ }
+ }
+ return;
+}
+
+static inline void sync_slowdown_cycles(uint64_t cycles)
+{
+
+ uint64_t start, now;
+ if (cycles){
+ /*
+ * We keep reading TSC counter to check if the delay has
+ * been passed and this prevents CPU from being scaled down,
+ * which provides a stable estimation of the elapsed time.
+ */
+ TIMESTAMP(start);
+ while(1) {
+ TIMESTAMP(now);
+ if ((now - start) >= cycles)
+ break;
+ }
+ }
+ return;
+}
+
+static inline void async_slowdown(uint64_t ns)
+{
+ uint64_t ms = ns / 1000000;
+ uint64_t left = ns - (ms * 1000000);
+ /* do ms delay with sleep */
+ msleep(ms);
+
+ /* make up the sub-ms delay */
+ sync_slowdown(left);
+}
+
+#if 0
+static inline void slowdown_us(unsigned long long us)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(us * HZ / 1000000);
+}
+#endif
+
+static void pmbd_slowdown(uint64_t ns, unsigned in_lock)
+{
+ /*
+ * NOTE: if the delay is less than 10ms, we use sync_slowdown to keep
+ * polling the CPU cycle counter and busy waiting for the delay elapse;
+ * otherwise, we use msleep() to relinquish the CPU control.
+ */
+ if (ns > MAX_SYNC_SLOWDOWN && !in_lock)
+ async_slowdown(ns);
+ else if (ns > 0)
+ sync_slowdown(ns);
+
+ return;
+}
+
+/*
+ * Emulating the transfer time for a batch of requests for specific bandwidth
+ *
+ * We group a bunch of consecutive requests as a "batch". In one batch, the
+ * interval between two consecutive requests should be small, and the total
+ * amount of accessed data should be a good size (not too small, not too
+ * large), the duration is reasonable (not too long). For each batch, we
+ * estimate the emulated transfer time and compare it with the real transfer
+ * time (the start and end time of the batch), if the real transfer time is
+ * less than the emulated time, we apply an extra delay to the end of batch for
+ * making up the difference. In this way we can make the bandwidth emulation
+ * closer to real situation. Note that, since requests from multiple threads
+ * could be processed in parallel, so we must slowdown ALL the threads
+ * accessing the PMBD device, thus, we use batch_lock to coordinate all threads.
+ *
+ * @num_sectors: the num of sectors of the request
+ * @rw: read or write
+ * @pmbd: the involved pmbd device
+ *
+ */
+
+static void pmbd_emul_transfer_time(int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
+{
+ uint64_t interval_ns = 0;
+ uint64_t duration_ns = 0;
+ unsigned new_batch = FALSE;
+ unsigned end_batch = FALSE;
+ uint64_t now_cycle = 0;
+
+ spin_lock(&pmbd->batch_lock);
+
+ /* get a timestamp for now */
+ TIMESTAMP(now_cycle);
+
+ /* if this is the first timestamp */
+ if (pmbd->batch_start_cycle[rw] == 0) {
+ pmbd->batch_start_cycle[rw] = now_cycle;
+ pmbd->batch_end_cycle[rw] = now_cycle;
+ goto done;
+ }
+
+ /* calculate the interval from the last request */
+ if (now_cycle >= pmbd->batch_end_cycle[rw]){
+ interval_ns = cycle_to_ns(now_cycle - pmbd->batch_end_cycle[rw]);
+ } else {
+ panic(KERN_ERR "%s(%d): timestamp in the past found.\n", __FUNCTION__, __LINE__);
+ }
+
+ /* check the interval length (cannot be too distant) */
+ if (interval_ns >= PMBD_BATCH_MAX_INTERVAL) {
+ /* interval is too big, break it to two batches */
+ new_batch = TRUE;
+ end_batch = TRUE;
+ } else {
+ /* still in the same batch, good */
+ pmbd->batch_sectors[rw] += num_sectors;
+ pmbd->batch_end_cycle[rw] = now_cycle;
+ }
+
+ /* check current batch duration (cannot be too long) */
+ duration_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
+ if (duration_ns >= PMBD_BATCH_MAX_DURATION)
+ end_batch = TRUE;
+
+ /* check current batch data amount (cannot be too large) */
+ if (pmbd->batch_sectors[rw] >= PMBD_BATCH_MAX_SECTORS)
+ end_batch = TRUE;
+
+ /* if the batch ends, check and apply slow-down */
+ if (end_batch) {
+ /* batch size must be large enough, if not, just skip it */
+ if (pmbd->batch_sectors[rw] > PMBD_BATCH_MIN_SECTORS) {
+ uint64_t real_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
+ uint64_t emul_ns = cal_trans_time(pmbd->batch_sectors[rw], rw, pmbd);
+
+ if (emul_ns > real_ns)
+ pmbd_slowdown((emul_ns - real_ns), TRUE);
+ }
+
+ pmbd->batch_sectors[rw] = 0;
+ pmbd->batch_start_cycle[rw] = now_cycle;
+ pmbd->batch_end_cycle[rw] = now_cycle;
+ }
+
+ /* if a new batch begins, add the first request */
+ if (new_batch) {
+ pmbd->batch_sectors[rw] = num_sectors;
+ pmbd->batch_start_cycle[rw] = now_cycle;
+ pmbd->batch_end_cycle[rw] = now_cycle;
+ }
+
+done:
+ spin_unlock(&pmbd->batch_lock);
+ return;
+}
+
+/*
+ * Emulating access time for a request
+ *
+ * Different from emulating bandwidths, we emulate access time for each
+ * individual access. Right after we simulate the transfer time, we examine
+ * the real access time (including transfer time), if the real time is smaller
+ * than the specified access time, we slow down the request by applying a delay
+ * to make up the difference. Note that we do not use any lock to coordinate
+ * multiple threads for a system-wide "slowdown", but apply this delay on each
+ * request individually and separately.
+ *
+ * Also note that since we basically use "busy-waiting", when the total number
+ * of threads exceeds or be close to the total number of processors, the
+ * simulated access time observed at application level could be longer than the
+ * specified access time due to high CPU usage. But for each request, after
+ * directly examining the duration of being in the make_request() function, the
+ * simulated access time is still very precise.
+ *
+ */
+static void pmbd_emul_access_time(uint64_t start, uint64_t end, int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
+{
+ /*
+ * Access time can be overlapped with each other, so there is no need
+ * to use a lock to serialize it.
+ * FIXME: should we apply this on each batch or each request?
+ */
+ uint64_t real_ns = cycle_to_ns(end - start);
+ uint64_t emul_ns = cal_access_time(num_sectors, rw, pmbd);
+
+ if (emul_ns > real_ns)
+ pmbd_slowdown((emul_ns - real_ns), FALSE);
+
+ return;
+}
+
+/*
+ * set the starting hook for PM emulation
+ *
+ * @pmbd: pmbd device
+ * @num_sectors: sectors being accessed
+ * @rw: READ/WRITE
+ * return value: the start cycle
+ */
+static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw)
+{
+ uint64_t start = 0;
+ if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
+ /* start timer here */
+ TIMESTAMP(start);
+ }
+ return start;
+}
+
+/*
+ * set the stopping hook for PM emulation
+ *
+ * @pmbd: pmbd device
+ * @num_sectors: sectors being accessed
+ * @rw: READ/WRITE
+ * @start: the starting cycle
+ * return value: the end cycle
+ */
+static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start)
+{
+ uint64_t end = 0;
+ uint64_t end2 = 0;
+ /*
+ * NOTE: emulation can be done in two ways - (1) directly specify the
+ * read/write latencies and bandwidths (2) only specify a relative
+ * slowdown ratio (X), compared to DRAM.
+ *
+ * Also note that if rdsx/wrsx is set, we will ignore
+ * rdlat/wrlat/rdbw/wrbw.
+ */
+ if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
+ /*
+ * NOTE: we first attempt to meet the target bandwidth and then
+ * latency. This means the actual bandwidth should be close
+ * to the emulated bandwidth, and then we guarantee that the
+ * latency would not be SMALLER than the target latency.
+ */
+
+ /* emulate the bandwidth first */
+ if (pmbd->rdbw > 0 && pmbd->wrbw > 0) {
+ /* emulate transfer time (bandwidth) */
+ pmbd_emul_transfer_time(num_sectors, rw, pmbd);
+ }
+
+ /* emulate the latency now */
+ TIMESTAMP(end);
+ if (pmbd->rdlat > 0 || pmbd->wrlat > 0) {
+ /* emulate access time (latency) */
+ pmbd_emul_access_time(start, end, num_sectors, rw, pmbd);
+ }
+ }
+ /* get the ending timestamp */
+ TIMESTAMP(end2);
+
+ return end2;
+}
+
+/*
+ * *************************************************************************
+ * PM space protection functions
+ * - clflush
+ * - write protection
+ * - write verification
+ * - checksum
+ * *************************************************************************
+ */
+
+/*
+ * flush designated cache lines in CPU cache
+ */
+
+static inline void pmbd_clflush_all(PMBD_DEVICE_T* pmbd)
+{
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ TIMESTAMP(time_p1);
+ if (cpu_has_clflush){
+#ifdef CONFIG_X86
+ wbinvd_on_all_cpus();
+#else
+ printk(KERN_WARNING "pmbd: WARNING - %s(%d) flush_cache_all() not implemented\n", __FUNCTION__, __LINE__);
+#endif
+ }
+ TIMESTAMP(time_p2);
+
+ /* emulating slowdown */
+ if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+ pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_clflushall[WRITE][cid] += time_p2 - time_p1;
+ }
+ return;
+}
+
+static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes)
+{
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+
+ TIMESTAMP(time_p1);
+ if (cpu_has_clflush){
+ clflush_cache_range(dst, bytes);
+ }
+ TIMESTAMP(time_p2);
+
+ /* emulating slowdown */
+ if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+ pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+ /* update time statistics */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_clflush[WRITE][cid] += time_p2 - time_p1;
+ }
+ return;
+}
+
+
+/*
+ * Write-protection
+ *
+ * Being used as storage, PMBD needs to provide certain protection on accidental
+ * change caused by wild pointers. So we initialize all the PM pages as
+ * read-only; before we perform write operations into PM space, we set the
+ * pages writable, after done, we set it back to read-only. This would
+ * introduce extra overhead. However, this is a realistic solution to tackle
+ * wild pointer problem.
+ *
+ */
+
+/*
+ * set PM pages to read-only
+ * @addr - the starting virtual address (PM space)
+ * @bytes - the range in bytes
+ * @on_access - this change command from request or during creating/destroying
+ */
+
+static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
+{
+ if (PMBD_USE_WRITE_PROTECTION()) {
+ /* FIXME: type conversion happens here */
+ /* FIXME: add range and bytes check here?? - not so necessary */
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+ unsigned long offset = (unsigned long) addr;
+ unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
+ int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
+
+ if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
+ printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n",
+ __FUNCTION__, __LINE__, vaddr, num_pages);
+
+ TIMESTAMP(time_p1);
+ set_memory_ro(vaddr, num_pages);
+ TIMESTAMP(time_p2);
+
+ /* update time statistics */
+// if(PMBD_USE_TIMESTAT() && on_access){
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_setpages_ro[WRITE][cid] += time_p2 - time_p1;
+ }
+ }
+ return;
+}
+
+static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
+{
+ if (PMBD_USE_WRITE_PROTECTION()) {
+ uint64_t time_p1 = 0;
+ uint64_t time_p2 = 0;
+ unsigned long offset = (unsigned long) addr;
+ unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
+ int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
+
+ if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
+ printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", __FUNCTION__, __LINE__, vaddr, num_pages);
+
+ TIMESTAMP(time_p1);
+ set_memory_rw(vaddr, num_pages);
+ TIMESTAMP(time_p2);
+
+ /* update time statistics */
+// if(PMBD_USE_TIMESTAT() && on_access){
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_setpages_rw[WRITE][cid] += time_p2 - time_p1;
+ }
+ }
+ return;
+}
+
+
+/*
+ * Write verification (EXPERIMENTAL)
+ *
+ * Note: Even we do write protection by setting PM space read-only, there is
+ * still a short vulnerable window when we write pages into PM space - between
+ * the time when the pages are set RW and the time when the pages are set back
+ * to RO. So we need to verify that no data has been changed during this window
+ * by reading out the written data and comparing with the source data.
+ *
+ */
+
+
+static inline int pmbd_verify_wr_pages_pmap(PMBD_DEVICE_T* pmbd, void* pmbd_dummy_va, void* ram_va, size_t bytes)
+{
+
+ unsigned long flags = 0;
+
+ /*NOTE: we assume src is starting from 0 */
+ uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
+
+ /* disable interrupt (FIXME: do we need to do this?)*/
+ DISABLE_SAVE_IRQ(flags);
+
+ /* do the real work */
+ while(bytes){
+ uint64_t pfn = (pa >> PAGE_SHIFT); // page frame number
+ unsigned off = pa & (~PAGE_MASK); // offset in one page
+ unsigned size = MIN_OF((PAGE_SIZE - off), bytes); // the size to copy
+
+ /* map it */
+ void * map = pmap_atomic_pfn(pfn, pmbd, WRITE);
+ void * pmbd_va = map + off;
+
+ /* do memcopy */
+ if (memcmp(pmbd_va, ram_va, size)){
+ punmap_atomic(map, pmbd, WRITE);
+ goto bad;
+ }
+
+ /* unmap it */
+ punmap_atomic(map, pmbd, WRITE);
+
+ /* prepare the next iteration */
+ ram_va += size;
+ bytes -= size;
+ pa += size;
+ }
+
+ /* re-enable interrupt */
+ ENABLE_RESTORE_IRQ(flags);
+ return 0;
+
+bad:
+ ENABLE_RESTORE_IRQ(flags);
+ return -1;
+}
+
+
+static inline int pmbd_verify_wr_pages_nopmap(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
+{
+ if (memcmp(pmbd_va, ram_va, bytes))
+ return -1;
+ else
+ return 0;
+}
+
+static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
+{
+ int rtn = 0;
+ uint64_t time_p1, time_p2;
+
+ TIMESTAT_POINT(time_p1);
+
+ /* check it */
+ if (PMBD_USE_PMAP())
+ rtn = pmbd_verify_wr_pages_pmap(pmbd, pmbd_va, ram_va, bytes);
+ else
+ rtn = pmbd_verify_wr_pages_nopmap(pmbd, pmbd_va, ram_va, bytes);
+
+ /* found mismatch */
+ if (rtn < 0){
+ panic("pmbd: *** writing into PM failed (error found) ***\n");
+ return -1;
+ }
+
+ TIMESTAT_POINT(time_p2);
+
+ /* timestamp */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_wrverify[WRITE][cid] += time_p2 - time_p1;
+ }
+
+ return 0;
+}
+
+/*
+ * Checksum (EXPERIMENTAL)
+ *
+ * Note: With write-protection and write verification, we can largely reduce
+ * the risk of PM data corruption caused by wild in-kernel pointers, however,
+ * it is still possible that some data gets corrupted (e.g. PM pages are
+ * maliciously changed to writable). Thus, we need to provide another layer of
+ * protection by checksuming the PM pages. When writing a page, we compute a
+ * checksum and write it into memory; When reading a page, we compute its
+ * checksum and compare it with the stored checksum. If a mismatch is found,
+ * it indicates that either PM data or the checksum has been corrupted.
+ *
+ * FIXME:
+ * (1) checksum should be stored in PM space, currently we just store it in RAM.
+ * (2) probably we should use the CPU cache to speed up and avoid reading the same
+ * chunk of data again.
+ * (3) currently we always allocate checksum space, whether we enable or disable it
+ * in the module config options; may need to make it more efficient in the future.
+ *
+ */
+
+
+static int pmbd_checksum_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+ int err = 0;
+
+ /* allocate checksum space */
+ pmbd->checksum_space= vmalloc(PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T));
+ if (pmbd->checksum_space){
+ memset(pmbd->checksum_space, 0, (PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T)));
+ printk(KERN_INFO "pmbd(%d): checksum space is allocated\n", pmbd->pmbd_id);
+ } else {
+ err = -ENOMEM;
+ }
+
+ /* allocate checksum buffer space */
+ pmbd->checksum_iomem_buf = vmalloc(pmbd->checksum_unit_size);
+ if (pmbd->checksum_iomem_buf){
+ memset(pmbd->checksum_iomem_buf, 0, pmbd->checksum_unit_size);
+ printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is allocated\n", pmbd->pmbd_id);
+ } else {
+ err = -ENOMEM;
+ }
+
+ return err;
+}
+
+static int pmbd_checksum_space_free(PMBD_DEVICE_T* pmbd)
+{
+ if (pmbd->checksum_space) {
+ vfree(pmbd->checksum_space);
+ pmbd->checksum_space = NULL;
+ printk(KERN_INFO "pmbd(%d): checksum space is freed\n", pmbd->pmbd_id);
+ }
+ if (pmbd->checksum_iomem_buf) {
+ vfree(pmbd->checksum_iomem_buf);
+ pmbd->checksum_iomem_buf = NULL;
+ printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is freed\n", pmbd->pmbd_id);
+ }
+ return 0;
+}
+
+
+/*
+ * Derived from linux/lib/crc32.c GPL v2
+ */
+static unsigned int crc32_my(unsigned char const *p, unsigned int len)
+{
+ int i;
+ unsigned int crc = 0;
+ while (len--) {
+ crc ^= *p++;
+ for (i = 0; i < 8; i++)
+ crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
+ }
+ return crc;
+}
+
+static inline PMBD_CHECKSUM_T pmbd_checksum_func(void* data, size_t size)
+{
+ return crc32_my(data, size);
+}
+
+/*
+ * calculate the checksum for a chunksum unit
+ * @pmbd: the pmbd device
+ * @data: the virtual address of the target data (must be aligned to the
+ * checksum unit boundaries)
+ */
+
+
+static inline PMBD_CHECKSUM_T pmbd_cal_checksum(PMBD_DEVICE_T* pmbd, void* data)
+{
+ void* vaddr = data;
+ size_t size = pmbd->checksum_unit_size;
+ PMBD_CHECKSUM_T chk = 0;
+
+#if 0
+#ifndef CONFIG_X86
+ /*
+ * Note: If we are directly using vmalloc(), we don't have to copy it
+ * to the checksum buffer; however, if we are using High Memory, we should not
+ * directly dereference the ioremapped data (on non-x86 platform), so we have to
+ * first copy it to a temporary buffer, this extra copy would significantly
+ * slows down operations. We do this here is just to remove this extra copy on
+ * x86 platform. (see kernel/Documents/IO-mapping.txt)
+ *
+ */
+ if (PMBD_DEV_USE_HIGHMEM(pmbd) && VADDR_IN_PMBD_SPACE(pmbd, data)) {
+ memcpy_fromio(pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
+ vaddr = pmbd->checksum_iomem_buf;
+ }
+#endif
+#endif
+
+ if (pmbd->checksum_unit_size != PAGE_SIZE){
+ panic("ERR: %s(%d) checksum unit size (%u) must be %lu\n", __FUNCTION__, __LINE__, pmbd->checksum_unit_size, PAGE_SIZE);
+ return 0;
+ }
+
+ /* FIXME: do we really need to copy the data out first (if not pmap)*/
+ memcpy_from_pmbd(pmbd, pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
+
+ /* calculate the checksum */
+ vaddr = pmbd->checksum_iomem_buf;
+ chk = pmbd_checksum_func(vaddr, size);
+
+ return chk;
+}
+
+static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
+{
+ unsigned long i;
+ unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
+ unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
+
+ uint64_t time_p1, time_p2;
+
+ TIMESTAT_POINT(time_p1);
+
+ for (i = ck_id_s; i <= ck_id_e; i ++){
+ void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
+ void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
+
+ PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
+ memcpy(chk, &checksum, sizeof(PMBD_CHECKSUM_T));
+ }
+
+ TIMESTAT_POINT(time_p2);
+
+ /* timestamp */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_checksum[WRITE][cid] += time_p2 - time_p1;
+ }
+ return 0;
+}
+
+static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
+{
+ unsigned long i;
+ unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
+ unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
+
+ uint64_t time_p1, time_p2;
+ TIMESTAT_POINT(time_p1);
+
+ for (i = ck_id_s; i <= ck_id_e; i ++){
+ void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
+ void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
+
+ PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
+ if (memcmp(chk, &checksum, sizeof(PMBD_CHECKSUM_T))){
+ printk(KERN_WARNING "pmbd(%d): checksum mismatch found!", pmbd->pmbd_id);
+ }
+ }
+
+ TIMESTAT_POINT(time_p2);
+
+ /* timestamp */
+ if(PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ pmbd_stat->cycles_checksum[READ][cid] += time_p2 - time_p1;
+ }
+
+ return 0;
+}
+
+#if 0
+/* WARN: Calculating checksum for a big PM space is slow and could lockup system*/
+static int pmbd_checksum_space_init(PMBD_DEVICE_T* pmbd)
+{
+ unsigned long i;
+ PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, pmbd->mem_space);
+ unsigned long ck_s = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_FIRST_BYTE(pmbd));
+ unsigned long ck_e = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_LAT_BYTE(pmbd));
+
+ for (i = ck_s; i <= ck_e; i ++){
+ void* dst = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
+ memcpy(dst, &checksum, sizeof(PMBD_CHECKSUM_T));
+ }
+ return 0;
+}
+#endif
+
+/*
+ * locks
+ *
+ * Note: We should prevent multiple threads from concurrently accessing the same
+ * chunk of data. For example, if two writes access the same page, the PM page
+ * could be corrupted with a merged content of two. So we allocate one spinlock
+ * for each 4KB PM page. When read/writing PM data, we lock the related pages
+ * and unlock them after done.
+ *
+ */
+
+static int pmbd_lock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+ if (PMBD_USE_LOCK()) {
+ PBN_T pbn = 0;
+ PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+ PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ spin_lock(&pbi->lock);
+ }
+ }
+ return 0;
+}
+
+static int pmbd_unlock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+ if (PMBD_USE_LOCK()){
+ PBN_T pbn = 0;
+ PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+ PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+
+ for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
+ PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+ spin_unlock(&pbi->lock);
+ }
+ }
+ return 0;
+}
+
+/*
+ **************************************************************************
+ * Unbuffered Read/write functions
+ **************************************************************************
+ */
+static void copy_to_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes, unsigned do_fua)
+{
+ void *dst;
+
+ dst = pmbd->mem_space + sector * pmbd->sector_size;
+
+ /* lock the pages */
+ pmbd_lock_on_access(pmbd, sector, bytes);
+
+ /* set the pages writable */
+ /* if we use CR0/WP to temporarily switch the writable permission,
+ * we don't have to change the PTE attributes directly */
+ if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+ pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
+
+ /* do memcpy */
+ memcpy_to_pmbd(pmbd, dst, src, bytes, do_fua);
+
+ /* finish up */
+ /* set the pages read-only */
+ if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+ pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
+
+ /* verify that the write operation succeeded */
+ if(PMBD_USE_WRITE_VERIFICATION())
+ pmbd_verify_wr_pages(pmbd, dst, src, bytes);
+
+ /* generate check sum */
+ if (PMBD_USE_CHECKSUM())
+ pmbd_checksum_on_write(pmbd, dst, bytes);
+
+ /* unlock the pages */
+ pmbd_unlock_on_access(pmbd, sector, bytes);
+
+ return;
+}
+
+
+static void copy_from_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+ void *src = pmbd->mem_space + sector * pmbd->sector_size;
+
+ /* lock the pages */
+ pmbd_lock_on_access(pmbd, sector, bytes);
+
+ /* check checksum first */
+ if (PMBD_USE_CHECKSUM())
+ pmbd_checksum_on_read(pmbd, src, bytes);
+
+ /* read it out*/
+ memcpy_from_pmbd(pmbd, dst, src, bytes);
+
+ /* unlock the pages */
+ pmbd_unlock_on_access(pmbd, sector, bytes);
+
+ return;
+}
+
+
+/*
+ * *************************************************************************
+ * Read/write functions
+ * *************************************************************************
+ */
+
+static void copy_to_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes, unsigned do_fua)
+{
+ if (PMBD_DEV_USE_BUFFER(pmbd)){
+ copy_to_pmbd_buffered(pmbd, dst, sector, bytes);
+ if (do_fua){
+ /* NOTE:
+ * When we use a FUA, if the buffer is enabled, we
+ * still write into the buffer first, but then we
+ * directly write into the PM space without using the
+ * buffer again. This is suboptimal (we need to write
+ * the data twice), however, it is better than changing
+ * the buffering code.
+ */
+ copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
+ }
+ }else
+ copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
+ return;
+}
+
+static void copy_from_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+ if (PMBD_DEV_USE_BUFFER(pmbd))
+ copy_from_pmbd_buffered(pmbd, dst, sector, bytes);
+ else
+ copy_from_pmbd_unbuffered(pmbd, dst, sector, bytes);
+ return;
+}
+
+static int pmbd_seg_read_write(PMBD_DEVICE_T* pmbd, struct page *page, unsigned int len,
+ unsigned int off, int rw, sector_t sector, unsigned do_fua)
+{
+ void *mem;
+ int err = 0;
+
+ mem = kmap_atomic(page);
+
+ if (rw == READ) {
+ copy_from_pmbd(pmbd, mem + off, sector, len);
+ flush_dcache_page(page);
+ } else {
+ flush_dcache_page(page);
+ copy_to_pmbd(pmbd, mem + off, sector, len, do_fua);
+ }
+
+ kunmap_atomic(mem);
+
+ return err;
+}
+
+static int pmbd_do_bvec(PMBD_DEVICE_T* pmbd, struct page *page,
+ unsigned int len, unsigned int off, int rw, sector_t sector, unsigned do_fua)
+{
+ return pmbd_seg_read_write(pmbd, page, len, off, rw, sector, do_fua);
+}
+
+/*
+ * Handling write barrier
+ * @pmbd: the pmbd device
+ *
+ * When the application sends fsync(), a bio labeled with WRITE_BARRIER would be
+ * received by pmbd_make_request(), and we need to stop accepting new incoming
+ * writes (by locking pmbd->wr_barrier_lock), and wait for the on-the-fly writes
+ * to complete (by checking pmbd->num_flying_wr), then if we use buffer, we flush
+ * the whole entire DRAM buffer with clflush enabled. If we do not use the buffer,
+ * we flush the CPU cache to let all the data securely be written into PM.
+ *
+ */
+
+
+static void __x86_mfence_all(void *arg)
+{
+ unsigned long cache = (unsigned long)arg;
+ if (cache && boot_cpu_data.x86 >= 4)
+ mfence();
+}
+
+static void x86_mfence_all(unsigned long cache)
+{
+ BUG_ON(irqs_disabled());
+ on_each_cpu(__x86_mfence_all, (void*) cache, 1);
+}
+
+static inline void pmbd_mfence_all(PMBD_DEVICE_T* pmbd)
+{
+ x86_mfence_all(1);
+}
+
+
+static void __x86_sfence_all(void *arg)
+{
+ unsigned long cache = (unsigned long)arg;
+ if (cache && boot_cpu_data.x86 >= 4)
+ sfence();
+}
+
+static void x86_sfence_all(unsigned long cache)
+{
+ BUG_ON(irqs_disabled());
+ on_each_cpu(__x86_sfence_all, (void*) cache, 1);
+
+}
+
+static inline void pmbd_sfence_all(PMBD_DEVICE_T* pmbd)
+{
+ x86_sfence_all(1);
+}
+
+static int pmbd_write_barrier(PMBD_DEVICE_T* pmbd)
+{
+ unsigned i;
+
+ /* blocking incoming writes */
+ spin_lock(&pmbd->wr_barrier_lock);
+
+ /* wait for all on-the-fly writes to finish first */
+ while (atomic_read(&pmbd->num_flying_wr) != 0)
+ ;
+
+ if (PMBD_DEV_USE_BUFFER(pmbd)){
+ /* if buffer is used, flush the entire buffer */
+ for (i = 0; i < pmbd->num_buffers; i ++){
+ PMBD_BUFFER_T* buffer = pmbd->buffers[i];
+ pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
+ }
+ }
+
+ /*
+ * considering the following:
+ * UC (write-through): strong ordering, we do nothing
+ * UC-Minus: strong ordering (may be overridden by WC), we use sfence, do nothing
+ * WC (write-combining): sfence should be used after each write, so we do nothing
+ * WB (write-back): non-temporal store : sfence is used, do nothing
+ * clflush/mfence: mfence is used in clflush_cache_range(), do nothing
+ * nothing: wbinvd needed to drop the entire cache
+ */
+ if (PMBD_CPU_CACHE_USE_WB()){
+ if (PMBD_USE_NTS()){
+ /* sfence is used after each movntq, so it is safe, we
+ * do nothing, just stop accepting any incoming requests */
+ } else if (PMBD_USE_CLFLUSH()) {
+ /* if use clflush/mfence to sync I/O, we do nothing*/
+// pmbd_mfence_all(pmbd);
+ } else {
+ /* if no sync operations, we have to drop the entire cache */
+ pmbd_clflush_all(pmbd);
+ }
+ } else if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) {
+ /* if using WC, sfence should used already, so do nothing */
+
+ } else if (PMBD_CPU_CACHE_USE_UC()) {
+ /* strong ordering is used, no need to do anything else*/
+ } else {
+ panic("%s(%d): something is wrong\n", __FUNCTION__, __LINE__);
+ }
+
+ /* unblock incoming writes */
+ spin_unlock(&pmbd->wr_barrier_lock);
+ return 0;
+}
+
+
+// #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & REQ_FLUSH) == REQ_FLUSH)
+// #define BIO_WR_BARRIER(BIO) ((BIO)->bi_rw & (REQ_FLUSH | REQ_FLUSH_SEQ))
+ #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & WRITE_FLUSH) == WRITE_FLUSH)
+ #define BIO_WR_FUA(BIO) (((BIO)->bi_rw & WRITE_FUA) == WRITE_FUA)
+ #define BIO_WR_SYNC(BIO) (((BIO)->bi_rw & WRITE_SYNC) == WRITE_SYNC)
+
+static void pmbd_make_request(struct request_queue *q, struct bio *bio)
+{
+ int i = 0;
+ int err = -EIO;
+ uint64_t start = 0;
+ uint64_t end = 0;
+ struct bio_vec *bvec;
+ int rw = bio_rw(bio);
+ sector_t sector = bio->bi_sector;
+ int num_sectors = bio_sectors(bio);
+ struct block_device *bdev = bio->bi_bdev;
+ PMBD_DEVICE_T *pmbd = bdev->bd_disk->private_data;
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+ unsigned bio_is_write_fua = FALSE;
+ unsigned bio_is_write_barrier = FALSE;
+ unsigned do_fua = FALSE;
+ uint64_t time_p1, time_p2, time_p3, time_p4, time_p5, time_p6;
+ time_p1 = time_p2 = time_p3 = time_p4 = time_p5 = time_p6 = 0;
+
+
+ TIMESTAT_POINT(time_p1);
+// printk("ACCESS: %u %d %X %d\n", sector, num_sectors, bio->bi_rw, rw);
+
+ /* update rw */
+ if (rw == READA)
+ rw = READ;
+ if (rw != READ && rw != WRITE)
+ panic("pmbd: %s(%d) found request not read or write either\n", __FUNCTION__, __LINE__);
+
+ /* handle write barrier (we don't do for BIO_WR_SYNC(bio) anymore*/
+ if (BIO_WR_BARRIER(bio)){
+ /*
+ * Note: Linux kernel 2.6.37 and later use file systems and FUA
+ * to ensure data reliability, rather than write barriers.
+ * See http://monolight.cc/2011/06/barriers-caches-filesystems
+ */
+ bio_is_write_barrier = TRUE;
+// printk(KERN_INFO "pmbd: received barrier request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
+
+ if (PMBD_USE_WB())
+ pmbd_write_barrier(pmbd);
+ }
+
+ if (BIO_WR_FUA(bio)){
+ bio_is_write_fua = TRUE;
+// printk(KERN_INFO "pmbd: received FUA request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
+
+ if (PMBD_USE_FUA())
+ do_fua = TRUE;
+ }
+
+ TIMESTAT_POINT(time_p2);
+
+ /* blocking write until write barrier is done */
+ if (rw == WRITE){
+ spin_lock(&pmbd->wr_barrier_lock);
+ spin_unlock(&pmbd->wr_barrier_lock);
+ }
+
+ /* increment on-the-fly writes counter */
+ atomic_inc(&pmbd->num_flying_wr);
+
+ /* starting emulation */
+ if (PMBD_DEV_SIM_DEV(pmbd))
+ start = emul_start(pmbd, num_sectors, rw);
+
+ /* check if out of range */
+ if (sector + (bio->bi_size >> SECTOR_SHIFT) > get_capacity(bdev->bd_disk)){
+ printk(KERN_WARNING "pmbd: request exceeds the PMBD capacity\n");
+ TIMESTAT_POINT(time_p3);
+ goto out;
+ }
+
+// printk("DEBUG: ACCESS %lu %d %d\n", sector, num_sectors, rw);
+
+ /*
+ * NOTE: some applications (e.g. fdisk) call fsync() to request
+ * flushing dirty data from the buffer cache. In default, fsync() is
+ * linked to blkdev_fsync() in the def_blk_fops structure, and
+ * blkdev_fsync() will call blkdev_issue_flush(), which generates an
+ * empty bio carrying a write barrier down to the block device through
+ * generic_make_request(), which calls pmbd_make_request() in turn. If
+ * we don't set err=0 here, this error message would pass upwards back
+ * to the application. For example, fdisk will fail and reports error
+ * when trying to write the partition table before it exits. Thus we
+ * must reset the error code here if the bio is empty. Also note that
+ * we directly check the bio size, rather than using bio_wr_barrier(),
+ * to handle other cases.
+ *
+ */
+ if (num_sectors == 0) {
+ err = 0;
+ TIMESTAT_POINT(time_p3);
+ goto out;
+ }
+
+ /* update the access time*/
+ PMBD_DEV_UPDATE_ACCESS_TIME(pmbd);
+
+ TIMESTAT_POINT(time_p3);
+
+ /*
+ * Do read/write now. We first perform the operation, then check how
+ * long it actually takes to finish the operation, then we calculate an
+ * emulated time for a given slow-down model, if the actual access time
+ * is less than the emulated time, we just make up the difference to
+ * emulate a slower device.
+ */
+ bio_for_each_segment(bvec, bio, i) {
+ unsigned int len = bvec->bv_len;
+ err = pmbd_do_bvec(pmbd, bvec->bv_page, len,
+ bvec->bv_offset, rw, sector, do_fua);
+ if (err)
+ break;
+ sector += len >> SECTOR_SHIFT;
+ }
+
+out:
+ TIMESTAT_POINT(time_p4);
+
+ bio_endio(bio, err);
+
+ TIMESTAT_POINT(time_p5);
+
+ /* ending emulation (simmode0)*/
+ if (PMBD_DEV_SIM_DEV(pmbd))
+ end = emul_end(pmbd, num_sectors, rw, start);
+
+ /* decrement on-the-fly writes counter */
+ atomic_dec(&pmbd->num_flying_wr);
+
+ TIMESTAT_POINT(time_p6);
+
+ /* update statistics data */
+ spin_lock(&pmbd_stat->stat_lock);
+ if (rw == READ) {
+ pmbd_stat->num_requests_read ++;
+ pmbd_stat->num_sectors_read += num_sectors;
+ } else {
+ pmbd_stat->num_requests_write ++;
+ pmbd_stat->num_sectors_write += num_sectors;
+ }
+ if (bio_is_write_barrier)
+ pmbd_stat->num_write_barrier ++;
+ if (bio_is_write_fua)
+ pmbd_stat->num_write_fua ++;
+ spin_unlock(&pmbd_stat->stat_lock);
+
+ /* cycles */
+ if (PMBD_USE_TIMESTAT()){
+ int cid = CUR_CPU_ID();
+ pmbd_stat->cycles_total[rw][cid] += time_p6 - time_p1;
+ pmbd_stat->cycles_wb[rw][cid] += time_p2 - time_p1; /* write barrier */
+ pmbd_stat->cycles_prepare[rw][cid] += time_p3 - time_p2;
+ pmbd_stat->cycles_work[rw][cid] += time_p4 - time_p3;
+ pmbd_stat->cycles_endio[rw][cid] += time_p5 - time_p4;
+ pmbd_stat->cycles_finish[rw][cid] += time_p6 - time_p5;
+ }
+}
+
+
+/*
+ **************************************************************************
+ * Allocating memory space for PMBD device
+ **************************************************************************
+ */
+
+/*
+ * Set the page attributes for the PMBD backstore memory space
+ * - WB: cache enabled, write back (default)
+ * - WC: cache disabled, write through, speculative writes combined
+ * - UC: cache disabled, write through, no write combined
+ * - UC-Minus: the same as UC
+ *
+ * REF:
+ * - http://www.kernel.org/doc/ols/2008/ols2008v2-pages-135-144.pdf
+ * - http://www.mjmwired.net/kernel/Documentation/x86/pat.txt
+ */
+
+static int pmbd_set_pages_cache_flags(PMBD_DEVICE_T* pmbd)
+{
+ if (pmbd->mem_space && pmbd->num_sectors) {
+ /* NOTE: we convert it here with no problem on 64-bit system */
+ unsigned long vaddr = (unsigned long) pmbd->mem_space;
+ int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
+
+ printk(KERN_INFO "pmbd: setting %s PTE flags (%lx:%d)\n", pmbd->pmbd_name, vaddr, num_pages);
+ set_pages_cache_flags(vaddr, num_pages);
+ printk(KERN_INFO "pmbd: setting %s PTE flags done.\n", pmbd->pmbd_name);
+ }
+ return 0;
+}
+
+static int pmbd_reset_pages_cache_flags(PMBD_DEVICE_T* pmbd)
+{
+ if (pmbd->mem_space){
+ unsigned long vaddr = (unsigned long) pmbd->mem_space;
+ int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
+ set_memory_wb(vaddr, num_pages);
+ printk(KERN_INFO "pmbd: %s pages cache flags are reset to WB\n", pmbd->pmbd_name);
+ }
+ return 0;
+}
+
+
+/*
+ * Allocate/free memory backstore space for PMBD devices
+ */
+static int pmbd_mem_space_alloc (PMBD_DEVICE_T* pmbd)
+{
+ int err = 0;
+
+ /* allocate PM memory space */
+ if (PMBD_DEV_USE_VMALLOC(pmbd)){
+ pmbd->mem_space = vmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
+ } else if (PMBD_DEV_USE_HIGHMEM(pmbd)){
+ pmbd->mem_space = hmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
+ }
+
+ if (pmbd->mem_space) {
+#if 0
+ /* FIXME: No need to do this. It's slow, system could be locked up */
+ memset(pmbd->mem_space, 0, pmbd->sectors * pmbd->sector_size);
+#endif
+ printk(KERN_INFO "pmbd: /dev/%s is created [%lu : %llu MBs]\n",
+ pmbd->pmbd_name, (unsigned long) pmbd->mem_space, SECTORS_TO_MB(pmbd->num_sectors));
+ } else {
+ printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
+ err = -ENOMEM;
+ }
+ return err;
+}
+
+static int pmbd_mem_space_free(PMBD_DEVICE_T* pmbd)
+{
+ /* free it up */
+ if (pmbd->mem_space) {
+ if (PMBD_DEV_USE_VMALLOC(pmbd))
+ vfree(pmbd->mem_space);
+ else if (PMBD_DEV_USE_HIGHMEM(pmbd)) {
+ hfree(pmbd->mem_space);
+ }
+ pmbd->mem_space = NULL;
+ }
+ return 0;
+}
+
+/* pmbd->pmbd_stat */
+static int pmbd_stat_alloc(PMBD_DEVICE_T* pmbd)
+{
+ int err = 0;
+ pmbd->pmbd_stat = (PMBD_STAT_T*)kzalloc(sizeof(PMBD_STAT_T), GFP_KERNEL);
+ if (pmbd->pmbd_stat){
+ spin_lock_init(&pmbd->pmbd_stat->stat_lock);
+ } else {
+ printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
+ err = -ENOMEM;
+ }
+ return 0;
+}
+
+static int pmbd_stat_free(PMBD_DEVICE_T* pmbd)
+{
+ if(pmbd->pmbd_stat) {
+ kfree(pmbd->pmbd_stat);
+ pmbd->pmbd_stat = NULL;
+ }
+ return 0;
+}
+
+/* /proc/pmbd/<dev> */
+static int pmbd_proc_pmbdstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+ int rtn;
+ if (offset > 0) {
+ *eof = 1;
+ rtn = 0;
+ } else {
+ //char local_buffer[1024];
+ char* local_buffer = kzalloc(8192, GFP_KERNEL);
+ PMBD_DEVICE_T* pmbd, *next;
+ char rdwr_name[2][16] = {"read\0", "write\0"};
+ local_buffer[0] = '\0';
+
+ list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+ unsigned i, j;
+ BBN_T num_dirty = 0;
+ BBN_T num_blocks = 0;
+ PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+
+ /* FIXME: should we lock the buffer? (NOT NECESSARY)*/
+ for (i = 0; i < pmbd->num_buffers; i ++){
+ num_blocks += pmbd->buffers[i]->num_blocks;
+ num_dirty += pmbd->buffers[i]->num_dirty;
+ }
+
+ /* print stuff now */
+ spin_lock(&pmbd->pmbd_stat->stat_lock);
+
+ sprintf(local_buffer+strlen(local_buffer), "num_dirty_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) num_dirty);
+ sprintf(local_buffer+strlen(local_buffer), "num_clean_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) (num_blocks - num_dirty));
+ sprintf(local_buffer+strlen(local_buffer), "num_sectors_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_read);
+ sprintf(local_buffer+strlen(local_buffer), "num_sectors_write[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_write);
+ sprintf(local_buffer+strlen(local_buffer), "num_requests_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_requests_read);
+ sprintf(local_buffer+strlen(local_buffer), "num_requests_write[%s] %llu\n",pmbd->pmbd_name, pmbd_stat->num_requests_write);
+ sprintf(local_buffer+strlen(local_buffer), "num_write_barrier[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_barrier);
+ sprintf(local_buffer+strlen(local_buffer), "num_write_fua[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_fua);
+
+ spin_unlock(&pmbd->pmbd_stat->stat_lock);
+
+// sprintf(local_buffer+strlen(local_buffer), "\n");
+
+ for (j = 0; j <= 1; j ++){
+ int k=0;
+
+ unsigned long long cycles_total = 0;
+ unsigned long long cycles_prepare = 0;
+ unsigned long long cycles_wb = 0;
+ unsigned long long cycles_work = 0;
+ unsigned long long cycles_endio = 0;
+ unsigned long long cycles_finish = 0;
+
+ unsigned long long cycles_pmap = 0;
+ unsigned long long cycles_punmap = 0;
+ unsigned long long cycles_memcpy = 0;
+ unsigned long long cycles_clflush = 0;
+ unsigned long long cycles_clflushall = 0;
+ unsigned long long cycles_wrverify = 0;
+ unsigned long long cycles_checksum = 0;
+ unsigned long long cycles_pause = 0;
+ unsigned long long cycles_slowdown = 0;
+ unsigned long long cycles_setpages_ro = 0;
+ unsigned long long cycles_setpages_rw = 0;
+
+ for (k = 0; k < PMBD_MAX_NUM_CPUS; k ++){
+ cycles_total += pmbd_stat->cycles_total[j][k];
+ cycles_prepare += pmbd_stat->cycles_prepare[j][k];
+ cycles_wb += pmbd_stat->cycles_wb[j][k];
+ cycles_work += pmbd_stat->cycles_work[j][k];
+ cycles_endio += pmbd_stat->cycles_endio[j][k];
+ cycles_finish += pmbd_stat->cycles_finish[j][k];
+
+ cycles_pmap += pmbd_stat->cycles_pmap[j][k];
+ cycles_punmap += pmbd_stat->cycles_punmap[j][k];
+ cycles_memcpy += pmbd_stat->cycles_memcpy[j][k];
+ cycles_clflush += pmbd_stat->cycles_clflush[j][k];
+ cycles_clflushall+=pmbd_stat->cycles_clflushall[j][k];
+ cycles_wrverify += pmbd_stat->cycles_wrverify[j][k];
+ cycles_checksum += pmbd_stat->cycles_checksum[j][k];
+ cycles_pause += pmbd_stat->cycles_pause[j][k];
+ cycles_slowdown += pmbd_stat->cycles_slowdown[j][k];
+ cycles_setpages_ro+= pmbd_stat->cycles_setpages_ro[j][k];
+ cycles_setpages_rw+= pmbd_stat->cycles_setpages_rw[j][k];
+ }
+
+ sprintf(local_buffer+strlen(local_buffer), "cycles_total_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_total);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_prepare_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_prepare);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_wb_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wb);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_work_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_work);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_endio_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_endio);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_finish_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_finish);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_pmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pmap);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_punmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_punmap);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_memcpy_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_memcpy);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_clflush_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflush);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_clflushall_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflushall);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_wrverify_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wrverify);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_checksum_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_checksum);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_pause_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pause);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_slowdown_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_slowdown);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_ro_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_ro);
+ sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_rw_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_rw);
+ }
+
+#if 0
+ /* print something temporary for debugging purpose */
+ if (0) {
+ spin_lock(&pmbd->tmp_lock);
+ printk("%llu %lu\n", pmbd->tmp_data, pmbd->tmp_num);
+ spin_unlock(&pmbd->tmp_lock);
+ }
+#endif
+ }
+
+ memcpy(buffer, local_buffer, strlen(local_buffer));
+ rtn = strlen(local_buffer);
+ kfree(local_buffer);
+ }
+ return rtn;
+}
+
+/* /proc/pmbdcfg */
+static int pmbd_proc_pmbdcfg_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+ int rtn;
+ if (offset > 0) {
+ *eof = 1;
+ rtn = 0;
+ } else {
+ char* local_buffer = kzalloc(8192, GFP_KERNEL);
+ PMBD_DEVICE_T* pmbd, *next;
+ local_buffer[0] = '\0';
+
+ /* global configurations */
+ sprintf(local_buffer+strlen(local_buffer), "MODULE OPTIONS: %s\n", mode);
+ sprintf(local_buffer+strlen(local_buffer), "\n");
+
+ sprintf(local_buffer+strlen(local_buffer), "max_part %d\n", max_part);
+ sprintf(local_buffer+strlen(local_buffer), "part_shift %d\n", part_shift);
+
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_type %u\n", g_pmbd_type);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_mergeable %u\n", g_pmbd_mergeable);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_clflush %u\n", g_pmbd_cpu_cache_clflush);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_flag %lu\n", g_pmbd_cpu_cache_flag);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_protect %u\n", g_pmbd_wr_protect);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_verify %u\n", g_pmbd_wr_verify);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_checksum %u\n", g_pmbd_checksum);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_lock %u\n", g_pmbd_lock);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_subpage_update %u\n", g_pmbd_subpage_update);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_pmap %u\n", g_pmbd_pmap);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nts %u\n", g_pmbd_nts);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_ntl %u\n", g_pmbd_ntl);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wb %u\n", g_pmbd_wb);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_fua %u\n", g_pmbd_fua);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_timestat %u\n", g_pmbd_timestat);
+ sprintf(local_buffer+strlen(local_buffer), "g_highmem_size %lu\n", g_highmem_size);
+ sprintf(local_buffer+strlen(local_buffer), "g_highmem_phys_addr %llu\n", (unsigned long long) g_highmem_phys_addr);
+ sprintf(local_buffer+strlen(local_buffer), "g_highmem_virt_addr %llu\n", (unsigned long long) g_highmem_virt_addr);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nr %u\n", g_pmbd_nr);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_adjust_ns %llu\n", g_pmbd_adjust_ns);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_num_buffers %llu\n", g_pmbd_num_buffers);
+ sprintf(local_buffer+strlen(local_buffer), "g_pmbd_buffer_stride %llu\n", g_pmbd_buffer_stride);
+ sprintf(local_buffer+strlen(local_buffer), "\n");
+
+ /* device specific configurations */
+ list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+ int i = 0;
+
+ sprintf(local_buffer+strlen(local_buffer), "pmbd_id[%s] %d\n", pmbd->pmbd_name, pmbd->pmbd_id);
+ sprintf(local_buffer+strlen(local_buffer), "num_sectors[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->num_sectors);
+ sprintf(local_buffer+strlen(local_buffer), "sector_size[%s] %u\n", pmbd->pmbd_name, pmbd->sector_size);
+ sprintf(local_buffer+strlen(local_buffer), "pmbd_type[%s] %u\n", pmbd->pmbd_name, pmbd->pmbd_type);
+ sprintf(local_buffer+strlen(local_buffer), "rammode[%s] %u\n", pmbd->pmbd_name, pmbd->rammode);
+ sprintf(local_buffer+strlen(local_buffer), "bufmode[%s] %u\n", pmbd->pmbd_name, pmbd->bufmode);
+ sprintf(local_buffer+strlen(local_buffer), "wpmode[%s] %u\n", pmbd->pmbd_name, pmbd->wpmode);
+ sprintf(local_buffer+strlen(local_buffer), "num_buffers[%s] %u\n", pmbd->pmbd_name, pmbd->num_buffers);
+ sprintf(local_buffer+strlen(local_buffer), "buffer_stride[%s] %u\n", pmbd->pmbd_name, pmbd->buffer_stride);
+ sprintf(local_buffer+strlen(local_buffer), "pb_size[%s] %u\n", pmbd->pmbd_name, pmbd->pb_size);
+ sprintf(local_buffer+strlen(local_buffer), "checksum_unit_size[%s] %u\n", pmbd->pmbd_name, pmbd->checksum_unit_size);
+ sprintf(local_buffer+strlen(local_buffer), "simmode[%s] %u\n", pmbd->pmbd_name, pmbd->simmode);
+ sprintf(local_buffer+strlen(local_buffer), "rdlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdlat);
+ sprintf(local_buffer+strlen(local_buffer), "wrlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrlat);
+ sprintf(local_buffer+strlen(local_buffer), "rdbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdbw);
+ sprintf(local_buffer+strlen(local_buffer), "wrbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrbw);
+ sprintf(local_buffer+strlen(local_buffer), "rdsx[%s] %u\n", pmbd->pmbd_name, pmbd->rdsx);
+ sprintf(local_buffer+strlen(local_buffer), "wrsx[%s] %u\n", pmbd->pmbd_name, pmbd->wrsx);
+ sprintf(local_buffer+strlen(local_buffer), "rdpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdpause);
+ sprintf(local_buffer+strlen(local_buffer), "wrpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrpause);
+
+ for (i = 0; i < pmbd->num_buffers; i ++){
+ PMBD_BUFFER_T* buffer = pmbd->buffers[i];
+ sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]buffer_id %u\n", i, pmbd->pmbd_name, buffer->buffer_id);
+ sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]num_blocks %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->num_blocks);
+ sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]batch_size %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->batch_size);
+ }
+
+ }
+
+ memcpy(buffer, local_buffer, strlen(local_buffer));
+ rtn = strlen(local_buffer);
+ kfree(local_buffer);
+ }
+ return rtn;
+}
+
+
+
+static int pmbd_proc_devstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+ int rtn;
+ char local_buffer[1024];
+ if (offset > 0) {
+ *eof = 1;
+ rtn = 0;
+ } else {
+ sprintf(local_buffer, "N/A\n");
+ memcpy(buffer, local_buffer, strlen(local_buffer));
+ rtn = strlen(local_buffer);
+ }
+ return rtn;
+}
+
+static int pmbd_proc_devstat_create(PMBD_DEVICE_T* pmbd)
+{
+ /* create a /proc/pmbd/<dev> entry */
+ pmbd->proc_devstat = create_proc_entry(pmbd->pmbd_name, S_IRUGO, proc_pmbd);
+ if (pmbd->proc_devstat == NULL) {
+ remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
+ printk(KERN_ERR "pmbd: cannot create /proc/pmbd/%s\n", pmbd->pmbd_name);
+ return -ENOMEM;
+ }
+ pmbd->proc_devstat->read_proc = pmbd_proc_devstat_read;
+ printk(KERN_INFO "pmbd: /proc/pmbd/%s created\n", pmbd->pmbd_name);
+
+ return 0;
+}
+
+static int pmbd_proc_devstat_destroy(PMBD_DEVICE_T* pmbd)
+{
+ remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
+ printk(KERN_INFO "pmbd: /proc/pmbd/%s removed\n", pmbd->pmbd_name);
+ return 0;
+}
+
+static int pmbd_create (PMBD_DEVICE_T* pmbd, uint64_t sectors)
+{
+ int err = 0;
+
+ pmbd->num_sectors = sectors;
+ pmbd->sector_size = PMBD_SECTOR_SIZE; /* FIXME: now we use 512, do we need to change it? */
+ pmbd->pmbd_type = g_pmbd_type;
+ pmbd->checksum_unit_size = PAGE_SIZE;
+ pmbd->pb_size = PAGE_SIZE;
+
+ spin_lock_init(&pmbd->batch_lock);
+ spin_lock_init(&pmbd->wr_barrier_lock);
+
+ spin_lock_init(&pmbd->tmp_lock);
+ pmbd->tmp_data = 0;
+ pmbd->tmp_num = 0;
+
+ /* allocate statistics info */
+ if ((err = pmbd_stat_alloc(pmbd)) < 0)
+ goto error;
+
+ /* allocate memory space */
+ if ((err = pmbd_mem_space_alloc(pmbd)) < 0)
+ goto error;
+
+ /* allocate buffer space */
+ if ((err = pmbd_buffer_space_alloc(pmbd)) < 0)
+ goto error;
+
+ /* allocate checksum space */
+ if ((err = pmbd_checksum_space_alloc(pmbd)) < 0)
+ goto error;
+
+ /* allocate block info space */
+ if ((err = pmbd_pbi_space_alloc(pmbd)) < 0)
+ goto error;
+
+ /* create a /proc/pmbd/<dev> entry*/
+ if ((err = pmbd_proc_devstat_create(pmbd)) < 0)
+ goto error;
+
+#if 0
+ /* FIXME: No need to do it. It's slow and could lock up the system*/
+ pmbd_checksum_space_init(pmbd);
+#endif
+
+ /* set up the page attributes related with CPU cache
+ * if using vmalloc(), we need to set up the page cache flags (WB,WC,UC,UM);
+ * if using high memory, we set up the page cache flag with ioremap_prot();
+ * WARN: In Linux 3.2.1, this function is slow and could cause system hangs.
+ */
+
+ if (PMBD_USE_VMALLOC()){
+ pmbd_set_pages_cache_flags(pmbd);
+ }
+
+ /* initialize PM pages read-only */
+ if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())
+ pmbd_set_pages_ro(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
+
+ printk(KERN_INFO "pmbd: %s created\n", pmbd->pmbd_name);
+error:
+ return err;
+}
+
+static int pmbd_destroy (PMBD_DEVICE_T* pmbd)
+{
+ /* flush everything down */
+ // FIXME: this implies flushing CPU cache
+ pmbd_write_barrier(pmbd);
+
+ /* free /proc entry */
+ pmbd_proc_devstat_destroy(pmbd);
+
+ /* free buffer space */
+ pmbd_buffer_space_free(pmbd);
+
+ /* set PM pages writable */
+ if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())
+ pmbd_set_pages_rw(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
+
+ /* reset memory attributes to WB */
+ if (PMBD_USE_VMALLOC())
+ pmbd_reset_pages_cache_flags(pmbd);
+
+ /* free block info space */
+ pmbd_pbi_space_free(pmbd);
+
+ /* free checksum space */
+ pmbd_checksum_space_free(pmbd);
+
+ /* free memory backstore space */
+ pmbd_mem_space_free(pmbd);
+
+ /* free statistics data */
+ pmbd_stat_free(pmbd);
+
+ printk(KERN_INFO "pmbd: /dev/%s is destroyed (%llu MB)\n", pmbd->pmbd_name, SECTORS_TO_MB(pmbd->num_sectors));
+
+ pmbd->num_sectors = 0;
+ pmbd->sector_size = 0;
+ pmbd->checksum_unit_size = 0;
+ return 0;
+}
+
+static int pmbd_free_pages(PMBD_DEVICE_T* pmbd)
+{
+ return pmbd_destroy(pmbd);
+}
+
+/*
+ **************************************************************************
+ * /proc file system entries
+ **************************************************************************
+ */
+
+static int pmbd_proc_create(void)
+{
+ proc_pmbd= proc_mkdir("pmbd", 0);
+ if(proc_pmbd == NULL){
+ printk(KERN_ERR "pmbd: %s(%d): cannot create /proc/pmbd\n", __FUNCTION__, __LINE__);
+ return -ENOMEM;
+ }
+
+ proc_pmbdstat = create_proc_entry("pmbdstat", S_IRUGO, proc_pmbd);
+ if (proc_pmbdstat == NULL){
+ remove_proc_entry("pmbdstat", proc_pmbd);
+ printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdstat\n");
+ return -ENOMEM;
+ }
+ proc_pmbdstat->read_proc = pmbd_proc_pmbdstat_read;
+ printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat created\n");
+
+ proc_pmbdcfg = create_proc_entry("pmbdcfg", S_IRUGO, proc_pmbd);
+ if (proc_pmbdcfg == NULL){
+ remove_proc_entry("pmbdcfg", proc_pmbd);
+ printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdcfg\n");
+ return -ENOMEM;
+ }
+ proc_pmbdcfg->read_proc = pmbd_proc_pmbdcfg_read;
+ printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg created\n");
+
+ return 0;
+}
+
+static int pmbd_proc_destroy(void)
+{
+ remove_proc_entry("pmbdcfg", proc_pmbd);
+ printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg is removed\n");
+
+ remove_proc_entry("pmbdstat", proc_pmbd);
+ printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat is removed\n");
+
+ remove_proc_entry("pmbd", 0);
+ printk(KERN_INFO "pmbd: /proc/pmbd is removed\n");
+ return 0;
+}
+
+/*
+ **************************************************************************
+ * device driver interface hook functions
+ **************************************************************************
+ */
+
+static int pmbd_mergeable_bvec(struct request_queue *q,
+ struct bvec_merge_data *bvm,
+ struct bio_vec *biovec) {
+ static int flag = 0;
+
+ if(PMBD_IS_MERGEABLE()) {
+ /* always merge */
+ if (!flag) {
+ printk(KERN_INFO "pmbd: bio merging enabled\n");
+ flag = 1;
+ }
+ return biovec->bv_len;
+ } else {
+ /* never merge */
+ if (!flag) {
+ printk(KERN_INFO "pmbd: bio merging disabled\n");
+ flag = 1;
+ }
+ if (!bvm->bi_size) {
+ return biovec->bv_len;
+ } else {
+ return 0;
+ }
+ }
+}
+
+int pmbd_fsync(struct file* file, struct dentry* dentry, int datasync)
+{
+ printk(KERN_WARNING "pmbd: pmbd_fsync not implemented\n");
+
+ return 0;
+}
+
+int pmbd_open(struct block_device* bdev, fmode_t mode)
+{
+ printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) opened\n", bdev->bd_disk->disk_name);
+ return 0;
+}
+
+int pmbd_release (struct gendisk* disk, fmode_t mode)
+{
+ printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) released\n", disk->disk_name);
+ return 0;
+}
+
+static const struct block_device_operations pmbd_fops = {
+ .owner = THIS_MODULE,
+// .open = pmbd_open,
+// .release = pmbd_release,
+};
+
+/*
+ * NOTE: partial of the following code is derived from linux/block/brd.c
+ */
+
+
+static PMBD_DEVICE_T *pmbd_alloc(int i)
+{
+ PMBD_DEVICE_T *pmbd;
+ struct gendisk *disk;
+
+ /* no more than 26 devices */
+ if (i >= PMBD_MAX_NUM_DEVICES)
+ return NULL;
+
+ /* alloc and set up pmbd object */
+ pmbd = kzalloc(sizeof(*pmbd), GFP_KERNEL);
+ if (!pmbd)
+ goto out;
+ pmbd->pmbd_id = i;
+ pmbd->pmbd_queue = blk_alloc_queue(GFP_KERNEL);
+ sprintf(pmbd->pmbd_name, "pm%c", ('a' + i));
+ pmbd->rdlat = g_pmbd_rdlat[i];
+ pmbd->wrlat = g_pmbd_wrlat[i];
+ pmbd->rdbw = g_pmbd_rdbw[i];
+ pmbd->wrbw = g_pmbd_wrbw[i];
+ pmbd->rdsx = g_pmbd_rdsx[i];
+ pmbd->wrsx = g_pmbd_wrsx[i];
+ pmbd->rdpause = g_pmbd_rdpause[i];
+ pmbd->wrpause = g_pmbd_wrpause[i];
+ pmbd->simmode = g_pmbd_simmode[i];
+ pmbd->rammode = g_pmbd_rammode[i];
+ pmbd->wpmode = g_pmbd_wpmode[i];
+ pmbd->num_buffers = g_pmbd_num_buffers;
+ pmbd->buffer_stride = g_pmbd_buffer_stride;
+ pmbd->bufmode = (g_pmbd_bufsize[i] > 0 && g_pmbd_num_buffers > 0) ? TRUE : FALSE;
+
+ if (!pmbd->pmbd_queue)
+ goto out_free_dev;
+
+ /* hook functions */
+ blk_queue_make_request(pmbd->pmbd_queue, pmbd_make_request);
+
+ /* set flush capability, otherwise, WRITE_FLUSH and WRITE_FUA will be filtered in
+ generic_make_request() */
+ if (PMBD_USE_FUA() && PMBD_USE_WB())
+ blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH | REQ_FUA);
+ else if (PMBD_USE_WB())
+ blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH);
+ else if (PMBD_USE_FUA())
+ blk_queue_flush(pmbd->pmbd_queue, REQ_FUA);
+
+ blk_queue_max_hw_sectors(pmbd->pmbd_queue, 1024);
+ blk_queue_bounce_limit(pmbd->pmbd_queue, BLK_BOUNCE_ANY);
+ blk_queue_merge_bvec(pmbd->pmbd_queue, pmbd_mergeable_bvec);
+
+ disk = pmbd->pmbd_disk = alloc_disk(1 << part_shift);
+ if (!disk)
+ goto out_free_queue;
+
+ disk->major = PMBD_MAJOR;
+ disk->first_minor = i << part_shift;
+ disk->fops = &pmbd_fops;
+ disk->private_data = pmbd;
+ disk->queue = pmbd->pmbd_queue;
+ strcpy(disk->disk_name, pmbd->pmbd_name);
+ set_capacity(disk, GB_TO_SECTORS(g_pmbd_size[i])); /* num of sectors */
+
+ /* allocate PM space */
+ if (pmbd_create(pmbd, GB_TO_SECTORS(g_pmbd_size[i])) < 0)
+ goto out_free_queue;
+
+ /* done */
+ return pmbd;
+
+out_free_queue:
+ blk_cleanup_queue(pmbd->pmbd_queue);
+out_free_dev:
+ kfree(pmbd);
+out:
+ return NULL;
+}
+
+static void pmbd_free(PMBD_DEVICE_T *pmbd)
+{
+ put_disk(pmbd->pmbd_disk);
+ blk_cleanup_queue(pmbd->pmbd_queue);
+ pmbd_free_pages(pmbd);
+ kfree(pmbd);
+}
+
+static void pmbd_del_one(PMBD_DEVICE_T *pmbd)
+{
+ list_del(&pmbd->pmbd_list);
+ del_gendisk(pmbd->pmbd_disk);
+ pmbd_free(pmbd);
+}
+
+static int __init pmbd_init(void)
+{
+ int i, nr;
+ unsigned long range;
+ PMBD_DEVICE_T *pmbd, *next;
+
+ /* parse input options */
+ pmbd_parse_conf();
+
+ /* initialize pmap start*/
+ pmap_create();
+
+ /* ioremap high memory space */
+ if (PMBD_USE_HIGHMEM()) {
+ if (pmbd_highmem_map() == NULL)
+ return -ENOMEM;
+ }
+
+ part_shift = 0;
+ if (max_part > 0)
+ part_shift = fls(max_part);
+
+ if (g_pmbd_nr > 1UL << (MINORBITS - part_shift))
+ return -EINVAL;
+
+ if (g_pmbd_nr) {
+ nr = g_pmbd_nr;
+ range = g_pmbd_nr;
+ } else {
+ printk(KERN_ERR "pmbd: %s(%d) - g_pmbd_nr=%d\n", __FUNCTION__, __LINE__, g_pmbd_nr);
+ return -EINVAL;
+ }
+
+ pmbd_proc_create();
+
+ if (register_blkdev(PMBD_MAJOR, PMBD_NAME))
+ return -EIO;
+ else
+ printk(KERN_INFO "pmbd: registered device at major %d\n", PMBD_MAJOR);
+
+ for (i = 0; i < nr; i++) {
+ pmbd = pmbd_alloc(i);
+ if (!pmbd)
+ goto out_free;
+ list_add_tail(&pmbd->pmbd_list, &pmbd_devices);
+ }
+
+ /* point of no return */
+ list_for_each_entry(pmbd, &pmbd_devices, pmbd_list)
+ add_disk(pmbd->pmbd_disk);
+
+ printk(KERN_INFO "pmbd: module loaded\n");
+ return 0;
+
+out_free:
+ list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+ list_del(&pmbd->pmbd_list);
+ pmbd_free(pmbd);
+ }
+ unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
+
+ return -ENOMEM;
+}
+
+
+static void __exit pmbd_exit(void)
+{
+ unsigned long range;
+ PMBD_DEVICE_T *pmbd, *next;
+
+ range = g_pmbd_nr ? g_pmbd_nr : 1UL << (MINORBITS - part_shift);
+
+ /* deactivate each pmbd instance*/
+ list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list)
+ pmbd_del_one(pmbd);
+
+ /* deioremap high memory space */
+ if (PMBD_USE_HIGHMEM()) {
+ pmbd_highmem_unmap();
+ }
+
+ /* destroy pmap entries */
+ pmap_destroy();
+
+ unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
+
+ pmbd_proc_destroy();
+
+ printk(KERN_INFO "pmbd: module unloaded\n");
+ return;
+}
+
+/* module setup */
+MODULE_AUTHOR("Intel Corporation <linux-pmbd at intel.com>");
+MODULE_ALIAS("pmbd");
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION("0.9");
+MODULE_ALIAS_BLOCKDEV_MAJOR(PMBD_MAJOR);
+module_init(pmbd_init);
+module_exit(pmbd_exit);
+
+/* THE END */
+
+
diff --git a/include/linux/pmbd.h b/include/linux/pmbd.h
new file mode 100644
index 0000000..8e8691f
--- /dev/null
+++ b/include/linux/pmbd.h
@@ -0,0 +1,509 @@
+/*
+ * Intel Persistent Memory Block Driver
+ * Copyright (c) <2011-2013>, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+/*
+ * Intel Persistent Memory Block Driver (v0.9)
+ *
+ * pmbd.h
+ *
+ * Intel Corporation <linux-pmbd at intel.com>
+ * 03/24/2011
+ */
+
+#ifndef PMBD_H
+#define PMBD_H
+
+#define PMBD_MAJOR 261 /* FIXME: temporarily use this */
+#define PMBD_NAME "pmbd" /* pmbd module name */
+#define PMBD_MAX_NUM_DEVICES 26 /* max num of devices */
+#define PMBD_MAX_NUM_CPUS 32 /* max num of cpus*/
+
+/*
+ * type definitions
+ */
+typedef uint32_t PMBD_CHECKSUM_T;/* we use CRC32 to calculate checksum */
+typedef sector_t BBN_T; /* BBN_T */
+typedef sector_t PBN_T; /* BBN_T */
+
+
+/*
+ * PMBD device buffer control structure
+ * NOTE:
+ * (1) buffer_space is an array of num_blocks of blocks, the size of which is
+ * defined as pmbd->pb_size
+ * (2) bbi_space is an array of num_blocks of bbi (buffer block info) units,
+ * each of which contains the metadata information of each block in the buffer
+
+ buffer space management variables
+ * num_dirty - total number of dirty blocks in buffer
+ * pos_dirty - point to the end of the sequence of dirty blocks
+ * pos_clean - point to the end of the sequence of clean blocks
+ *
+ * post_dirty and pos_clean logically segment the buffer into
+ * dirty/clean regions as follows.
+ *
+ * pos_dirty ----v v--- pos_clean
+ * ----------------------------
+ * | clean |*DIRTY*| clean |
+ * ----------------------------
+ * buffer_lock - protects reads/writes to the aforesaid three
+ */
+typedef struct pmbd_bbi { /* pmbd buffer block info (BBI) */
+ PBN_T pbn; /* physical block number in PM (converted from sector) */
+ unsigned dirty; /* dirty (1) or clean (0)*/
+} PMBD_BBI_T;
+
+typedef struct pmbd_bsort_entry { /* pmbd buffer block info for sorting */
+ BBN_T bbn; /* buffer block number (in buffer)*/
+ PBN_T pbn; /* physical block number (in PMBD)*/
+} PMBD_BSORT_ENTRY_T;
+
+typedef struct pmbd_buffer {
+ unsigned buffer_id;
+ struct pmbd_device* pmbd; /* the linked pmbd device */
+
+ BBN_T num_blocks; /* buffer space size (# of blocks) */
+ void* buffer_space; /* buffer space base vaddr address */
+ PMBD_BBI_T* bbi_space; /* array of buffer block info (BBI)*/
+
+ BBN_T num_dirty; /* num of dirty blocks */
+ BBN_T pos_dirty; /* the first dirty block */
+ BBN_T pos_clean; /* the first clean block */
+ spinlock_t buffer_lock; /* lock to protect metadata updates */
+ unsigned int batch_size; /* the batch size for flushing buffer pages */
+
+ struct task_struct* syncer; /* the syncer daemon */
+
+ spinlock_t flush_lock; /* lock to protect metadata updates */
+ PMBD_BSORT_ENTRY_T* bbi_sort_buffer;/* a temp array of the bbi for sorting */
+} PMBD_BUFFER_T;
+
+/*
+ * PM physical block information (each corresponding to a PM block)
+ *
+ * (1) if the physical block is buffered, bbn contains a valid buffer block
+ * number (BBN) between 0 - (buffer->num_blocks-1), otherwise, it contains an
+ * invalid value (buffer->num_blocks + 1)
+ * (2) any access to the block (read/write/sync) must have this lock first to
+ * prevent multiple concurrent accesses to the same PM block
+ */
+typedef struct pmbd_pbi{
+ BBN_T bbn;
+ spinlock_t lock;
+} PMBD_PBI_T;
+
+typedef struct pmbd_stat{
+ /* stat_lock does not protect cycles_*[] counters */
+ spinlock_t stat_lock; /* protection lock */
+
+ unsigned last_access_jiffies; /* the timestamp of the most recent access */
+ uint64_t num_sectors_read; /* total num of sectors being read */
+ uint64_t num_sectors_write; /* total num of sectors being written */
+ uint64_t num_requests_read; /* total num of requests for read */
+ uint64_t num_requests_write; /* total num of request for write */
+ uint64_t num_write_barrier; /* total num of write barriers received */
+ uint64_t num_write_fua; /* total num of write barriers received */
+
+ /* cycles counters (enabled/disabled by timestat)*/
+ uint64_t cycles_total[2][PMBD_MAX_NUM_CPUS]; /* total cycles for read in make_request*/
+ uint64_t cycles_prepare[2][PMBD_MAX_NUM_CPUS]; /* total cycles for prepare in make_request*/
+ uint64_t cycles_wb[2][PMBD_MAX_NUM_CPUS]; /* total cycles for write barrier in make_request*/
+ uint64_t cycles_work[2][PMBD_MAX_NUM_CPUS]; /* total cycles for work in make_request*/
+ uint64_t cycles_endio[2][PMBD_MAX_NUM_CPUS]; /* total cycles for endio in make_request*/
+ uint64_t cycles_finish[2][PMBD_MAX_NUM_CPUS]; /* total cycles for finish-up in make_request*/
+
+ uint64_t cycles_pmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private mapping*/
+ uint64_t cycles_punmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private unmapping */
+ uint64_t cycles_memcpy[2][PMBD_MAX_NUM_CPUS]; /* total cycles for memcpy */
+ uint64_t cycles_clflush[2][PMBD_MAX_NUM_CPUS]; /* total cycles for clflush_range */
+ uint64_t cycles_clflushall[2][PMBD_MAX_NUM_CPUS];/* total cycles for clflush_all */
+ uint64_t cycles_wrverify[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing write verification */
+ uint64_t cycles_checksum[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing checksum */
+ uint64_t cycles_pause[2][PMBD_MAX_NUM_CPUS]; /* total cycles for pause */
+ uint64_t cycles_slowdown[2][PMBD_MAX_NUM_CPUS]; /* total cycles for slowdown*/
+ uint64_t cycles_setpages_ro[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to ro*/
+ uint64_t cycles_setpages_rw[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to rw*/
+} PMBD_STAT_T;
+
+/*
+ * pmbd_device structure (each corresponding to a pmbd instance)
+ */
+#define PBN_TO_PMBD_BUFFER_ID(PMBD, PBN) (((PBN)/(PMBD)->buffer_stride) % (PMBD)->num_buffers)
+#define PBN_TO_PMBD_BUFFER(PMBD, PBN) ((PMBD)->buffers[PBN_TO_PMBD_BUFFER_ID((PMBD), (PBN))])
+
+typedef struct pmbd_device {
+ int pmbd_id; /* dev id */
+ char pmbd_name[DISK_NAME_LEN];/* device name */
+
+ struct request_queue * pmbd_queue;
+ struct gendisk * pmbd_disk;
+ struct list_head pmbd_list;
+
+ /* PM backstore space */
+ void* mem_space; /* pointer to the kernel mem space */
+ uint64_t num_sectors; /* PMBD device capacity (num of 512-byte sectors)*/
+ unsigned sector_size; /* 512 bytes */
+
+ /* configurations */
+ unsigned pmbd_type; /* vmalloc() or high_mem */
+ unsigned rammode; /* RAM mode (no write protection) or not */
+ unsigned bufmode; /* use buffer or not */
+ unsigned wpmode; /* write protection mode: PTE change (0) or CR0/WP bit switch (1)*/
+
+ /* buffer management */
+ PMBD_BUFFER_T** buffers; /* buffer control structure */
+ unsigned num_buffers; /* number of buffers */
+ unsigned buffer_stride; /* the number of contiguous blocks mapped to the same buffer */
+
+
+
+ /* physical block info (metadata) */
+ PMBD_PBI_T* pbi_space; /* physical block info space (each) */
+ unsigned pb_size; /* the unit size of each block (4096 in default) */
+
+ /* checksum */
+ PMBD_CHECKSUM_T* checksum_space; /* checksum array */
+ unsigned checksum_unit_size; /* checksum unit size (bytes) */
+ void* checksum_iomem_buf; /* one unit buffer for ioremapped PM */
+
+ /* emulating PM with injected latency */
+ unsigned simmode; /* simulating whole device (0) or PM only (1)*/
+ uint64_t rdlat; /* read access latency (in nanoseconds)*/
+ uint64_t wrlat; /* write access latency (in nanoseconds)*/
+ uint64_t rdbw; /* read bandwidth (MB/sec) */
+ uint64_t wrbw; /* write bandwidth (MB/sec) */
+ unsigned rdsx; /* read slowdown (X) */
+ unsigned wrsx; /* write slowdown (X) */
+ uint64_t rdpause; /* read pause (cycles per 4KB page) */
+ uint64_t wrpause; /* write pause (cycles per 4KB page) */
+
+ spinlock_t batch_lock; /* lock protecting batch_* fields */
+ uint64_t batch_start_cycle[2]; /* start time of the batch (cycles)*/
+ uint64_t batch_end_cycle[2]; /* end time of the batch (cycles) */
+ uint64_t batch_sectors[2]; /* the total num of sectors in the batch */
+
+ PMBD_STAT_T* pmbd_stat; /* statistics data */
+ struct proc_dir_entry* proc_devstat; /* the proc output */
+
+ spinlock_t wr_barrier_lock;/* for write barrier and other control */
+ atomic_t num_flying_wr; /* the counter of writes on the fly */
+
+ spinlock_t tmp_lock;
+ uint64_t tmp_data;
+ unsigned long tmp_num;
+} PMBD_DEVICE_T;
+
+/*
+ * support definitions
+ */
+#define TRUE 1
+#define FALSE 0
+
+#define __CURRENT_PID__ (current->pid)
+#define CONFIG_PMBD_DEBUG 1
+//#define PRINTK_DEBUG_HDR "DEBUG %s(%d)%u - "
+//#define PRINTK_DEBUG_PAR __FUNCTION__, __LINE__, __CURRENT_PID__
+//#define PRINTK_DEBUG_1 if(CONFIG_PMBD_DEBUG >= 1) printk
+//#define PRINTK_DEBUG_2 if(CONFIG_PMBD_DEBUG >= 2) printk
+//#define PRINTK_DEBUG_3 if(CONFIG_PMBD_DEBUG >= 3) printk
+
+#define MAX_OF(A, B) (((A) > (B))? (A) : (B))
+#define MIN_OF(A, B) (((A) < (B))? (A) : (B))
+
+#define SECTOR_SHIFT 9
+#define PAGE_SHIFT 12
+#define SECTOR_SIZE (1UL << SECTOR_SHIFT)
+//#define PAGE_SIZE (1UL << PAGE_SHIFT)
+#define SECTOR_MASK (~(SECTOR_SIZE-1))
+#define PAGE_MASK (~(PAGE_SIZE-1))
+#define PMBD_SECTOR_SIZE SECTOR_SIZE
+#define PMBD_PAGE_SIZE PAGE_SIZE
+#define KB_SHIFT 10
+#define MB_SHIFT 20
+#define GB_SHIFT 30
+#define MB_TO_BYTES(N) ((N) << MB_SHIFT)
+#define GB_TO_BYTES(N) ((N) << GB_SHIFT)
+#define BYTES_TO_MB(N) ((N) >> MB_SHIFT)
+#define BYTES_TO_GB(N) ((N) >> GB_SHIFT)
+#define MB_TO_SECTORS(N) ((N) << (MB_SHIFT - SECTOR_SHIFT))
+#define GB_TO_SECTORS(N) ((N) << (GB_SHIFT - SECTOR_SHIFT))
+#define SECTORS_TO_MB(N) ((N) >> (MB_SHIFT - SECTOR_SHIFT))
+#define SECTORS_TO_GB(N) ((N) >> (GB_SHIFT - SECTOR_SHIFT))
+#define SECTOR_TO_PAGE(N) ((N) >> (PAGE_SHIFT - SECTOR_SHIFT))
+#define SECTOR_TO_BYTE(N) ((N) << SECTOR_SHIFT)
+#define BYTE_TO_SECTOR(N) ((N) >> SECTOR_SHIFT)
+#define PAGE_TO_SECTOR(N) ((N) << (PAGE_SHIFT - SECTOR_SHIFT))
+#define BYTE_TO_PAGE(N) ((N) >> (PAGE_SHIFT))
+
+#define IS_SPACE(C) (isspace(C) || (C) == '\0')
+#define IS_DIGIT(C) (isdigit(C) && (C) != '\0')
+#define IS_ALPHA(C) (isalpha(C) && (C) != '\0')
+
+#define DISABLE_SAVE_IRQ(FLAGS) {local_irq_save((FLAGS)); local_irq_disable();}
+#define ENABLE_RESTORE_IRQ(FLAGS) {local_irq_restore((FLAGS)); local_irq_enable();}
+#define CUR_CPU_ID() smp_processor_id()
+
+/*
+ * PMBD related config
+ */
+
+#define PMBD_CONFIG_VMALLOC 0 /* vmalloc() based PMBD (default) */
+#define PMBD_CONFIG_HIGHMEM 1 /* ioremap() based PMBD */
+
+
+/* global config */
+#define PMBD_IS_MERGEABLE() (g_pmbd_mergeable == TRUE)
+#define PMBD_USE_VMALLOC() (g_pmbd_type == PMBD_CONFIG_VMALLOC)
+#define PMBD_USE_HIGHMEM() (g_pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_USE_CLFLUSH() (g_pmbd_cpu_cache_clflush == TRUE)
+#define PMBD_CPU_CACHE_FLAG() ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB)? "WB" : \
+ ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC)? "WC" : \
+ ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC)? "UC" : \
+ ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS)? "UC-Minus" : "UNKNOWN"))))
+
+#define PMBD_CPU_CACHE_USE_WB() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB) /* write back */
+#define PMBD_CPU_CACHE_USE_WC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC) /* write combining */
+#define PMBD_CPU_CACHE_USE_UC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC) /* uncachable */
+#define PMBD_CPU_CACHE_USE_UM() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS) /* uncachable minus */
+
+#define PMBD_USE_WRITE_PROTECTION() (g_pmbd_wr_protect == TRUE)
+#define PMBD_USE_WRITE_VERIFICATION() (g_pmbd_wr_verify == TRUE)
+#define PMBD_USE_CHECKSUM() (g_pmbd_checksum == TRUE)
+#define PMBD_USE_LOCK() (g_pmbd_lock == TRUE)
+#define PMBD_USE_SUBPAGE_UPDATE() (g_pmbd_subpage_update == TRUE)
+
+#define PMBD_USE_PMAP() (g_pmbd_pmap == TRUE && g_pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_USE_NTS() (g_pmbd_nts == TRUE)
+#define PMBD_USE_NTL() (g_pmbd_ntl == TRUE)
+#define PMBD_USE_WB() (g_pmbd_wb == TRUE)
+#define PMBD_USE_FUA() (g_pmbd_fua == TRUE)
+#define PMBD_USE_TIMESTAT() (g_pmbd_timestat == TRUE)
+
+#define TIMESTAMP(TS) rdtscll((TS))
+#define TIMESTAT_POINT(TS) {(TS) = 0; if (PMBD_USE_TIMESTAT()) rdtscll((TS));}
+
+/* instanced based config */
+#define PMBD_DEV_USE_VMALLOC(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_VMALLOC)
+#define PMBD_DEV_USE_HIGHMEM(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_DEV_USE_BUFFER(PMBD) ((PMBD)->bufmode)
+#define PMBD_DEV_USE_WPMODE_PTE(PMBD) ((PMBD)->wpmode == 0)
+#define PMBD_DEV_USE_WPMODE_CR0(PMBD) ((PMBD)->wpmode == 1)
+
+#define PMBD_DEV_USE_EMULATION(PMBD) ((PMBD)->rdlat || (PMBD)->wrlat || (PMBD)->rdbw || (PMBD)->wrbw)
+#define PMBD_DEV_SIM_PMBD(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 1)
+#define PMBD_DEV_SIM_DEV(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 0)
+#define PMBD_DEV_USE_SLOWDOWN(PMBD) ((PMBD)->rdsx > 1 || (PMBD)->wrsx > 1)
+
+/* support functions */
+#define PMBD_MEM_TOTAL_SECTORS(PMBD) ((PMBD)->num_sectors)
+#define PMBD_MEM_TOTAL_BYTES(PMBD) ((PMBD)->num_sectors * (PMBD)->sector_size)
+#define PMBD_MEM_TOTAL_PAGES(PMBD) (((PMBD)->num_sectors) >> (PAGE_SHIFT - SECTOR_SHIFT))
+#define PMBD_MEM_SPACE_FIRST_BYTE(PMBD) ((PMBD)->mem_space)
+#define PMBD_MEM_SPACE_LAST_BYTE(PMBD) ((PMBD)->mem_space + PMBD_MEM_TOTAL_BYTES(PMBD) - 1)
+#define PMBD_CHECKSUM_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->checksum_unit_size)
+#define PMBD_LOCK_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->lock_unit_size)
+#define VADDR_IN_PMBD_SPACE(PMBD, ADDR) ((ADDR) >= PMBD_MEM_SPACE_FIRST_BYTE(PMBD) \
+ && (ADDR) <= PMBD_MEM_SPACE_LAST_BYTE(PMBD))
+
+#define BYTE_TO_PBN(PMBD, BYTES) ((BYTES) / (PMBD)->pb_size)
+#define PBN_TO_BYTE(PMBD, PBN) ((PBN) * (PMBD)->pb_size)
+#define SECTOR_TO_PBN(PMBD, SECT) (BYTE_TO_PBN((PMBD), SECTOR_TO_BYTE(SECT)))
+#define PBN_TO_SECTOR(PMBD, PBN) (BYTE_TO_SECTOR(PBN_TO_BYTE((PMBD), (PBN))))
+
+
+#define PMBD_CACHELINE_SIZE (64) /* FIXME: configure this machine by machine? (check x86_clflush_size)*/
+
+/* buffer related functions */
+#define CALLER_ALLOCATOR (0)
+#define CALLER_SYNCER (1)
+#define CALLER_DESTROYER (2)
+
+#define PMBD_BLOCK_VADDR(PMBD, PBN) ((PMBD)->mem_space + ((PMBD)->pb_size * (PBN)))
+#define PMBD_BLOCK_PBI(PMBD, PBN) ((PMBD)->pbi_space + (PBN))
+#define PMBD_TOTAL_PB_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->pb_size)
+#define PMBD_BLOCK_IS_BUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn < PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks)
+#define PMBD_SET_BLOCK_BUFFERED(PMBD, PBN, BBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = (BBN))
+#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PMBD_TOTAL_PB_NUM((PMBD)) + 3)
+//#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks + 1)
+
+#define PMBD_BUFFER_MIN_BUFSIZE (4) /* buffer size (in MBs) */
+#define PMBD_BUFFER_BLOCK(BUF, BBN) ((BUF)->buffer_space + (BUF)->pmbd->pb_size*(BBN))
+#define PMBD_BUFFER_BBI(BUF, BBN) ((BUF)->bbi_space + (BBN))
+#define PMBD_BUFFER_BBI_INDEX(BUF, ADDR) ((ADDR)-(BUF)->bbi_space)
+#define PMBD_BUFFER_SET_BBI_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = FALSE)
+#define PMBD_BUFFER_SET_BBI_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = TRUE)
+#define PMBD_BUFFER_BBI_IS_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == FALSE)
+#define PMBD_BUFFER_BBI_IS_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == TRUE)
+#define PMBD_BUFFER_SET_BBI_BUFFERED(BUF,BBN,PBN)((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = (PBN))
+#define PMBD_BUFFER_SET_BBI_UNBUFFERED(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = PMBD_TOTAL_PB_NUM((BUF)->pmbd) + 2)
+
+#define PMBD_BUFFER_FLUSH_HW (0.7) /* high watermark */
+#define PMBD_BUFFER_FLUSH_LW (0.1) /* low watermark */
+#define PMBD_BUFFER_IS_FULL(BUF) ((BUF)->num_dirty >= (BUF)->num_blocks)
+#define PMBD_BUFFER_IS_EMPTY(BUF) ((BUF)->num_dirty == 0)
+#define PMBD_BUFFER_ABOVE_HW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
+#define PMBD_BUFFER_BELOW_HW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
+#define PMBD_BUFFER_ABOVE_LW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
+#define PMBD_BUFFER_BELOW_LW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
+#define PMBD_BUFFER_BATCH_SIZE_DEFAULT (1024) /* the batch size for each flush */
+
+#define PMBD_BUFFER_NEXT_POS(BUF, POS) (((POS)==((BUF)->num_blocks - 1))? 0 : ((POS)+1))
+#define PMBD_BUFFER_PRIO_POS(BUF, POS) (((POS)== 0)? ((BUF)->num_blocks - 1) : ((POS)-1))
+#define PMBD_BUFFER_NEXT_N_POS(BUF,POS,N) (((POS)+(N))%((BUF)->num_blocks))
+#define PMBD_BUFFER_PRIO_N_POS(BUF,POS,N) ((BUF)->num_blocks - (((N)+(BUF)->num_blocks-(POS))%(BUF)->num_blocks))
+
+/* high memory */
+#define PMBD_HIGHMEM_AVAILABLE_SPACE (g_highmem_virt_addr + g_highmem_size - g_highmem_curr_addr)
+
+/* emulation */
+#define MAX_SYNC_SLOWDOWN (10000000) /* use async_slowdown, if larger than 10ms */
+#define OVERHEAD_NANOSEC (100)
+#define PMBD_USLEEP(n) {set_current_state(TASK_INTERRUPTIBLE); \
+ schedule_timeout((n)*HZ/1000000);}
+
+/* statistics */
+#define PMBD_BATCH_MAX_SECTORS (4096) /* maximum data amount requested in a batch */
+#define PMBD_BATCH_MIN_SECTORS (256) /* maximum data amount requested in a batch */
+#define PMBD_BATCH_MAX_INTERVAL (1000000) /* maximum interval between two requests in a batch*/
+#define PMBD_BATCH_MAX_DURATION (10000000) /* maximum duration of a batch (ns)*/
+
+/* write protection*/
+#define VADDR_TO_PAGE(ADDR) ((ADDR) >> PAGE_SHIFT)
+#define PAGE_TO_VADDR(PAGE) ((PAGE) << PAGE_SHIFT)
+
+/* checksum */
+#define VADDR_TO_CHECKSUM_IDX(PMBD, ADDR) (((ADDR) - (PMBD)->mem_space) / (PMBD)->checksum_unit_size)
+#define CHECKSUM_IDX_TO_VADDR(PMBD, IDX) ((PMBD)->mem_space + (IDX) * (PMBD)->checksum_unit_size)
+#define CHECKSUM_IDX_TO_CKADDR(PMBD, IDX) ((PMBD)->checksum_space + (IDX))
+
+/* idle period timer */
+#define PMBD_BUFFER_FLUSH_IDLE_TIMEOUT (2000) /* 1 millisecond */
+#define PMBD_DEV_UPDATE_ACCESS_TIME(PMBD) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
+ (PMBD)->pmbd_stat->last_access_jiffies = jiffies; \
+ spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
+#define PMBD_DEV_GET_ACCESS_TIME(PMBD, T) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
+ (T) = (PMBD)->pmbd_stat->last_access_jiffies; \
+ spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
+#define PMBD_DEV_IS_IDLE(PMBD, IDLE) ((IDLE) > PMBD_BUFFER_FLUSH_IDLE_TIMEOUT)
+
+/* Help info */
+#define USAGE_INFO \
+"\n\n\
+============================================\n\
+Intel Persistent Memory Block Driver (v0.9)\n\
+============================================\n\n\
+usage: $ modprobe pmbd mode=\"pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];[Option3];..\"\n\
+\n\
+GENERAL OPTIONS: \n\
+\t pmbd<#,#..> \t set PM block device size (GBs) \n\
+\t HM|VM \t\t use high memory (HM default) or vmalloc (VM) \n\
+\t hmo<#> \t high memory starting offset (GB) \n\
+\t hms<#> \t high memory size (GBs) \n\
+\t pmap<Y|N> \t use private mapping (Y) or not (N default) - (note: must enable HM and wrprotN) \n\
+\t nts<Y|N> \t use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), or regular memcpy (N default)\n\
+\t wb<Y|N> \t use write barrier (Y) or not (N default)\n\
+\t fua<Y|N> \t use WRITE_FUA (Y default) or not (N) \n\
+\t ntl<Y|N> \t use non-temporal load (MOVNTDQA) to do memcpy (Y), or regular memcpy (N default) - this option enforces memory type of write combining\n\
+\n\
+SIMULATION: \n\
+\t simmode<#,#..> use the specified numbers to the whole device (0 default) or PM only (1)\n\
+\t rdlat<#,#..> \t set read access latency (ns) \n\
+\t wrlat<#,#..> \t set write access latency (ns)\n\
+\t rdbw<#,#..> \t set read bandwidth (MB/sec) (if set 0, no emulation) \n\
+\t wrbw<#,#..> \t set write bandwidth (MB/sec) (if set 0, no emulation) \n\
+\t rdsx<#,#..> \t set the relative slowdown (x) for read \n\
+\t wrsx<#,#..> \t set the relative slowdown (x) for write \n\
+\t rdpause<#,.> \t set a pause (cycles per 4KB) for each read\n\
+\t wrpause<#,.> \t set a pause (cycles per 4KB) for each write\n\
+\t adj<#> \t set an adjustment to the system overhead (nanoseconds) \n\
+\n\
+WRITE PROTECTION: \n\
+\t wrprot<Y|N> \t use write protection for PM pages? (Y or N)\n\
+\t wpmode<#,#,..> write protection mode: use the PTE change (0 default) or switch CR0/WP bit (1) \n\
+\t clflush<Y|N> \t use clflush to flush CPU cache for each write to PM space? (Y or N) \n\
+\t wrverify<Y|N> \t use write verification for PM pages? (Y or N) \n\
+\t checksum<Y|N> \t use checksum to protect PM pages? (Y or N)\n\
+\t bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)\n\
+\t bufnum<#> \t the number of buffers for a PMBD device (16 buffers, at least 1 if using buffer, 0 -no buffer) \n\
+\t bufstride<#> \t the number of contiguous blocks(4KB) mapped into one buffer (bucket size for round-robin mapping) (1024 in default)\n\
+\t batch<#,#> \t the batch size (num of pages) for flushing PMBD device buffer (1 means no batching) \n\
+\n\
+MISC: \n\
+\t mgb<Y|N> \t mergeable? (Y or N) \n\
+\t lock<Y|N> \t lock the on-access page to serialize accesses? (Y or N) \n\
+\t cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined (WB), or Uncachable (UC)\n\
+\t subupdate<Y|N> only update the changed cachelines of a page? (Y or N) (check PMBD_CACHELINE_SIZE) \n\
+\t timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat)? (Y or N) (This will cause significant performance slowdown) \n\
+\n\
+NOTE: \n\
+\t (1) Option rdlat/wrlat only specifies the minimum access times. Real access times can be higher.\n\
+\t (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. \n\
+\t (3) Option simmode1 applies the simulated specification to the PM space, rather than the whole device, which may have buffer.\n\
+\n\
+WARNING: \n\
+\t (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning may appear. Use \"nosoftlockup\" boot option to disable it.\n\
+\t (2) Enabling timestat may cause performance degradation.\n\
+\t (3) FUA is supported in PMBD, but if buffer is used (for PT-based protection), enabling FUA lowers performance due to double writes.\n\
+\t (4) No support for changing CPU cache related PTE attributes for VM-based PMBD (RCU stalls).\n\
+\n\
+PROC ENTRIES: \n\
+\t /proc/pmbd/pmbdcfg config info about the PMBD devices\n\
+\t /proc/pmbd/pmbdstat statistics of the PMBD devices (if timestat is enabled)\n\
+\n\
+EXAMPLE: \n\
+\t Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:\n\
+\t (1) Basic (Ramdisk): \n\
+\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;\"\n\n\
+\t (2) Protected (with private mapping): \n\
+\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;\"\n\n\
+\t (3) Protected and synced (with private mapping, non-temp store): \n\
+\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;\"\n\n\
+\t (4) *** RECOMMENDED CONFIG *** \n\
+\t Protected, synced, and ordered (with private mapping, non-temp store, write barrier): \n\
+\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;wbY;\"\n\
+\n"
+
+/* functions */
+static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
+static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
+static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes);
+static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes);
+static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
+static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
+
+static inline int put_ulong(unsigned long arg, unsigned long val)
+{
+ return put_user(val, (unsigned long __user *)arg);
+}
+static inline int put_u64(unsigned long arg, u64 val)
+{
+ return put_user(val, (u64 __user *)arg);
+}
+
+static inline void mfence(void)
+{
+ asm volatile("mfence": : :);
+}
+
+static inline void sfence(void)
+{
+ asm volatile("sfence": : :);
+}
+
+#endif
+/* THEN END */
More information about the Linux-pmfs
mailing list