[PATCH 1/1] PMFS: Add experimental Persistent Memory Block Driver

Thu May 9 15:08:09 EDT 2013

From: Nicholas Moulin <nicholas.w.moulin at linux.intel.com>

Initial version of PMBD, the persistent memory block driver
This commit rebases to Linux 3.9

Signed-off-by: Nicholas Moulin <nicholas.w.moulin at intel.com>
---
 Documentation/blockdev/00-INDEX |    2 +
 Documentation/blockdev/pmbd.txt |  185 ++
 drivers/block/Kconfig           |   10 +
 drivers/block/Makefile          |    2 +
 drivers/block/pmbd.c            | 4541 +++++++++++++++++++++++++++++
 include/linux/pmbd.h            |  509 ++++

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..2e8f5b2 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -16,3 +16,5 @@ paride.txt
 	- information about the parallel port IDE subsystem.
 ramdisk.txt
 	- short guide on how to set up and use the RAM disk.
+pmbd.txt
+	- information about Persistent Memory Block Driver.
diff --git a/Documentation/blockdev/pmbd.txt b/Documentation/blockdev/pmbd.txt
new file mode 100644
index 0000000..244820f
--- /dev/null
+++ b/Documentation/blockdev/pmbd.txt
@@ -0,0 +1,185 @@
+===============================================================================
+              INTEL PERSISTENT MEMORY BLOCK DRIVER (PMBD) v0.9
+===============================================================================
+
+This software implements a block device driver for persistent memory (PM).
+This module provides a block-based logical interface to manage PM that is
+physically attached to the system memory bus. 
+
+The architecture is assumed as follows. Both DRAM and PM DIMMs are directly
+attached to the host memory bus. The PM space is presented to the operating
+system as a contiguous range of physical memory address space at the high end. 
+
+There are three major design considerations: (1) Data protection - Private
+mapping is used to prevent stray pointers (in kernel/driver bugs) to
+accidentally wipe off persistent PM data. (2) Data persistence - Non-temporal
+store and fence instructions are used to leverage the processor store buffer
+and avoid polluting the CPU cache. (3) Write ordering - Write barrier is
+supported to ensure correct order of writes. 
+
+This module also includes other (experimental) features, such as PM speed
+emulation, checksum for page integrity, partial page updates, write
+verification, etc. Please refer to the help page of the module. 
+
+
+===============================================================================
+                 COMPILING AND INSTALLING THE PMBD DRIVER
+===============================================================================
+
+1. Compile the PMBD driver:
+
+   $ make
+
+2. Install the PMBD driver:
+
+   $ sudo make install
+
+3. Check available driver information:
+
+   $ modinfo pmbd
+
+===============================================================================
+                  QUICK USER'S GUIDE OF THE PMBD DRIVER
+===============================================================================
+
+1. modify /etc/grub.conf to set the physical memory address range that
+   is to be simulated as PM. 
+
+   Add the following to the boot option line:
+
+        memmap=<PM_SIZE_GB>G$<DRAM_SIZE_GB>G numa=off 
+
+   NOTE: 
+
+   PM_SIZE_GB - the PM space size (in GBs)
+   DRAM_SIZE_GB - the DRAM space size (in GBs)
+
+   Example: 
+
+   Assuming a total memory capacity of 24GB, and if we want to use 16GB PM and
+   8GB DRAM, it should be "memmap=16G$8G". 
+   
+2. Reboot and check if the memory size is set as expected. 
+   
+   $ sudo reboot; exit
+   $ free
+
+3. Load the device driver module
+
+   Load the driver module into the kernel with private mapping, non-temp store,
+   and write barrier enabled (*** RECOMMENDED CONFIG ***):
+
+   $ modprobe pmbd mode="pmbd<PM_SIZE_GB>;hmo=<DRAM_SIZE_GB>;hms<PM_SIZE_GB>; \
+                        pmapY;ntsY;wbY;"
+
+   Check the kernel message output:
+
+   $ dmesg 
+   
+   After loading the module, a block device (/dev/pma) should appear. Since
+   now, it can be used as any block device, such as fdisk, mkfs, etc. 
+
+4. Unload the device driver
+
+   $ rmmod pmbd
+
+===============================================================================
+  OTHER CONFIGURATION OPTIONS OF THE PERSISTENT MEMORY DEVICE DRIVER MODULE
+===============================================================================
+
+usage: $ modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];;.."
+
+GENERAL OPTIONS:
+ pmbd<#,#..>     set pmbd size (GBs)
+ HM|VM           use high memory (HM default) or vmalloc (VM)
+ hmo<#>          high memory starting offset (GB)
+ hms<#>          high memory size (GBs)
+ pmap<Y|N>       use private mapping (Y) or not (N default) - (note: must
+                 enable HM and wrprotN)
+ nts<Y|N>        use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), 
+                 or regular memcpy (N default)
+ wb<Y|N>         use write barrier (Y) or not (N default)
+ fua<Y|N>        use WRITE_FUA (Y default) or not (N) 
+ ntl<Y|N>        use non-temporal load (MOVNTDQA) to do memcpy (Y), or
+                 regular memcpy (N default) - this option enforces memory type 
+                 of write combining
+
+
+SIMULATION:
+ simmode<#,#..>  use the specified numbers to the whole device (0 default) or
+                 PM only (1)
+ rdlat<#,#..>    set read access latency (ns)
+ wrlat<#,#..>    set write access latency (ns)
+ rdbw<#,#..>     set read bandwidth (MB/sec)  (if set 0, no emulation)
+ wrbw<#,#..>     set write bandwidth (MB/sec) (if set 0, no emulation)
+ rdsx<#,#..>     set the relative slowdown (x) for read
+ wrsx<#,#..>     set the relative slowdown (x) for write
+ rdpause<#,.>    set a pause (cycles per 4KB) for each read
+ wrpause<#,.>    set a pause (cycles per 4KB) for each write
+ adj<#>          set an adjustment to the system overhead (nanoseconds)
+
+WRITE PROTECTION:
+ wrprot<Y|N>     use write protection for PM pages? (Y or N)
+ wpmode<#,#,..>  write protection mode: use the PTE change (0 default) or flip
+                 CR0/WP bit (1)
+ clflush<Y|N>    use clflush to flush CPU cache for each write to PM space?
+                 (Y or N)
+ wrverify<Y|N>   use write verification for PM pages? (Y or N)
+ checksum<Y|N>   use checksum to protect PM pages? (Y or N)
+ bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)
+ bufnum<#>       the number of buffers for a PMBD device (16 buffers, at least 1
+                 if using buffer, 0 -no buffer)
+ bufstride<#>    the number of contiguous blocks(4KB) mapped into one buffer
+                 (bucket size for round-robin mapping) (1024 in default)
+ batch<#,#>      the batch size (num of pages) for flushing PMBD buffer (1 means
+                 no batching)
+
+MISC:
+ mgb<Y|N>        mergeable? (Y or N)
+ lock<Y|N>       lock the on-access page to serialize accesses? (Y or N)
+ cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined
+                 (WB), or Uncachable (UC)
+ subupdate<Y|N>  only update the changed cachelines of a page? (Y or N) (check
+                 PMBD_CACHELINE_SIZE)
+ timestat<Y|N>   enable the detailed timing statistics (/proc/pmbd/pmbdstat)?
+                 This will cause significant performance slowdown (Y or N)
+
+NOTE:
+ (1) Option rdlat/wrlat only specifies the minimum access times. Real access
+     times can be higher.
+ (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored.
+ (3) Option simmode1 applies the simulated specification to the PM space,
+     rather than the whole device, which may have buffer.
+
+WARNING:
+ (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning
+     may appear. Use "nosoftlockup" boot option to disable it.  
+ (2) Enabling timestat may cause performance degradation.
+ (3) FUA is supported , but if buffer is used (for PT based
+     protection), enabling FUA lowers performance due to double writes.
+ (4) No support for changing CPU cache related PTE attributes for VM-based PMBD
+     (RCU stalls).
+
+PROC ENTRIES:
+ /proc/pmbd/pmbdcfg:   config info about the PMBD devices
+ /proc/pmbd/pmbdstat:  statistics of the PMBD devices (if timestat is enabled)
+
+EXAMPLE:
+ Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:
+ (1) Basic (Ramdisk): 
+     $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;"
+
+ (2) Protected (with private mapping): 
+     $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;"
+
+ (3) Protected and synced (with private mapping, non-temp store): 
+     $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;"
+
+ (4) *** RECOMMENDED CONFIGURATION ***
+     Protected, synced, and ordered (with private mapping, nt-store, write
+     barrier): 
+     $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;wbY;"
+
+
+
+
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b81ddfe..47dbb6d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -540,5 +540,15 @@ config BLK_DEV_RSXX
 
 	  To compile this driver as a module, choose M here: the
 	  module will be called rsxx.
+	  
+config BLK_DEV_PMBD
+	tristate "Persistent Memory Block Driver"
+	depends on m
+
+	default n
+	help
+	  Say M here if you want include the Persistent Memory Block Driver.
+	  
+	  If unsure, say N.
 
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index a3b4023..6ac1cbe 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -42,4 +42,6 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 
+obj-$(CONFIG_BLK_DEV_PMBD) += pmbd.o
+
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/pmbd.c b/drivers/block/pmbd.c
new file mode 100644
index 0000000..62d61f7
--- /dev/null
+++ b/drivers/block/pmbd.c
@@ -0,0 +1,4541 @@
+/*
+ * Intel Persistent Memory Block Driver
+ * Copyright (c) <2011-2013>, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+/* 
+ * Intel Persistent Memory Block Driver (v0.9)
+ *
+ * Parts derived with changes from drivers/block/brd.c, lib/crc32.c, and
+ * arch/x86/lib/mmx_32.c
+ *
+ * Intel Corporation <linux-pmbd at intel.com>
+ * 03/24/2011
+ *
+ * Authors
+ * 2013 - Released the open-source version 0.9 (fchen)
+ * 2012 - Ported to Linux 3.2.1 (fchen)
+ * 2011 - Feng Chen (Intel) implemented version 1 of PMBD for Linux 2.6.34.
+ */
+
+
+/*
+ *******************************************************************************
+ * Persistent Memory Block Device Driver
+ *
+ * USAGE:
+ *  % sudo modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[OPTION1];[OPTION2];..>"
+ *
+ * GENERAL OPTIONS:
+ *  - pmbd<#,..>:    a sequence of integer numbers setting PMBD device sizes (in
+ *                   units of GBs). For example, mode="pmbd4,1" means creating a
+ *                   4GB and a 1GB PMBD device (/dev/pma and /dev/pmb).
+ *
+ *  - HM|VM:         choose two types of PMBD devices
+ *                   - VM:  vmalloc() based 
+ *                   - HM:  HIGH_MEM based (default)
+ *                   - In /boot/grub/grub.conf, add "mem=<n>G memmap=<m>G$<n>G" 
+ *                    to reserve the high m GBs for PM, starting from offset n 
+ *                    GBs in physical memory
+ *
+ *  - hmo<#>:        if HM is set, setting the starting physical mem address 
+ *                   (in units of GBs).
+ *
+ *  - hms<#>:        if HM is set, setting the remapping memory size (in GBs)
+ *
+ *  - pmap<Y|N>      set private mapping (Y) or not (N default). using 
+ *                   pmap_atomic_pfn() to dynamically map/unmap the 
+ *                   to-be-accessed PM page for protection purpose. 
+ *                   This option must work with HM enabled. In the Linux boot 
+ *                   option, "mem" option must be removed.
+ *
+ *  - nts<Y|N>       set non-temporal store/sfence (Y) or not (N default). 
+ *
+ *  - wb<Y|N>:       use write barrier (Y) or not (N default)
+ *
+ *  - fua<Y|N>       use WRITE_FUA (Y default) or not (N)
+ *  			 FUA with PT-based protection (with buffer) incurs
+ *  			 double-write overhead
+ *
+ * SIMULATION OPTIONS:
+ *
+ *  - simmode<#,#..> set the simulation mode for each PMBD device
+ *                   - 0 for simulating the whole device 
+ *                   - 1 for simulating the PM space only
+ *                   Note that simulating the PM space may cause some system 
+ *                   warning of soft lockup. To disable it, add nonsoftlockup 
+ *                   in the boot options.
+ *
+ *  - rdlat<#,#..>:  a sequence of integer numbers setting emulated read 
+ *                   latencies (in units of nanoseconds) for reading each 
+ *                   sector. Each number is corresponding to a device. Default
+ *                   value is 0. 
+ *
+ *  - wrlat<#,#..>:  set emulated write access latencies (see rdlat)
+ *
+ *  - rdbw<#,#..>:   a sequence of integer numbers setting emulated read 
+ *                   bandwidth (in units of MB/sec) for reading each sector. 
+ *                   Each number corresponds to a device. Default value is 0;
+ *
+ *  - wrbw<#,#..>:   set emulated write bandwidth (see rdbw)
+ *
+ *  - rdsx<#,#..>:   set the slowdown ratio (x) for reads as compared to DRAM
+ *
+ *  - wrsx<#,#..>:   set the slowdown ratio (x) for writes as compared to DRAM
+ *
+ *  - rdpause<#,#..>: set the injected delay (cycles per page) for read (not
+ *                   for emulation, just inject latencies 
+ *                   for each read per page)
+ *
+ *  - wrpause<#,#..>: set the injected delay (cycles per page) for write
+ *  		     (not for emulation, just inject latencies for
+ *  		      each read per page).
+ *
+ *  - adj<#>:        offset the overhead with estimated system overhead. Default
+ *  		     is 4us, however, this could vary system by system.
+ *
+ * WRITE PROTECTION:
+ *
+ *  - wrprot<Y|N>:   provide write protection on PM space by setting page
+ *                   read-only (default: N).
+ *                   This option is incompatible with pmap.
+ *
+ *  - wpmode<#,#,..> write protection mode: use the PTE change (0 default) or
+ *                   switch CR0/WP bit (1)
+ *
+ *  - wrverify<Y|N>: read out the data for verification after writing into PM
+ *                   space
+ *
+ *  - clflush<Y|N>:  flush CPU cache or not (default: N) 
+ *
+ *  - checksum<Y|N>: use checksum to provide further protection from data
+ *                   corruption (default: N)
+ *
+ *  - lock<Y|N>:     lock the on-access PM page to serialize accesses
+ *  			 (default: Y)
+ *
+ *  - bufsize<#,#,#.#...>  -- the buffer size in MBs (for speeding up write
+ *                   protection) 0 means no buffer, minimum size is 16 MBs
+ *
+ *  - bufnum<#>      the number of buffers for a pmbd device (16 buffers, at
+ *                   least 1 if using buffering, 0 will disable buffer mode)
+ *
+ *  - bufstride<#>   the number of contiguous blocks(4KB) mapped into one
+ *                   buffer (the bucket size for round-robin mapping)
+ *                   (1024 in default)
+ *
+ *  - batch<#,#>     the batch size (num of pages) for flushing PMBD buffer (1
+ *                   means no batching)
+ *
+ * MISC OPTIONS:
+ *
+ *  - subupdate<Y|N> only update changed cachelines of a page (check
+ *                   PMBD_CACHELINE_SIZE, default: N)
+ *
+ *  - mgb<Y|N>:      setting mergeable or not (default: Y)
+ *
+ *  - cache<WB|WC|UM|UC>:
+ *  		     WB -- write back (both read/write cache used)
+ *  		     WC -- write combined (write through but cachable)
+ *  		     UM -- uncachable but write back 
+ *  		     UC -- write through and uncachable
+ *			 No support for changing CPU cache flags
+ *			 with vmalloc() based PMBD
+ *
+ *  - timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat) or
+ *                  not (default: N). This will cause significant performance loss. 
+ *
+ * EXAMPLE:
+ *  mode="pmbd2,1;rdlat100,2000;wrlat500,4000;rdbw100,100;wrbw100,100;HM;hmo4;hms3;
+ *  mgbY;flushY;cacheWB;wrprotY;wrverifyY;checksumY;lockY;rammode0,1;bufsize16,0;
+ *  subupdateY;"
+ *
+ *  Explanation: Create two PMBD devices, /dev/pma (2GB) and /dev/pmb (1GB).
+ *  Insert 100ns and 500ns for reading and writing a sector to /dev/pma,
+ *  respectively.  Insert 2000ns and 4000ns for reading and writing a sector
+ *  to /dev/pmb.  Make the read/write bandwidth for both devices 100MB/sec.
+ *  No system overhead adjustment is applied.  We use 3GB high memory for the
+ *  PMBD devices, starting from 4GB physical memory address. Make it
+ *  mergeable, use writeback and flush CPU cache for the PM space, use write
+ *  protection for PM space by setting PM space read-only, verify each
+ *  write by reading out written data, use checksum to protect PM space, use
+ *  spinlock to protect from corruption caused by concurrent accesses, the
+ *  first device is applied without write protection, the second device is
+ *  applied with write protection, and use sub-page updates.
+ *
+ * NOTE:
+ *  - We can create no more than 26 devices, 4 partitions each. 
+ *
+ * FIXME: 
+ *  (1) We use an unoccupied major device num (261) temporarily
+ *******************************************************************************
+ */
+
+#include <linux/init.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/major.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <asm/uaccess.h>
+#include <linux/time.h>
+#include <asm/timer.h>
+#include <linux/cpufreq.h>
+#include <linux/crc32.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/kthread.h>
+#include <linux/sort.h>
+#include <linux/timex.h>
+#include <linux/proc_fs.h>
+#include <asm/tlbflush.h>
+#include <asm/i387.h>
+#include <asm/asm.h>
+#include <linux/pmbd.h>
+#include <linux/delay.h>
+
+/* device configs  */
+static int max_part = 4;	/* maximum num of partitions */
+static int part_shift = 0;	/* partition shift */
+static LIST_HEAD(pmbd_devices);	/* device list */
+static DEFINE_MUTEX(pmbd_devices_mutex); /* device mutex */
+
+/* /proc file system entry */
+static struct proc_dir_entry* proc_pmbd = NULL;
+static struct proc_dir_entry* proc_pmbdstat = NULL;
+static struct proc_dir_entry* proc_pmbdcfg = NULL;
+
+/* pmbd device default configuration */
+static unsigned g_pmbd_type 		= PMBD_CONFIG_HIGHMEM;	/* vmalloc(PMBD_CONFIG_VMALLOC) or reserve highmem (PMBD_CONFIG_HIGHMEM default) */
+static unsigned g_pmbd_pmap		= FALSE;		/* use pmap_atomic() to map/unmap space on demand  */
+static unsigned g_pmbd_nts		= FALSE;		/* use non-temporal store (movntq) */
+static unsigned g_pmbd_wb		= FALSE;		/* use write barrier */
+static unsigned g_pmbd_fua		= TRUE;			/* use fua support (Linux 3.2.1) */
+static unsigned g_pmbd_mergeable 	= TRUE;			/* mergeable or not  */
+static unsigned g_pmbd_cpu_cache_clflush= FALSE;	 	/* flush CPU cache or not*/
+static unsigned g_pmbd_wr_protect	= FALSE;		/* flip PTE R/W bits for write protection */
+static unsigned g_pmbd_wr_verify	= FALSE;		/* read out written data for verification */
+static unsigned g_pmbd_checksum		= FALSE;		/* do checksum on PM data */
+static unsigned g_pmbd_lock		= TRUE;			/* do spinlock on accessing a PM page */
+static unsigned g_pmbd_subpage_update	= FALSE;		/* do subpage update (only write changed content) */
+static unsigned g_pmbd_timestat		= FALSE;		/* do a detailed timestamp breakdown statistics */
+static unsigned g_pmbd_ntl		= FALSE;		/* use non-temporal load (movntdqa)*/
+static unsigned long g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB;	/* CPU cache flag (default - write back) */
+
+/* high memory configs */
+static unsigned long 	g_highmem_size = 0; 			/* size of the reserved physical mem space (bytes) */
+static phys_addr_t 	g_highmem_phys_addr = 0;		/* beginning of the reserved phy mem space (bytes)*/
+static void* 		g_highmem_virt_addr = NULL;		/* beginning of the reserve HIGH_MEM space */
+static void* 		g_highmem_curr_addr = NULL;		/* beginning of the available HIGH_MEM space for alloc*/ 
+
+/* module parameters */
+static unsigned g_pmbd_nr = 0;					/* num of PMBD devices */
+static unsigned long long g_pmbd_size[PMBD_MAX_NUM_DEVICES];	/* PMBD device sizes in units of GBs */
+static unsigned long long g_pmbd_rdlat[PMBD_MAX_NUM_DEVICES]; 	/* access latency for read (nanosecs) */
+static unsigned long long g_pmbd_wrlat[PMBD_MAX_NUM_DEVICES]; 	/* access latency for write nanosecs) */
+static unsigned long long g_pmbd_rdbw[PMBD_MAX_NUM_DEVICES]; 	/* bandwidth for read (MB/sec) */
+static unsigned long long g_pmbd_wrbw[PMBD_MAX_NUM_DEVICES]; 	/* bandwidth for write (MB/sec)*/
+static unsigned long long g_pmbd_rdsx[PMBD_MAX_NUM_DEVICES]; 	/* read slowdown (x) */
+static unsigned long long g_pmbd_wrsx[PMBD_MAX_NUM_DEVICES]; 	/* write slowdown (x)*/
+static unsigned long long g_pmbd_rdpause[PMBD_MAX_NUM_DEVICES];	/* read pause (cycles per page) */
+static unsigned long long g_pmbd_wrpause[PMBD_MAX_NUM_DEVICES];	/* write pause (cycles per page)*/
+static unsigned long long g_pmbd_simmode[PMBD_MAX_NUM_DEVICES];	/* simulating PM space (1) or the whole device (0 default) */
+static unsigned long long g_pmbd_adjust_ns = 0;			/* nanosec of adjustment to offset system overhead */
+static unsigned long long g_pmbd_rammode[PMBD_MAX_NUM_DEVICES];	/* do write optimization or not */
+static unsigned long long g_pmbd_bufsize[PMBD_MAX_NUM_DEVICES];	/* the buffer size (in MBs) */
+static unsigned long long g_pmbd_buffer_batch_size[PMBD_MAX_NUM_DEVICES]; /* the batch size (num of pages) for flushing PMBD buffer */
+static unsigned long long g_pmbd_wpmode[PMBD_MAX_NUM_DEVICES];	/* write protection mode: PTE change (0 default) and CR0 Switch (1)*/
+
+static unsigned long long g_pmbd_num_buffers = 0;		/* number of individual buffers */
+static unsigned long long g_pmbd_buffer_stride = 1024;		/* number of contiguous PBNs belonging to the same buffer */
+
+/* definition of functions */
+static inline uint64_t cycle_to_ns(uint64_t cycle);
+static inline void sync_slowdown_cycles(uint64_t cycles);
+static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw);
+static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start);
+
+/*
+ * *************************************************************************
+ * parse module parameters functions
+ * *************************************************************************
+ */
+static char *mode = ""; 
+module_param(mode, charp, 444);
+MODULE_PARM_DESC(mode, USAGE_INFO);
+
+/* print pmbd configuration info */
+static void pmbd_print_conf(void)
+{
+	int i;
+#ifndef CONFIG_X86
+	printk(KERN_INFO "pmbd: running on a non-x86 platform, check ioremap()...\n");
+#endif
+	printk(KERN_INFO "pmbd: cacheline_size=%d\n", PMBD_CACHELINE_SIZE);
+	printk(KERN_INFO "pmbd: PMBD_SECTOR_SIZE=%lu, PMBD_PAGE_SIZE=%lu\n", PMBD_SECTOR_SIZE, PMBD_PAGE_SIZE);
+	printk(KERN_INFO "pmbd: g_pmbd_type = %s\n", PMBD_USE_VMALLOC()? "VMALLOC" : "HIGH_MEM");
+	printk(KERN_INFO "pmbd: g_pmbd_mergeable = %s\n", PMBD_IS_MERGEABLE()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_clflush = %s\n", PMBD_USE_CLFLUSH()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_flag = %s\n", PMBD_CPU_CACHE_FLAG());
+	printk(KERN_INFO "pmbd: g_pmbd_wr_protect = %s\n", PMBD_USE_WRITE_PROTECTION()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_wr_verify = %s\n", PMBD_USE_WRITE_VERIFICATION()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_checksum = %s\n", PMBD_USE_CHECKSUM()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_lock = %s\n", PMBD_USE_LOCK()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_subpage_update = %s\n", PMBD_USE_SUBPAGE_UPDATE()? "YES" : "NO");
+	printk(KERN_INFO "pmbd: g_pmbd_adjust_ns = %llu ns\n", g_pmbd_adjust_ns);
+	printk(KERN_INFO "pmbd: g_pmbd_num_buffers = %llu\n", g_pmbd_num_buffers);
+	printk(KERN_INFO "pmbd: g_pmbd_buffer_stride = %llu blocks\n", g_pmbd_buffer_stride);
+	printk(KERN_INFO "pmbd: g_pmbd_timestat = %u \n", g_pmbd_timestat);
+	printk(KERN_INFO "pmbd: HIGHMEM offset [%llu] size [%lu] Private Mapping (%s) (%s) (%s) Write Barrier(%s) FUA(%s)\n", 
+			g_highmem_phys_addr, g_highmem_size, (PMBD_USE_PMAP()? "Enabled" : "Disabled"), 
+			(PMBD_USE_NTS()? "Non-Temporal Store":"Temporal Store"),	
+			(PMBD_USE_NTL()? "Non-Temporal Load":"Temporal Load"),	
+			(PMBD_USE_WB()? "Enabled": "Disabled"),
+			(PMBD_USE_FUA()? "Enabled":"Disabled"));
+
+	/* for each pmbd device */
+	for (i = 0; i < g_pmbd_nr; i ++) {
+		printk(KERN_INFO "pmbd: /dev/pm%c (%d)[%llu GB] read[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] write[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] [%s] [Buf: %llu MBs, batch %llu pages] [%s] [%s]\n", 
+			'a'+i, i, g_pmbd_size[i], g_pmbd_rdlat[i], g_pmbd_rdbw[i], g_pmbd_rdsx[i], g_pmbd_rdpause[i], g_pmbd_wrlat[i], g_pmbd_wrbw[i], g_pmbd_wrsx[i], g_pmbd_wrpause[i],\
+			(g_pmbd_rammode[i] ? "RAM" : "PMBD"), g_pmbd_bufsize[i], g_pmbd_buffer_batch_size[i], \
+			(g_pmbd_simmode[i] ? "Simulating PM only" : "Simulating the whole device"), \
+			(PMBD_USE_PMAP() ? "PMAP" : (g_pmbd_wpmode[i] ? "WP-CR0/WP" : "WP-PTE")));
+
+		if (g_pmbd_simmode[i] > 0){
+			printk(KERN_INFO "pmbd: ********************************* WARNING **************************************\n");
+			printk(KERN_INFO "pmbd: Using simmode%llu to simulate a slowed-down PM space may cause system soft lockup.\n", g_pmbd_simmode[i]);
+			printk(KERN_INFO "pmbd: To disable the warning message, please add \"nosoftlockup\" in the boot option. \n");
+			printk(KERN_INFO "pmbd: ********************************************************************************\n");
+		}
+	}
+
+	printk(KERN_INFO "pmbd: ****************************** WARNING ***********************************\n");
+	printk(KERN_INFO "pmbd: 1. Checksum mismatch can be detected but not handled \n");
+	printk(KERN_INFO "pmbd: 2. PMAP is incompatible with \"wrprotY\"\n");
+	printk(KERN_INFO "pmbd: **************************************************************************\n");
+
+	return;
+}
+
+/*
+ * Parse a string with config for multiple devices (e.g. mode="pmbd4,1,3;")
+ * @mode: input option string
+ * @tag:  the tag being looked for (e.g. pmbd)
+ * @data: output in an array
+ */
+static int _pmbd_parse_multi(char* mode, char* tag, unsigned long long data[])
+{
+	int nr = 0;
+	if (strlen(mode)) {
+	       	char* head = mode;
+        	char* tail = mode;
+		char* end  = mode + strlen(mode);
+       		char tmp[128];
+	
+        	if ((head = strstr(mode, tag))) {
+	        	head = head + strlen(tag);
+		        tail = head;
+			while(head < end){
+	                	int len = 0;
+
+				/* locate the position of the first non-number char */
+				for(tail = head; IS_DIGIT(*tail) && tail < end; tail++) {};
+
+				/* pick up the numbers */
+	                	len = tail - head;
+				if(len > 0) {
+					nr ++;
+					if (nr > PMBD_MAX_NUM_DEVICES) {
+						printk(KERN_ERR "pmbd: %s(%d) - too many (%d) device config for %s\n", 
+							__FUNCTION__, __LINE__, nr, tag);
+						return -1;
+					}
+		        	        strncpy(tmp, head, len); tmp[len] = '\0';
+        		        	data[nr - 1] = simple_strtoull(tmp, NULL, 0);
+				} 
+
+				/* check the next sequence of numbers */
+				for(; !IS_DIGIT(*tail) && tail < end; tail++) {
+					/* if we meet the first alpha char or space, clause ends */
+					if(IS_ALPHA(*tail) || IS_SPACE(*tail))
+						goto done;
+				};
+
+				/* move head to the next sequence of numbers */
+				head = tail;
+			}
+		}
+	}
+done:
+	return nr;
+}
+
+/*
+ * Parse a string with config for all devices (e.g. mode="adj1000")
+ * @mode: input option string
+ * @tag:  the tag being looked for (e.g. pmbd)
+ * @data: output 
+ */
+static int _pmbd_parse_single(char* mode, char* tag, unsigned long long* data)
+{
+	if (strlen(mode)) {
+	       	char* head = mode;
+        	char* tail = mode;
+       		char tmp[128];
+
+		if (strstr(mode, tag)) {
+			head = strstr(mode, tag) + strlen(tag);
+			for(tail=head; IS_DIGIT(*tail); tail++) {};
+			if(tail == head) {
+				return -1;
+			} else {
+				int len = tail - head;
+				strncpy(tmp, head, len); tmp[len] = '\0';
+				*data = simple_strtoull(tmp, NULL, 0);
+			}
+		} 
+	}
+	return 0;
+}
+
+static void load_default_conf(void)
+{
+	int i = 0;
+	for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) 
+		g_pmbd_buffer_batch_size[i] = PMBD_BUFFER_BATCH_SIZE_DEFAULT;
+}
+
+/* parse the module parameters (mode) */
+static void pmbd_parse_conf(void)
+{
+	int i = 0;
+	static unsigned enforce_cache_wc = FALSE;
+
+	load_default_conf();
+
+	if (strlen(mode)) {
+		unsigned long long data = 0;
+
+		/* check pmbd size/usable */
+		if (strstr(mode, "pmbd")) {
+			if( (g_pmbd_nr = _pmbd_parse_multi(mode, "pmbd", g_pmbd_size)) <= 0)
+				goto fail;
+		} else {
+			printk(KERN_ERR "pmbd: no pmbd size set\n");
+			goto fail;
+		}
+		
+		/* rdlat/wrlat (emulated read/write latency) in nanosec */
+		if (strstr(mode, "rdlat"))
+			if (_pmbd_parse_multi(mode, "rdlat", g_pmbd_rdlat) < 0)
+				goto fail;
+		if (strstr(mode, "wrlat")) 
+			if (_pmbd_parse_multi(mode, "wrlat", g_pmbd_wrlat) < 0)
+				goto fail;
+
+		/* rdbw/wrbw (emulated read/write bandwidth) in MB/sec*/
+		if (strstr(mode, "rdbw"))
+			if (_pmbd_parse_multi(mode, "rdbw", g_pmbd_rdbw) < 0)
+				goto fail;
+		if (strstr(mode, "wrbw")) 
+			if (_pmbd_parse_multi(mode, "wrbw", g_pmbd_wrbw) < 0)
+				goto fail;
+
+		/* rdsx/wrsx (emulated read/write slowdown X) */
+		if (strstr(mode, "rdsx"))
+			if (_pmbd_parse_multi(mode, "rdsx", g_pmbd_rdsx) < 0)
+				goto fail;
+		if (strstr(mode, "wrsx")) 
+			if (_pmbd_parse_multi(mode, "wrsx", g_pmbd_wrsx) < 0)
+				goto fail;
+
+		/* rdsx/wrsx (emulated read/write slowdown X) */
+		if (strstr(mode, "rdpause"))
+			if (_pmbd_parse_multi(mode, "rdpause", g_pmbd_rdpause) < 0)
+				goto fail;
+		if (strstr(mode, "wrpause")) 
+			if (_pmbd_parse_multi(mode, "wrpause", g_pmbd_wrpause) < 0)
+				goto fail;
+
+		/* do write optimization */
+		if (strstr(mode, "rammode")){
+			printk(KERN_ERR "pmbd: rammode removed\n");
+			goto fail;
+			if (_pmbd_parse_multi(mode, "rammode", g_pmbd_rammode) < 0)
+				goto fail;
+		}
+
+		if (strstr(mode, "bufsize")){
+			if (_pmbd_parse_multi(mode, "bufsize", g_pmbd_bufsize) < 0)
+				goto fail;
+			for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
+				if (g_pmbd_bufsize[i] > 0 && g_pmbd_bufsize[i] < PMBD_BUFFER_MIN_BUFSIZE){
+					printk(KERN_ERR "pmbd: bufsize cannot be smaller than %d MBs. Setting 0 to disable PMBD buffer.\n", PMBD_BUFFER_MIN_BUFSIZE);
+					goto fail;
+				}
+			}
+		}
+
+		/* numbuf and bufstride*/
+		if (strstr(mode, "bufnum")) { 
+			if(_pmbd_parse_single(mode, "bufnum", &data) < 0) {
+				printk(KERN_ERR "pmbd: incorrect bufnum (must be at least 1)\n");
+				goto fail;
+			} else {
+				g_pmbd_num_buffers = data;
+			}
+		}
+		if (strstr(mode, "bufstride")) { 
+			if(_pmbd_parse_single(mode, "bufstride", &data) < 0) {
+				printk(KERN_ERR "pmbd: incorrect bufstride (must be at least 1)\n");
+				goto fail;
+			} else {
+				g_pmbd_buffer_stride = data;
+			}
+		}
+
+		/* check the nanoseconds of overhead to compensate */
+		if (strstr(mode, "adj")) { 
+			if(_pmbd_parse_single(mode, "adj", &data) < 0) {
+				printk(KERN_ERR "pmbd: incorrect adj\n");
+				goto fail;
+			} else {
+				g_pmbd_adjust_ns = data;
+			}
+		}
+
+		/* check PMBD device type */
+		if ((strstr(mode, "VM"))) {
+			g_pmbd_type = PMBD_CONFIG_VMALLOC;
+		} else if ((strstr(mode, "HM"))) {
+			g_pmbd_type = PMBD_CONFIG_HIGHMEM;
+		}
+
+		/* use pmap*/
+		if ((strstr(mode, "pmapY"))) {
+			g_pmbd_pmap = TRUE;
+		} else if ((strstr(mode, "pmapN"))) {
+			g_pmbd_pmap = FALSE;
+		} 
+		if ((strstr(mode, "PMAP"))){
+			printk("WARNING: !!! pmbd: PMAP is not supported any more (use pmapY) !!!\n");
+			goto fail;
+		}
+
+		/* use nts*/
+		if ((strstr(mode, "ntsY"))) {
+			g_pmbd_nts = TRUE;
+		} else if ((strstr(mode, "ntsN"))) {
+			g_pmbd_nts = FALSE;
+		}
+		if ((strstr(mode, "NTS"))){
+			printk("WARNING: !!! pmbd: NTS is not supported any more (use ntsY) !!!\n");
+			goto fail;
+		}
+
+		/* use ntl*/
+		if ((strstr(mode, "ntlY"))) {
+			g_pmbd_ntl = TRUE;
+			enforce_cache_wc = TRUE;
+		} else if ((strstr(mode, "ntlN"))) {
+			g_pmbd_ntl = FALSE;
+		}
+
+		/* timestat */
+		if ((strstr(mode, "timestatY"))) {
+			g_pmbd_timestat = TRUE;
+		} else if ((strstr(mode, "timestatN"))) {
+			g_pmbd_timestat = FALSE;
+		}
+
+
+		/* write barrier */
+		if ((strstr(mode, "wbY"))) {
+			g_pmbd_wb = TRUE;
+		} else if ((strstr(mode, "wbN"))) {
+			g_pmbd_wb = FALSE;
+		}
+
+		/* write barrier */
+		if ((strstr(mode, "fuaY"))) {
+			g_pmbd_fua = TRUE;
+		} else if ((strstr(mode, "fuaN"))) {
+			g_pmbd_fua = FALSE;
+		}
+
+
+		/* check if HIGH_MEM PMBD is configured */
+		if (PMBD_USE_HIGHMEM()) { 
+			if (strstr(mode, "hmo") && strstr(mode, "hms")) {
+				/* parse reserved HIGH_MEM offset */
+				if(_pmbd_parse_single(mode, "hmo", &data) < 0){
+					printk(KERN_ERR "pmbd: incorrect hmo\n");
+					g_highmem_phys_addr = 0;
+					goto fail;
+				} else {
+					g_highmem_phys_addr = data * 1024 * 1024 * 1024;
+				}
+
+				/* parse reserved HIGH_MEM size */
+				if(_pmbd_parse_single(mode, "hms", &data) < 0 || data == 0){
+					printk(KERN_ERR "pmbd: incorrect hms\n");
+					g_highmem_size = 0;
+					goto fail;
+				} else {
+					g_highmem_size = data * 1024 * 1024 * 1024;
+				} 
+			} else {
+				printk(KERN_ERR "pmbd: hmo or hms not set ***\n");
+				goto fail;
+			}
+
+
+		} 
+
+
+		/* check if mergeable */
+		if((strstr(mode,"mgbY")))
+			g_pmbd_mergeable = TRUE;
+		else if((strstr(mode,"mgbN")))
+			g_pmbd_mergeable = FALSE;
+
+		/* CPU cache flushing  */
+		if((strstr(mode,"clflushY")))
+			g_pmbd_cpu_cache_clflush = TRUE;
+		else if((strstr(mode,"clflushN")))
+			g_pmbd_cpu_cache_clflush = FALSE;
+
+		/* CPU cache setting */
+		if((strstr(mode,"cacheWB")))		/* cache write back */
+			g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB;
+		else if((strstr(mode,"cacheWC")))	/* cache write combined (through) */
+			g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
+		else if((strstr(mode,"cacheUM")))	/* cache cachable but write back */
+			g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC_MINUS;
+		else if((strstr(mode,"cacheUC")))	/* cache uncablable */
+			g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC;
+
+
+		/* write protectable  */
+		if((strstr(mode,"wrprotY")))
+			g_pmbd_wr_protect = TRUE;
+		else if((strstr(mode,"wrprotN")))
+			g_pmbd_wr_protect = FALSE;
+
+		/* write protectable  */
+		if((strstr(mode,"wrverifyY")))
+			g_pmbd_wr_verify = TRUE;
+		else if((strstr(mode,"wrverifyN")))
+			g_pmbd_wr_verify = FALSE;
+
+		/* checksum  */
+		if((strstr(mode,"checksumY")))
+			g_pmbd_checksum = TRUE;
+		else if((strstr(mode,"checksumN")))
+			g_pmbd_checksum = FALSE;
+
+		/* checksum  */
+		if((strstr(mode,"lockY")))
+			g_pmbd_lock = TRUE;
+		else if((strstr(mode,"lockN")))
+			g_pmbd_lock = FALSE;
+
+		/* write protectable  */
+		if((strstr(mode,"subupdateY")))
+			g_pmbd_subpage_update = TRUE;
+		else if((strstr(mode,"subupdateN")))
+			g_pmbd_subpage_update = FALSE;
+
+
+		/* batch */
+		if (strstr(mode, "batch")){
+			if (_pmbd_parse_multi(mode, "batch", g_pmbd_buffer_batch_size) < 0)
+				goto fail;
+			/* check if any batch size is set too small */
+			for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
+				if (g_pmbd_buffer_batch_size[i] < 1){
+					printk(KERN_ERR "pmbd: buffer batch size cannot be smaller than 1 page (default: 1024 pages)\n");
+					goto fail;
+				}
+			}
+		}
+
+		/* simmode */
+		if (strstr(mode, "simmode")){
+			if (_pmbd_parse_multi(mode, "simmode", g_pmbd_simmode) < 0)
+				goto fail;
+		}
+
+		/* wpmode */
+		if (strstr(mode, "wpmode")){
+			if (_pmbd_parse_multi(mode, "wpmode", g_pmbd_wpmode) < 0)
+				goto fail;
+		}
+
+	} else {
+		goto fail;
+	}
+
+	/* apply some enforced configuration */
+	if (enforce_cache_wc)	/* if ntl is used, we must use WC */
+		g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
+
+	/* Done, print input options */
+	pmbd_print_conf();
+	return;
+
+fail:
+	printk(KERN_ERR "pmbd: wrong mode config! Check modinfo\n\n");
+	g_pmbd_nr = 0;
+	return;
+}
+
+/*
+ * *****************************************************************
+ * simple emulation API functions
+ * pmbd_rdwr_pause - pause read/write for a specified cycles/page
+ * pmbd_rdwr_slowdown - slowdown read/write proportionally to DRAM 
+ * *****************************************************************/
+
+/* handle rdpause and wrpause options*/
+static void pmbd_rdwr_pause(PMBD_DEVICE_T* pmbd, size_t bytes, unsigned rw)
+{
+	uint64_t cycles = 0;
+	uint64_t time_p1, time_p2;
+
+	/* sanity check */
+	if (pmbd->rdpause == 0 && pmbd->wrpause == 0)
+		return;
+
+	/* start */
+	TIMESTAT_POINT(time_p1);
+
+	/* calculate the cycles to pause */
+	if (rw == READ && pmbd->rdpause){
+		cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->rdpause), pmbd->rdpause);
+	} else if (rw == WRITE && pmbd->wrpause){
+		cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->wrpause), pmbd->wrpause);
+	}
+
+	/* slow down now */
+	if (cycles)
+		sync_slowdown_cycles(cycles);
+	
+	TIMESTAT_POINT(time_p2);
+
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_pause[rw][cid] += time_p2 - time_p1;
+	}
+
+	return;
+}
+
+
+/* handle rdsx and wrsx options */
+static void pmbd_rdwr_slowdown(PMBD_DEVICE_T* pmbd, int rw, uint64_t start, uint64_t end)
+{
+	uint64_t cycles = 0;
+	uint64_t time_p1, time_p2;
+
+	/* sanity check */
+	if ( !((rw == READ && pmbd->rdsx > 1) || (rw == WRITE && pmbd->wrsx > 1)))
+		return;
+
+	if (end < start){
+		printk(KERN_WARNING "pmbd: %s(%d) end (%llu) is earlier than start (%llu)\n", \
+			__FUNCTION__, __LINE__, (unsigned long long) start, (unsigned long long)end);
+		return;
+	}
+
+	/* start */
+	TIMESTAT_POINT(time_p1);
+
+	/*FIXME: should we allow to do async slowdown? */
+	cycles = (end-start)*((rw == READ) ? (pmbd->rdsx - 1) : (pmbd->wrsx -1));
+
+	/*FIXME: should we minus a slack here (80-100cycles)? */
+	if (cycles)
+		sync_slowdown_cycles(cycles);
+
+	TIMESTAT_POINT(time_p2);
+
+	/* updating statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_slowdown[rw][cid] += time_p2 - time_p1;
+	}
+
+	return;
+}
+
+
+/* 
+ * set page's cache flags
+ * @vaddr: start virtual address
+ * @num_pages: the range size
+ */
+static void set_pages_cache_flags(unsigned long vaddr, int num_pages)
+{
+	switch (g_pmbd_cpu_cache_flag) {
+		case _PAGE_CACHE_WB:
+			printk(KERN_INFO "pmbd: set PM pages cache flags (WB)\n");
+			set_memory_wb(vaddr, num_pages);
+			break;
+		case _PAGE_CACHE_WC:
+			printk(KERN_INFO "pmbd: set PM pages cache flags (WC)\n");
+			set_memory_wc(vaddr, num_pages);
+			break;
+		case _PAGE_CACHE_UC:
+			printk(KERN_INFO "pmbd: set PM pages cache flags (UC)\n");
+			set_memory_uc(vaddr, num_pages);
+			break;
+		case _PAGE_CACHE_UC_MINUS:
+			printk(KERN_INFO "pmbd: set PM pages cache flags (UM)\n");
+			set_memory_uc(vaddr, num_pages);
+			break;
+		default:
+			set_memory_wb(vaddr, num_pages);
+			printk(KERN_WARNING "pmbd: PM page attribute is not set - use WB\n");
+			break;
+	}
+	return;
+}
+
+
+/* 
+ * *************************************************************************
+ * PMAP - Private mapping interface APIs
+ * *************************************************************************
+ *
+ * The private mapping is for providing write protection -- only when we need
+ * to access the PM page, we map it into the kernel virtual memory space, once
+ * we finish using it, we unmap it, so the spatial and temporal window left for
+ * bug attack is really small.
+ *
+ * Notes: pmap works similar to kmap_atomic*. It does the following:
+ * (1) pmap_create(): allocate 128 pages with vmalloc, these 128 pte mapping is
+ * saved to a backup place, and then be cleared to prevent accidental accesses.
+ * Each page is assigned correspondingly to the CPU ID where the calling thread
+ * is running on. So we support at most 128 CPU IDs. 
+ * (2) pmap_atomic_pfn(): map the specified pfn into the entry, whose index is
+ * the ID of the CPU on which the current thread is running. The pfn is loaded
+ * into the corresponding pte entry and the corresponding TLB entry is flushed
+ * (3) punmap_atomic(): the specified pte entry is cleared, and the TLB entry
+ * is flushed
+ * (4) pmap_destroy(): the saved pte mapping of the 128 pages are restored, and
+ * vfree() is called to release the 128 pages allocated through vmalloc().
+ *
+ */
+
+#define PMAP_NR_PAGES 	(128)
+static unsigned int 	pmap_nr_pages = 0;			/* the total number of available pages for private mapping */
+static void* 		pmap_va_start = NULL;			/* the first PMAP virtual address */
+static pte_t*  		pmap_ptep[PMAP_NR_PAGES];		/* the array of PTE entries */
+static unsigned long	pmap_pfn[PMAP_NR_PAGES];		/* the array of page frame numbers for restoring */
+static pgprot_t 	pmap_prot[PMAP_NR_PAGES];		/* the array of page protection fields */
+#define PMAP_VA(IDX)	(pmap_va_start + (IDX) * PAGE_SIZE)
+#define PMAP_IDX(VA)	(((unsigned long)(VA) - (unsigned long)pmap_va_start) >> PAGE_SHIFT)
+
+static inline void pmap_flush_tlb_single(unsigned long addr)
+{
+	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+}
+
+static inline void* update_pmap_pfn(unsigned long pfn, unsigned int idx)
+{
+	void* va 	= PMAP_VA(idx);
+	pte_t* ptep 	= pmap_ptep[idx];
+	pte_t old_pte 	= *ptep;
+	pte_t new_pte 	= pfn_pte(pfn, pmap_prot[idx]);
+
+	if (pte_val(old_pte) == pte_val(new_pte))
+		return va;
+
+	/* update the pte entry */
+	set_pte_atomic(ptep, new_pte);
+//	set_pte(ptep, new_pte);
+
+	/* flush one single tlb */
+	__flush_tlb_one((unsigned long) va);
+//	pmap_flush_tlb_single((unsigned long) va);
+
+	/* return the old one for bkup */
+	return va;
+}
+
+static inline void clear_pmap_pfn(unsigned idx)
+{
+	if (idx < pmap_nr_pages){
+
+		void* va = PMAP_VA(idx);
+		pte_t* ptep = pmap_ptep[idx];
+
+		/* clear the mapping */
+		pte_clear(NULL, (unsigned long) va, ptep);
+		__flush_tlb_one((unsigned long) va);
+
+	} else {
+		panic("%s(%d) illegal pmap idx\n", __FUNCTION__, __LINE__);
+	}
+}
+
+static int pmap_atomic_init(void)
+{
+	unsigned int i;
+
+	/* checking */
+	if (pmap_va_start)
+		panic("%s(%d) something is wrong\n", __FUNCTION__, __LINE__);
+
+	/* allocate an array of dummy pages as pmap virtual addresses */
+	pmap_va_start = vmalloc(PAGE_SIZE * PMAP_NR_PAGES);
+	if (!pmap_va_start){
+		printk(KERN_ERR "pmbd:%s(%d) pmap_va_start cannot be initialized\n", __FUNCTION__, __LINE__);
+		return -ENOMEM;
+	}
+	pmap_nr_pages = PMAP_NR_PAGES;
+
+	/* set pages' cache flags, this flag would be saved into pmap_prot
+	 * and will be applied together with the dynamically mapped page too (01/12/2012)*/
+	set_pages_cache_flags((unsigned long)pmap_va_start, pmap_nr_pages);
+
+	/* save the dummy pages' ptep, pfn, and prot info */	
+	printk(KERN_INFO "pmbd: saving dummy pmap entries\n");
+	for (i = 0; i < pmap_nr_pages; i ++){
+		pte_t old_pte;
+		unsigned int level;
+		void* va = PMAP_VA(i);
+
+		/* get the ptep */
+		pte_t* ptep = lookup_address((unsigned long)(va), &level);
+
+		/* sanity check */
+		if (!ptep)
+			panic("%s(%d) mapping not found\n", __FUNCTION__, __LINE__);
+
+		old_pte = *ptep;
+		if (!pte_val(old_pte))
+			panic("%s(%d) invalid pte value\n", __FUNCTION__, __LINE__);
+
+		if (level != PG_LEVEL_4K)
+			panic("%s(%d) not PG_LEVEL_4K \n", __FUNCTION__, __LINE__);
+
+		/* save dummy entries */
+		pmap_ptep[i] = ptep;
+		pmap_pfn[i] = pte_pfn(old_pte);
+		pmap_prot[i] = pte_pgprot(old_pte);
+
+/*		printk(KERN_INFO "%s(%d): saving dummy pmap entries: %u va=%p pfn=%lx\n", \
+					__FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
+*/
+	}
+
+	/* clear the pte to make it illegal to access */
+	for (i = 0; i < pmap_nr_pages; i ++)
+		clear_pmap_pfn(i);
+
+	return 0;
+}
+
+static void pmap_atomic_done(void)
+{
+	int i;
+	
+	/* restore the dummy pages' pte */
+	printk(KERN_INFO "pmbd: restoring dummy pmap entries\n");
+	for (i = 0; i < pmap_nr_pages; i ++){
+/*		void* va = PMAP_VA(i);
+		printk(KERN_INFO "%s(%d): restoring dummy pmap entries: %d va=%p pfn=%lx\n", \
+					__FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
+*/
+		/* restore the old pfn */
+		update_pmap_pfn(pmap_pfn[i], i);
+		pmap_ptep[i]= NULL;
+		pmap_pfn[i] = 0;
+	}
+
+	/* free the dummy pages*/
+	if (pmap_va_start)
+		vfree(pmap_va_start);
+	else
+		panic("%s(%d): freeing dummy pages failed\n", __FUNCTION__, __LINE__);
+
+	pmap_va_start = NULL;
+	pmap_nr_pages = 0;
+	return;
+}
+
+static void* pmap_atomic_pfn(unsigned long pfn, PMBD_DEVICE_T* pmbd, unsigned rw)
+{
+	void* va = NULL;
+	unsigned int idx = CUR_CPU_ID();
+	uint64_t time_p1 = 0;
+	uint64_t time_p2 = 0;
+
+	TIMESTAMP(time_p1);
+
+	/* disable page fault temporarily */
+	pagefault_disable();
+
+	/* change the mapping to the specified pfn*/
+	va = update_pmap_pfn(pfn, idx);
+
+	TIMESTAMP(time_p2);
+
+	/* update time statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_pmap[rw][cid] += time_p2 - time_p1;
+	}
+
+	return va;
+}
+
+static void punmap_atomic(void* va, PMBD_DEVICE_T* pmbd, unsigned rw)
+{
+	unsigned int idx = PMAP_IDX(va);
+	uint64_t time_p1 = 0;
+	uint64_t time_p2 = 0;
+
+	TIMESTAMP(time_p1);
+
+	/* clear the mapping */
+	clear_pmap_pfn(idx);
+
+	/* re-enable the page fault */
+	pagefault_enable();
+
+	TIMESTAMP(time_p2);
+
+	/* update time statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_punmap[rw][cid] += time_p2 - time_p1;
+	}
+
+	return;
+}
+
+/* create the dummy pmap space */
+static int pmap_create(void)
+{
+	pmap_atomic_init();
+	return 0;
+}
+
+/* destroy the dummy pmap space */
+static void pmap_destroy(void)
+{
+	pmap_atomic_done();
+	return;
+}
+
+/*
+ * *************************************************************************
+ * Non-temporal memcpy 
+ * *************************************************************************
+ * Non-temporal memcpy does the following:
+ * (1) use movntq to copy into PM space
+ * (2) use sfence to flush the data to memory controller
+ * 
+ * Compared to regular temporal memcpy, it provides several benefits here:
+ * (1) writes to PM bypass the CPU cache, which avoids polluting CPU cache
+ * (2) reads from PM still benefit from the CPU cache
+ * (3) sfence used for each write guarantees data will be flushed out of buffer
+ */
+
+static void nts_memcpy_64bytes_v2(void* to, void* from, size_t size)
+{
+	int i;
+	unsigned bs = 64;	/* write unit size 8 bytes */
+
+	if (size < bs)
+		panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
+
+	if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
+		panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
+
+	/* start */
+	kernel_fpu_begin();
+	
+	/* do the non-temporal mov */	
+	for (i = 0; i < size; i += bs){
+		__asm__ __volatile__ (
+		"movdqa (%0), %%xmm0\n"
+		"movdqa 16(%0), %%xmm1\n"
+		"movdqa 32(%0), %%xmm2\n"
+		"movdqa 48(%0), %%xmm3\n"
+		"movntdq %%xmm0, (%1)\n"
+		"movntdq %%xmm1, 16(%1)\n"
+		"movntdq %%xmm2, 32(%1)\n"
+		"movntdq %%xmm3, 48(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	/* do sfence to push data out */
+	__asm__ __volatile__ (
+		" sfence\n" : :
+	);
+
+	/* end */
+	kernel_fpu_end();
+
+	/*NOTE: we assume it would be multiple units of 64 bytes*/
+	if (i != size)
+		panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
+
+	return;
+}
+
+/* non-temporal store */
+static void nts_memcpy(void* to, void* from, size_t size)
+{
+	if (size < 64){
+		panic("no support for nt load smaller than 64 bytes yet\n");
+	} else {
+		nts_memcpy_64bytes_v2(to, from, size);
+	}
+}
+
+
+static void ntl_memcpy_64bytes(void* to, void* from, size_t size)
+{
+	int i;
+	unsigned bs = 64;	/* write unit size 16 bytes */
+
+	if (size < bs)
+		panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
+
+	if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
+		panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
+
+	/* start */
+	kernel_fpu_begin();
+	
+	/* do the non-temporal mov */	
+	for (i = 0; i < size; i += bs){
+		__asm__ __volatile__ (
+		"movntdqa (%0), %%xmm0\n"
+		"movntdqa 16(%0), %%xmm1\n"
+		"movntdqa 32(%0), %%xmm2\n"
+		"movntdqa 48(%0), %%xmm3\n"
+		"movdqa %%xmm0, (%1)\n"
+		"movdqa %%xmm1, 16(%1)\n"
+		"movdqa %%xmm2, 32(%1)\n"
+		"movdqa %%xmm3, 48(%1)\n"
+		:
+		: "r" (from), "r" (to)
+		: "memory");
+
+		to += bs;
+		from += bs;
+	}
+
+	/* end */
+	kernel_fpu_end();
+
+	/*NOTE: we assume it would be multiple units of 64 bytes (at least 512 bytes)*/
+	if (i != size)
+		panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
+
+	return;
+}
+
+/* non-temporal load */
+static void ntl_memcpy(void* to, void* from, size_t size)
+{
+	if (size < 64){
+		panic("no support for nt load smaller than 128 bytes yet\n");
+	} else {
+		ntl_memcpy_64bytes(to, from, size);
+	}
+}
+
+
+/*
+ * *************************************************************************
+ * COPY TO/FROM PM
+ * *************************************************************************
+ * 
+ * NOTE: copying into PM needs particular care, we use different solution here:
+ * (1) pmap: we only map/unmap PM pages when we need to access, which provides
+ *     us the most protection, for both reads and writes
+ * (2) non-pmap: we always map every page into the kernel space, however, we
+ *     put different protection for writes only. In both cases, PM pages are
+ *     initialized as read-only 
+ *     - PTE manipulation: before each write, the page writable bit is enabled, and
+ *       disabled right after the write operation is done.
+ *     - CR0/WP switch: before each write, the WP bit in the CR0 register turned
+ *       off, and turned back on right after the write operation is done. Once
+ *       CR0/WP bit is turned off, the CPU would not check the writable bit in the
+ *       TLB in local CPU. So it is a tricky way to hack and walk around this
+ *       problem. 
+ *
+ */
+
+#define PMBD_PMAP_DUMMY_BASE_VA	(4096)
+#define PMBD_PMAP_VA_TO_PA(VA)	(g_highmem_phys_addr + (VA) - PMBD_PMAP_DUMMY_BASE_VA)
+/*
+ * copying from/to a contiguous PM space using pmap
+ * @ram_va: the RAM virtual address
+ * @pmbd_dummy_va: the dummy PM virtual address (for converting to phys addr)
+ * @rw: 0 - read, 1 - write
+ */
+
+#define MEMCPY_TO_PMBD(dst, src, bytes) { if (PMBD_USE_NTS()) \
+						nts_memcpy((dst), (src), (bytes)); \
+					else \
+						memcpy((dst), (src), (bytes));}
+
+#define MEMCPY_FROM_PMBD(dst, src, bytes) { if (PMBD_USE_NTL()) \
+						ntl_memcpy((dst), (src), (bytes)); \
+					else \
+						memcpy((dst), (src), (bytes));}
+
+static inline int _memcpy_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* ram_va, void* pmbd_dummy_va, size_t bytes, unsigned rw, unsigned do_fua)
+{
+	unsigned long flags = 0;
+	uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
+
+	/* disable interrupt (PMAP entry is shared) */	
+	DISABLE_SAVE_IRQ(flags);
+	
+	/* do the real work */
+	while(bytes){
+		uint64_t time_p1 = 0;
+		uint64_t time_p2 = 0;
+
+		unsigned long pfn = (pa >> PAGE_SHIFT);		/* page frame number */
+		unsigned off = pa & (~PAGE_MASK);		/* offset in one page */
+		unsigned size = MIN_OF((PAGE_SIZE - off), bytes);/* the size to copy */
+
+		/* map it */
+		void * map = pmap_atomic_pfn(pfn, pmbd, rw);
+		void * pmbd_va = map + off;
+
+		/* do memcopy */
+		TIMESTAMP(time_p1);
+		if (rw == READ) { 
+			MEMCPY_FROM_PMBD(ram_va, pmbd_va, size);
+		} else { 
+			if (PMBD_USE_SUBPAGE_UPDATE()) {
+				/* if we do subpage write, write a cacheline each time */
+				/* FIXME: we probably need to check the alignment here */
+				size = MIN_OF(size, PMBD_CACHELINE_SIZE);
+				if (memcmp(pmbd_va, ram_va, size)){
+					MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
+				}
+			} else {
+				MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
+			}
+		}
+		TIMESTAMP(time_p2);
+
+		/* emulating slowdown*/
+		if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+			pmbd_rdwr_slowdown((pmbd), rw, time_p1, time_p2);
+
+		/* for write check if we need to do clflush or do FUA*/
+		if (rw == WRITE){ 
+			if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
+				pmbd_clflush_range(pmbd, pmbd_va, (size));
+		}
+
+		/* if write combine is used, we need to do sfence (like in ntstore) */
+		if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) 
+			sfence();
+
+		/* update time statistics */
+		if(PMBD_USE_TIMESTAT()){
+			int cid = CUR_CPU_ID();
+			PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+			pmbd_stat->cycles_memcpy[rw][cid] += time_p2 - time_p1;
+		}
+
+		/* unmap it */
+		punmap_atomic(map, pmbd, rw);
+
+		/* prepare the next iteration */
+		ram_va  += size;
+		bytes 	-= size;
+		pa 	+= size;
+	}
+	
+	/* re-enable interrupt */	
+	ENABLE_RESTORE_IRQ(flags);
+
+	return 0;
+}
+
+static inline int memcpy_from_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+	return _memcpy_pmbd_pmap(pmbd, dst, src, bytes, READ, FALSE);
+}
+
+static inline int memcpy_to_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+	return _memcpy_pmbd_pmap(pmbd, src, dst, bytes, WRITE, do_fua);
+}
+
+
+/*
+ * memcpy from/to PM without using pmap
+ */
+
+#define DISABLE_CR0_WP(CR0,FLAGS)	{\
+						if (PMBD_USE_WRITE_PROTECTION()){\
+							DISABLE_SAVE_IRQ((FLAGS));\
+							(CR0) = read_cr0();\
+							write_cr0((CR0) & ~X86_CR0_WP);\
+						}\
+					}
+#define ENABLE_CR0_WP(CR0,FLAGS)	{\
+						if (PMBD_USE_WRITE_PROTECTION()){\
+							write_cr0((CR0));\
+							ENABLE_RESTORE_IRQ((FLAGS));\
+						}\
+					}
+
+static inline int memcpy_from_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+	uint64_t time_p1 = 0;
+	uint64_t time_p2 = 0;
+
+	/* start memcpy */
+	TIMESTAMP(time_p1);
+#if 0
+	if (PMBD_DEV_USE_VMALLOC((pmbd))) 
+		memcpy((dst), (src), (bytes)); 
+	else if (PMBD_DEV_USE_HIGHMEM((pmbd))) 
+		memcpy_fromio((dst), (src), (bytes));
+#endif
+	MEMCPY_FROM_PMBD(dst, src, bytes);
+
+	TIMESTAMP(time_p2);
+
+	/* emulating slowdown*/
+	if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+		pmbd_rdwr_slowdown((pmbd), READ, time_p1, time_p2);
+
+	/* update time statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_memcpy[READ][cid] += time_p2 - time_p1;
+	}
+
+	return 0;
+}
+
+static int memcpy_to_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+
+	unsigned long cr0 = 0;
+	unsigned long flags = 0;
+	size_t left = bytes;
+
+
+	/* get a bkup copy of the CR0 (to allow writable)*/
+	if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
+		DISABLE_CR0_WP(cr0, flags);
+
+	/* do the real work */
+	while(left){
+		size_t size = left; // the size to copy 
+		uint64_t time_p1 = 0;
+		uint64_t time_p2 = 0;
+
+		TIMESTAMP(time_p1);
+		/* do memcopy */
+		if (PMBD_USE_SUBPAGE_UPDATE()) {
+			/* if we do subpage write, write a cacheline each time */
+			size = MIN_OF(size, PMBD_CACHELINE_SIZE);
+
+			if (memcmp(dst, src, size)){
+				MEMCPY_TO_PMBD(dst, src, size);
+			}
+		} else {
+			MEMCPY_TO_PMBD(dst, src, size);
+		}
+		TIMESTAMP(time_p2);
+
+		/* emulating slowdown*/
+		if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+			pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+		/* if write, check if we need to do clflush or we do FUA */
+		if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
+			pmbd_clflush_range(pmbd, dst, (size));
+
+		/* if write combine is used, we need to do sfence (like in ntstore) */
+		if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM())
+			sfence();
+
+		/* update time statistics */
+		if(PMBD_USE_TIMESTAT()){
+			int cid = CUR_CPU_ID();
+			PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+			pmbd_stat->cycles_memcpy[WRITE][cid] += time_p2 - time_p1;
+		}
+
+		/* prepare the next iteration */
+		dst  	+= size;
+		src 	+= size;
+		left 	-= size;
+	}
+	
+	/* restore the CR0 */
+	if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
+		ENABLE_CR0_WP(cr0, flags);
+
+	return 0;
+}
+
+static int memcpy_to_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
+{
+	uint64_t start = 0; 
+	uint64_t end = 0; 
+
+	/* start simulation timing */
+	if (PMBD_DEV_SIM_PMBD((pmbd)))
+		start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), WRITE);
+
+	/* do memcpy now */
+	if (PMBD_USE_PMAP()){
+		memcpy_to_pmbd_pmap(pmbd, dst, src, bytes, do_fua);
+	} else {
+		memcpy_to_pmbd_nopmap(pmbd, dst, src, bytes, do_fua);
+	}
+
+	/* stop simulation timing */
+	if (PMBD_DEV_SIM_PMBD((pmbd))) 
+		end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), WRITE, start); 
+
+	/* pause write for a while*/
+	pmbd_rdwr_pause(pmbd, bytes, WRITE);
+
+	return 0;
+}
+
+
+
+static int memcpy_from_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
+{
+	uint64_t start = 0; 
+	uint64_t end = 0; 
+
+	/* start simulation timing */
+	if (PMBD_DEV_SIM_PMBD((pmbd)))
+		start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), READ);
+
+	/* do memcpy here */
+	if (PMBD_USE_PMAP()){
+		memcpy_from_pmbd_pmap(pmbd, dst, src, bytes);
+	}else{
+		memcpy_from_pmbd_nopmap(pmbd, dst, src, bytes);
+	}
+
+	/* stop simulation timing */
+	if (PMBD_DEV_SIM_PMBD((pmbd))) 
+		end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), READ, start); 
+
+	/* pause read for a while */
+	pmbd_rdwr_pause(pmbd, bytes, READ);
+
+	return 0;
+}
+
+
+
+/*
+ * *************************************************************************
+ * PMBD device buffer management
+ * *************************************************************************
+ *
+ * Since write protection involves high performance overhead (due to TLB
+ * shootdown and other system locking, linked list scan overhead related with
+ * set_memory_* functions), we cannot change page table attributes for each
+ * incoming write to PM space. In order to battle this issue, we added a
+ * buffer to temporarily hold the incoming writes into a DRAM buffer, and
+ * launch a syncer daemon to periodically flush dirty pages from the buffer to
+ * the PM storage.  This brings two benefits: first, more contiguous pages can
+ * be clustered together, and we only need to do one page attribute change for
+ * a cluster; second, high overhead is hidden in the background, since the
+ * writes become asynchronous now. 
+ * 
+ */
+
+
+/* support functions to sort the bbi entries */
+static int compare_bbi_sort_entries(const void* m, const void* n)
+{
+	PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
+	PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
+	if (a->pbn < b->pbn)
+		return -1;
+	else if (a->pbn == b->pbn)
+		return 0;
+	else
+		return 1;
+
+}
+
+static void swap_bbi_sort_entries(void* m, void* n, int size)
+{
+	PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
+	PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
+	PMBD_BSORT_ENTRY_T tmp;
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+	return;
+}
+
+
+/*
+ * get the aligned in-block offsets for a given request
+ * @pmbd: the pmbd device
+ * @sector: the starting offset (in sectors) of the incoming request
+ * @bytes: the size of the incoming request
+ * 
+ * return: the in-block offset of the starting sector in the request
+ * 
+ * Since the block size (4096 bytes) is larger than the sector size (512 bytes),
+ * if the incoming request is not completely aligned in units of blocks, then
+ * we need to pull the whole block from PM space into the buffer, and apply
+ * changes to partial blocks. This function is needed to calculate the offset
+ * for the beginning and ending sectors. 
+ *
+ * For example: assuming sector size is 1024, buffer block size is 4096, sector
+ * is 5, size is 1024, then the returned start offset is 1 (the second sector
+ * in the buffer block), and the returned end offset is 2 (the third sector in
+ * the buffer block)
+ *
+ * offset_s -----v     v--- offset_e
+ *      ----------------------------------
+ *      |        |*****|        |        |
+ *      ----------------------------------
+ *
+ */
+
+static sector_t pmbd_buffer_aligned_request_start(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+	sector_t sector_s  	= sector;
+	PBN_T pbn_s 		= SECTOR_TO_PBN(pmbd, sector_s);
+	sector_t block_s 	= PBN_TO_SECTOR(pmbd, pbn_s);	/* the block's starting offset (in sector) */
+	sector_t offset_s 	= 0;
+	if (sector_s >= block_s) /* if not aligned */
+		offset_s = sector_s - block_s;
+	return offset_s;
+}
+
+static sector_t pmbd_buffer_aligned_request_end(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+	sector_t sector_e  	= sector + BYTE_TO_SECTOR(bytes) - 1;
+	PBN_T pbn_e 		= SECTOR_TO_PBN(pmbd, sector_e);
+	sector_t block_e 	= PBN_TO_SECTOR(pmbd, pbn_e);	/* the block's starting offset (in sector) */
+	sector_t offset_e 	= PBN_TO_SECTOR(pmbd, 1) - 1;
+	
+	if (sector_e >= block_e) /* if not aligned */
+		offset_e = (sector_e - block_e);
+	return offset_e;
+}
+
+
+/*
+ * check and see if a physical block (pbn) is buffered
+ * @pmbd: 	pmbd device
+ * @pbn: 	buffer block number
+ * 
+ * NOTE: The caller must hold the pbi->lock
+ */ 
+static PMBD_BBI_T* _pmbd_buffer_lookup(PMBD_BUFFER_T* buffer, PBN_T pbn)
+{
+	PMBD_BBI_T* bbi = NULL;
+	PMBD_DEVICE_T* pmbd = buffer->pmbd;
+	PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+
+	if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
+		bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn);
+	}
+	return bbi;
+}
+
+/*
+ * Alloc/flush buffer functions
+ */
+
+/* 
+ * flushing a range of contiguous physical blocks from buffer to PM space
+ * @pmbd: pmbd device
+ * @pbn_s: the first physical block number to flush (start)
+ * @pbn_e: the last physical block number to flush (end)
+ *
+ * This function only flushes blocks from buffer to PM and unlink(free) the
+ * corresponding buffer blocks and physical PM blocks, and it does not update
+ * the buffer control info (num_dirty, pos_dirty).  This is because after
+ * sorting, the processing order of buffer blocks (BBNs) may be different from
+ * the spatial order of the buffer blocks, which makes it impossible to move
+ * pos_dirty forward exactly one after one. In other words, pos_dirty only
+ * points to the end of the dirty range, and we may flush a dirty block in the
+ * middle of the range, rather than from the end first. 
+ *
+ * NOTE: The caller must hold the flush_lock; only one thread is allowed to do
+ * this sync; we also assume all the physical blocks in the specified range are
+ * buffered.
+ *
+ */
+
+static unsigned long _pmbd_buffer_flush_range(PMBD_BUFFER_T* buffer, PBN_T pbn_s, PBN_T pbn_e)
+{
+	PBN_T pbn = 0;
+	unsigned long num_cleaned = 0;
+	PMBD_DEVICE_T* pmbd = buffer->pmbd;
+	void* dst = PMBD_BLOCK_VADDR(pmbd, pbn_s);
+	size_t bytes = PBN_TO_BYTE(pmbd, (pbn_e - pbn_s + 1));
+	
+	/* NOTE: we are protected by the flush_lock here, no-one else here */
+
+	/* set the pages readwriteable */
+	/* if we use CR0/WP to temporarily switch the writable permission, 
+ 	 * we don't have to change the PTE attributes directly */
+	if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+		pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
+	
+
+	/* for each physical block, flush it from buffer to the PM space */
+	for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+		BBN_T bbn 	= 0;
+		PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+		void* to 	= PMBD_BLOCK_VADDR(pmbd, pbn);
+		size_t size 	= pmbd->pb_size;
+		void* from	= NULL;		/* wait to get it in locked region */
+		PMBD_BBI_T* bbi	= NULL;		/* wait to get it in locked region */
+
+		/* 
+		 * NOTE: This would not cause a deadlock, because the block
+		 * here are already buffered, and these blocks would not call
+		 * pmbd_buffer_alloc_block() 
+		 */
+		spin_lock(&pbi->lock);		/* lock the block */
+
+		/* get related buffer block info */
+		if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
+			bbn	= pbi->bbn;
+			bbi	= PMBD_BUFFER_BBI(buffer, pbi->bbn);
+			from  	= PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
+		} else {
+			panic("pmbd: %s(%d) something wrong here \n", __FUNCTION__, __LINE__);
+		}
+		
+		/* sync data from buffer into PM first */
+		if (PMBD_BUFFER_BBI_IS_DIRTY(buffer, bbn)) {
+			/* flush to PM */	
+			memcpy_to_pmbd(pmbd, to, from, size, FALSE);
+
+			/* mark it as clean */
+			PMBD_BUFFER_SET_BBI_CLEAN(buffer, bbn);
+		}
+	}
+
+	/* set the pages back to read-only */
+	if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+		pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
+	
+
+	/* finish the remaining work */	
+	for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+		PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+		void* to 	= PMBD_BLOCK_VADDR(pmbd, pbn);
+		size_t size 	= pmbd->pb_size;
+		BBN_T bbn	= pbi->bbn;
+		void* from  	= PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
+
+		/* verify that the write operation succeeded */
+		if(PMBD_USE_WRITE_VERIFICATION())
+			pmbd_verify_wr_pages(pmbd, to, from, size);
+
+		/* reset the bbi and pbi link info */
+		PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, bbn);
+		PMBD_SET_BLOCK_UNBUFFERED(pmbd, pbn);
+
+		/* unlock the block */
+		spin_unlock(&pbi->lock);
+
+		num_cleaned ++;
+	}
+
+	/* generate checksum */
+	if (PMBD_USE_CHECKSUM())
+		pmbd_checksum_on_write(pmbd, dst, bytes);
+	
+	return num_cleaned;
+}
+
+
+/*
+ * core function of flushing the pmbd buffer
+ * @pmbd: pmbd device
+ *
+ * NOTE: this function performs the flushing in the following steps
+ * (1) get the flush lock (to allow only one to do flushing)
+ * (2) get the buffer_lock to protect the buffer control info (num_dirty,
+ * pos_dirty, pos_clean)
+ * (3) check if someone else has already done the flushing work while waiting
+ * for the lock 
+ * (4) copy the buffer block info from pos_dirty to pos_clean to a temporary
+ * array
+ * (5) release the buffer_lock (to allow alloc to proceed, as long as free
+ * blocks exist)
+ *
+ * (6) sort the temporary array of buffer blocks in the order of their PBNs.
+ * This is because we need to organize sequences of contiguous physical blocks,
+ * so that we can use only one set_memory_* function for a sequence of memory
+ * pages, rather than once for each page. So the larger the sequence is, the
+ * more efficient it would be.
+ * (7) scan the sorted list, and form sequences of contiguous physical blocks,
+ * and call __pmbd_buffer_flush_range() to synchronize the sequences one by one
+ *
+ * (8) get the flush_lock again
+ * (9) update the pos_dirty and num_dirty to reflect the recent changes
+ * (10) release the flush_lock
+ *
+ * NOTE: The caller must not hold flush_lock and buffer_lock, but can hold
+ * pbi->lock. 
+ *
+ */
+static unsigned long pmbd_buffer_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean)
+{
+	BBN_T i = 0;
+	BBN_T bbn_s = 0;
+	BBN_T bbn_e = 0; 
+	PBN_T first_pbn = 0;
+	PBN_T last_pbn = 0;
+	unsigned long num_cleaned = 0;
+	unsigned long num_scanned = 0; 
+	PMBD_DEVICE_T* pmbd = buffer->pmbd;
+	PMBD_BSORT_ENTRY_T* bbi_sort_buffer = buffer->bbi_sort_buffer;
+
+	/* lock the flush_lock to ensure no-one else can do flush in parallel */
+	spin_lock(&buffer->flush_lock);
+
+	/* now we lock the buffer to protect buffer control info */
+	spin_lock(&buffer->buffer_lock);
+
+	/* check if num_to_clean is too large */
+	if (num_to_clean > buffer->num_dirty)
+		num_to_clean = buffer->num_dirty;
+
+	/* check if the buffer is empty (someone else may have done the flushing job) */
+	if (PMBD_BUFFER_IS_EMPTY(buffer) || num_to_clean == 0) {
+		spin_unlock(&buffer->buffer_lock);
+		goto done;
+	}
+
+	/* set up the range of BBNs we need to check */
+	bbn_s = buffer->pos_dirty; 				/* the first bbn */
+	bbn_e = PMBD_BUFFER_PRIO_POS(buffer, buffer->pos_clean);/* the last bbn */
+
+	/* scan the buffer range and put it into the sort buffer */ 
+	/* 
+         * NOTE: bbn_s could be equal to PMBD_BUFFER_NEXT_POS(buffer, bbn_e), if
+         * the buffer is filled with dirty blocks, so we need to check num_scanned
+         * here. 
+         * */
+	for (i = bbn_s; 
+	    (i != PMBD_BUFFER_NEXT_POS(buffer, bbn_e)) || (num_scanned == 0); 
+	     i = PMBD_BUFFER_NEXT_POS(buffer, i)) {
+		/* 
+		 * FIXME: it may be possible that some blocks in the dirty
+		 * block range are "clean", because after the block is
+		 * allocated, and before it is being written, the block is
+		 * marked as CLEAN, but it is allocated already. However, it is
+		 * safe to attempt to flush it, because the pbi->lock would
+		 * protect us. 
+		 *
+		 * UPDATES: we changed the allocator code to mark it dirty as
+		 * soon as the block is allocated. So the aforesaid situation
+		 * would not happen anymore. 
+		 */
+		if(PMBD_BUFFER_BBI_IS_CLEAN(buffer, i)){ 
+			/* found clean blocks */
+			panic("ERR: %s(%d)%u: found clean block in the range of dirty blocks (bbn_s=%lu bbn_e=%lu, i=%lu, num_scanned=%lu num_to_clean=%lu num_dirty=%lu pos_dirty=%lu pos_clean=%lu)\n", 
+					__FUNCTION__, __LINE__, __CURRENT_PID__,bbn_s, bbn_e, i, num_scanned, num_to_clean, buffer->num_dirty, buffer->pos_dirty, buffer->pos_clean);
+			continue;
+		} else {
+			PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, i);
+			PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + num_scanned;
+
+			/* add it to the buffer for sorting */
+			se->pbn = bbi->pbn;
+			se->bbn = i;
+			num_scanned ++;
+
+			/* only clean num_to_clean blocks */
+			if (num_scanned >= num_to_clean)
+				break;
+		}
+	}
+	/* unlock the buffer to let allocator continue */
+	spin_unlock(&buffer->buffer_lock);
+
+	/* if no valid dirty block to be cleaned*/
+	if (num_scanned == 0)
+		goto done;
+
+	/* 
+	 * sort the buffer to get sequences of contiguous blocks 
+	 */
+	if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+		sort(bbi_sort_buffer, num_scanned, sizeof(PMBD_BSORT_ENTRY_T), compare_bbi_sort_entries, swap_bbi_sort_entries);
+
+	/* scan the sorted list to organize and flush the sequences of contiguous PBNs */
+	for (i = 0; i < num_scanned; i ++) {
+		PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + i;
+		PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, se->bbn);
+		if (i == 0) {
+			/* the first one */ 
+			first_pbn = bbi->pbn;
+			last_pbn = bbi->pbn;
+			continue;
+		} else {
+			if (bbi->pbn == (last_pbn + 1) ) {
+				/* if blocks are contiguous */
+				last_pbn = bbi->pbn;
+				continue;
+			} else {
+				/* if blocks are not contiguous */
+				num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
+
+				/* start a new sequence */
+				first_pbn = bbi->pbn;
+				last_pbn = bbi->pbn;
+				continue;
+			}
+		}
+	}
+
+	/* finish the last sequence of contiguous PBNs */
+	num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
+
+	/* update the buffer control info */
+	spin_lock(&buffer->buffer_lock);
+	buffer->pos_dirty = PMBD_BUFFER_NEXT_N_POS(buffer, bbn_s, num_cleaned);	/* move pos_dirty forward */
+	buffer->num_dirty -= num_cleaned;	/* decrement the counter*/
+	spin_unlock(&buffer->buffer_lock);
+
+done:
+	spin_unlock(&buffer->flush_lock);
+	return num_cleaned;
+}
+
+/*
+ * entry function of flushing buffer
+ * This function is called by both allocator and syncer
+ * @pmbd: pmbd device
+ * @num_to_clean: how many blocks to clean 
+ * @i_am_syncer: indicate which caller is (TRUE for syncer and FALSE for allocator)
+ */
+static unsigned long pmbd_buffer_check_and_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean, unsigned caller)
+{
+	unsigned long num_cleaned = 0;
+
+	/* 
+	 * Since there may exist more than one thread (e.g. alloc/flush or
+	 * alloc/alloc) trying to flush the buffer, we need to first check if
+	 * someone else has already done the job while waiting for the lock. If
+	 * true, we don't have to proceed and flush it again. This improves the
+	 * responsiveness of applications 
+	 */
+	if (caller == CALLER_DESTROYER){
+		/* if destroyer calls this function, just flush everything */
+		goto do_it;
+
+	} else if (caller == CALLER_SYNCER) {
+		/* if syncer calls this function and the buffer is empty, do nothing */
+		spin_lock(&buffer->buffer_lock);
+		if (PMBD_BUFFER_IS_EMPTY(buffer)){
+			spin_unlock(&buffer->buffer_lock);
+			goto done;
+		}
+		spin_unlock(&buffer->buffer_lock);
+
+	} else if (caller == CALLER_ALLOCATOR){
+	
+		/* if reader/writer calls this function, some blocks are freed, then 
+		 * we just do nothing */
+		spin_lock(&buffer->buffer_lock);
+		if (!PMBD_BUFFER_IS_FULL(buffer)){
+			spin_unlock(&buffer->buffer_lock);
+			goto done;
+		}
+		spin_unlock(&buffer->buffer_lock);
+
+	} else {
+		panic("ERR: %s(%d) unknown caller id\n", __FUNCTION__, __LINE__);
+	}
+
+	/* otherwise, we do flushing */
+do_it:
+	num_cleaned = pmbd_buffer_flush(buffer, num_to_clean);
+
+done:
+	return num_cleaned;
+}
+
+/* 
+ * Core function of allocating a buffer block
+ * 
+ * We first grab the buffer_lock, and check to see if the buffer is full. If
+ * not, we allocate a buffer block, move the pos_clean, and update num_dirty,
+ * then release the buffer_lock. Since we already hold the pbi->lock, it is
+ * safe to release the lock and let other threads proceed (before we really
+ * write data into the buffer block), because no one else can read/write or
+ * access the same buffer block concurrently. If the buffer is full, we release
+ * the buffer_lock to allow others to proceed (because we may be blocked at
+ * flush_lock later), and then we call the function to synchronously flush the
+ * buffer. Note that someone else may be there already, so we may be blocked
+ * there, and if we find someone has already flushed the buffer, we need to
+ * grab the buffer_lock and check if there is available buffer block again.  
+ *
+ * NOTE: The caller must hold the pbi->lock.
+ *
+ */
+static PMBD_BBI_T* pmbd_buffer_alloc_block(PMBD_BUFFER_T* buffer, PBN_T pbn)
+{
+	BBN_T pos		= 0;
+	PMBD_BBI_T* bbi		= NULL;
+	PMBD_DEVICE_T* pmbd 	= buffer->pmbd;
+	PMBD_PBI_T* pbi 	= PMBD_BLOCK_PBI(pmbd, pbn);
+
+	/* lock the buffer control info (we will check and update it) */
+	spin_lock(&buffer->buffer_lock);
+
+check_again:
+	/* check if the buffer is completely full, if yes, flush it to PM */
+	if (PMBD_BUFFER_IS_FULL(buffer)) {
+		/* release the buffer_lock (someone may be doing flushing)*/
+		spin_unlock(&buffer->buffer_lock);
+
+		/* If the buffer is full, we must flush it synchronously.
+		 * 
+		 * NOTE: this on-demand flushing can improve performance a lot, since
+		 * the allocator has not to wait for waking up syncer to do this, which
+		 * is much faster. Another merit is that it makes the application run
+		 * more smoothly (it is abrupt if completely relying on syncer). Also
+		 * note that we only flush a batch (e.g. 1024) of blocks, rather than
+		 * all the buffer blocks, this is because we only need a few blocks to
+		 * satisfy the application's own need, and this reduces the time that 
+		 * the application spends on allocation. */
+		pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_ALLOCATOR);
+
+		/* grab the lock and check the availability of free buffer blocks 
+		 * again, because someone may use up all the free buffer blocks, right
+		 * after the buffer is flushed but before we can get one */
+		spin_lock(&buffer->buffer_lock);
+		goto check_again;
+	} 
+
+	/* if buffer is not full, only reserve one spot first.
+	 * 
+	 * NOTE that we do not have to do link and memcpy in the locked region,
+	 * because pbi->lock guarantees that no-one else can use it now. This
+	 * moves the high-cost operations out of the critical section */
+	pos = buffer->pos_clean;
+	buffer->pos_clean = PMBD_BUFFER_NEXT_POS(buffer, buffer->pos_clean); 
+	buffer->num_dirty ++;
+
+	/* NOTE: we mark it "dirty" here, but actually the data has not been
+	 * really written into the PMBD buffer block yet. This is safe, because
+	 * we are protected by the pbi->lock  */
+	PMBD_BUFFER_SET_BBI_DIRTY(buffer, pos); 
+
+	/* now link them up (no-one else can see it) */
+	bbi = PMBD_BUFFER_BBI(buffer, pos);
+
+	bbi->pbn = pbn;
+	pbi->bbn = pos;
+
+	/* unlock the buffer_lock and let others proceed */
+	spin_unlock(&buffer->buffer_lock);
+
+	return bbi;
+}
+
+
+/*
+ * syncer daemon worker function
+ */
+
+static inline uint64_t pmbd_device_is_idle(PMBD_DEVICE_T* pmbd)
+{
+	unsigned last_jiffies, now_jiffies;
+	uint64_t interval = 0;
+
+	now_jiffies = jiffies;
+	PMBD_DEV_GET_ACCESS_TIME(pmbd, last_jiffies);
+	interval = jiffies_to_usecs(now_jiffies - last_jiffies);
+	
+	if (PMBD_DEV_IS_IDLE(pmbd, interval)) {
+		return interval;
+	} else {
+		return 0;
+	}
+}
+
+static int pmbd_syncer_worker(void* data)
+{
+	PMBD_BUFFER_T* buffer = (PMBD_BUFFER_T*) data;
+
+	set_user_nice(current, 0);
+
+	do {
+		unsigned do_flush  = 0;
+//		unsigned long loop = 0;
+		uint64_t idle_usec = 0;
+		spin_lock(&buffer->buffer_lock);
+
+		/* we start flushing, if 
+		 * (1) the num of dirty blocks hits the high watermark, or
+		 * (2) the device has been idle for a while */
+		if (PMBD_BUFFER_ABOVE_HW(buffer)) {
+			//printk("High watermark is hit\n";
+			do_flush = 1;
+		}
+//		if (pmbd_device_is_idle(buffer->pmbd) && !PMBD_BUFFER_IS_EMPTY(buffer)) {
+		if ((idle_usec = pmbd_device_is_idle(buffer->pmbd)) && PMBD_BUFFER_ABOVE_LW(buffer)) {
+			//printk("Device is idle for %llu uSeconds\n", idle_usec);
+			do_flush = 1;
+		}
+		if (do_flush){
+			unsigned long num_dirty = 0;
+			unsigned long num_cleaned = 0;
+repeat:
+			num_dirty = buffer->num_dirty;
+			spin_unlock(&buffer->buffer_lock);
+
+			/* start flushing 
+			 * 
+			 * NOTE: we only allocate a batch (e.g. 1024) of blocks each time. The
+			 * purpose is to let the applications wait for free blocks, so that they can
+			 * get a few free blocks and proceed, rather than waiting for the whole
+			 * buffer gets flushed. Otherwise, the bandwidth would be lower and the
+			 * applications cannot run smoothly. 
+			 */
+			num_cleaned = pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_SYNCER);
+			//printk("Syncer(%u) activated (%lu) - Before (%lu) Cleaned (%lu) After (%lu)\n", 
+			//		buffer->buffer_id, loop++, num_dirty, num_cleaned, buffer->num_dirty);
+			
+			/* continue to flush until we hit the low watermark */
+			spin_lock(&buffer->buffer_lock);
+			if (PMBD_BUFFER_ABOVE_LW(buffer)) {
+//			if (buffer->num_dirty > 0) {
+				goto repeat;
+			}
+		}
+		spin_unlock(&buffer->buffer_lock);
+
+		/* go to sleep */
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(1);
+		set_current_state(TASK_RUNNING);
+
+	} while(!kthread_should_stop());
+	return 0;
+}
+
+static struct task_struct* pmbd_buffer_syncer_init(PMBD_BUFFER_T* buffer)
+{
+	struct task_struct* tsk = NULL;
+	tsk = kthread_run(pmbd_syncer_worker, (void*) buffer, "nsyncer");
+	if (!tsk) {
+		printk(KERN_ERR "pmbd: initializing buffer syncer failed\n");
+		return NULL;
+	}
+
+	buffer->syncer = tsk;
+	printk("pmbd: buffer syncer launched\n");
+	return tsk;
+}
+
+static int pmbd_buffer_syncer_stop(PMBD_BUFFER_T* buffer)
+{
+	if (buffer->syncer){
+		kthread_stop(buffer->syncer);
+		buffer->syncer = NULL;
+		printk(KERN_INFO "pmbd: buffer syncer stopped\n");
+	}
+	return 0;
+}
+
+/*
+ * read and write to PMBD with buffer 
+ */ 
+static void copy_to_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes)
+{
+	PBN_T pbn   = 0;
+	void*  from = src;
+
+	/*
+	 * get the start and end in-block offset
+	 * 
+	 * NOTE: Since the buffer block (4096 bytes) can be larger than the
+	 * sector(512 bytes), if incoming request is not completely aligned to
+	 * buffer blocks, we need to read the full block from PM into the
+	 * buffer block and apply writes to partial of the buffer block. Here,
+	 * offset_s and offset_e are the start and end in-block offsets (in
+	 * units of sectors) for the first and the last sector in the request,
+	 * they may or may not appear in the same buffer block, depending on the
+	 * request size.
+	 */
+	PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+	PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+	sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
+	sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
+
+	/* for each physical block */
+	for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+		void* to 	= NULL;
+		PMBD_BBI_T* bbi = NULL;
+		PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+		sector_t sect_s	= (pbn == pbn_s) ? offset_s : 0; /* sub-block access */
+		sector_t sect_e	= (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
+		size_t size 	= SECTOR_TO_BYTE(sect_e - sect_s + 1);	/* get the real size */
+		PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
+
+		/* lock the physical block first */
+		spin_lock(&pbi->lock);
+
+		/* check if the physical block is buffered */
+		bbi = _pmbd_buffer_lookup(buffer, pbn);
+
+		if (bbi){
+			/* if the block is already buffered */
+			to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+		} else {
+			/* if not buffered, allocate one free buffer block */
+			bbi = pmbd_buffer_alloc_block(buffer, pbn);
+
+			/* if not aligned to a full block, we have to copy the whole 
+			 * block from the PM space to the buffer block first */
+			if (size < pmbd->pb_size){
+				memcpy_from_pmbd(pmbd, PMBD_BUFFER_BLOCK(buffer, pbi->bbn), PMBD_BLOCK_VADDR(pmbd, pbn), pmbd->pb_size);
+			}
+			to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+		}
+		
+		/* writing it into buffer */
+		memcpy(to, from, size);
+		PMBD_BUFFER_SET_BBI_DIRTY(buffer, pbi->bbn);
+
+		/* unlock the block */
+		spin_unlock(&pbi->lock);
+
+		from += size;
+	}
+
+	return;
+}
+
+static void copy_from_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+	PBN_T pbn = 0;
+	void*  to = dst;
+
+	/* get the start and end in-block offset */
+	PBN_T pbn_s 	= SECTOR_TO_PBN(pmbd, sector);
+	PBN_T pbn_e 	= BYTE_TO_PBN(pmbd, SECTOR_TO_BYTE(sector) + bytes - 1);
+	sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
+	sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
+
+	for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
+		/* Scan the incoming request and check each block, for each block, we
+		 * check if it is in the buffer. If true, we read it from the buffer,
+		 * otherwise, we read from the PM space. */
+
+		void* from 	= NULL;
+		PMBD_BBI_T* bbi = NULL;
+		PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
+		sector_t sect_s	= (pbn == pbn_s) ? offset_s : 0;				
+		sector_t sect_e	= (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
+		size_t size 	= SECTOR_TO_BYTE(sect_e - sect_s + 1);	/* get the real size */
+		PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
+
+		/* lock the physical block first */
+		spin_lock(&pbi->lock);
+
+		/* check if the block is in the buffer */
+		bbi = _pmbd_buffer_lookup(buffer, pbn);
+
+		/* start reading data */
+		if (bbi) { 
+			/* if buffered, read it from the buffer */
+			from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
+
+			/* read it out */
+			memcpy(to, from, size);
+
+		} else {
+			/* if not buffered, read it from PM space */
+			from = PMBD_BLOCK_VADDR(pmbd, pbn) + SECTOR_TO_BYTE(sect_s);
+
+			/* verify the checksum first */
+			if (PMBD_USE_CHECKSUM())
+				pmbd_checksum_on_read(pmbd, from, size);
+
+			/* read it out*/
+			memcpy_from_pmbd(pmbd, to, from, size);
+		}
+
+		/* unlock the block */
+		spin_unlock(&pbi->lock);
+
+		to += size;
+	}
+
+	return;
+}
+
+/*
+ * buffer related space alloc/free functions
+ */
+static int pmbd_pbi_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+	int err = 0;
+
+	/* allocate checksum space */
+	pmbd->pbi_space = vmalloc(PMBD_TOTAL_PB_NUM(pmbd) * sizeof(PMBD_PBI_T));
+	if (pmbd->pbi_space) {
+		PBN_T i;
+		for (i = 0; i < PMBD_TOTAL_PB_NUM(pmbd); i ++) {
+			PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, i);
+			PMBD_SET_BLOCK_UNBUFFERED(pmbd, i);
+			spin_lock_init(&pbi->lock);
+		}
+		printk(KERN_INFO "pmbd(%d): pbi space is initialized\n", pmbd->pmbd_id);
+	} else {
+		err = -ENOMEM;
+	}
+
+	return err;
+}
+
+static int pmbd_pbi_space_free(PMBD_DEVICE_T* pmbd)
+{
+	if (pmbd->pbi_space){
+		vfree(pmbd->pbi_space);
+		pmbd->pbi_space = NULL;
+		printk(KERN_INFO "pmbd(%d): pbi space is freed\n", pmbd->pmbd_id);
+	}
+	return 0;
+}
+
+static PMBD_BUFFER_T* pmbd_buffer_create(PMBD_DEVICE_T* pmbd)
+{
+	int i;
+	PMBD_BUFFER_T* buffer = kzalloc (sizeof(PMBD_BUFFER_T), GFP_KERNEL);
+	if (!buffer){
+		goto fail;
+	}
+
+	/* link to the pmbd device */	
+	buffer->pmbd = pmbd;
+
+	/* set size */
+	if (g_pmbd_bufsize[pmbd->pmbd_id] > PMBD_BUFFER_MIN_BUFSIZE) {
+		buffer->num_blocks = MB_TO_BYTES(g_pmbd_bufsize[pmbd->pmbd_id]) / pmbd->pb_size;
+	} else {
+		if (PMBD_DEV_USE_BUFFER(pmbd)) {
+			printk(KERN_INFO "pmbd(%d): WARNING - too small buffer size (%llu MBs). Buffer set to %d MBs\n", 
+				pmbd->pmbd_id, g_pmbd_bufsize[pmbd->pmbd_id], PMBD_BUFFER_MIN_BUFSIZE);
+		}
+		buffer->num_blocks = MB_TO_BYTES(PMBD_BUFFER_MIN_BUFSIZE) / pmbd->pb_size;
+	}
+	
+	/* buffer space */
+	buffer->buffer_space = vmalloc(buffer->num_blocks * pmbd->pb_size);
+	if (!buffer->buffer_space)
+		goto fail;
+
+	/* BBI array */
+	buffer->bbi_space = vmalloc(buffer->num_blocks * sizeof(PMBD_BBI_T));
+	if (!buffer->bbi_space)
+		goto fail;
+	memset(buffer->bbi_space, 0, buffer->num_blocks * sizeof(PMBD_BBI_T));
+
+	/* temporary array of bbi for sorting */
+	buffer->bbi_sort_buffer = vmalloc(buffer->num_blocks * sizeof(PMBD_BSORT_ENTRY_T));
+	if (!buffer->bbi_sort_buffer)
+		goto fail;
+
+	/* initialize the locks*/
+	spin_lock_init(&buffer->buffer_lock);
+	spin_lock_init(&buffer->flush_lock);
+
+	/* initialize the BBI array */
+	for (i = 0; i < buffer->num_blocks; i ++){
+		PMBD_BUFFER_SET_BBI_CLEAN(buffer, i);
+		PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, i);
+	}
+	
+	/* initialize the buffer control info */
+	buffer->num_dirty = 0;
+	buffer->pos_dirty = 0;
+	buffer->pos_clean = 0;
+	buffer->batch_size = g_pmbd_buffer_batch_size[pmbd->pmbd_id];
+
+	/* launch the syncer daemon */
+	pmbd_buffer_syncer_init(buffer);
+	if (!buffer->syncer) 
+		goto fail;
+
+	printk(KERN_INFO "pmbd: pmbd device buffer (%u) allocated (%lu blocks - block size %u bytes)\n", 
+			buffer->buffer_id, buffer->num_blocks, pmbd->pb_size);
+	return buffer;
+
+fail:
+	if (buffer && buffer->bbi_sort_buffer)
+		vfree(buffer->bbi_sort_buffer);
+	if (buffer && buffer->bbi_space)
+		vfree(buffer->bbi_space);
+	if (buffer && buffer->buffer_space)
+		vfree(buffer->buffer_space);
+	if (buffer)
+		kfree(buffer);
+	printk(KERN_ERR "%s(%d) vzalloc failed\n", __FUNCTION__, __LINE__);
+	return NULL;
+}
+
+static int pmbd_buffer_destroy(PMBD_BUFFER_T* buffer)
+{
+	unsigned id = buffer->buffer_id;
+
+	/* stop syncer first */
+	pmbd_buffer_syncer_stop(buffer);
+	
+	/* flush the buffer to the PM space */
+	pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
+	
+	/* FIXME: wait for the on-going operations to finish first? */
+	if (buffer && buffer->bbi_sort_buffer)
+		vfree(buffer->bbi_sort_buffer);
+	if (buffer && buffer->bbi_space)
+		vfree(buffer->bbi_space);
+	if (buffer && buffer->buffer_space)
+		vfree(buffer->buffer_space);
+	if (buffer)
+		kfree(buffer);
+	printk(KERN_INFO "pmbd: pmbd device buffer (%u) space freed\n", id);
+	return 0;
+}
+
+static int pmbd_buffers_create(PMBD_DEVICE_T* pmbd)
+{
+	int i;
+	for (i = 0; i < pmbd->num_buffers; i ++){
+		pmbd->buffers[i] = pmbd_buffer_create(pmbd);
+		if (pmbd->buffers[i] == NULL)
+			return -ENOMEM;
+		(pmbd->buffers[i])->buffer_id = i;
+	}
+	return 0;
+}
+
+static int pmbd_buffers_destroy(PMBD_DEVICE_T* pmbd)
+{
+	int i;
+	for (i = 0; i < pmbd->num_buffers; i ++){
+		if(pmbd->buffers[i]){
+			pmbd_buffer_destroy(pmbd->buffers[i]);
+			pmbd->buffers[i] = NULL;
+		}
+	}
+	return 0;
+}
+
+static int pmbd_buffer_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+	int err = 0;
+	
+	if (pmbd->num_buffers <= 0)
+		return 0;
+
+	/* allocate buffers array */
+	pmbd->buffers = kzalloc (sizeof(PMBD_BUFFER_T*) * pmbd->num_buffers, GFP_KERNEL);
+	if (pmbd->buffers == NULL){
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	/* allocate each buffer */
+	err = pmbd_buffers_create(pmbd);
+	printk(KERN_INFO "pmbd: pmbd buffer space allocated.\n");
+fail:
+	return err;
+}
+
+static int pmbd_buffer_space_free(PMBD_DEVICE_T* pmbd)
+{
+	if (pmbd->num_buffers <=0)
+		return 0;
+
+	pmbd_buffers_destroy(pmbd);
+	kfree(pmbd->buffers);
+	pmbd->buffers = NULL;
+	printk(KERN_INFO "pmbd: pmbd buffer space freed.\n");
+
+	return 0;
+}
+
+
+/*
+ * *************************************************************************
+ * High memory based PMBD functions
+ * *************************************************************************
+ *
+ * NOTE:
+ * (1) memcpy_fromio() and memcpy_intoio() are used for reading/writing PM,
+ *     but it is unnecessary on x86 architectures.
+ * (2) Currently we only allocate the reserved space to multiple PMBDs once.  
+ *     No dynamic allocate/deallocate of the space is needed so far. 
+ */
+
+
+static void* pmbd_highmem_map(void)
+{
+	/* 
+	 * NOTE: we can also use ioremap_* functions to directly set memory
+	 * page attributes when do remapping, but to make it consistent with
+	 * using vmalloc(), we do ioremap_cache() and call set_memory_* later. 
+	 */
+
+	if (PMBD_USE_PMAP()){
+		/* NOTE: If we use pmap(), we don't need to map the reserved
+		 * physical memory into the kernel space. Instead we use
+		 * pmap_atomic() to make and unmap the to-be-accessed pages on
+		 * demand. Since such mapping is private to the processor,
+		 * there is no need to change PTE, and TLB shootdown either. 
+		 *
+		 * Also note that We use PMBD_PMAP_DUMMY_BASE_VA to make the rest
+		 * of code happy with a valid virtual address. The real
+		 * physical address is calculated as follows:
+		 * g_highmem_phys_addr + (vaddr) - PMBD_PMAP_DUMMY_BASE_VA 
+		 *
+		 * (updated 10/25/2011) 
+		 */
+
+		g_highmem_virt_addr = (void*) PMBD_PMAP_DUMMY_BASE_VA;
+		g_highmem_curr_addr = g_highmem_virt_addr;
+		printk(KERN_INFO "pmbd: PMAP enabled - setting g_highmem_virt_addr to a dummy address (%d)\n", PMBD_PMAP_DUMMY_BASE_VA);
+		return g_highmem_virt_addr;
+
+	} else if ((g_highmem_virt_addr = ioremap_prot(g_highmem_phys_addr, g_highmem_size, g_pmbd_cpu_cache_flag))) {
+
+		g_highmem_curr_addr = g_highmem_virt_addr;
+		printk(KERN_INFO "pmbd: high memory space remapped (offset: %llu MB, size=%lu MB, cache flag=%s)\n",
+			BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size), PMBD_CPU_CACHE_FLAG());
+		return g_highmem_virt_addr;
+
+	} else {
+
+		printk(KERN_ERR "pmbd: %s(%d) - failed remapping high memory space (offset: %llu MB size=%lu MB)\n",
+			__FUNCTION__, __LINE__, BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
+		return NULL;
+	}
+}
+
+static void pmbd_highmem_unmap(void)
+{
+	/* de-remap the high memory from kernel address space */
+	/* NOTE: if we use pmap(), the g_highmem_virt_addr is fake */
+	if (!PMBD_USE_PMAP()){ 
+		if(g_highmem_virt_addr){
+			iounmap(g_highmem_virt_addr);
+			g_highmem_virt_addr = NULL;
+			printk(KERN_INFO "pmbd: unmapping high mem space (offset: %llu MB, size=%lu MB)is unmapped\n",
+				BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
+		}
+	}
+	return;
+}
+
+static void* hmalloc(uint64_t bytes)
+{
+	void* rtn = NULL;
+	
+	/* check if there is still available reserve high memory space */
+	if (bytes <= PMBD_HIGHMEM_AVAILABLE_SPACE) {
+		rtn = g_highmem_curr_addr;
+		g_highmem_curr_addr += bytes;
+	} else {
+		printk(KERN_ERR "pmbd: %s(%d) - no available space (< %llu bytes) in reserved high memory\n", 
+			__FUNCTION__, __LINE__, bytes);
+	}
+	return rtn;
+}
+
+static int hfree(void* addr)
+{	
+	/* FIXME: no support for dynamic alloc/dealloc in HIGH_MEM space */
+	return 0;
+}
+
+
+/*
+ * *************************************************************************
+ * Device Emulation
+ * *************************************************************************
+ *
+ * Our emulation is based on a simple model - access time and transfer time.
+ *
+ *     emulated time = access time + (request size / bandwidth)
+ *     inserted delay = emulated time - observed time
+ *
+ * (1) Access time is applied to each request. We check each request's real
+ * access time and pad it with an extra delay to meet the designated latency.
+ * This is a best-effort solution, which means we just guarantee that no
+ * request can be completed with a response time less than the specified
+ * latency, but the real access latencies could be higher. In addition, if the
+ * total number of threads is larger than the number of available processors,
+ * the simulated latencies could be higher, due to CPU saturation. 
+ *
+ * (2) Transfer time is calculated based on batches
+ *     - A batch is a sequence of consecutive requests with a short interval in
+ *     between; requests in a batch can be overlapped with each other (parallel
+ *     jobs); there is a limit for the total amount of data and the duration of
+ *     a batch 
+ *     - For each batch, we calculate its target emulated transfer time as
+ *     "emul_trans_time = num_sectors/emul_bandwidth" and calculate a delay as
+ *     "delay = emul_trans_time - real_trans_time"
+ *     - The calculated delay is applied to each batch at the end
+ *     - A lock is used to slow down all threads, because bandwidth is a
+ *     system-wide specification. In this way, we serialize the threads
+ *     accessing the device, which simulates that the device is busy on a task.
+ *
+ * (3) Two types of delays implemented 
+ *     - Sync delay:  if delay is less than 10ms, we keep polling the TSC
+ *     counter, which is basically "busy waiting", like spin-lock. This allows
+ *     to reach precision of one hundred of cycles
+ *     - Async delay: if delay is more than 10ms, we call msleep() to sleep for
+ *     a while, which relinquish CPU control, which results in a low precision.
+ *     The left-over delay is done with sync delay in nanosecs.  Async delay
+ *     cannot be used while holding a lock.
+ *
+ */ 
+
+
+static inline uint64_t DIV64_ROUND(uint64_t dividend, uint64_t divisor)
+{
+	if (divisor > 0) {
+		uint32_t quot1 = dividend / divisor;
+		uint32_t mod = dividend % divisor;
+		uint32_t mult = mod * 2;
+		uint32_t quot2 = mult / divisor;
+		uint64_t result = quot1 + quot2;
+		return result;
+	} else { // FIXME: how to handle this?
+		printk(KERN_WARNING "pmbd: WARNING - %s(%d) divisor is zero\n", __FUNCTION__, __LINE__);
+		return 0;
+	}
+}
+
+static inline unsigned int get_cpu_freq(void)
+{
+#if 0
+	unsigned int khz = cpufreq_quick_get(0);  /* FIXME: use cpufreq_get() ??? */
+	if (!khz) 
+		khz = cpu_khz;
+	printk("khz=%u, cpu_khz=%u\n", khz, cpu_khz);
+#endif
+	return cpu_khz;
+}
+
+static inline uint64_t _cycle_to_ns(uint64_t cycle, unsigned int khz)
+{
+	return cycle * 1000000 / khz;
+}
+
+static inline uint64_t cycle_to_ns(uint64_t cycle)
+{
+	unsigned int khz = get_cpu_freq();
+	return _cycle_to_ns(cycle, khz);
+}
+
+/* 
+ * emulate the latency for a given request size/type on a device  
+ * @num_sectors: num of sectors to read/write
+ * @rw: read or write
+ * @pmbd: the pmbd device
+ */
+static uint64_t cal_trans_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
+{
+	uint64_t ns = 0;
+	uint64_t bw = (rw == READ) ? pmbd->rdbw : pmbd->wrbw;   /* bandwidth */
+	if (bw) {
+		uint64_t tmp = num_sectors * PMBD_SECTOR_SIZE;
+		uint64_t tt = 1000000000UL >> MB_SHIFT;
+		ns += DIV64_ROUND((tmp * tt), bw);
+	}
+	return ns;
+}
+
+static uint64_t cal_access_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
+{
+	uint64_t ns = (rw == READ) ? pmbd->rdlat : pmbd->wrlat; /* access time */
+	return ns;
+}
+
+static inline void sync_slowdown(uint64_t ns)
+{
+	uint64_t start, now;
+	unsigned int khz = get_cpu_freq();
+	if (ns) {
+		/* 
+		 * We keep reading TSC counter to check if the delay has
+		 * been passed and this prevents CPU from being scaled down,
+		 * which provides a stable estimation of the elapsed time.
+		 */
+		TIMESTAMP(start);
+		while(1) {
+			TIMESTAMP(now);
+			if (_cycle_to_ns((now-start), khz) > ns)
+				break;
+		}
+	}
+	return;
+}
+
+static inline void sync_slowdown_cycles(uint64_t cycles)
+{
+
+	uint64_t start, now;
+	if (cycles){
+		/* 
+		 * We keep reading TSC counter to check if the delay has
+		 * been passed and this prevents CPU from being scaled down,
+		 * which provides a stable estimation of the elapsed time.
+		 */
+		TIMESTAMP(start);
+		while(1) {
+			TIMESTAMP(now);
+			if ((now - start) >= cycles)
+				break;
+		}
+	}
+	return;
+}
+
+static inline void async_slowdown(uint64_t ns)
+{
+	uint64_t ms = ns / 1000000;
+	uint64_t left = ns - (ms * 1000000);
+	/* do ms delay with sleep */
+	msleep(ms);		
+
+	/* make up the sub-ms delay */
+	sync_slowdown(left);	
+}
+
+#if 0
+static inline void slowdown_us(unsigned long long us)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	schedule_timeout(us * HZ / 1000000);
+}
+#endif
+
+static void pmbd_slowdown(uint64_t ns, unsigned in_lock)
+{
+	/* 
+	 * NOTE: if the delay is less than 10ms, we use sync_slowdown to keep
+	 * polling the CPU cycle counter and busy waiting for the delay elapse;
+	 * otherwise, we use msleep() to relinquish the CPU control. 
+	 */
+	if (ns > MAX_SYNC_SLOWDOWN && !in_lock)
+		async_slowdown(ns);
+	else if (ns > 0) 
+		sync_slowdown(ns);
+
+	return;
+}
+
+/*
+ * Emulating the transfer time for a batch of requests for specific bandwidth
+ * 
+ * We group a bunch of consecutive requests as a "batch". In one batch, the
+ * interval between two consecutive requests should be small, and the total
+ * amount of accessed data should be a good size (not too small, not too
+ * large), the duration is reasonable (not too long). For each batch, we
+ * estimate the emulated transfer time and compare it with the real transfer
+ * time (the start and end time of the batch), if the real transfer time is
+ * less than the emulated time, we apply an extra delay to the end of batch for
+ * making up the difference. In this way we can make the bandwidth emulation
+ * closer to real situation. Note that, since requests from multiple threads
+ * could be processed in parallel, so we must slowdown ALL the threads
+ * accessing the PMBD device, thus, we use batch_lock to coordinate all threads. 
+ *
+ * @num_sectors: the num of sectors of the request
+ * @rw: read or write
+ * @pmbd: the involved pmbd device
+ *
+ */
+
+static void pmbd_emul_transfer_time(int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
+{
+	uint64_t interval_ns 	= 0;
+	uint64_t duration_ns 	= 0; 
+	unsigned new_batch 	= FALSE;
+	unsigned end_batch 	= FALSE;
+	uint64_t now_cycle 	= 0;
+
+	spin_lock(&pmbd->batch_lock);
+
+	/* get a timestamp for now */
+	TIMESTAMP(now_cycle);
+
+	/* if this is the first timestamp */
+	if (pmbd->batch_start_cycle[rw] == 0) {
+		pmbd->batch_start_cycle[rw] = now_cycle;
+		pmbd->batch_end_cycle[rw] = now_cycle;
+		goto done;
+	}
+
+	/* calculate the interval from the last request */
+	if (now_cycle >= pmbd->batch_end_cycle[rw]){
+		interval_ns = cycle_to_ns(now_cycle - pmbd->batch_end_cycle[rw]); 
+	} else {
+		panic(KERN_ERR "%s(%d): timestamp in the past found.\n", __FUNCTION__, __LINE__);
+	}
+
+	/* check the interval length (cannot be too distant) */
+	if (interval_ns >= PMBD_BATCH_MAX_INTERVAL) {
+		/* interval is too big, break it to two batches */
+		new_batch = TRUE;
+		end_batch = TRUE;
+	} else {
+		/* still in the same batch, good */
+		pmbd->batch_sectors[rw] += num_sectors;
+		pmbd->batch_end_cycle[rw] = now_cycle;
+	}
+
+	/* check current batch duration (cannot be too long) */
+	duration_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
+	if (duration_ns >= PMBD_BATCH_MAX_DURATION) 
+		end_batch = TRUE;
+
+	/* check current batch data amount (cannot be too large) */
+	if (pmbd->batch_sectors[rw] >= PMBD_BATCH_MAX_SECTORS)
+		end_batch = TRUE;
+
+	/* if the batch ends, check and apply slow-down */
+	if (end_batch) {
+		/* batch size must be large enough, if not, just skip it */
+		if (pmbd->batch_sectors[rw] > PMBD_BATCH_MIN_SECTORS) {
+			uint64_t real_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
+			uint64_t emul_ns = cal_trans_time(pmbd->batch_sectors[rw], rw, pmbd);
+
+			if (emul_ns > real_ns)
+				pmbd_slowdown((emul_ns - real_ns), TRUE);
+		}
+
+		pmbd->batch_sectors[rw] = 0;
+		pmbd->batch_start_cycle[rw] = now_cycle;
+		pmbd->batch_end_cycle[rw] = now_cycle;
+	}
+
+	/* if a new batch begins, add the first request */
+	if (new_batch) {
+		pmbd->batch_sectors[rw] = num_sectors;
+		pmbd->batch_start_cycle[rw] = now_cycle;
+		pmbd->batch_end_cycle[rw] = now_cycle;
+	}
+
+done:
+	spin_unlock(&pmbd->batch_lock);
+	return;
+}
+
+/*
+ * Emulating access time for a request
+ *
+ * Different from emulating bandwidths, we emulate access time for each
+ * individual access. Right after we simulate the transfer time, we examine
+ * the real access time (including transfer time), if the real time is smaller
+ * than the specified access time, we slow down the request by applying a delay
+ * to make up the difference.  Note that we do not use any lock to coordinate
+ * multiple threads for a system-wide "slowdown", but apply this delay on each
+ * request individually and separately. 
+ *
+ * Also note that since we basically use "busy-waiting", when the total number
+ * of threads exceeds or be close to the total number of processors, the
+ * simulated access time observed at application level could be longer than the
+ * specified access time due to high CPU usage. But for each request, after
+ * directly examining the duration of being in the make_request() function, the
+ * simulated access time is still very precise. 
+ *
+ */ 
+static void pmbd_emul_access_time(uint64_t start, uint64_t end, int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
+{
+	/* 
+	 * Access time can be overlapped with each other, so there is no need
+	 * to use a lock to serialize it.
+	 * FIXME: should we apply this on each batch or each request?
+	 */
+	uint64_t real_ns = cycle_to_ns(end - start);
+	uint64_t emul_ns = cal_access_time(num_sectors, rw, pmbd);
+
+	if (emul_ns > real_ns)
+		pmbd_slowdown((emul_ns - real_ns), FALSE);
+		
+	return;
+}
+
+/* 
+ * set the starting hook for PM emulation 
+ *
+ * @pmbd: pmbd device
+ * @num_sectors: sectors being accessed
+ * @rw: READ/WRITE
+ * return value: the start cycle
+ */
+static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw)
+{
+	uint64_t start = 0;
+	if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
+		/* start timer here */
+		TIMESTAMP(start);	
+	}
+	return start;
+}
+
+/* 
+ * set the stopping hook for PM emulation 
+ *
+ * @pmbd: pmbd device
+ * @num_sectors: sectors being accessed
+ * @rw: READ/WRITE
+ * @start: the starting cycle
+ * return value: the end cycle
+ */
+static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start)
+{
+	uint64_t end = 0;
+	uint64_t end2 = 0;
+	/*
+	 * NOTE: emulation can be done in two ways - (1) directly specify the
+	 * read/write latencies and bandwidths (2) only specify a relative
+	 * slowdown ratio (X), compared to DRAM.
+	 *
+	 * Also note that if rdsx/wrsx is set, we will ignore
+	 * rdlat/wrlat/rdbw/wrbw. 
+	 */
+	if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
+		/* 
+		 * NOTE: we first attempt to meet the target bandwidth and then
+		 * latency. This means the actual bandwidth should be close
+		 * to the emulated bandwidth, and then we guarantee that the
+		 * latency would not be SMALLER than the target latency. 
+		 */
+
+		/* emulate the bandwidth first */	
+		if (pmbd->rdbw > 0 && pmbd->wrbw > 0) {
+			/* emulate transfer time (bandwidth) */
+			pmbd_emul_transfer_time(num_sectors, rw, pmbd);
+		}
+
+		/* emulate the latency now */
+		TIMESTAMP(end);
+		if (pmbd->rdlat > 0 || pmbd->wrlat > 0) {
+			/* emulate access time (latency) */
+			pmbd_emul_access_time(start, end, num_sectors, rw, pmbd);
+		}
+	}
+	/* get the ending timestamp */
+	TIMESTAMP(end2);
+
+	return end2;
+}
+
+/*
+ * *************************************************************************
+ * PM space protection functions 
+ * - clflush
+ * - write protection
+ * - write verification
+ * - checksum
+ * *************************************************************************
+ */
+
+/* 
+ * flush designated cache lines in CPU cache 
+ */
+
+static inline void pmbd_clflush_all(PMBD_DEVICE_T* pmbd)
+{
+	uint64_t time_p1 = 0;
+	uint64_t time_p2 = 0;
+
+	TIMESTAMP(time_p1);
+	if (cpu_has_clflush){
+#ifdef CONFIG_X86
+		wbinvd_on_all_cpus();
+#else
+		printk(KERN_WARNING "pmbd: WARNING - %s(%d) flush_cache_all() not implemented\n", __FUNCTION__, __LINE__);
+#endif
+	}
+	TIMESTAMP(time_p2);
+
+	/* emulating slowdown */
+	if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+		pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+	/* update time statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_clflushall[WRITE][cid] += time_p2 - time_p1;
+	}
+	return;
+}
+
+static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes)
+{
+	uint64_t time_p1 = 0;
+	uint64_t time_p2 = 0;
+
+	TIMESTAMP(time_p1);
+	if (cpu_has_clflush){
+		clflush_cache_range(dst, bytes);
+	}
+	TIMESTAMP(time_p2);
+
+	/* emulating slowdown */
+	if(PMBD_DEV_USE_SLOWDOWN(pmbd))
+		pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
+
+	/* update time statistics */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_clflush[WRITE][cid] += time_p2 - time_p1;
+	}
+	return;
+}
+
+
+/* 
+ * Write-protection 
+ *
+ * Being used as storage, PMBD needs to provide certain protection on accidental
+ * change caused by wild pointers. So we initialize all the PM pages as
+ * read-only; before we perform write operations into PM space, we set the
+ * pages writable, after done, we set it back to read-only. This would
+ * introduce extra overhead. However, this is a realistic solution to tackle
+ * wild pointer problem.
+ *
+ */
+
+/*
+ * set PM pages to read-only
+ * @addr -  the starting virtual address (PM space)
+ * @bytes - the range in bytes
+ * @on_access - this change command from request or during creating/destroying
+ */
+
+static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
+{
+	if (PMBD_USE_WRITE_PROTECTION()) {
+		/* FIXME: type conversion happens here */
+		/* FIXME: add range and bytes check here?? - not so necessary */
+		uint64_t time_p1 = 0;
+		uint64_t time_p2 = 0;
+		unsigned long offset = (unsigned long) addr;
+		unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
+		int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
+
+		if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
+			printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", 
+					__FUNCTION__, __LINE__, vaddr, num_pages);
+
+		TIMESTAMP(time_p1);
+		set_memory_ro(vaddr, num_pages);
+		TIMESTAMP(time_p2);
+
+		/* update time statistics */
+//		if(PMBD_USE_TIMESTAT() && on_access){
+		if(PMBD_USE_TIMESTAT()){
+			int cid = CUR_CPU_ID();
+			PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+			pmbd_stat->cycles_setpages_ro[WRITE][cid] += time_p2 - time_p1;
+		}
+	}
+	return;
+}
+
+static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
+{
+	if (PMBD_USE_WRITE_PROTECTION()) {
+		uint64_t time_p1 = 0;
+		uint64_t time_p2 = 0;
+		unsigned long offset = (unsigned long) addr;
+		unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
+		int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
+
+		if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
+			printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", __FUNCTION__, __LINE__, vaddr, num_pages);
+
+		TIMESTAMP(time_p1);
+		set_memory_rw(vaddr, num_pages);
+		TIMESTAMP(time_p2);
+
+		/* update time statistics */
+//		if(PMBD_USE_TIMESTAT() && on_access){
+		if(PMBD_USE_TIMESTAT()){
+			int cid = CUR_CPU_ID();
+			PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+			pmbd_stat->cycles_setpages_rw[WRITE][cid] += time_p2 - time_p1;
+		}
+	}
+	return;
+}
+
+
+/*
+ * Write verification (EXPERIMENTAL)
+ *
+ * Note: Even we do write protection by setting PM space read-only, there is
+ * still a short vulnerable window when we write pages into PM space - between
+ * the time when the pages are set RW and the time when the pages are set back
+ * to RO. So we need to verify that no data has been changed during this window
+ * by reading out the written data and comparing with the source data. 
+ *
+ */
+
+
+static inline int pmbd_verify_wr_pages_pmap(PMBD_DEVICE_T* pmbd, void* pmbd_dummy_va, void* ram_va, size_t bytes)
+{
+
+	unsigned long flags = 0;
+
+	/*NOTE: we assume src is starting from 0 */
+	uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
+
+	/* disable interrupt (FIXME: do we need to do this?)*/	
+	DISABLE_SAVE_IRQ(flags);
+
+	/* do the real work */
+	while(bytes){
+		uint64_t pfn = (pa >> PAGE_SHIFT);	// page frame number
+		unsigned off = pa & (~PAGE_MASK);	// offset in one page
+		unsigned size = MIN_OF((PAGE_SIZE - off), bytes); // the size to copy 
+
+		/* map it */
+		void * map = pmap_atomic_pfn(pfn, pmbd, WRITE);
+		void * pmbd_va = map + off;
+
+		/* do memcopy */
+		if (memcmp(pmbd_va, ram_va, size)){
+			punmap_atomic(map, pmbd, WRITE);
+			goto bad;
+		}
+
+		/* unmap it */
+		punmap_atomic(map, pmbd, WRITE);
+
+		/* prepare the next iteration */
+		ram_va  += size;
+		bytes 	-= size;
+		pa 	+= size;
+	}
+	
+	/* re-enable interrupt */
+	ENABLE_RESTORE_IRQ(flags);
+	return 0;
+
+bad:
+	ENABLE_RESTORE_IRQ(flags);
+	return -1;
+}
+
+
+static inline int pmbd_verify_wr_pages_nopmap(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
+{
+	if (memcmp(pmbd_va, ram_va, bytes)) 
+		return -1;
+	else
+		return 0;
+}
+
+static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
+{
+	int rtn = 0;
+	uint64_t time_p1, time_p2;
+
+	TIMESTAT_POINT(time_p1);
+
+	/* check it */
+	if (PMBD_USE_PMAP())
+		rtn = pmbd_verify_wr_pages_pmap(pmbd, pmbd_va, ram_va, bytes);
+	else
+		rtn = pmbd_verify_wr_pages_nopmap(pmbd, pmbd_va, ram_va, bytes);
+
+	/* found mismatch */
+	if (rtn < 0){
+		panic("pmbd: *** writing into PM failed (error found) ***\n");
+		return -1;
+	}
+
+	TIMESTAT_POINT(time_p2);
+
+	/* timestamp */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_wrverify[WRITE][cid] += time_p2 - time_p1;
+	}
+
+	return 0;
+}
+
+/*
+ * Checksum (EXPERIMENTAL)
+ *
+ * Note: With write-protection and write verification, we can largely reduce
+ * the risk of PM data corruption caused by wild in-kernel pointers, however,
+ * it is still possible that some data gets corrupted (e.g. PM pages are
+ * maliciously changed to writable). Thus, we need to provide another layer of
+ * protection by checksuming the PM pages. When writing a page, we compute a
+ * checksum and write it into memory; When reading a page, we compute its
+ * checksum and compare it with the stored checksum. If a mismatch is found,
+ * it indicates that either PM data or the checksum has been corrupted. 
+ *
+ * FIXME:
+ * (1) checksum should be stored in PM space, currently we just store it in RAM.
+ * (2) probably we should use the CPU cache to speed up and avoid reading the same 
+ *     chunk of data again. 
+ * (3) currently we always allocate checksum space, whether we enable or disable it
+ * in the module config options; may need to make it more efficient in the future. 
+ *
+ */ 
+
+
+static int pmbd_checksum_space_alloc(PMBD_DEVICE_T* pmbd)
+{
+	int err = 0;
+
+	/* allocate checksum space */
+	pmbd->checksum_space= vmalloc(PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T));
+	if (pmbd->checksum_space){
+		memset(pmbd->checksum_space, 0, (PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T)));
+		printk(KERN_INFO "pmbd(%d): checksum space is allocated\n", pmbd->pmbd_id);
+	} else {
+		err = -ENOMEM;
+	}
+
+	/* allocate checksum buffer space */
+	pmbd->checksum_iomem_buf = vmalloc(pmbd->checksum_unit_size);
+	if (pmbd->checksum_iomem_buf){
+		memset(pmbd->checksum_iomem_buf, 0, pmbd->checksum_unit_size);
+		printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is allocated\n", pmbd->pmbd_id);
+	} else {
+		err = -ENOMEM;
+	}
+
+	return err;
+}
+
+static int pmbd_checksum_space_free(PMBD_DEVICE_T* pmbd)
+{
+	if (pmbd->checksum_space) {
+		vfree(pmbd->checksum_space);
+		pmbd->checksum_space = NULL;
+		printk(KERN_INFO "pmbd(%d): checksum space is freed\n", pmbd->pmbd_id);
+	}
+	if (pmbd->checksum_iomem_buf) {
+		vfree(pmbd->checksum_iomem_buf);
+		pmbd->checksum_iomem_buf = NULL;
+		printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is freed\n", pmbd->pmbd_id);
+	}
+	return 0;
+}
+
+
+/*
+ * Derived from linux/lib/crc32.c GPL v2
+ */
+static unsigned int crc32_my(unsigned char const *p, unsigned int len)
+{
+        int i;
+        unsigned int crc = 0;
+        while (len--) {
+                crc ^= *p++;
+                for (i = 0; i < 8; i++)
+                        crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
+        }
+       return crc;
+}
+
+static inline PMBD_CHECKSUM_T pmbd_checksum_func(void* data, size_t size)
+{
+	return crc32_my(data, size);
+}
+
+/*
+ * calculate the checksum for a chunksum unit
+ * @pmbd: the pmbd device
+ * @data: the virtual address of the target data (must be aligned to the
+ * checksum unit boundaries)
+ */ 
+
+
+static inline PMBD_CHECKSUM_T pmbd_cal_checksum(PMBD_DEVICE_T* pmbd, void* data)
+{
+	void* vaddr = data;
+	size_t size = pmbd->checksum_unit_size;
+	PMBD_CHECKSUM_T chk = 0;
+
+#if 0
+#ifndef CONFIG_X86
+	/* 
+	 * Note: If we are directly using vmalloc(), we don't have to copy it
+ 	 * to the checksum buffer; however, if we are using High Memory, we should not
+	 * directly dereference the ioremapped data (on non-x86 platform), so we have to
+	 * first copy it to a temporary buffer, this extra copy would significantly
+	 * slows down operations. We do this here is just to remove this extra copy on
+	 * x86 platform.  (see kernel/Documents/IO-mapping.txt)
+	 *
+	 */
+	if (PMBD_DEV_USE_HIGHMEM(pmbd) && VADDR_IN_PMBD_SPACE(pmbd, data)) {
+		memcpy_fromio(pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
+		vaddr = pmbd->checksum_iomem_buf;
+	} 
+#endif
+#endif
+
+	if (pmbd->checksum_unit_size != PAGE_SIZE){
+		panic("ERR: %s(%d) checksum unit size (%u) must be %lu\n", __FUNCTION__, __LINE__, pmbd->checksum_unit_size, PAGE_SIZE);
+		return 0;
+	}
+
+	/* FIXME: do we really need to copy the data out first (if not pmap)*/
+	memcpy_from_pmbd(pmbd, pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
+
+	/* calculate the checksum */
+	vaddr = pmbd->checksum_iomem_buf;
+	chk = pmbd_checksum_func(vaddr, size);
+
+	return chk;
+}
+
+static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
+{
+	unsigned long i;
+	unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
+	unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
+
+	uint64_t time_p1, time_p2;
+
+	TIMESTAT_POINT(time_p1);
+
+	for (i = ck_id_s; i <= ck_id_e; i ++){
+		void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
+		void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 
+
+		PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
+		memcpy(chk, &checksum, sizeof(PMBD_CHECKSUM_T));
+	}
+
+	TIMESTAT_POINT(time_p2);
+
+	/* timestamp */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_checksum[WRITE][cid] += time_p2 - time_p1;
+	}
+	return 0;
+}
+
+static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
+{
+	unsigned long i;
+	unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
+	unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
+
+	uint64_t time_p1, time_p2;
+	TIMESTAT_POINT(time_p1);
+
+	for (i = ck_id_s; i <= ck_id_e; i ++){
+		void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
+		void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 
+
+		PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
+		if (memcmp(chk, &checksum, sizeof(PMBD_CHECKSUM_T))){
+			printk(KERN_WARNING "pmbd(%d): checksum mismatch found!", pmbd->pmbd_id);
+		}
+	}
+
+	TIMESTAT_POINT(time_p2);
+
+	/* timestamp */
+	if(PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+		pmbd_stat->cycles_checksum[READ][cid] += time_p2 - time_p1;
+	}
+
+	return 0;
+}
+
+#if 0
+/* WARN: Calculating checksum for a big PM space is slow and could lockup system*/
+static int pmbd_checksum_space_init(PMBD_DEVICE_T* pmbd)
+{
+	unsigned long i;
+	PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, pmbd->mem_space);
+	unsigned long ck_s = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_FIRST_BYTE(pmbd));
+	unsigned long ck_e = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_LAT_BYTE(pmbd));
+
+	for (i = ck_s; i <= ck_e; i ++){
+		void* dst = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 
+		memcpy(dst, &checksum, sizeof(PMBD_CHECKSUM_T));
+	}
+	return 0;
+}
+#endif
+
+/*
+ * locks
+ *
+ * Note:  We should prevent multiple threads from concurrently accessing the same
+ * chunk of data. For example, if two writes access the same page, the PM page
+ * could be corrupted with a merged content of two. So we allocate one spinlock
+ * for each 4KB PM page. When read/writing PM data, we lock the related pages
+ * and unlock them after done.  
+ *
+ */
+
+static int pmbd_lock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+	if (PMBD_USE_LOCK()) {
+		PBN_T pbn = 0;
+		PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+		PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+
+		for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
+			PMBD_PBI_T* pbi 	= PMBD_BLOCK_PBI(pmbd, pbn);
+			spin_lock(&pbi->lock);
+		}
+	}
+	return 0;
+}
+
+static int pmbd_unlock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
+{
+	if (PMBD_USE_LOCK()){
+		PBN_T pbn = 0;
+		PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
+		PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
+
+		for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
+			PMBD_PBI_T* pbi 	= PMBD_BLOCK_PBI(pmbd, pbn);
+			spin_unlock(&pbi->lock);
+		}
+	}
+	return 0;
+}
+
+/*
+ **************************************************************************
+ * Unbuffered Read/write functions
+ **************************************************************************
+ */
+static void copy_to_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes, unsigned do_fua)
+{
+	void *dst;
+
+	dst = pmbd->mem_space + sector * pmbd->sector_size;
+
+	/* lock the pages */
+	pmbd_lock_on_access(pmbd, sector, bytes);
+
+	/* set the pages writable */
+	/* if we use CR0/WP to temporarily switch the writable permission, 
+ 	 * we don't have to change the PTE attributes directly */
+	if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
+		pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
+
+	/* do memcpy */
+	memcpy_to_pmbd(pmbd, dst, src, bytes, do_fua);
+
+	/* finish up */
+	/* set the pages read-only */
+	if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 
+		pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
+
+	/* verify that the write operation succeeded */
+	if(PMBD_USE_WRITE_VERIFICATION())
+		pmbd_verify_wr_pages(pmbd, dst, src, bytes);
+
+	/* generate check sum */
+	if (PMBD_USE_CHECKSUM())
+		pmbd_checksum_on_write(pmbd, dst, bytes);
+
+	/* unlock the pages */
+	pmbd_unlock_on_access(pmbd, sector, bytes);
+
+	return;
+}
+
+
+static void copy_from_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+	void *src = pmbd->mem_space + sector * pmbd->sector_size;
+
+	/* lock the pages */
+	pmbd_lock_on_access(pmbd, sector, bytes);
+
+	/* check checksum first */
+	if (PMBD_USE_CHECKSUM())
+		pmbd_checksum_on_read(pmbd, src, bytes);
+
+	/* read it out*/
+	memcpy_from_pmbd(pmbd, dst, src, bytes);
+
+	/* unlock the pages */
+	pmbd_unlock_on_access(pmbd, sector, bytes);
+
+	return;
+}
+
+
+/*
+ * *************************************************************************
+ * Read/write functions 
+ * *************************************************************************
+ */ 
+
+static void copy_to_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes, unsigned do_fua)
+{
+	if (PMBD_DEV_USE_BUFFER(pmbd)){
+		copy_to_pmbd_buffered(pmbd, dst, sector, bytes);
+		if (do_fua){
+			/* NOTE: 
+			 * When we use a FUA, if the buffer is enabled, we
+			 * still write into the buffer first, but then we
+			 * directly write into the PM space without using the
+			 * buffer again.  This is suboptimal (we need to write
+			 * the data twice), however, it is better than changing
+			 * the buffering code. 
+			 */
+			copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
+		}
+	}else
+		copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
+	return;
+}
+
+static void copy_from_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
+{
+	if (PMBD_DEV_USE_BUFFER(pmbd))
+		copy_from_pmbd_buffered(pmbd, dst, sector, bytes);
+	else
+		copy_from_pmbd_unbuffered(pmbd, dst, sector, bytes);
+	return;
+}
+
+static int pmbd_seg_read_write(PMBD_DEVICE_T* pmbd, struct page *page, unsigned int len, 
+					unsigned int off, int rw, sector_t sector, unsigned do_fua)
+{
+	void *mem;
+	int err = 0;
+
+	mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_pmbd(pmbd, mem + off, sector, len);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		copy_to_pmbd(pmbd, mem + off, sector, len, do_fua);
+	}
+
+	kunmap_atomic(mem);
+
+	return err;
+}
+
+static int pmbd_do_bvec(PMBD_DEVICE_T* pmbd, struct page *page,
+			unsigned int len, unsigned int off, int rw, sector_t sector, unsigned do_fua)
+{
+	return pmbd_seg_read_write(pmbd, page, len, off, rw, sector, do_fua);
+}
+
+/*
+ * Handling write barrier 
+ * @pmbd: the pmbd device
+ *
+ * When the application sends fsync(), a bio labeled with WRITE_BARRIER would be 
+ * received by pmbd_make_request(), and we need to stop accepting new incoming 
+ * writes (by locking pmbd->wr_barrier_lock), and wait for the on-the-fly writes
+ * to complete (by checking pmbd->num_flying_wr), then if we use buffer, we flush
+ * the whole entire DRAM buffer with clflush enabled. If we do not use the buffer,
+ * we flush the CPU cache to let all the data securely be written into PM. 
+ *
+ */
+
+
+static void __x86_mfence_all(void *arg)
+{
+	unsigned long cache = (unsigned long)arg;
+	if (cache && boot_cpu_data.x86 >= 4)
+		mfence();
+}
+
+static void x86_mfence_all(unsigned long cache)
+{
+	BUG_ON(irqs_disabled());
+	on_each_cpu(__x86_mfence_all, (void*) cache, 1);
+}
+
+static inline void pmbd_mfence_all(PMBD_DEVICE_T* pmbd)
+{
+	x86_mfence_all(1);
+}
+
+
+static void __x86_sfence_all(void *arg)
+{
+	unsigned long cache = (unsigned long)arg;
+	if (cache && boot_cpu_data.x86 >= 4)
+		sfence();
+}
+
+static void x86_sfence_all(unsigned long cache)
+{
+	BUG_ON(irqs_disabled());
+	on_each_cpu(__x86_sfence_all, (void*) cache, 1);
+	
+}
+
+static inline void pmbd_sfence_all(PMBD_DEVICE_T* pmbd)
+{
+	x86_sfence_all(1);
+}
+
+static int pmbd_write_barrier(PMBD_DEVICE_T* pmbd)
+{
+	unsigned i;
+
+	/* blocking incoming writes */
+	spin_lock(&pmbd->wr_barrier_lock);
+
+	/* wait for all on-the-fly writes to finish first */
+	while (atomic_read(&pmbd->num_flying_wr) != 0)
+		;
+
+	if (PMBD_DEV_USE_BUFFER(pmbd)){
+		/* if buffer is used, flush the entire buffer */
+		for (i = 0; i < pmbd->num_buffers; i ++){
+			PMBD_BUFFER_T* buffer = pmbd->buffers[i];
+			pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
+		}
+	} 
+
+	/* 
+	 * considering the following:
+	 * UC (write-through): 		strong ordering, we do nothing
+	 * UC-Minus:			strong ordering (may be overridden by WC), we use sfence, do nothing
+	 * WC (write-combining):	sfence should be used after each write, so we do nothing
+	 * WB (write-back):		non-temporal store : sfence is used, do nothing
+	 * 				clflush/mfence: mfence is used in clflush_cache_range(), do nothing
+	 * 				nothing: wbinvd needed to drop the entire cache
+	 */
+	if (PMBD_CPU_CACHE_USE_WB()){
+		if (PMBD_USE_NTS()){
+			/* sfence is used after each movntq, so it is safe, we
+ 			* do nothing, just stop accepting any incoming requests */
+		} else if (PMBD_USE_CLFLUSH()) {
+			/* if use clflush/mfence to sync I/O, we do nothing*/
+//			pmbd_mfence_all(pmbd);
+		} else {
+			/* if no sync operations, we have to drop the entire cache */
+			pmbd_clflush_all(pmbd);
+		}
+	} else if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) {
+		/* if using WC, sfence should used already, so do nothing */
+	
+	} else if (PMBD_CPU_CACHE_USE_UC()) {
+		/* strong ordering is used, no need to do anything else*/
+	} else {
+		panic("%s(%d): something is wrong\n", __FUNCTION__, __LINE__);
+	}
+
+	/* unblock incoming writes */
+	spin_unlock(&pmbd->wr_barrier_lock);
+	return 0;
+}
+
+
+//	#define BIO_WR_BARRIER(BIO)	(((BIO)->bi_rw & REQ_FLUSH) == REQ_FLUSH)
+//	#define BIO_WR_BARRIER(BIO)	((BIO)->bi_rw & (REQ_FLUSH | REQ_FLUSH_SEQ))
+	#define BIO_WR_BARRIER(BIO)	(((BIO)->bi_rw & WRITE_FLUSH) == WRITE_FLUSH)
+	#define BIO_WR_FUA(BIO)		(((BIO)->bi_rw & WRITE_FUA) == WRITE_FUA)
+	#define BIO_WR_SYNC(BIO)	(((BIO)->bi_rw & WRITE_SYNC) == WRITE_SYNC)
+
+static void pmbd_make_request(struct request_queue *q, struct bio *bio)
+{
+	int i 	= 0;
+	int err = -EIO;
+	uint64_t start = 0;
+	uint64_t end   = 0;
+	struct bio_vec *bvec;
+	int rw 	= bio_rw(bio);
+	sector_t sector = bio->bi_sector;
+	int num_sectors = bio_sectors(bio);
+	struct block_device *bdev = bio->bi_bdev;
+	PMBD_DEVICE_T *pmbd = bdev->bd_disk->private_data;
+	PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+	unsigned bio_is_write_fua = FALSE;
+	unsigned bio_is_write_barrier = FALSE;
+	unsigned do_fua = FALSE;
+	uint64_t time_p1, time_p2, time_p3, time_p4, time_p5, time_p6;
+	time_p1 = time_p2 = time_p3 = time_p4 = time_p5 = time_p6 = 0;
+
+
+	TIMESTAT_POINT(time_p1);
+//	printk("ACCESS: %u %d %X %d\n", sector, num_sectors, bio->bi_rw, rw);
+	
+	/* update rw */
+	if (rw == READA)
+		rw = READ;
+	if (rw != READ && rw != WRITE)
+		panic("pmbd: %s(%d) found request not read or write either\n", __FUNCTION__, __LINE__);
+
+	/* handle write barrier (we don't do for BIO_WR_SYNC(bio) anymore*/
+	if (BIO_WR_BARRIER(bio)){
+		/* 
+		 * Note: Linux kernel 2.6.37 and later use file systems and FUA 
+		 * to ensure data reliability, rather than write barriers. 
+		 * See http://monolight.cc/2011/06/barriers-caches-filesystems
+		 */
+		bio_is_write_barrier = TRUE;
+//		printk(KERN_INFO "pmbd: received barrier request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
+
+		if (PMBD_USE_WB())
+			pmbd_write_barrier(pmbd);
+	}
+
+	if (BIO_WR_FUA(bio)){
+		bio_is_write_fua = TRUE;
+//		printk(KERN_INFO "pmbd: received FUA request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
+
+		if (PMBD_USE_FUA())
+			do_fua = TRUE;
+	}
+
+	TIMESTAT_POINT(time_p2);
+
+	/* blocking write until write barrier is done */
+	if (rw == WRITE){
+		spin_lock(&pmbd->wr_barrier_lock);
+		spin_unlock(&pmbd->wr_barrier_lock);
+	}
+
+	/* increment on-the-fly writes counter */
+	atomic_inc(&pmbd->num_flying_wr);
+
+	/* starting emulation */
+	if (PMBD_DEV_SIM_DEV(pmbd))
+		start = emul_start(pmbd, num_sectors, rw);
+
+	/* check if out of range */
+	if (sector + (bio->bi_size >> SECTOR_SHIFT) > get_capacity(bdev->bd_disk)){
+		printk(KERN_WARNING "pmbd: request exceeds the PMBD capacity\n");
+		TIMESTAT_POINT(time_p3);
+		goto out;
+	}
+
+//	printk("DEBUG: ACCESS %lu %d %d\n", sector, num_sectors, rw);
+
+	/*
+	 * NOTE: some applications (e.g. fdisk) call fsync() to request
+	 * flushing dirty data from the buffer cache. In default, fsync() is
+	 * linked to blkdev_fsync() in the def_blk_fops structure, and
+	 * blkdev_fsync() will call blkdev_issue_flush(), which generates an
+	 * empty bio carrying a write barrier down to the block device through
+	 * generic_make_request(), which calls pmbd_make_request() in turn. If
+	 * we don't set err=0 here, this error message would pass upwards back
+	 * to the application. For example, fdisk will fail and reports error
+	 * when trying to write the partition table before it exits. Thus we
+	 * must reset the error code here if the bio is empty. Also note that
+	 * we directly check the bio size, rather than using bio_wr_barrier(),
+	 * to handle other cases.
+	 *
+ 	 */
+	if (num_sectors == 0) { 
+		err = 0;
+		TIMESTAT_POINT(time_p3);
+		goto out;
+	} 
+
+	/* update the access time*/
+	PMBD_DEV_UPDATE_ACCESS_TIME(pmbd);
+
+	TIMESTAT_POINT(time_p3);
+
+	/* 
+	 * Do read/write now. We first perform the operation, then check how
+	 * long it actually takes to finish the operation, then we calculate an
+	 * emulated time for a given slow-down model, if the actual access time
+	 * is less than the emulated time, we just make up the difference to
+	 * emulate a slower device. 
+	 */
+	bio_for_each_segment(bvec, bio, i) {
+		unsigned int len = bvec->bv_len;
+		err = pmbd_do_bvec(pmbd, bvec->bv_page, len, 
+					bvec->bv_offset, rw, sector, do_fua);
+		if (err)
+			break;
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	TIMESTAT_POINT(time_p4);
+
+	bio_endio(bio, err);
+
+	TIMESTAT_POINT(time_p5);
+
+	/* ending emulation (simmode0)*/
+	if (PMBD_DEV_SIM_DEV(pmbd))
+		end = emul_end(pmbd, num_sectors, rw, start);
+
+	/* decrement on-the-fly writes counter */
+	atomic_dec(&pmbd->num_flying_wr);
+
+	TIMESTAT_POINT(time_p6);
+
+	/* update statistics data */
+	spin_lock(&pmbd_stat->stat_lock);
+	if (rw == READ) {
+		pmbd_stat->num_requests_read ++;
+		pmbd_stat->num_sectors_read += num_sectors;
+	} else {
+		pmbd_stat->num_requests_write ++;
+		pmbd_stat->num_sectors_write += num_sectors;
+	}
+	if (bio_is_write_barrier)
+		pmbd_stat->num_write_barrier ++;
+	if (bio_is_write_fua)
+		pmbd_stat->num_write_fua ++;
+	spin_unlock(&pmbd_stat->stat_lock);
+
+	/* cycles */
+	if (PMBD_USE_TIMESTAT()){
+		int cid = CUR_CPU_ID();
+		pmbd_stat->cycles_total[rw][cid] 	+= time_p6 - time_p1;
+		pmbd_stat->cycles_wb[rw][cid] 		+= time_p2 - time_p1;		/* write barrier */
+		pmbd_stat->cycles_prepare[rw][cid] 	+= time_p3 - time_p2;	
+		pmbd_stat->cycles_work[rw][cid] 		+= time_p4 - time_p3;
+		pmbd_stat->cycles_endio[rw][cid] 	+= time_p5 - time_p4;
+		pmbd_stat->cycles_finish[rw][cid] 	+= time_p6 - time_p5;
+	}
+}
+
+
+/*
+ **************************************************************************
+ * Allocating memory space for PMBD device 
+ **************************************************************************
+ */ 
+
+/*
+ * Set the page attributes for the PMBD backstore memory space
+ *  - WB: cache enabled, write back (default)
+ *  - WC: cache disabled, write through, speculative writes combined
+ *  - UC: cache disabled, write through, no write combined
+ *  - UC-Minus: the same as UC 
+ *
+ * REF: 
+ * - http://www.kernel.org/doc/ols/2008/ols2008v2-pages-135-144.pdf
+ * - http://www.mjmwired.net/kernel/Documentation/x86/pat.txt
+ */ 
+
+static int pmbd_set_pages_cache_flags(PMBD_DEVICE_T* pmbd)
+{
+	if (pmbd->mem_space && pmbd->num_sectors) {
+		/* NOTE: we convert it here with no problem on 64-bit system */
+		unsigned long vaddr = (unsigned long) pmbd->mem_space;
+		int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
+
+		printk(KERN_INFO "pmbd: setting %s PTE flags (%lx:%d)\n", pmbd->pmbd_name, vaddr, num_pages);
+		set_pages_cache_flags(vaddr, num_pages);
+		printk(KERN_INFO "pmbd: setting %s PTE flags done.\n", pmbd->pmbd_name);
+	}
+	return 0;
+}
+
+static int pmbd_reset_pages_cache_flags(PMBD_DEVICE_T* pmbd)
+{
+	if (pmbd->mem_space){
+		unsigned long vaddr = (unsigned long) pmbd->mem_space;
+		int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
+		set_memory_wb(vaddr, num_pages);
+		printk(KERN_INFO "pmbd: %s pages cache flags are reset to WB\n", pmbd->pmbd_name);
+	}
+	return 0;
+}
+
+
+/*
+ * Allocate/free memory backstore space for PMBD devices
+ */
+static int pmbd_mem_space_alloc (PMBD_DEVICE_T* pmbd)
+{
+	int err = 0;
+
+	/* allocate PM memory space */
+	if (PMBD_DEV_USE_VMALLOC(pmbd)){
+		pmbd->mem_space = vmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
+	} else if (PMBD_DEV_USE_HIGHMEM(pmbd)){
+		pmbd->mem_space = hmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
+	}
+
+	if (pmbd->mem_space) {
+#if 0
+		/* FIXME: No need to do this. It's slow, system could be locked up */
+		memset(pmbd->mem_space, 0, pmbd->sectors * pmbd->sector_size);
+#endif
+		printk(KERN_INFO "pmbd: /dev/%s is created [%lu : %llu MBs]\n", 
+				pmbd->pmbd_name, (unsigned long) pmbd->mem_space, SECTORS_TO_MB(pmbd->num_sectors));
+	} else {
+		printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
+		err = -ENOMEM;
+	} 
+	return err;
+}
+
+static int pmbd_mem_space_free(PMBD_DEVICE_T* pmbd)
+{
+	/* free it up */
+	if (pmbd->mem_space) {
+		if (PMBD_DEV_USE_VMALLOC(pmbd))
+			vfree(pmbd->mem_space);
+		else if (PMBD_DEV_USE_HIGHMEM(pmbd)) {
+			hfree(pmbd->mem_space);
+		}
+		pmbd->mem_space = NULL;
+	}
+	return 0;
+}
+
+/* pmbd->pmbd_stat */
+static int pmbd_stat_alloc(PMBD_DEVICE_T* pmbd)
+{
+	int err = 0;
+	pmbd->pmbd_stat = (PMBD_STAT_T*)kzalloc(sizeof(PMBD_STAT_T), GFP_KERNEL);
+	if (pmbd->pmbd_stat){
+		spin_lock_init(&pmbd->pmbd_stat->stat_lock);
+	} else {
+		printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
+		err = -ENOMEM;
+	}
+	return 0;
+}
+
+static int pmbd_stat_free(PMBD_DEVICE_T* pmbd)
+{
+	if(pmbd->pmbd_stat) {
+		kfree(pmbd->pmbd_stat);
+		pmbd->pmbd_stat = NULL;
+	}
+	return 0;
+}
+
+/* /proc/pmbd/<dev> */
+static int pmbd_proc_pmbdstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+	int rtn;
+	if (offset > 0) {
+		*eof = 1;
+		rtn  = 0;
+	} else {
+		//char local_buffer[1024];
+		char* local_buffer = kzalloc(8192, GFP_KERNEL);
+		PMBD_DEVICE_T* pmbd, *next;
+		char rdwr_name[2][16] = {"read\0", "write\0"};
+		local_buffer[0] = '\0';
+
+		list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+			unsigned i, j;
+			BBN_T num_dirty = 0;
+			BBN_T num_blocks = 0; 
+			PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
+
+			/* FIXME: should we lock the buffer? (NOT NECESSARY)*/
+			for (i = 0; i < pmbd->num_buffers; i ++){
+				num_blocks += pmbd->buffers[i]->num_blocks;
+				num_dirty += pmbd->buffers[i]->num_dirty;
+			}
+
+			/* print stuff now */
+			spin_lock(&pmbd->pmbd_stat->stat_lock);
+
+			sprintf(local_buffer+strlen(local_buffer), "num_dirty_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) num_dirty);
+			sprintf(local_buffer+strlen(local_buffer), "num_clean_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) (num_blocks - num_dirty));
+			sprintf(local_buffer+strlen(local_buffer), "num_sectors_read[%s] %llu\n",  pmbd->pmbd_name, pmbd_stat->num_sectors_read);
+			sprintf(local_buffer+strlen(local_buffer), "num_sectors_write[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_write);
+			sprintf(local_buffer+strlen(local_buffer), "num_requests_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_requests_read);
+			sprintf(local_buffer+strlen(local_buffer), "num_requests_write[%s] %llu\n",pmbd->pmbd_name, pmbd_stat->num_requests_write);
+			sprintf(local_buffer+strlen(local_buffer), "num_write_barrier[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_barrier);
+			sprintf(local_buffer+strlen(local_buffer), "num_write_fua[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_fua);
+
+			spin_unlock(&pmbd->pmbd_stat->stat_lock);
+
+//			sprintf(local_buffer+strlen(local_buffer), "\n");
+				
+			for (j = 0; j <= 1; j ++){
+				int k=0;
+
+				unsigned long long cycles_total = 0;
+				unsigned long long cycles_prepare = 0;
+				unsigned long long cycles_wb = 0;
+				unsigned long long cycles_work = 0;
+				unsigned long long cycles_endio = 0;
+				unsigned long long cycles_finish = 0;
+
+				unsigned long long cycles_pmap = 0;
+				unsigned long long cycles_punmap = 0;
+				unsigned long long cycles_memcpy = 0;
+				unsigned long long cycles_clflush = 0;
+				unsigned long long cycles_clflushall = 0;
+				unsigned long long cycles_wrverify = 0;
+				unsigned long long cycles_checksum = 0;
+				unsigned long long cycles_pause = 0;
+				unsigned long long cycles_slowdown = 0;
+				unsigned long long cycles_setpages_ro = 0;
+				unsigned long long cycles_setpages_rw = 0;
+
+				for (k = 0; k < PMBD_MAX_NUM_CPUS; k ++){
+					cycles_total 	+= pmbd_stat->cycles_total[j][k];
+					cycles_prepare 	+= pmbd_stat->cycles_prepare[j][k];
+					cycles_wb	+= pmbd_stat->cycles_wb[j][k];
+					cycles_work	+= pmbd_stat->cycles_work[j][k];
+					cycles_endio	+= pmbd_stat->cycles_endio[j][k];
+					cycles_finish	+= pmbd_stat->cycles_finish[j][k];
+
+					cycles_pmap	+= pmbd_stat->cycles_pmap[j][k];
+					cycles_punmap	+= pmbd_stat->cycles_punmap[j][k];
+					cycles_memcpy	+= pmbd_stat->cycles_memcpy[j][k];
+					cycles_clflush	+= pmbd_stat->cycles_clflush[j][k];
+					cycles_clflushall+=pmbd_stat->cycles_clflushall[j][k];
+					cycles_wrverify	+= pmbd_stat->cycles_wrverify[j][k];
+					cycles_checksum += pmbd_stat->cycles_checksum[j][k];
+					cycles_pause	+= pmbd_stat->cycles_pause[j][k];
+					cycles_slowdown	+= pmbd_stat->cycles_slowdown[j][k];
+					cycles_setpages_ro+= pmbd_stat->cycles_setpages_ro[j][k];
+					cycles_setpages_rw+= pmbd_stat->cycles_setpages_rw[j][k];
+				}
+
+				sprintf(local_buffer+strlen(local_buffer), "cycles_total_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_total);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_prepare_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_prepare);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_wb_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wb);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_work_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_work);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_endio_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_endio);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_finish_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_finish);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_pmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pmap);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_punmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_punmap);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_memcpy_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_memcpy);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_clflush_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflush);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_clflushall_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflushall);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_wrverify_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wrverify);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_checksum_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_checksum);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_pause_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pause);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_slowdown_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_slowdown);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_ro_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_ro);
+				sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_rw_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_rw);
+			}
+
+#if 0
+			/* print something temporary for debugging purpose */
+			if (0) {
+				spin_lock(&pmbd->tmp_lock);
+				printk("%llu %lu\n", pmbd->tmp_data, pmbd->tmp_num);
+				spin_unlock(&pmbd->tmp_lock);
+			}
+#endif
+		}
+
+		memcpy(buffer, local_buffer, strlen(local_buffer));
+		rtn = strlen(local_buffer);
+		kfree(local_buffer);
+	}
+	return rtn;
+}
+
+/* /proc/pmbdcfg */
+static int pmbd_proc_pmbdcfg_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+	int rtn;
+	if (offset > 0) {
+		*eof = 1;
+		rtn  = 0;
+	} else {
+		char* local_buffer = kzalloc(8192, GFP_KERNEL);
+		PMBD_DEVICE_T* pmbd, *next;
+		local_buffer[0] = '\0';
+
+		/* global configurations */
+		sprintf(local_buffer+strlen(local_buffer), "MODULE OPTIONS: %s\n", mode);
+		sprintf(local_buffer+strlen(local_buffer), "\n");
+
+		sprintf(local_buffer+strlen(local_buffer), "max_part %d\n", max_part);
+		sprintf(local_buffer+strlen(local_buffer), "part_shift %d\n", part_shift);
+
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_type %u\n", g_pmbd_type);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_mergeable %u\n", g_pmbd_mergeable);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_clflush %u\n", g_pmbd_cpu_cache_clflush);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_flag %lu\n", g_pmbd_cpu_cache_flag);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_protect %u\n", g_pmbd_wr_protect);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_verify %u\n", g_pmbd_wr_verify);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_checksum %u\n", g_pmbd_checksum);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_lock %u\n", g_pmbd_lock);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_subpage_update %u\n", g_pmbd_subpage_update);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_pmap %u\n", g_pmbd_pmap);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nts %u\n", g_pmbd_nts);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_ntl %u\n", g_pmbd_ntl);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wb %u\n", g_pmbd_wb);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_fua %u\n", g_pmbd_fua);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_timestat %u\n", g_pmbd_timestat);
+		sprintf(local_buffer+strlen(local_buffer), "g_highmem_size %lu\n", g_highmem_size);
+		sprintf(local_buffer+strlen(local_buffer), "g_highmem_phys_addr %llu\n", (unsigned long long) g_highmem_phys_addr);
+		sprintf(local_buffer+strlen(local_buffer), "g_highmem_virt_addr %llu\n", (unsigned long long) g_highmem_virt_addr);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nr %u\n", g_pmbd_nr);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_adjust_ns %llu\n", g_pmbd_adjust_ns);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_num_buffers %llu\n", g_pmbd_num_buffers);
+		sprintf(local_buffer+strlen(local_buffer), "g_pmbd_buffer_stride %llu\n", g_pmbd_buffer_stride);
+		sprintf(local_buffer+strlen(local_buffer), "\n");
+
+		/* device specific configurations */
+		list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+			int i = 0;
+
+			sprintf(local_buffer+strlen(local_buffer), "pmbd_id[%s] %d\n", pmbd->pmbd_name, pmbd->pmbd_id);
+			sprintf(local_buffer+strlen(local_buffer), "num_sectors[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->num_sectors);
+			sprintf(local_buffer+strlen(local_buffer), "sector_size[%s] %u\n", pmbd->pmbd_name, pmbd->sector_size);
+			sprintf(local_buffer+strlen(local_buffer), "pmbd_type[%s] %u\n", pmbd->pmbd_name, pmbd->pmbd_type);
+			sprintf(local_buffer+strlen(local_buffer), "rammode[%s] %u\n", pmbd->pmbd_name, pmbd->rammode);
+			sprintf(local_buffer+strlen(local_buffer), "bufmode[%s] %u\n", pmbd->pmbd_name, pmbd->bufmode);
+			sprintf(local_buffer+strlen(local_buffer), "wpmode[%s] %u\n", pmbd->pmbd_name, pmbd->wpmode);
+			sprintf(local_buffer+strlen(local_buffer), "num_buffers[%s] %u\n", pmbd->pmbd_name, pmbd->num_buffers);
+			sprintf(local_buffer+strlen(local_buffer), "buffer_stride[%s] %u\n", pmbd->pmbd_name, pmbd->buffer_stride);
+			sprintf(local_buffer+strlen(local_buffer), "pb_size[%s] %u\n", pmbd->pmbd_name, pmbd->pb_size);
+			sprintf(local_buffer+strlen(local_buffer), "checksum_unit_size[%s] %u\n", pmbd->pmbd_name, pmbd->checksum_unit_size);
+			sprintf(local_buffer+strlen(local_buffer), "simmode[%s] %u\n", pmbd->pmbd_name, pmbd->simmode);
+			sprintf(local_buffer+strlen(local_buffer), "rdlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdlat);
+			sprintf(local_buffer+strlen(local_buffer), "wrlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrlat);
+			sprintf(local_buffer+strlen(local_buffer), "rdbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdbw);
+			sprintf(local_buffer+strlen(local_buffer), "wrbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrbw);
+			sprintf(local_buffer+strlen(local_buffer), "rdsx[%s] %u\n", pmbd->pmbd_name, pmbd->rdsx);
+			sprintf(local_buffer+strlen(local_buffer), "wrsx[%s] %u\n", pmbd->pmbd_name, pmbd->wrsx);
+			sprintf(local_buffer+strlen(local_buffer), "rdpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdpause);
+			sprintf(local_buffer+strlen(local_buffer), "wrpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrpause);
+
+			for (i = 0; i < pmbd->num_buffers; i ++){
+				PMBD_BUFFER_T* buffer = pmbd->buffers[i];
+					sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]buffer_id %u\n", i, pmbd->pmbd_name, buffer->buffer_id);
+					sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]num_blocks %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->num_blocks);
+					sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]batch_size %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->batch_size);
+			}
+
+		}
+
+		memcpy(buffer, local_buffer, strlen(local_buffer));
+		rtn = strlen(local_buffer);
+		kfree(local_buffer);
+	}
+	return rtn;
+}
+
+
+
+static int pmbd_proc_devstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
+{
+	int rtn;
+	char local_buffer[1024];
+	if (offset > 0) {
+		*eof = 1;
+		rtn  = 0;
+	} else {
+		sprintf(local_buffer, "N/A\n");
+		memcpy(buffer, local_buffer, strlen(local_buffer));
+		rtn = strlen(local_buffer);
+	}
+	return rtn;
+}
+
+static int pmbd_proc_devstat_create(PMBD_DEVICE_T* pmbd)
+{
+	/* create a /proc/pmbd/<dev> entry */
+	pmbd->proc_devstat = create_proc_entry(pmbd->pmbd_name, S_IRUGO, proc_pmbd);
+	if (pmbd->proc_devstat == NULL) {
+		remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
+		printk(KERN_ERR "pmbd: cannot create /proc/pmbd/%s\n", pmbd->pmbd_name);
+		return -ENOMEM;
+	}
+	pmbd->proc_devstat->read_proc = pmbd_proc_devstat_read;
+	printk(KERN_INFO "pmbd: /proc/pmbd/%s created\n", pmbd->pmbd_name);
+
+	return 0;
+}
+
+static int pmbd_proc_devstat_destroy(PMBD_DEVICE_T* pmbd)
+{
+	remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
+	printk(KERN_INFO "pmbd: /proc/pmbd/%s removed\n", pmbd->pmbd_name);
+	return 0;
+}
+
+static int pmbd_create (PMBD_DEVICE_T* pmbd, uint64_t sectors)
+{
+	int err = 0;
+
+	pmbd->num_sectors = sectors; 
+	pmbd->sector_size = PMBD_SECTOR_SIZE;	 	/* FIXME: now we use 512, do we need to change it? */
+	pmbd->pmbd_type = g_pmbd_type;
+	pmbd->checksum_unit_size = PAGE_SIZE;
+	pmbd->pb_size = PAGE_SIZE;
+
+	spin_lock_init(&pmbd->batch_lock);
+	spin_lock_init(&pmbd->wr_barrier_lock);
+
+	spin_lock_init(&pmbd->tmp_lock);
+	pmbd->tmp_data = 0;
+	pmbd->tmp_num = 0;
+
+	/* allocate statistics info */
+	if ((err = pmbd_stat_alloc(pmbd)) < 0)
+		goto error;
+
+	/* allocate memory space */
+	if ((err = pmbd_mem_space_alloc(pmbd)) < 0)
+		goto error;
+
+	/* allocate buffer space */
+	if ((err = pmbd_buffer_space_alloc(pmbd)) < 0)
+		goto error;
+
+	/* allocate checksum space */
+	if ((err = pmbd_checksum_space_alloc(pmbd)) < 0)
+		goto error;
+	
+	/* allocate block info space */
+	if ((err = pmbd_pbi_space_alloc(pmbd)) < 0)
+		goto error;
+
+	/* create a /proc/pmbd/<dev> entry*/
+	if ((err = pmbd_proc_devstat_create(pmbd)) < 0)
+		goto error;
+
+#if 0
+	/* FIXME: No need to do it. It's slow and could lock up the system*/
+	pmbd_checksum_space_init(pmbd);
+#endif
+
+	/* set up the page attributes related with CPU cache 
+	 * if using vmalloc(), we need to set up the page cache flags (WB,WC,UC,UM);
+	 * if using high memory, we set up the page cache flag with ioremap_prot();
+	 * WARN: In Linux 3.2.1, this function is slow and could cause system hangs. 
+ 	 */
+	
+	if (PMBD_USE_VMALLOC()){
+		pmbd_set_pages_cache_flags(pmbd);
+	}
+
+	/* initialize PM pages read-only */
+	if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())	
+		pmbd_set_pages_ro(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
+
+	printk(KERN_INFO "pmbd: %s created\n", pmbd->pmbd_name);
+error:
+	return err;
+}
+
+static int pmbd_destroy (PMBD_DEVICE_T* pmbd)
+{
+	/* flush everything down */
+	// FIXME: this implies flushing CPU cache
+	pmbd_write_barrier(pmbd);
+	
+	/* free /proc entry */
+	pmbd_proc_devstat_destroy(pmbd);
+
+	/* free buffer space */
+	pmbd_buffer_space_free(pmbd);
+
+	/* set PM pages writable */
+	if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())
+		pmbd_set_pages_rw(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
+
+	/* reset memory attributes to WB */
+	if (PMBD_USE_VMALLOC())
+		pmbd_reset_pages_cache_flags(pmbd);
+		
+	/* free block info space */
+	pmbd_pbi_space_free(pmbd);
+
+	/* free checksum space */
+	pmbd_checksum_space_free(pmbd);
+
+	/* free memory backstore space */
+	pmbd_mem_space_free(pmbd);
+
+	/* free statistics data */
+	pmbd_stat_free(pmbd);
+	
+	printk(KERN_INFO "pmbd: /dev/%s is destroyed (%llu MB)\n", pmbd->pmbd_name, SECTORS_TO_MB(pmbd->num_sectors));
+
+	pmbd->num_sectors = 0;
+	pmbd->sector_size = 0;
+	pmbd->checksum_unit_size = 0;
+	return 0;
+}
+
+static int pmbd_free_pages(PMBD_DEVICE_T* pmbd)
+{
+	return pmbd_destroy(pmbd);
+}
+
+/*
+ **************************************************************************
+ * /proc file system entries
+ **************************************************************************
+ */
+
+static int pmbd_proc_create(void)
+{
+	proc_pmbd= proc_mkdir("pmbd", 0);
+	if(proc_pmbd == NULL){
+		printk(KERN_ERR "pmbd: %s(%d): cannot create /proc/pmbd\n", __FUNCTION__, __LINE__);
+		return -ENOMEM;
+	}
+
+	proc_pmbdstat = create_proc_entry("pmbdstat", S_IRUGO, proc_pmbd);
+	if (proc_pmbdstat == NULL){
+		remove_proc_entry("pmbdstat", proc_pmbd);
+		printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdstat\n");
+		return -ENOMEM;
+	}
+	proc_pmbdstat->read_proc = pmbd_proc_pmbdstat_read;
+	printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat created\n");
+
+	proc_pmbdcfg = create_proc_entry("pmbdcfg", S_IRUGO, proc_pmbd);
+	if (proc_pmbdcfg == NULL){
+		remove_proc_entry("pmbdcfg", proc_pmbd);
+		printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdcfg\n");
+		return -ENOMEM;
+	}
+	proc_pmbdcfg->read_proc = pmbd_proc_pmbdcfg_read;
+	printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg created\n");
+
+	return 0;
+}
+
+static int pmbd_proc_destroy(void)
+{
+	remove_proc_entry("pmbdcfg", proc_pmbd);
+	printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg is removed\n");
+
+	remove_proc_entry("pmbdstat", proc_pmbd);
+	printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat is removed\n");
+
+	remove_proc_entry("pmbd", 0);
+	printk(KERN_INFO "pmbd: /proc/pmbd is removed\n");
+	return 0;
+}
+
+/*
+ **************************************************************************
+ * device driver interface hook functions
+ **************************************************************************
+ */
+
+static int pmbd_mergeable_bvec(struct request_queue *q, 
+                              struct bvec_merge_data *bvm,
+                              struct bio_vec *biovec) {
+	static int flag = 0;
+    
+	if(PMBD_IS_MERGEABLE()) {
+		/* always merge */
+		if (!flag) {
+		    printk(KERN_INFO "pmbd: bio merging enabled\n");
+		    flag = 1;
+		}
+		return biovec->bv_len;
+	} else {
+		/* never merge */
+		if (!flag) {
+			printk(KERN_INFO "pmbd: bio merging disabled\n");
+			flag = 1;
+		}
+		if (!bvm->bi_size) {
+        		return biovec->bv_len;
+		} else {
+			return 0;
+		}
+	}
+}
+
+int pmbd_fsync(struct file* file, struct dentry* dentry, int datasync)
+{
+	printk(KERN_WARNING "pmbd: pmbd_fsync not implemented\n");
+
+	return 0;
+}
+
+int pmbd_open(struct block_device* bdev, fmode_t mode)
+{
+	printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) opened\n", bdev->bd_disk->disk_name);
+	return 0;
+}
+
+int pmbd_release (struct gendisk* disk, fmode_t mode)
+{
+	printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) released\n", disk->disk_name);
+	return 0;
+}
+
+static const struct block_device_operations pmbd_fops = {
+	.owner =		THIS_MODULE,
+//	.open =			pmbd_open,
+//	.release = 		pmbd_release,
+};
+
+/*
+ * NOTE: partial of the following code is derived from linux/block/brd.c
+ */
+
+
+static PMBD_DEVICE_T *pmbd_alloc(int i)
+{
+	PMBD_DEVICE_T *pmbd;
+	struct gendisk *disk;
+
+	/* no more than 26 devices */
+	if (i >= PMBD_MAX_NUM_DEVICES)
+		return NULL;
+
+	/* alloc and set up pmbd object */
+	pmbd = kzalloc(sizeof(*pmbd), GFP_KERNEL);
+	if (!pmbd) 
+		goto out;
+	pmbd->pmbd_id = i;
+	pmbd->pmbd_queue = blk_alloc_queue(GFP_KERNEL);
+	sprintf(pmbd->pmbd_name, "pm%c", ('a' + i));
+	pmbd->rdlat = g_pmbd_rdlat[i];
+	pmbd->wrlat = g_pmbd_wrlat[i];
+	pmbd->rdbw  = g_pmbd_rdbw[i];
+	pmbd->wrbw  = g_pmbd_wrbw[i];
+	pmbd->rdsx  = g_pmbd_rdsx[i];
+	pmbd->wrsx  = g_pmbd_wrsx[i];
+	pmbd->rdpause  = g_pmbd_rdpause[i];
+	pmbd->wrpause  = g_pmbd_wrpause[i];
+	pmbd->simmode  = g_pmbd_simmode[i];
+	pmbd->rammode  = g_pmbd_rammode[i];
+	pmbd->wpmode   = g_pmbd_wpmode[i];
+	pmbd->num_buffers  = g_pmbd_num_buffers;
+	pmbd->buffer_stride  = g_pmbd_buffer_stride;
+	pmbd->bufmode  = (g_pmbd_bufsize[i] > 0 && g_pmbd_num_buffers > 0) ? TRUE : FALSE;
+
+	if (!pmbd->pmbd_queue)
+		goto out_free_dev;
+
+	/* hook functions */
+	blk_queue_make_request(pmbd->pmbd_queue, pmbd_make_request);
+
+	/* set flush capability, otherwise, WRITE_FLUSH and WRITE_FUA will be filtered in
+ 	   generic_make_request() */
+	if (PMBD_USE_FUA() && PMBD_USE_WB())
+		blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH | REQ_FUA);
+	else if (PMBD_USE_WB())
+		blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH);
+	else if (PMBD_USE_FUA())
+		blk_queue_flush(pmbd->pmbd_queue, REQ_FUA);
+
+	blk_queue_max_hw_sectors(pmbd->pmbd_queue, 1024);
+	blk_queue_bounce_limit(pmbd->pmbd_queue, BLK_BOUNCE_ANY);
+    	blk_queue_merge_bvec(pmbd->pmbd_queue, pmbd_mergeable_bvec);
+
+	disk = pmbd->pmbd_disk = alloc_disk(1 << part_shift);
+	if (!disk)
+		goto out_free_queue;
+
+	disk->major		= PMBD_MAJOR;
+	disk->first_minor	= i << part_shift;
+	disk->fops		= &pmbd_fops;
+	disk->private_data	= pmbd;
+	disk->queue		= pmbd->pmbd_queue;
+	strcpy(disk->disk_name, pmbd->pmbd_name);
+	set_capacity(disk, GB_TO_SECTORS(g_pmbd_size[i])); /* num of sectors */
+
+	/* allocate PM space */
+	if (pmbd_create(pmbd, GB_TO_SECTORS(g_pmbd_size[i])) < 0)
+		goto out_free_queue;
+
+	/* done */
+	return pmbd;
+
+out_free_queue:
+	blk_cleanup_queue(pmbd->pmbd_queue);
+out_free_dev:
+	kfree(pmbd);
+out:
+	return NULL;
+}
+
+static void pmbd_free(PMBD_DEVICE_T *pmbd)
+{
+	put_disk(pmbd->pmbd_disk);
+	blk_cleanup_queue(pmbd->pmbd_queue);
+	pmbd_free_pages(pmbd);
+	kfree(pmbd);
+}
+
+static void pmbd_del_one(PMBD_DEVICE_T *pmbd)
+{
+	list_del(&pmbd->pmbd_list);
+	del_gendisk(pmbd->pmbd_disk);
+	pmbd_free(pmbd);
+}
+
+static int __init pmbd_init(void)
+{
+	int i, nr;
+	unsigned long range;
+	PMBD_DEVICE_T *pmbd, *next;
+
+	/* parse input options */
+	pmbd_parse_conf();
+
+	/* initialize pmap start*/
+	pmap_create();
+
+	/* ioremap high memory space */
+	if (PMBD_USE_HIGHMEM()) {
+		if (pmbd_highmem_map() == NULL) 
+			return -ENOMEM;
+	}
+
+	part_shift = 0;
+	if (max_part > 0)
+		part_shift = fls(max_part);
+
+	if (g_pmbd_nr > 1UL << (MINORBITS - part_shift))
+		return -EINVAL;
+
+	if (g_pmbd_nr) {
+		nr = g_pmbd_nr;
+		range = g_pmbd_nr;
+	} else {
+		printk(KERN_ERR "pmbd: %s(%d) - g_pmbd_nr=%d\n", __FUNCTION__, __LINE__, g_pmbd_nr);
+		return -EINVAL;
+	} 
+
+	pmbd_proc_create();
+
+	if (register_blkdev(PMBD_MAJOR, PMBD_NAME))
+		return -EIO;
+	else
+		printk(KERN_INFO "pmbd: registered device at major %d\n", PMBD_MAJOR);
+
+	for (i = 0; i < nr; i++) {
+		pmbd = pmbd_alloc(i);
+		if (!pmbd)
+			goto out_free;
+		list_add_tail(&pmbd->pmbd_list, &pmbd_devices);
+	}
+
+	/* point of no return */
+	list_for_each_entry(pmbd, &pmbd_devices, pmbd_list)
+		add_disk(pmbd->pmbd_disk);
+
+	printk(KERN_INFO "pmbd: module loaded\n");
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
+		list_del(&pmbd->pmbd_list);
+		pmbd_free(pmbd);
+	}
+	unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
+
+	return -ENOMEM;
+}
+
+
+static void __exit pmbd_exit(void)
+{
+	unsigned long range;
+	PMBD_DEVICE_T *pmbd, *next;
+
+	range = g_pmbd_nr ? g_pmbd_nr :  1UL << (MINORBITS - part_shift);
+
+	/* deactivate each pmbd instance*/
+	list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list)
+		pmbd_del_one(pmbd);
+
+	/* deioremap high memory space */
+	if (PMBD_USE_HIGHMEM()) {
+		pmbd_highmem_unmap(); 
+	}
+
+	/* destroy pmap entries */
+	pmap_destroy();
+
+	unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
+
+	pmbd_proc_destroy();
+
+	printk(KERN_INFO "pmbd: module unloaded\n");
+	return;
+}
+
+/* module setup */
+MODULE_AUTHOR("Intel Corporation <linux-pmbd at intel.com>");
+MODULE_ALIAS("pmbd");
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION("0.9");
+MODULE_ALIAS_BLOCKDEV_MAJOR(PMBD_MAJOR);
+module_init(pmbd_init);
+module_exit(pmbd_exit);
+
+/* THE END */
+
+
diff --git a/include/linux/pmbd.h b/include/linux/pmbd.h
new file mode 100644
index 0000000..8e8691f
--- /dev/null
+++ b/include/linux/pmbd.h
@@ -0,0 +1,509 @@
+/*
+ * Intel Persistent Memory Block Driver
+ * Copyright (c) <2011-2013>, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+/* 
+ * Intel Persistent Memory Block Driver (v0.9)
+ *
+ * pmbd.h
+ *
+ * Intel Corporation <linux-pmbd at intel.com>
+ * 03/24/2011
+ */ 
+
+#ifndef PMBD_H
+#define PMBD_H
+
+#define PMBD_MAJOR 			261		/* FIXME: temporarily use this */
+#define PMBD_NAME			"pmbd"		/* pmbd module name */
+#define PMBD_MAX_NUM_DEVICES 		26		/* max num of devices */
+#define PMBD_MAX_NUM_CPUS		32		/* max num of cpus*/
+
+/*
+ * type definitions  
+ */ 
+typedef uint32_t			PMBD_CHECKSUM_T;/* we use CRC32 to calculate checksum */
+typedef sector_t			BBN_T;		/* BBN_T */
+typedef sector_t			PBN_T;		/* BBN_T */
+
+
+/*
+ * PMBD device buffer control structure 
+ * NOTE: 
+ * (1) buffer_space is an array of num_blocks of blocks, the size of which is
+ * defined as pmbd->pb_size
+ * (2) bbi_space is an array of num_blocks of bbi (buffer block info) units,
+ * each of which contains the metadata information of each block in the buffer
+
+    buffer space management variables
+ * num_dirty - total number of dirty blocks in buffer
+ *  pos_dirty - point to the end of the sequence of dirty blocks 
+ *  pos_clean - point to the end of the sequence of clean blocks
+ * 
+ * post_dirty and pos_clean logically segment the buffer into
+ * dirty/clean regions as follows. 
+ *  
+ *   pos_dirty ----v       v--- pos_clean
+ *       ----------------------------
+ *       |  clean  |*DIRTY*| clean  |
+ *       ----------------------------
+ *  buffer_lock - protects reads/writes to the aforesaid three
+ */
+typedef struct pmbd_bbi {				/* pmbd buffer block info (BBI) */
+	PBN_T				pbn;		/* physical block number in PM (converted from sector) */
+	unsigned			dirty;		/* dirty (1) or clean (0)*/
+} PMBD_BBI_T;
+
+typedef struct pmbd_bsort_entry {			/* pmbd buffer block info for sorting */
+	BBN_T				bbn;		/* buffer block number (in buffer)*/
+	PBN_T				pbn;		/* physical block number (in PMBD)*/
+} PMBD_BSORT_ENTRY_T;
+
+typedef struct pmbd_buffer {
+	unsigned			buffer_id;
+	struct pmbd_device* 		pmbd;		/* the linked pmbd device */
+
+	BBN_T				num_blocks;	/* buffer space size (# of blocks) */
+	void* 				buffer_space;	/* buffer space base vaddr address */
+	PMBD_BBI_T*			bbi_space;	/* array of buffer block info (BBI)*/
+
+	BBN_T				num_dirty;	/* num of dirty blocks */
+	BBN_T				pos_dirty;	/* the first dirty block */
+	BBN_T				pos_clean;	/* the first clean block */
+	spinlock_t			buffer_lock;	/* lock to protect metadata updates */
+	unsigned int			batch_size;	/* the batch size for flushing buffer pages */
+
+	struct task_struct*		syncer;		/* the syncer daemon */
+
+	spinlock_t			flush_lock;	/* lock to protect metadata updates */
+	PMBD_BSORT_ENTRY_T*		bbi_sort_buffer;/* a temp array of the bbi for sorting */
+} PMBD_BUFFER_T;
+
+/*
+ * PM physical block information (each corresponding to a PM block)
+ *
+ * (1) if the physical block is buffered, bbn contains a valid buffer block
+ * number (BBN) between 0 - (buffer->num_blocks-1), otherwise, it contains an
+ * invalid value (buffer->num_blocks + 1)
+ * (2) any access to the block (read/write/sync) must have this lock first to
+ * prevent multiple concurrent accesses to the same PM block
+ */
+typedef struct pmbd_pbi{
+	BBN_T				bbn;
+	spinlock_t			lock;	
+} PMBD_PBI_T;
+
+typedef struct pmbd_stat{
+	/* stat_lock does not protect cycles_*[] counters */
+	spinlock_t			stat_lock;		/* protection lock */
+
+	unsigned			last_access_jiffies;	/* the timestamp of the most recent access */
+	uint64_t			num_sectors_read;	/* total num of sectors being read */
+	uint64_t			num_sectors_write;	/* total num of sectors being written */
+	uint64_t			num_requests_read;	/* total num of requests for read */
+	uint64_t			num_requests_write;	/* total num of request for write */
+	uint64_t			num_write_barrier;	/* total num of write barriers received */
+	uint64_t			num_write_fua;		/* total num of write barriers received */
+	
+	/* cycles counters (enabled/disabled by timestat)*/
+	uint64_t			cycles_total[2][PMBD_MAX_NUM_CPUS];	/* total cycles for read in make_request*/
+	uint64_t			cycles_prepare[2][PMBD_MAX_NUM_CPUS];	/* total cycles for prepare in make_request*/
+	uint64_t			cycles_wb[2][PMBD_MAX_NUM_CPUS];	/* total cycles for write barrier in make_request*/
+	uint64_t			cycles_work[2][PMBD_MAX_NUM_CPUS];	/* total cycles for work in make_request*/
+	uint64_t			cycles_endio[2][PMBD_MAX_NUM_CPUS];	/* total cycles for endio in make_request*/
+	uint64_t			cycles_finish[2][PMBD_MAX_NUM_CPUS];	/* total cycles for finish-up in make_request*/
+
+	uint64_t			cycles_pmap[2][PMBD_MAX_NUM_CPUS];	/* total cycles for private mapping*/
+	uint64_t			cycles_punmap[2][PMBD_MAX_NUM_CPUS];	/* total cycles for private unmapping */
+	uint64_t			cycles_memcpy[2][PMBD_MAX_NUM_CPUS];	/* total cycles for memcpy */
+	uint64_t			cycles_clflush[2][PMBD_MAX_NUM_CPUS];	/* total cycles for clflush_range */
+	uint64_t			cycles_clflushall[2][PMBD_MAX_NUM_CPUS];/* total cycles for clflush_all */
+	uint64_t			cycles_wrverify[2][PMBD_MAX_NUM_CPUS];	/* total cycles for doing write verification */
+	uint64_t			cycles_checksum[2][PMBD_MAX_NUM_CPUS];	/* total cycles for doing checksum */
+	uint64_t			cycles_pause[2][PMBD_MAX_NUM_CPUS];	/* total cycles for pause */
+	uint64_t			cycles_slowdown[2][PMBD_MAX_NUM_CPUS];	/* total cycles for slowdown*/
+	uint64_t			cycles_setpages_ro[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to ro*/
+	uint64_t			cycles_setpages_rw[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to rw*/
+} PMBD_STAT_T;
+
+/*
+ * pmbd_device structure (each corresponding to a pmbd instance)
+ */
+#define PBN_TO_PMBD_BUFFER_ID(PMBD, PBN)	(((PBN)/(PMBD)->buffer_stride) % (PMBD)->num_buffers)
+#define PBN_TO_PMBD_BUFFER(PMBD, PBN)	((PMBD)->buffers[PBN_TO_PMBD_BUFFER_ID((PMBD), (PBN))])
+
+typedef struct pmbd_device {
+	int				pmbd_id;		/* dev id */
+	char				pmbd_name[DISK_NAME_LEN];/* device name */
+
+	struct request_queue *		pmbd_queue;
+	struct gendisk *		pmbd_disk;
+	struct list_head		pmbd_list;
+
+	/* PM backstore space */
+	void*				mem_space;	/* pointer to the kernel mem space */
+	uint64_t 			num_sectors;	/* PMBD device capacity (num of 512-byte sectors)*/
+	unsigned 			sector_size;	/* 512 bytes */
+
+	/* configurations */
+	unsigned 			pmbd_type;	/* vmalloc() or high_mem */
+	unsigned 			rammode;	/* RAM mode (no write protection) or not */
+	unsigned			bufmode;	/* use buffer or not */
+	unsigned			wpmode;		/* write protection mode: PTE change (0) or CR0/WP bit switch (1)*/
+
+	/* buffer management */
+	PMBD_BUFFER_T**			buffers;	/* buffer control structure */
+	unsigned 			num_buffers;	/* number of buffers */
+	unsigned			buffer_stride;	/* the number of contiguous blocks mapped to the same buffer */
+
+
+
+	/* physical block info (metadata) */	
+	PMBD_PBI_T*			pbi_space;	/* physical block info space (each) */
+	unsigned			pb_size;	/* the unit size of each block (4096 in default) */
+
+	/* checksum */
+	PMBD_CHECKSUM_T*		checksum_space;		/* checksum array */
+	unsigned 			checksum_unit_size;	/* checksum unit size (bytes) */
+	void*				checksum_iomem_buf;	/* one unit buffer for ioremapped PM */
+
+	/* emulating PM with injected latency */
+	unsigned			simmode;	/* simulating whole device (0) or PM only (1)*/
+	uint64_t			rdlat;		/* read access latency (in nanoseconds)*/
+	uint64_t			wrlat;		/* write access latency (in nanoseconds)*/
+	uint64_t			rdbw;		/* read bandwidth (MB/sec) */
+	uint64_t			wrbw;		/* write bandwidth (MB/sec) */
+	unsigned			rdsx;		/* read slowdown (X) */
+	unsigned			wrsx;		/* write slowdown (X) */
+	uint64_t			rdpause;	/* read pause (cycles per 4KB page) */
+	uint64_t			wrpause;	/* write pause (cycles per 4KB page) */
+
+	spinlock_t			batch_lock;		/* lock protecting batch_* fields */
+	uint64_t			batch_start_cycle[2]; 	/* start time of the batch (cycles)*/
+	uint64_t			batch_end_cycle[2];	/* end time of the batch (cycles) */
+	uint64_t			batch_sectors[2];	/* the total num of sectors in the batch */ 
+
+	PMBD_STAT_T*			pmbd_stat;	/* statistics data */
+	struct proc_dir_entry* 		proc_devstat;	/* the proc output */
+
+	spinlock_t			wr_barrier_lock;/* for write barrier and other control */
+	atomic_t			num_flying_wr;	/* the counter of writes on the fly */
+
+	spinlock_t			tmp_lock;
+	uint64_t			tmp_data;
+	unsigned long			tmp_num;
+} PMBD_DEVICE_T;
+
+/*
+ * support definitions
+ */
+#define TRUE				1
+#define FALSE				0
+
+#define __CURRENT_PID__			(current->pid)
+#define CONFIG_PMBD_DEBUG		1
+//#define PRINTK_DEBUG_HDR		"DEBUG %s(%d)%u - "
+//#define PRINTK_DEBUG_PAR		__FUNCTION__, __LINE__, __CURRENT_PID__
+//#define PRINTK_DEBUG_1		if(CONFIG_PMBD_DEBUG >= 1) printk
+//#define PRINTK_DEBUG_2		if(CONFIG_PMBD_DEBUG >= 2) printk
+//#define PRINTK_DEBUG_3		if(CONFIG_PMBD_DEBUG >= 3) printk
+
+#define MAX_OF(A, B)			(((A) > (B))? (A) : (B))
+#define MIN_OF(A, B)			(((A) < (B))? (A) : (B))
+
+#define SECTOR_SHIFT			9
+#define PAGE_SHIFT			12
+#define SECTOR_SIZE			(1UL << SECTOR_SHIFT)
+//#define PAGE_SIZE			(1UL << PAGE_SHIFT)
+#define SECTOR_MASK			(~(SECTOR_SIZE-1))
+#define PAGE_MASK			(~(PAGE_SIZE-1))
+#define PMBD_SECTOR_SIZE			SECTOR_SIZE
+#define PMBD_PAGE_SIZE			PAGE_SIZE
+#define KB_SHIFT			10
+#define MB_SHIFT			20
+#define GB_SHIFT			30
+#define MB_TO_BYTES(N)			((N) << MB_SHIFT)
+#define GB_TO_BYTES(N)			((N) << GB_SHIFT)
+#define BYTES_TO_MB(N)			((N) >> MB_SHIFT)
+#define BYTES_TO_GB(N)			((N) >> GB_SHIFT)
+#define MB_TO_SECTORS(N)		((N) << (MB_SHIFT - SECTOR_SHIFT))
+#define GB_TO_SECTORS(N)		((N) << (GB_SHIFT - SECTOR_SHIFT))
+#define SECTORS_TO_MB(N)		((N) >> (MB_SHIFT - SECTOR_SHIFT))
+#define SECTORS_TO_GB(N)		((N) >> (GB_SHIFT - SECTOR_SHIFT))
+#define SECTOR_TO_PAGE(N)		((N) >> (PAGE_SHIFT - SECTOR_SHIFT))
+#define SECTOR_TO_BYTE(N)		((N) << SECTOR_SHIFT)
+#define BYTE_TO_SECTOR(N)		((N) >> SECTOR_SHIFT)
+#define PAGE_TO_SECTOR(N)		((N) << (PAGE_SHIFT - SECTOR_SHIFT))
+#define BYTE_TO_PAGE(N)			((N) >> (PAGE_SHIFT))
+
+#define IS_SPACE(C) 			(isspace(C) || (C) == '\0')
+#define IS_DIGIT(C) 			(isdigit(C) && (C) != '\0')
+#define IS_ALPHA(C)			(isalpha(C) && (C) != '\0')
+
+#define DISABLE_SAVE_IRQ(FLAGS)		{local_irq_save((FLAGS)); local_irq_disable();}
+#define ENABLE_RESTORE_IRQ(FLAGS)	{local_irq_restore((FLAGS)); local_irq_enable();}
+#define CUR_CPU_ID()			smp_processor_id()
+
+/*
+ * PMBD related config
+ */ 
+
+#define PMBD_CONFIG_VMALLOC  		0 /* vmalloc() based PMBD (default) */
+#define PMBD_CONFIG_HIGHMEM  		1 /* ioremap() based PMBD */
+
+
+/* global config */
+#define PMBD_IS_MERGEABLE()		(g_pmbd_mergeable == TRUE)
+#define PMBD_USE_VMALLOC()		(g_pmbd_type == PMBD_CONFIG_VMALLOC)
+#define PMBD_USE_HIGHMEM()		(g_pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_USE_CLFLUSH()		(g_pmbd_cpu_cache_clflush == TRUE)
+#define PMBD_CPU_CACHE_FLAG()		((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB)? "WB" : \
+					((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC)? "WC" : \
+					((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC)? "UC" : \
+					((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS)? "UC-Minus" : "UNKNOWN"))))
+
+#define PMBD_CPU_CACHE_USE_WB()		(g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB)	/* write back */
+#define PMBD_CPU_CACHE_USE_WC()		(g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC)	/* write combining */
+#define PMBD_CPU_CACHE_USE_UC()		(g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC)	/* uncachable */
+#define PMBD_CPU_CACHE_USE_UM()		(g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS)	/* uncachable minus */
+
+#define PMBD_USE_WRITE_PROTECTION()	(g_pmbd_wr_protect == TRUE)
+#define PMBD_USE_WRITE_VERIFICATION()	(g_pmbd_wr_verify == TRUE)
+#define PMBD_USE_CHECKSUM()		(g_pmbd_checksum == TRUE)
+#define PMBD_USE_LOCK()			(g_pmbd_lock == TRUE)
+#define PMBD_USE_SUBPAGE_UPDATE()	(g_pmbd_subpage_update == TRUE)
+
+#define PMBD_USE_PMAP()			(g_pmbd_pmap == TRUE && g_pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_USE_NTS()			(g_pmbd_nts == TRUE)
+#define PMBD_USE_NTL()			(g_pmbd_ntl == TRUE)
+#define PMBD_USE_WB()			(g_pmbd_wb == TRUE)
+#define PMBD_USE_FUA()			(g_pmbd_fua == TRUE)
+#define PMBD_USE_TIMESTAT()		(g_pmbd_timestat == TRUE)
+
+#define TIMESTAMP(TS)			rdtscll((TS))
+#define TIMESTAT_POINT(TS)		{(TS) = 0; if (PMBD_USE_TIMESTAT()) rdtscll((TS));}
+
+/* instanced based config */
+#define PMBD_DEV_USE_VMALLOC(PMBD)	((PMBD)->pmbd_type == PMBD_CONFIG_VMALLOC)
+#define PMBD_DEV_USE_HIGHMEM(PMBD)	((PMBD)->pmbd_type == PMBD_CONFIG_HIGHMEM)
+#define PMBD_DEV_USE_BUFFER(PMBD)		((PMBD)->bufmode)
+#define PMBD_DEV_USE_WPMODE_PTE(PMBD)	((PMBD)->wpmode == 0)
+#define PMBD_DEV_USE_WPMODE_CR0(PMBD)	((PMBD)->wpmode == 1)
+
+#define PMBD_DEV_USE_EMULATION(PMBD)	((PMBD)->rdlat || (PMBD)->wrlat || (PMBD)->rdbw || (PMBD)->wrbw)
+#define PMBD_DEV_SIM_PMBD(PMBD)		(PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 1)
+#define PMBD_DEV_SIM_DEV(PMBD)		(PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 0)
+#define PMBD_DEV_USE_SLOWDOWN(PMBD)	((PMBD)->rdsx > 1 || (PMBD)->wrsx > 1)
+
+/* support functions */
+#define PMBD_MEM_TOTAL_SECTORS(PMBD)	((PMBD)->num_sectors)
+#define PMBD_MEM_TOTAL_BYTES(PMBD)	((PMBD)->num_sectors * (PMBD)->sector_size)
+#define PMBD_MEM_TOTAL_PAGES(PMBD)	(((PMBD)->num_sectors) >> (PAGE_SHIFT - SECTOR_SHIFT))
+#define PMBD_MEM_SPACE_FIRST_BYTE(PMBD)	((PMBD)->mem_space)
+#define PMBD_MEM_SPACE_LAST_BYTE(PMBD)	((PMBD)->mem_space + PMBD_MEM_TOTAL_BYTES(PMBD) - 1)
+#define PMBD_CHECKSUM_TOTAL_NUM(PMBD) 	(PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->checksum_unit_size)
+#define PMBD_LOCK_TOTAL_NUM(PMBD) 	(PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->lock_unit_size)
+#define VADDR_IN_PMBD_SPACE(PMBD, ADDR)	((ADDR) >= PMBD_MEM_SPACE_FIRST_BYTE(PMBD) \
+						&& (ADDR) <= PMBD_MEM_SPACE_LAST_BYTE(PMBD))
+
+#define BYTE_TO_PBN(PMBD, BYTES)		((BYTES) / (PMBD)->pb_size)
+#define PBN_TO_BYTE(PMBD, PBN)		((PBN) * (PMBD)->pb_size)
+#define SECTOR_TO_PBN(PMBD, SECT)	(BYTE_TO_PBN((PMBD), SECTOR_TO_BYTE(SECT)))
+#define PBN_TO_SECTOR(PMBD, PBN)		(BYTE_TO_SECTOR(PBN_TO_BYTE((PMBD), (PBN))))
+
+
+#define PMBD_CACHELINE_SIZE			(64)	/* FIXME: configure this machine by machine? (check x86_clflush_size)*/
+
+/* buffer related functions */
+#define CALLER_ALLOCATOR			(0)
+#define CALLER_SYNCER				(1)
+#define CALLER_DESTROYER			(2)
+
+#define PMBD_BLOCK_VADDR(PMBD, PBN)		((PMBD)->mem_space + ((PMBD)->pb_size * (PBN)))
+#define PMBD_BLOCK_PBI(PMBD, PBN)			((PMBD)->pbi_space + (PBN))
+#define PMBD_TOTAL_PB_NUM(PMBD) 			(PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->pb_size)
+#define PMBD_BLOCK_IS_BUFFERED(PMBD, PBN)		(PMBD_BLOCK_PBI((PMBD),(PBN))->bbn < PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks)
+#define PMBD_SET_BLOCK_BUFFERED(PMBD, PBN, BBN)	(PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = (BBN))
+#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN)	(PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PMBD_TOTAL_PB_NUM((PMBD)) + 3)
+//#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN)	(PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks + 1)
+
+#define PMBD_BUFFER_MIN_BUFSIZE			(4) 	/* buffer size (in MBs) */
+#define PMBD_BUFFER_BLOCK(BUF, BBN)		((BUF)->buffer_space + (BUF)->pmbd->pb_size*(BBN))
+#define PMBD_BUFFER_BBI(BUF, BBN)		((BUF)->bbi_space + (BBN))
+#define PMBD_BUFFER_BBI_INDEX(BUF, ADDR)		((ADDR)-(BUF)->bbi_space)
+#define PMBD_BUFFER_SET_BBI_CLEAN(BUF, BBN)	((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = FALSE)
+#define PMBD_BUFFER_SET_BBI_DIRTY(BUF, BBN)	((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = TRUE)
+#define PMBD_BUFFER_BBI_IS_CLEAN(BUF, BBN)	((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == FALSE)
+#define PMBD_BUFFER_BBI_IS_DIRTY(BUF, BBN)	((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == TRUE)
+#define PMBD_BUFFER_SET_BBI_BUFFERED(BUF,BBN,PBN)((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = (PBN))
+#define PMBD_BUFFER_SET_BBI_UNBUFFERED(BUF, BBN)	((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = PMBD_TOTAL_PB_NUM((BUF)->pmbd) + 2)
+
+#define PMBD_BUFFER_FLUSH_HW			(0.7)	/* high watermark */
+#define PMBD_BUFFER_FLUSH_LW			(0.1)	/* low watermark */
+#define PMBD_BUFFER_IS_FULL(BUF)			((BUF)->num_dirty >= (BUF)->num_blocks)
+#define PMBD_BUFFER_IS_EMPTY(BUF)		((BUF)->num_dirty == 0)
+#define PMBD_BUFFER_ABOVE_HW(BUF)		((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
+#define PMBD_BUFFER_BELOW_HW(BUF)		((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
+#define PMBD_BUFFER_ABOVE_LW(BUF)		((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
+#define PMBD_BUFFER_BELOW_LW(BUF)		((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
+#define PMBD_BUFFER_BATCH_SIZE_DEFAULT		(1024)	/* the batch size for each flush */
+
+#define PMBD_BUFFER_NEXT_POS(BUF, POS)		(((POS)==((BUF)->num_blocks - 1))? 0 : ((POS)+1))
+#define PMBD_BUFFER_PRIO_POS(BUF, POS)		(((POS)== 0)? ((BUF)->num_blocks - 1) : ((POS)-1))
+#define PMBD_BUFFER_NEXT_N_POS(BUF,POS,N)	(((POS)+(N))%((BUF)->num_blocks))
+#define PMBD_BUFFER_PRIO_N_POS(BUF,POS,N)	((BUF)->num_blocks - (((N)+(BUF)->num_blocks-(POS))%(BUF)->num_blocks))
+
+/* high memory */
+#define PMBD_HIGHMEM_AVAILABLE_SPACE 		(g_highmem_virt_addr + g_highmem_size - g_highmem_curr_addr)
+
+/* emulation */
+#define MAX_SYNC_SLOWDOWN			(10000000)	/* use async_slowdown, if larger than 10ms */
+#define OVERHEAD_NANOSEC			(100)
+#define PMBD_USLEEP(n) 				{set_current_state(TASK_INTERRUPTIBLE); \
+		        				schedule_timeout((n)*HZ/1000000);}
+
+/* statistics */
+#define PMBD_BATCH_MAX_SECTORS   		(4096)		/* maximum data amount requested in a batch */
+#define PMBD_BATCH_MIN_SECTORS   		(256)		/* maximum data amount requested in a batch */
+#define PMBD_BATCH_MAX_INTERVAL 		(1000000)	/* maximum interval between two requests in a batch*/
+#define PMBD_BATCH_MAX_DURATION  		(10000000)	/* maximum duration of a batch (ns)*/
+
+/* write protection*/
+#define VADDR_TO_PAGE(ADDR)			((ADDR) >> PAGE_SHIFT)
+#define PAGE_TO_VADDR(PAGE)			((PAGE) << PAGE_SHIFT)
+
+/* checksum */
+#define VADDR_TO_CHECKSUM_IDX(PMBD, ADDR)	(((ADDR) - (PMBD)->mem_space) / (PMBD)->checksum_unit_size)
+#define CHECKSUM_IDX_TO_VADDR(PMBD, IDX) 	((PMBD)->mem_space + (IDX) * (PMBD)->checksum_unit_size)
+#define CHECKSUM_IDX_TO_CKADDR(PMBD, IDX)	((PMBD)->checksum_space + (IDX))
+
+/* idle period timer */
+#define PMBD_BUFFER_FLUSH_IDLE_TIMEOUT		(2000)		/* 1 millisecond */
+#define PMBD_DEV_UPDATE_ACCESS_TIME(PMBD)		{spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
+						(PMBD)->pmbd_stat->last_access_jiffies = jiffies; \
+						spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
+#define PMBD_DEV_GET_ACCESS_TIME(PMBD, T)		{spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
+						(T) = (PMBD)->pmbd_stat->last_access_jiffies; \
+						spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
+#define PMBD_DEV_IS_IDLE(PMBD, IDLE)		((IDLE) > PMBD_BUFFER_FLUSH_IDLE_TIMEOUT)
+
+/* Help info */
+#define USAGE_INFO \
+"\n\n\
+============================================\n\
+Intel Persistent Memory Block Driver (v0.9)\n\
+============================================\n\n\
+usage: $ modprobe pmbd mode=\"pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];[Option3];..\"\n\
+\n\
+GENERAL OPTIONS: \n\
+\t pmbd<#,#..> \t set PM block device size (GBs) \n\
+\t HM|VM \t\t use high memory (HM default) or vmalloc (VM) \n\
+\t hmo<#> \t high memory starting offset (GB) \n\
+\t hms<#> \t high memory size (GBs) \n\
+\t pmap<Y|N> \t use private mapping (Y) or not (N default) - (note: must enable HM and wrprotN) \n\
+\t nts<Y|N> \t use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), or regular memcpy (N default)\n\
+\t wb<Y|N> \t use write barrier (Y) or not (N default)\n\
+\t fua<Y|N> \t use WRITE_FUA (Y default) or not (N) \n\
+\t ntl<Y|N> \t use non-temporal load (MOVNTDQA) to do memcpy (Y), or regular memcpy (N default) - this option enforces memory type of write combining\n\
+\n\
+SIMULATION: \n\
+\t simmode<#,#..>  use the specified numbers to the whole device (0 default) or PM only (1)\n\
+\t rdlat<#,#..> \t set read access latency (ns) \n\
+\t wrlat<#,#..> \t set write access latency (ns)\n\
+\t rdbw<#,#..> \t set read bandwidth (MB/sec)  (if set 0, no emulation) \n\
+\t wrbw<#,#..> \t set write bandwidth (MB/sec) (if set 0, no emulation) \n\
+\t rdsx<#,#..> \t set the relative slowdown (x) for read \n\
+\t wrsx<#,#..> \t set the relative slowdown (x) for write \n\
+\t rdpause<#,.> \t set a pause (cycles per 4KB) for each read\n\
+\t wrpause<#,.> \t set a pause (cycles per 4KB) for each write\n\
+\t adj<#> \t set an adjustment to the system overhead (nanoseconds) \n\
+\n\
+WRITE PROTECTION: \n\
+\t wrprot<Y|N> \t use write protection for PM pages? (Y or N)\n\
+\t wpmode<#,#,..>  write protection mode: use the PTE change (0 default) or switch CR0/WP bit (1) \n\
+\t clflush<Y|N> \t use clflush to flush CPU cache for each write to PM space? (Y or N) \n\
+\t wrverify<Y|N> \t use write verification for PM pages? (Y or N) \n\
+\t checksum<Y|N> \t use checksum to protect PM pages? (Y or N)\n\
+\t bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)\n\
+\t bufnum<#> \t the number of buffers for a PMBD device (16 buffers, at least 1 if using buffer, 0 -no buffer) \n\
+\t bufstride<#> \t the number of contiguous blocks(4KB) mapped into one buffer (bucket size for round-robin mapping) (1024 in default)\n\
+\t batch<#,#> \t the batch size (num of pages) for flushing PMBD device buffer (1 means no batching) \n\
+\n\
+MISC: \n\
+\t mgb<Y|N> \t mergeable? (Y or N) \n\
+\t lock<Y|N> \t lock the on-access page to serialize accesses? (Y or N) \n\
+\t cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined (WB), or Uncachable (UC)\n\
+\t subupdate<Y|N>  only update the changed cachelines of a page? (Y or N) (check PMBD_CACHELINE_SIZE) \n\
+\t timestat<Y|N>   enable the detailed timing statistics (/proc/pmbd/pmbdstat)? (Y or N) (This will cause significant performance slowdown) \n\
+\n\
+NOTE: \n\
+\t (1) Option rdlat/wrlat only specifies the minimum access times. Real access times can be higher.\n\
+\t (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. \n\
+\t (3) Option simmode1 applies the simulated specification to the PM space, rather than the whole device, which may have buffer.\n\
+\n\
+WARNING: \n\
+\t (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning may appear. Use \"nosoftlockup\" boot option to disable it.\n\
+\t (2) Enabling timestat may cause performance degradation.\n\
+\t (3) FUA is supported in PMBD, but if buffer is used (for PT-based protection), enabling FUA lowers performance due to double writes.\n\
+\t (4) No support for changing CPU cache related PTE attributes for VM-based PMBD (RCU stalls).\n\
+\n\
+PROC ENTRIES: \n\
+\t /proc/pmbd/pmbdcfg     config info about the PMBD devices\n\
+\t /proc/pmbd/pmbdstat    statistics of the PMBD devices (if timestat is enabled)\n\
+\n\
+EXAMPLE: \n\
+\t Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:\n\
+\t (1) Basic (Ramdisk): \n\
+\t     $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;\"\n\n\
+\t (2) Protected (with private mapping): \n\
+\t     $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;\"\n\n\
+\t (3) Protected and synced (with private mapping, non-temp store): \n\
+\t     $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;\"\n\n\
+\t (4) *** RECOMMENDED CONFIG *** \n\
+\t     Protected, synced, and ordered (with private mapping, non-temp store, write barrier): \n\
+\t     $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;wbY;\"\n\
+\n"
+
+/* functions */
+static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
+static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
+static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes);
+static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes);
+static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
+static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
+
+static inline int put_ulong(unsigned long arg, unsigned long val)
+{
+	return put_user(val, (unsigned long __user *)arg);
+}
+static inline int put_u64(unsigned long arg, u64 val)
+{
+	return put_user(val, (u64 __user *)arg);
+}
+
+static inline void mfence(void)
+{
+	asm volatile("mfence": : :);
+}
+
+static inline void sfence(void)
+{
+	asm volatile("sfence": : :);
+}
+
+#endif
+/* THEN END */