[PATCH] PMFS: Remove experimental Persistent Memory Block Driver
Ross Zwisler
ross.zwisler at linux.intel.com
Fri Aug 23 18:13:42 EDT 2013
This reverts commits
3c292bdfbc8902aac20aca4d91c0826103f48ee8 and
0ef73601fc4ad26610003f8a8f2c9fa345bcf246.
Signed-off-by: Ross Zwisler <ross.zwisler at linux.intel.com>
---
Documentation/blockdev/00-INDEX | 2 -
Documentation/blockdev/pmbd.txt | 185 --
drivers/block/Kconfig | 10 -
drivers/block/Makefile | 2 -
drivers/block/pmbd.c | 4539 ---------------------------------------
include/linux/pmbd.h | 509 -----
6 files changed, 5247 deletions(-)
delete mode 100644 Documentation/blockdev/pmbd.txt
delete mode 100644 drivers/block/pmbd.c
delete mode 100644 include/linux/pmbd.h
diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index 2e8f5b2..c08df56 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -16,5 +16,3 @@ paride.txt
- information about the parallel port IDE subsystem.
ramdisk.txt
- short guide on how to set up and use the RAM disk.
-pmbd.txt
- - information about Persistent Memory Block Driver.
diff --git a/Documentation/blockdev/pmbd.txt b/Documentation/blockdev/pmbd.txt
deleted file mode 100644
index 244820f..0000000
--- a/Documentation/blockdev/pmbd.txt
+++ /dev/null
@@ -1,185 +0,0 @@
-===============================================================================
- INTEL PERSISTENT MEMORY BLOCK DRIVER (PMBD) v0.9
-===============================================================================
-
-This software implements a block device driver for persistent memory (PM).
-This module provides a block-based logical interface to manage PM that is
-physically attached to the system memory bus.
-
-The architecture is assumed as follows. Both DRAM and PM DIMMs are directly
-attached to the host memory bus. The PM space is presented to the operating
-system as a contiguous range of physical memory address space at the high end.
-
-There are three major design considerations: (1) Data protection - Private
-mapping is used to prevent stray pointers (in kernel/driver bugs) to
-accidentally wipe off persistent PM data. (2) Data persistence - Non-temporal
-store and fence instructions are used to leverage the processor store buffer
-and avoid polluting the CPU cache. (3) Write ordering - Write barrier is
-supported to ensure correct order of writes.
-
-This module also includes other (experimental) features, such as PM speed
-emulation, checksum for page integrity, partial page updates, write
-verification, etc. Please refer to the help page of the module.
-
-
-===============================================================================
- COMPILING AND INSTALLING THE PMBD DRIVER
-===============================================================================
-
-1. Compile the PMBD driver:
-
- $ make
-
-2. Install the PMBD driver:
-
- $ sudo make install
-
-3. Check available driver information:
-
- $ modinfo pmbd
-
-===============================================================================
- QUICK USER'S GUIDE OF THE PMBD DRIVER
-===============================================================================
-
-1. modify /etc/grub.conf to set the physical memory address range that
- is to be simulated as PM.
-
- Add the following to the boot option line:
-
- memmap=<PM_SIZE_GB>G$<DRAM_SIZE_GB>G numa=off
-
- NOTE:
-
- PM_SIZE_GB - the PM space size (in GBs)
- DRAM_SIZE_GB - the DRAM space size (in GBs)
-
- Example:
-
- Assuming a total memory capacity of 24GB, and if we want to use 16GB PM and
- 8GB DRAM, it should be "memmap=16G$8G".
-
-2. Reboot and check if the memory size is set as expected.
-
- $ sudo reboot; exit
- $ free
-
-3. Load the device driver module
-
- Load the driver module into the kernel with private mapping, non-temp store,
- and write barrier enabled (*** RECOMMENDED CONFIG ***):
-
- $ modprobe pmbd mode="pmbd<PM_SIZE_GB>;hmo=<DRAM_SIZE_GB>;hms<PM_SIZE_GB>; \
- pmapY;ntsY;wbY;"
-
- Check the kernel message output:
-
- $ dmesg
-
- After loading the module, a block device (/dev/pma) should appear. Since
- now, it can be used as any block device, such as fdisk, mkfs, etc.
-
-4. Unload the device driver
-
- $ rmmod pmbd
-
-===============================================================================
- OTHER CONFIGURATION OPTIONS OF THE PERSISTENT MEMORY DEVICE DRIVER MODULE
-===============================================================================
-
-usage: $ modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];;.."
-
-GENERAL OPTIONS:
- pmbd<#,#..> set pmbd size (GBs)
- HM|VM use high memory (HM default) or vmalloc (VM)
- hmo<#> high memory starting offset (GB)
- hms<#> high memory size (GBs)
- pmap<Y|N> use private mapping (Y) or not (N default) - (note: must
- enable HM and wrprotN)
- nts<Y|N> use non-temporal store (MOVNTQ) and sfence to do memcpy (Y),
- or regular memcpy (N default)
- wb<Y|N> use write barrier (Y) or not (N default)
- fua<Y|N> use WRITE_FUA (Y default) or not (N)
- ntl<Y|N> use non-temporal load (MOVNTDQA) to do memcpy (Y), or
- regular memcpy (N default) - this option enforces memory type
- of write combining
-
-
-SIMULATION:
- simmode<#,#..> use the specified numbers to the whole device (0 default) or
- PM only (1)
- rdlat<#,#..> set read access latency (ns)
- wrlat<#,#..> set write access latency (ns)
- rdbw<#,#..> set read bandwidth (MB/sec) (if set 0, no emulation)
- wrbw<#,#..> set write bandwidth (MB/sec) (if set 0, no emulation)
- rdsx<#,#..> set the relative slowdown (x) for read
- wrsx<#,#..> set the relative slowdown (x) for write
- rdpause<#,.> set a pause (cycles per 4KB) for each read
- wrpause<#,.> set a pause (cycles per 4KB) for each write
- adj<#> set an adjustment to the system overhead (nanoseconds)
-
-WRITE PROTECTION:
- wrprot<Y|N> use write protection for PM pages? (Y or N)
- wpmode<#,#,..> write protection mode: use the PTE change (0 default) or flip
- CR0/WP bit (1)
- clflush<Y|N> use clflush to flush CPU cache for each write to PM space?
- (Y or N)
- wrverify<Y|N> use write verification for PM pages? (Y or N)
- checksum<Y|N> use checksum to protect PM pages? (Y or N)
- bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)
- bufnum<#> the number of buffers for a PMBD device (16 buffers, at least 1
- if using buffer, 0 -no buffer)
- bufstride<#> the number of contiguous blocks(4KB) mapped into one buffer
- (bucket size for round-robin mapping) (1024 in default)
- batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1 means
- no batching)
-
-MISC:
- mgb<Y|N> mergeable? (Y or N)
- lock<Y|N> lock the on-access page to serialize accesses? (Y or N)
- cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined
- (WB), or Uncachable (UC)
- subupdate<Y|N> only update the changed cachelines of a page? (Y or N) (check
- PMBD_CACHELINE_SIZE)
- timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat)?
- This will cause significant performance slowdown (Y or N)
-
-NOTE:
- (1) Option rdlat/wrlat only specifies the minimum access times. Real access
- times can be higher.
- (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored.
- (3) Option simmode1 applies the simulated specification to the PM space,
- rather than the whole device, which may have buffer.
-
-WARNING:
- (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning
- may appear. Use "nosoftlockup" boot option to disable it.
- (2) Enabling timestat may cause performance degradation.
- (3) FUA is supported , but if buffer is used (for PT based
- protection), enabling FUA lowers performance due to double writes.
- (4) No support for changing CPU cache related PTE attributes for VM-based PMBD
- (RCU stalls).
-
-PROC ENTRIES:
- /proc/pmbd/pmbdcfg: config info about the PMBD devices
- /proc/pmbd/pmbdstat: statistics of the PMBD devices (if timestat is enabled)
-
-EXAMPLE:
- Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:
- (1) Basic (Ramdisk):
- $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;"
-
- (2) Protected (with private mapping):
- $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;"
-
- (3) Protected and synced (with private mapping, non-temp store):
- $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;"
-
- (4) *** RECOMMENDED CONFIGURATION ***
- Protected, synced, and ordered (with private mapping, nt-store, write
- barrier):
- $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;wbY;"
-
-
-
-
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 47dbb6d..b81ddfe 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -540,15 +540,5 @@ config BLK_DEV_RSXX
To compile this driver as a module, choose M here: the
module will be called rsxx.
-
-config BLK_DEV_PMBD
- tristate "Persistent Memory Block Driver"
- depends on m
-
- default n
- help
- Say M here if you want include the Persistent Memory Block Driver.
-
- If unsure, say N.
endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 6ac1cbe..a3b4023 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -42,6 +42,4 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
-obj-$(CONFIG_BLK_DEV_PMBD) += pmbd.o
-
swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/pmbd.c b/drivers/block/pmbd.c
deleted file mode 100644
index 8cc9b5d..0000000
--- a/drivers/block/pmbd.c
+++ /dev/null
@@ -1,4539 +0,0 @@
-/*
- * Intel Persistent Memory Block Driver
- * Copyright (c) <2011-2013>, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program; if not, write to the Free Software Foundation, Inc.,
- * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
- */
-
-/*
- * Intel Persistent Memory Block Driver (v0.9)
- *
- * Parts derived with changes from drivers/block/brd.c, lib/crc32.c, and
- * arch/x86/lib/mmx_32.c
- *
- * Intel Corporation <linux-pmbd at intel.com>
- * 03/24/2011
- *
- * Authors
- * 2013 - Released the open-source version 0.9 (fchen)
- * 2012 - Ported to Linux 3.2.1 (fchen)
- * 2011 - Feng Chen (Intel) implemented version 1 of PMBD for Linux 2.6.34.
- */
-
-
-/*
- *******************************************************************************
- * Persistent Memory Block Device Driver
- *
- * USAGE:
- * % sudo modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[OPTION1];[OPTION2];..>"
- *
- * GENERAL OPTIONS:
- * - pmbd<#,..>: a sequence of integer numbers setting PMBD device sizes (in
- * units of GBs). For example, mode="pmbd4,1" means creating a
- * 4GB and a 1GB PMBD device (/dev/pma and /dev/pmb).
- *
- * - HM|VM: choose two types of PMBD devices
- * - VM: vmalloc() based
- * - HM: HIGH_MEM based (default)
- * - In /boot/grub/grub.conf, add "mem=<n>G memmap=<m>G$<n>G"
- * to reserve the high m GBs for PM, starting from offset n
- * GBs in physical memory
- *
- * - hmo<#>: if HM is set, setting the starting physical mem address
- * (in units of GBs).
- *
- * - hms<#>: if HM is set, setting the remapping memory size (in GBs)
- *
- * - pmap<Y|N> set private mapping (Y) or not (N default). using
- * pmap_atomic_pfn() to dynamically map/unmap the
- * to-be-accessed PM page for protection purpose.
- * This option must work with HM enabled. In the Linux boot
- * option, "mem" option must be removed.
- *
- * - nts<Y|N> set non-temporal store/sfence (Y) or not (N default).
- *
- * - wb<Y|N>: use write barrier (Y) or not (N default)
- *
- * - fua<Y|N> use WRITE_FUA (Y default) or not (N)
- * FUA with PT-based protection (with buffer) incurs
- * double-write overhead
- *
- * SIMULATION OPTIONS:
- *
- * - simmode<#,#..> set the simulation mode for each PMBD device
- * - 0 for simulating the whole device
- * - 1 for simulating the PM space only
- * Note that simulating the PM space may cause some system
- * warning of soft lockup. To disable it, add nonsoftlockup
- * in the boot options.
- *
- * - rdlat<#,#..>: a sequence of integer numbers setting emulated read
- * latencies (in units of nanoseconds) for reading each
- * sector. Each number is corresponding to a device. Default
- * value is 0.
- *
- * - wrlat<#,#..>: set emulated write access latencies (see rdlat)
- *
- * - rdbw<#,#..>: a sequence of integer numbers setting emulated read
- * bandwidth (in units of MB/sec) for reading each sector.
- * Each number corresponds to a device. Default value is 0;
- *
- * - wrbw<#,#..>: set emulated write bandwidth (see rdbw)
- *
- * - rdsx<#,#..>: set the slowdown ratio (x) for reads as compared to DRAM
- *
- * - wrsx<#,#..>: set the slowdown ratio (x) for writes as compared to DRAM
- *
- * - rdpause<#,#..>: set the injected delay (cycles per page) for read (not
- * for emulation, just inject latencies
- * for each read per page)
- *
- * - wrpause<#,#..>: set the injected delay (cycles per page) for write
- * (not for emulation, just inject latencies for
- * each read per page).
- *
- * - adj<#>: offset the overhead with estimated system overhead. Default
- * is 4us, however, this could vary system by system.
- *
- * WRITE PROTECTION:
- *
- * - wrprot<Y|N>: provide write protection on PM space by setting page
- * read-only (default: N).
- * This option is incompatible with pmap.
- *
- * - wpmode<#,#,..> write protection mode: use the PTE change (0 default) or
- * switch CR0/WP bit (1)
- *
- * - wrverify<Y|N>: read out the data for verification after writing into PM
- * space
- *
- * - clflush<Y|N>: flush CPU cache or not (default: N)
- *
- * - checksum<Y|N>: use checksum to provide further protection from data
- * corruption (default: N)
- *
- * - lock<Y|N>: lock the on-access PM page to serialize accesses
- * (default: Y)
- *
- * - bufsize<#,#,#.#...> -- the buffer size in MBs (for speeding up write
- * protection) 0 means no buffer, minimum size is 16 MBs
- *
- * - bufnum<#> the number of buffers for a pmbd device (16 buffers, at
- * least 1 if using buffering, 0 will disable buffer mode)
- *
- * - bufstride<#> the number of contiguous blocks(4KB) mapped into one
- * buffer (the bucket size for round-robin mapping)
- * (1024 in default)
- *
- * - batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1
- * means no batching)
- *
- * MISC OPTIONS:
- *
- * - subupdate<Y|N> only update changed cachelines of a page (check
- * PMBD_CACHELINE_SIZE, default: N)
- *
- * - mgb<Y|N>: setting mergeable or not (default: Y)
- *
- * - cache<WB|WC|UM|UC>:
- * WB -- write back (both read/write cache used)
- * WC -- write combined (write through but cachable)
- * UM -- uncachable but write back
- * UC -- write through and uncachable
- * No support for changing CPU cache flags
- * with vmalloc() based PMBD
- *
- * - timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat) or
- * not (default: N). This will cause significant performance loss.
- *
- * EXAMPLE:
- * mode="pmbd2,1;rdlat100,2000;wrlat500,4000;rdbw100,100;wrbw100,100;HM;hmo4;hms3;
- * mgbY;flushY;cacheWB;wrprotY;wrverifyY;checksumY;lockY;rammode0,1;bufsize16,0;
- * subupdateY;"
- *
- * Explanation: Create two PMBD devices, /dev/pma (2GB) and /dev/pmb (1GB).
- * Insert 100ns and 500ns for reading and writing a sector to /dev/pma,
- * respectively. Insert 2000ns and 4000ns for reading and writing a sector
- * to /dev/pmb. Make the read/write bandwidth for both devices 100MB/sec.
- * No system overhead adjustment is applied. We use 3GB high memory for the
- * PMBD devices, starting from 4GB physical memory address. Make it
- * mergeable, use writeback and flush CPU cache for the PM space, use write
- * protection for PM space by setting PM space read-only, verify each
- * write by reading out written data, use checksum to protect PM space, use
- * spinlock to protect from corruption caused by concurrent accesses, the
- * first device is applied without write protection, the second device is
- * applied with write protection, and use sub-page updates.
- *
- * NOTE:
- * - We can create no more than 26 devices, 4 partitions each.
- *
- * FIXME:
- * (1) We use an unoccupied major device num (261) temporarily
- *******************************************************************************
- */
-
-#include <linux/init.h>
-#include <linux/version.h>
-#include <linux/module.h>
-#include <linux/moduleparam.h>
-#include <linux/major.h>
-#include <linux/blkdev.h>
-#include <linux/bio.h>
-#include <linux/fs.h>
-#include <linux/slab.h>
-#include <asm/uaccess.h>
-#include <linux/time.h>
-#include <asm/timer.h>
-#include <linux/cpufreq.h>
-#include <linux/crc32.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/kthread.h>
-#include <linux/sort.h>
-#include <linux/timex.h>
-#include <linux/proc_fs.h>
-#include <asm/tlbflush.h>
-#include <asm/i387.h>
-#include <asm/asm.h>
-#include <linux/pmbd.h>
-#include <linux/delay.h>
-
-/* device configs */
-static int max_part = 4; /* maximum num of partitions */
-static int part_shift = 0; /* partition shift */
-static LIST_HEAD(pmbd_devices); /* device list */
-static DEFINE_MUTEX(pmbd_devices_mutex); /* device mutex */
-
-/* /proc file system entry */
-static struct proc_dir_entry* proc_pmbd = NULL;
-static struct proc_dir_entry* proc_pmbdstat = NULL;
-static struct proc_dir_entry* proc_pmbdcfg = NULL;
-
-/* pmbd device default configuration */
-static unsigned g_pmbd_type = PMBD_CONFIG_HIGHMEM; /* vmalloc(PMBD_CONFIG_VMALLOC) or reserve highmem (PMBD_CONFIG_HIGHMEM default) */
-static unsigned g_pmbd_pmap = FALSE; /* use pmap_atomic() to map/unmap space on demand */
-static unsigned g_pmbd_nts = FALSE; /* use non-temporal store (movntq) */
-static unsigned g_pmbd_wb = FALSE; /* use write barrier */
-static unsigned g_pmbd_fua = TRUE; /* use fua support */
-static unsigned g_pmbd_mergeable = TRUE; /* mergeable or not */
-static unsigned g_pmbd_cpu_cache_clflush= FALSE; /* flush CPU cache or not*/
-static unsigned g_pmbd_wr_protect = FALSE; /* flip PTE R/W bits for write protection */
-static unsigned g_pmbd_wr_verify = FALSE; /* read out written data for verification */
-static unsigned g_pmbd_checksum = FALSE; /* do checksum on PM data */
-static unsigned g_pmbd_lock = TRUE; /* do spinlock on accessing a PM page */
-static unsigned g_pmbd_subpage_update = FALSE; /* do subpage update (only write changed content) */
-static unsigned g_pmbd_timestat = FALSE; /* do a detailed timestamp breakdown statistics */
-static unsigned g_pmbd_ntl = FALSE; /* use non-temporal load (movntdqa)*/
-static unsigned long g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB; /* CPU cache flag (default - write back) */
-
-/* high memory configs */
-static unsigned long g_highmem_size = 0; /* size of the reserved physical mem space (bytes) */
-static phys_addr_t g_highmem_phys_addr = 0; /* beginning of the reserved phy mem space (bytes)*/
-static void* g_highmem_virt_addr = NULL; /* beginning of the reserve HIGH_MEM space */
-static void* g_highmem_curr_addr = NULL; /* beginning of the available HIGH_MEM space for alloc*/
-
-/* module parameters */
-static unsigned g_pmbd_nr = 0; /* num of PMBD devices */
-static unsigned long long g_pmbd_size[PMBD_MAX_NUM_DEVICES]; /* PMBD device sizes in units of GBs */
-static unsigned long long g_pmbd_rdlat[PMBD_MAX_NUM_DEVICES]; /* access latency for read (nanosecs) */
-static unsigned long long g_pmbd_wrlat[PMBD_MAX_NUM_DEVICES]; /* access latency for write nanosecs) */
-static unsigned long long g_pmbd_rdbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for read (MB/sec) */
-static unsigned long long g_pmbd_wrbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for write (MB/sec)*/
-static unsigned long long g_pmbd_rdsx[PMBD_MAX_NUM_DEVICES]; /* read slowdown (x) */
-static unsigned long long g_pmbd_wrsx[PMBD_MAX_NUM_DEVICES]; /* write slowdown (x)*/
-static unsigned long long g_pmbd_rdpause[PMBD_MAX_NUM_DEVICES]; /* read pause (cycles per page) */
-static unsigned long long g_pmbd_wrpause[PMBD_MAX_NUM_DEVICES]; /* write pause (cycles per page)*/
-static unsigned long long g_pmbd_simmode[PMBD_MAX_NUM_DEVICES]; /* simulating PM space (1) or the whole device (0 default) */
-static unsigned long long g_pmbd_adjust_ns = 0; /* nanosec of adjustment to offset system overhead */
-static unsigned long long g_pmbd_rammode[PMBD_MAX_NUM_DEVICES]; /* do write optimization or not */
-static unsigned long long g_pmbd_bufsize[PMBD_MAX_NUM_DEVICES]; /* the buffer size (in MBs) */
-static unsigned long long g_pmbd_buffer_batch_size[PMBD_MAX_NUM_DEVICES]; /* the batch size (num of pages) for flushing PMBD buffer */
-static unsigned long long g_pmbd_wpmode[PMBD_MAX_NUM_DEVICES]; /* write protection mode: PTE change (0 default) and CR0 Switch (1)*/
-
-static unsigned long long g_pmbd_num_buffers = 0; /* number of individual buffers */
-static unsigned long long g_pmbd_buffer_stride = 1024; /* number of contiguous PBNs belonging to the same buffer */
-
-/* definition of functions */
-static inline uint64_t cycle_to_ns(uint64_t cycle);
-static inline void sync_slowdown_cycles(uint64_t cycles);
-static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw);
-static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start);
-
-/*
- * *************************************************************************
- * parse module parameters functions
- * *************************************************************************
- */
-static char *mode = "";
-module_param(mode, charp, 444);
-MODULE_PARM_DESC(mode, USAGE_INFO);
-
-/* print pmbd configuration info */
-static void pmbd_print_conf(void)
-{
- int i;
-#ifndef CONFIG_X86
- printk(KERN_INFO "pmbd: running on a non-x86 platform, check ioremap()...\n");
-#endif
- printk(KERN_INFO "pmbd: cacheline_size=%d\n", PMBD_CACHELINE_SIZE);
- printk(KERN_INFO "pmbd: PMBD_SECTOR_SIZE=%lu, PMBD_PAGE_SIZE=%lu\n", PMBD_SECTOR_SIZE, PMBD_PAGE_SIZE);
- printk(KERN_INFO "pmbd: g_pmbd_type = %s\n", PMBD_USE_VMALLOC()? "VMALLOC" : "HIGH_MEM");
- printk(KERN_INFO "pmbd: g_pmbd_mergeable = %s\n", PMBD_IS_MERGEABLE()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_clflush = %s\n", PMBD_USE_CLFLUSH()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_flag = %s\n", PMBD_CPU_CACHE_FLAG());
- printk(KERN_INFO "pmbd: g_pmbd_wr_protect = %s\n", PMBD_USE_WRITE_PROTECTION()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_wr_verify = %s\n", PMBD_USE_WRITE_VERIFICATION()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_checksum = %s\n", PMBD_USE_CHECKSUM()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_lock = %s\n", PMBD_USE_LOCK()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_subpage_update = %s\n", PMBD_USE_SUBPAGE_UPDATE()? "YES" : "NO");
- printk(KERN_INFO "pmbd: g_pmbd_adjust_ns = %llu ns\n", g_pmbd_adjust_ns);
- printk(KERN_INFO "pmbd: g_pmbd_num_buffers = %llu\n", g_pmbd_num_buffers);
- printk(KERN_INFO "pmbd: g_pmbd_buffer_stride = %llu blocks\n", g_pmbd_buffer_stride);
- printk(KERN_INFO "pmbd: g_pmbd_timestat = %u \n", g_pmbd_timestat);
- printk(KERN_INFO "pmbd: HIGHMEM offset [%llu] size [%lu] Private Mapping (%s) (%s) (%s) Write Barrier(%s) FUA(%s)\n",
- g_highmem_phys_addr, g_highmem_size, (PMBD_USE_PMAP()? "Enabled" : "Disabled"),
- (PMBD_USE_NTS()? "Non-Temporal Store":"Temporal Store"),
- (PMBD_USE_NTL()? "Non-Temporal Load":"Temporal Load"),
- (PMBD_USE_WB()? "Enabled": "Disabled"),
- (PMBD_USE_FUA()? "Enabled":"Disabled"));
-
- /* for each pmbd device */
- for (i = 0; i < g_pmbd_nr; i ++) {
- printk(KERN_INFO "pmbd: /dev/pm%c (%d)[%llu GB] read[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] write[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] [%s] [Buf: %llu MBs, batch %llu pages] [%s] [%s]\n",
- 'a'+i, i, g_pmbd_size[i], g_pmbd_rdlat[i], g_pmbd_rdbw[i], g_pmbd_rdsx[i], g_pmbd_rdpause[i], g_pmbd_wrlat[i], g_pmbd_wrbw[i], g_pmbd_wrsx[i], g_pmbd_wrpause[i],\
- (g_pmbd_rammode[i] ? "RAM" : "PMBD"), g_pmbd_bufsize[i], g_pmbd_buffer_batch_size[i], \
- (g_pmbd_simmode[i] ? "Simulating PM only" : "Simulating the whole device"), \
- (PMBD_USE_PMAP() ? "PMAP" : (g_pmbd_wpmode[i] ? "WP-CR0/WP" : "WP-PTE")));
-
- if (g_pmbd_simmode[i] > 0){
- printk(KERN_INFO "pmbd: ********************************* WARNING **************************************\n");
- printk(KERN_INFO "pmbd: Using simmode%llu to simulate a slowed-down PM space may cause system soft lockup.\n", g_pmbd_simmode[i]);
- printk(KERN_INFO "pmbd: To disable the warning message, please add \"nosoftlockup\" in the boot option. \n");
- printk(KERN_INFO "pmbd: ********************************************************************************\n");
- }
- }
-
- printk(KERN_INFO "pmbd: ****************************** WARNING ***********************************\n");
- printk(KERN_INFO "pmbd: 1. Checksum mismatch can be detected but not handled \n");
- printk(KERN_INFO "pmbd: 2. PMAP is incompatible with \"wrprotY\"\n");
- printk(KERN_INFO "pmbd: **************************************************************************\n");
-
- return;
-}
-
-/*
- * Parse a string with config for multiple devices (e.g. mode="pmbd4,1,3;")
- * @mode: input option string
- * @tag: the tag being looked for (e.g. pmbd)
- * @data: output in an array
- */
-static int _pmbd_parse_multi(char* mode, char* tag, unsigned long long data[])
-{
- int nr = 0;
- if (strlen(mode)) {
- char* head = mode;
- char* tail = mode;
- char* end = mode + strlen(mode);
- char tmp[128];
-
- if ((head = strstr(mode, tag))) {
- head = head + strlen(tag);
- tail = head;
- while(head < end){
- int len = 0;
-
- /* locate the position of the first non-number char */
- for(tail = head; IS_DIGIT(*tail) && tail < end; tail++) {};
-
- /* pick up the numbers */
- len = tail - head;
- if(len > 0) {
- nr ++;
- if (nr > PMBD_MAX_NUM_DEVICES) {
- printk(KERN_ERR "pmbd: %s(%d) - too many (%d) device config for %s\n",
- __FUNCTION__, __LINE__, nr, tag);
- return -1;
- }
- strncpy(tmp, head, len); tmp[len] = '\0';
- data[nr - 1] = simple_strtoull(tmp, NULL, 0);
- }
-
- /* check the next sequence of numbers */
- for(; !IS_DIGIT(*tail) && tail < end; tail++) {
- /* if we meet the first alpha char or space, clause ends */
- if(IS_ALPHA(*tail) || IS_SPACE(*tail))
- goto done;
- };
-
- /* move head to the next sequence of numbers */
- head = tail;
- }
- }
- }
-done:
- return nr;
-}
-
-/*
- * Parse a string with config for all devices (e.g. mode="adj1000")
- * @mode: input option string
- * @tag: the tag being looked for (e.g. pmbd)
- * @data: output
- */
-static int _pmbd_parse_single(char* mode, char* tag, unsigned long long* data)
-{
- if (strlen(mode)) {
- char* head = mode;
- char* tail = mode;
- char tmp[128];
-
- if (strstr(mode, tag)) {
- head = strstr(mode, tag) + strlen(tag);
- for(tail=head; IS_DIGIT(*tail); tail++) {};
- if(tail == head) {
- return -1;
- } else {
- int len = tail - head;
- strncpy(tmp, head, len); tmp[len] = '\0';
- *data = simple_strtoull(tmp, NULL, 0);
- }
- }
- }
- return 0;
-}
-
-static void load_default_conf(void)
-{
- int i = 0;
- for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++)
- g_pmbd_buffer_batch_size[i] = PMBD_BUFFER_BATCH_SIZE_DEFAULT;
-}
-
-/* parse the module parameters (mode) */
-static void pmbd_parse_conf(void)
-{
- int i = 0;
- static unsigned enforce_cache_wc = FALSE;
-
- load_default_conf();
-
- if (strlen(mode)) {
- unsigned long long data = 0;
-
- /* check pmbd size/usable */
- if (strstr(mode, "pmbd")) {
- if( (g_pmbd_nr = _pmbd_parse_multi(mode, "pmbd", g_pmbd_size)) <= 0)
- goto fail;
- } else {
- printk(KERN_ERR "pmbd: no pmbd size set\n");
- goto fail;
- }
-
- /* rdlat/wrlat (emulated read/write latency) in nanosec */
- if (strstr(mode, "rdlat"))
- if (_pmbd_parse_multi(mode, "rdlat", g_pmbd_rdlat) < 0)
- goto fail;
- if (strstr(mode, "wrlat"))
- if (_pmbd_parse_multi(mode, "wrlat", g_pmbd_wrlat) < 0)
- goto fail;
-
- /* rdbw/wrbw (emulated read/write bandwidth) in MB/sec*/
- if (strstr(mode, "rdbw"))
- if (_pmbd_parse_multi(mode, "rdbw", g_pmbd_rdbw) < 0)
- goto fail;
- if (strstr(mode, "wrbw"))
- if (_pmbd_parse_multi(mode, "wrbw", g_pmbd_wrbw) < 0)
- goto fail;
-
- /* rdsx/wrsx (emulated read/write slowdown X) */
- if (strstr(mode, "rdsx"))
- if (_pmbd_parse_multi(mode, "rdsx", g_pmbd_rdsx) < 0)
- goto fail;
- if (strstr(mode, "wrsx"))
- if (_pmbd_parse_multi(mode, "wrsx", g_pmbd_wrsx) < 0)
- goto fail;
-
- /* rdsx/wrsx (emulated read/write slowdown X) */
- if (strstr(mode, "rdpause"))
- if (_pmbd_parse_multi(mode, "rdpause", g_pmbd_rdpause) < 0)
- goto fail;
- if (strstr(mode, "wrpause"))
- if (_pmbd_parse_multi(mode, "wrpause", g_pmbd_wrpause) < 0)
- goto fail;
-
- /* do write optimization */
- if (strstr(mode, "rammode")){
- printk(KERN_ERR "pmbd: rammode removed\n");
- goto fail;
- if (_pmbd_parse_multi(mode, "rammode", g_pmbd_rammode) < 0)
- goto fail;
- }
-
- if (strstr(mode, "bufsize")){
- if (_pmbd_parse_multi(mode, "bufsize", g_pmbd_bufsize) < 0)
- goto fail;
- for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
- if (g_pmbd_bufsize[i] > 0 && g_pmbd_bufsize[i] < PMBD_BUFFER_MIN_BUFSIZE){
- printk(KERN_ERR "pmbd: bufsize cannot be smaller than %d MBs. Setting 0 to disable PMBD buffer.\n", PMBD_BUFFER_MIN_BUFSIZE);
- goto fail;
- }
- }
- }
-
- /* numbuf and bufstride*/
- if (strstr(mode, "bufnum")) {
- if(_pmbd_parse_single(mode, "bufnum", &data) < 0) {
- printk(KERN_ERR "pmbd: incorrect bufnum (must be at least 1)\n");
- goto fail;
- } else {
- g_pmbd_num_buffers = data;
- }
- }
- if (strstr(mode, "bufstride")) {
- if(_pmbd_parse_single(mode, "bufstride", &data) < 0) {
- printk(KERN_ERR "pmbd: incorrect bufstride (must be at least 1)\n");
- goto fail;
- } else {
- g_pmbd_buffer_stride = data;
- }
- }
-
- /* check the nanoseconds of overhead to compensate */
- if (strstr(mode, "adj")) {
- if(_pmbd_parse_single(mode, "adj", &data) < 0) {
- printk(KERN_ERR "pmbd: incorrect adj\n");
- goto fail;
- } else {
- g_pmbd_adjust_ns = data;
- }
- }
-
- /* check PMBD device type */
- if ((strstr(mode, "VM"))) {
- g_pmbd_type = PMBD_CONFIG_VMALLOC;
- } else if ((strstr(mode, "HM"))) {
- g_pmbd_type = PMBD_CONFIG_HIGHMEM;
- }
-
- /* use pmap*/
- if ((strstr(mode, "pmapY"))) {
- g_pmbd_pmap = TRUE;
- } else if ((strstr(mode, "pmapN"))) {
- g_pmbd_pmap = FALSE;
- }
- if ((strstr(mode, "PMAP"))){
- printk("WARNING: !!! pmbd: PMAP is not supported any more (use pmapY) !!!\n");
- goto fail;
- }
-
- /* use nts*/
- if ((strstr(mode, "ntsY"))) {
- g_pmbd_nts = TRUE;
- } else if ((strstr(mode, "ntsN"))) {
- g_pmbd_nts = FALSE;
- }
- if ((strstr(mode, "NTS"))){
- printk("WARNING: !!! pmbd: NTS is not supported any more (use ntsY) !!!\n");
- goto fail;
- }
-
- /* use ntl*/
- if ((strstr(mode, "ntlY"))) {
- g_pmbd_ntl = TRUE;
- enforce_cache_wc = TRUE;
- } else if ((strstr(mode, "ntlN"))) {
- g_pmbd_ntl = FALSE;
- }
-
- /* timestat */
- if ((strstr(mode, "timestatY"))) {
- g_pmbd_timestat = TRUE;
- } else if ((strstr(mode, "timestatN"))) {
- g_pmbd_timestat = FALSE;
- }
-
-
- /* write barrier */
- if ((strstr(mode, "wbY"))) {
- g_pmbd_wb = TRUE;
- } else if ((strstr(mode, "wbN"))) {
- g_pmbd_wb = FALSE;
- }
-
- /* write barrier */
- if ((strstr(mode, "fuaY"))) {
- g_pmbd_fua = TRUE;
- } else if ((strstr(mode, "fuaN"))) {
- g_pmbd_fua = FALSE;
- }
-
-
- /* check if HIGH_MEM PMBD is configured */
- if (PMBD_USE_HIGHMEM()) {
- if (strstr(mode, "hmo") && strstr(mode, "hms")) {
- /* parse reserved HIGH_MEM offset */
- if(_pmbd_parse_single(mode, "hmo", &data) < 0){
- printk(KERN_ERR "pmbd: incorrect hmo\n");
- g_highmem_phys_addr = 0;
- goto fail;
- } else {
- g_highmem_phys_addr = data * 1024 * 1024 * 1024;
- }
-
- /* parse reserved HIGH_MEM size */
- if(_pmbd_parse_single(mode, "hms", &data) < 0 || data == 0){
- printk(KERN_ERR "pmbd: incorrect hms\n");
- g_highmem_size = 0;
- goto fail;
- } else {
- g_highmem_size = data * 1024 * 1024 * 1024;
- }
- } else {
- printk(KERN_ERR "pmbd: hmo or hms not set ***\n");
- goto fail;
- }
-
-
- }
-
-
- /* check if mergeable */
- if((strstr(mode,"mgbY")))
- g_pmbd_mergeable = TRUE;
- else if((strstr(mode,"mgbN")))
- g_pmbd_mergeable = FALSE;
-
- /* CPU cache flushing */
- if((strstr(mode,"clflushY")))
- g_pmbd_cpu_cache_clflush = TRUE;
- else if((strstr(mode,"clflushN")))
- g_pmbd_cpu_cache_clflush = FALSE;
-
- /* CPU cache setting */
- if((strstr(mode,"cacheWB"))) /* cache write back */
- g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB;
- else if((strstr(mode,"cacheWC"))) /* cache write combined (through) */
- g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
- else if((strstr(mode,"cacheUM"))) /* cache cachable but write back */
- g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC_MINUS;
- else if((strstr(mode,"cacheUC"))) /* cache uncablable */
- g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC;
-
-
- /* write protectable */
- if((strstr(mode,"wrprotY")))
- g_pmbd_wr_protect = TRUE;
- else if((strstr(mode,"wrprotN")))
- g_pmbd_wr_protect = FALSE;
-
- /* write protectable */
- if((strstr(mode,"wrverifyY")))
- g_pmbd_wr_verify = TRUE;
- else if((strstr(mode,"wrverifyN")))
- g_pmbd_wr_verify = FALSE;
-
- /* checksum */
- if((strstr(mode,"checksumY")))
- g_pmbd_checksum = TRUE;
- else if((strstr(mode,"checksumN")))
- g_pmbd_checksum = FALSE;
-
- /* checksum */
- if((strstr(mode,"lockY")))
- g_pmbd_lock = TRUE;
- else if((strstr(mode,"lockN")))
- g_pmbd_lock = FALSE;
-
- /* write protectable */
- if((strstr(mode,"subupdateY")))
- g_pmbd_subpage_update = TRUE;
- else if((strstr(mode,"subupdateN")))
- g_pmbd_subpage_update = FALSE;
-
-
- /* batch */
- if (strstr(mode, "batch")){
- if (_pmbd_parse_multi(mode, "batch", g_pmbd_buffer_batch_size) < 0)
- goto fail;
- /* check if any batch size is set too small */
- for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) {
- if (g_pmbd_buffer_batch_size[i] < 1){
- printk(KERN_ERR "pmbd: buffer batch size cannot be smaller than 1 page (default: 1024 pages)\n");
- goto fail;
- }
- }
- }
-
- /* simmode */
- if (strstr(mode, "simmode")){
- if (_pmbd_parse_multi(mode, "simmode", g_pmbd_simmode) < 0)
- goto fail;
- }
-
- /* wpmode */
- if (strstr(mode, "wpmode")){
- if (_pmbd_parse_multi(mode, "wpmode", g_pmbd_wpmode) < 0)
- goto fail;
- }
-
- } else {
- goto fail;
- }
-
- /* apply some enforced configuration */
- if (enforce_cache_wc) /* if ntl is used, we must use WC */
- g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC;
-
- /* Done, print input options */
- pmbd_print_conf();
- return;
-
-fail:
- printk(KERN_ERR "pmbd: wrong mode config! Check modinfo\n\n");
- g_pmbd_nr = 0;
- return;
-}
-
-/*
- * *****************************************************************
- * simple emulation API functions
- * pmbd_rdwr_pause - pause read/write for a specified cycles/page
- * pmbd_rdwr_slowdown - slowdown read/write proportionally to DRAM
- * *****************************************************************/
-
-/* handle rdpause and wrpause options*/
-static void pmbd_rdwr_pause(PMBD_DEVICE_T* pmbd, size_t bytes, unsigned rw)
-{
- uint64_t cycles = 0;
- uint64_t time_p1, time_p2;
-
- /* sanity check */
- if (pmbd->rdpause == 0 && pmbd->wrpause == 0)
- return;
-
- /* start */
- TIMESTAT_POINT(time_p1);
-
- /* calculate the cycles to pause */
- if (rw == READ && pmbd->rdpause){
- cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->rdpause), pmbd->rdpause);
- } else if (rw == WRITE && pmbd->wrpause){
- cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->wrpause), pmbd->wrpause);
- }
-
- /* slow down now */
- if (cycles)
- sync_slowdown_cycles(cycles);
-
- TIMESTAT_POINT(time_p2);
-
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_pause[rw][cid] += time_p2 - time_p1;
- }
-
- return;
-}
-
-
-/* handle rdsx and wrsx options */
-static void pmbd_rdwr_slowdown(PMBD_DEVICE_T* pmbd, int rw, uint64_t start, uint64_t end)
-{
- uint64_t cycles = 0;
- uint64_t time_p1, time_p2;
-
- /* sanity check */
- if ( !((rw == READ && pmbd->rdsx > 1) || (rw == WRITE && pmbd->wrsx > 1)))
- return;
-
- if (end < start){
- printk(KERN_WARNING "pmbd: %s(%d) end (%llu) is earlier than start (%llu)\n", \
- __FUNCTION__, __LINE__, (unsigned long long) start, (unsigned long long)end);
- return;
- }
-
- /* start */
- TIMESTAT_POINT(time_p1);
-
- /*FIXME: should we allow to do async slowdown? */
- cycles = (end-start)*((rw == READ) ? (pmbd->rdsx - 1) : (pmbd->wrsx -1));
-
- /*FIXME: should we minus a slack here (80-100cycles)? */
- if (cycles)
- sync_slowdown_cycles(cycles);
-
- TIMESTAT_POINT(time_p2);
-
- /* updating statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_slowdown[rw][cid] += time_p2 - time_p1;
- }
-
- return;
-}
-
-
-/*
- * set page's cache flags
- * @vaddr: start virtual address
- * @num_pages: the range size
- */
-static void set_pages_cache_flags(unsigned long vaddr, int num_pages)
-{
- switch (g_pmbd_cpu_cache_flag) {
- case _PAGE_CACHE_WB:
- printk(KERN_INFO "pmbd: set PM pages cache flags (WB)\n");
- set_memory_wb(vaddr, num_pages);
- break;
- case _PAGE_CACHE_WC:
- printk(KERN_INFO "pmbd: set PM pages cache flags (WC)\n");
- set_memory_wc(vaddr, num_pages);
- break;
- case _PAGE_CACHE_UC:
- printk(KERN_INFO "pmbd: set PM pages cache flags (UC)\n");
- set_memory_uc(vaddr, num_pages);
- break;
- case _PAGE_CACHE_UC_MINUS:
- printk(KERN_INFO "pmbd: set PM pages cache flags (UM)\n");
- set_memory_uc(vaddr, num_pages);
- break;
- default:
- set_memory_wb(vaddr, num_pages);
- printk(KERN_WARNING "pmbd: PM page attribute is not set - use WB\n");
- break;
- }
- return;
-}
-
-
-/*
- * *************************************************************************
- * PMAP - Private mapping interface APIs
- * *************************************************************************
- *
- * The private mapping is for providing write protection -- only when we need
- * to access the PM page, we map it into the kernel virtual memory space, once
- * we finish using it, we unmap it, so the spatial and temporal window left for
- * bug attack is really small.
- *
- * Notes: pmap works similar to kmap_atomic*. It does the following:
- * (1) pmap_create(): allocate 128 pages with vmalloc, these 128 pte mapping is
- * saved to a backup place, and then be cleared to prevent accidental accesses.
- * Each page is assigned correspondingly to the CPU ID where the calling thread
- * is running on. So we support at most 128 CPU IDs.
- * (2) pmap_atomic_pfn(): map the specified pfn into the entry, whose index is
- * the ID of the CPU on which the current thread is running. The pfn is loaded
- * into the corresponding pte entry and the corresponding TLB entry is flushed
- * (3) punmap_atomic(): the specified pte entry is cleared, and the TLB entry
- * is flushed
- * (4) pmap_destroy(): the saved pte mapping of the 128 pages are restored, and
- * vfree() is called to release the 128 pages allocated through vmalloc().
- *
- */
-
-#define PMAP_NR_PAGES (128)
-static unsigned int pmap_nr_pages = 0; /* the total number of available pages for private mapping */
-static void* pmap_va_start = NULL; /* the first PMAP virtual address */
-static pte_t* pmap_ptep[PMAP_NR_PAGES]; /* the array of PTE entries */
-static unsigned long pmap_pfn[PMAP_NR_PAGES]; /* the array of page frame numbers for restoring */
-static pgprot_t pmap_prot[PMAP_NR_PAGES]; /* the array of page protection fields */
-#define PMAP_VA(IDX) (pmap_va_start + (IDX) * PAGE_SIZE)
-#define PMAP_IDX(VA) (((unsigned long)(VA) - (unsigned long)pmap_va_start) >> PAGE_SHIFT)
-
-static inline void pmap_flush_tlb_single(unsigned long addr)
-{
- asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
-}
-
-static inline void* update_pmap_pfn(unsigned long pfn, unsigned int idx)
-{
- void* va = PMAP_VA(idx);
- pte_t* ptep = pmap_ptep[idx];
- pte_t old_pte = *ptep;
- pte_t new_pte = pfn_pte(pfn, pmap_prot[idx]);
-
- if (pte_val(old_pte) == pte_val(new_pte))
- return va;
-
- /* update the pte entry */
- set_pte_atomic(ptep, new_pte);
-// set_pte(ptep, new_pte);
-
- /* flush one single tlb */
- __flush_tlb_one((unsigned long) va);
-// pmap_flush_tlb_single((unsigned long) va);
-
- /* return the old one for bkup */
- return va;
-}
-
-static inline void clear_pmap_pfn(unsigned idx)
-{
- if (idx < pmap_nr_pages){
-
- void* va = PMAP_VA(idx);
- pte_t* ptep = pmap_ptep[idx];
-
- /* clear the mapping */
- pte_clear(NULL, (unsigned long) va, ptep);
- __flush_tlb_one((unsigned long) va);
-
- } else {
- panic("%s(%d) illegal pmap idx\n", __FUNCTION__, __LINE__);
- }
-}
-
-static int pmap_atomic_init(void)
-{
- unsigned int i;
-
- /* checking */
- if (pmap_va_start)
- panic("%s(%d) something is wrong\n", __FUNCTION__, __LINE__);
-
- /* allocate an array of dummy pages as pmap virtual addresses */
- pmap_va_start = vmalloc(PAGE_SIZE * PMAP_NR_PAGES);
- if (!pmap_va_start){
- printk(KERN_ERR "pmbd:%s(%d) pmap_va_start cannot be initialized\n", __FUNCTION__, __LINE__);
- return -ENOMEM;
- }
- pmap_nr_pages = PMAP_NR_PAGES;
-
- /* set pages' cache flags, this flag would be saved into pmap_prot
- * and will be applied together with the dynamically mapped page too (01/12/2012)*/
- set_pages_cache_flags((unsigned long)pmap_va_start, pmap_nr_pages);
-
- /* save the dummy pages' ptep, pfn, and prot info */
- printk(KERN_INFO "pmbd: saving dummy pmap entries\n");
- for (i = 0; i < pmap_nr_pages; i ++){
- pte_t old_pte;
- unsigned int level;
- void* va = PMAP_VA(i);
-
- /* get the ptep */
- pte_t* ptep = lookup_address((unsigned long)(va), &level);
-
- /* sanity check */
- if (!ptep)
- panic("%s(%d) mapping not found\n", __FUNCTION__, __LINE__);
-
- old_pte = *ptep;
- if (!pte_val(old_pte))
- panic("%s(%d) invalid pte value\n", __FUNCTION__, __LINE__);
-
- if (level != PG_LEVEL_4K)
- panic("%s(%d) not PG_LEVEL_4K \n", __FUNCTION__, __LINE__);
-
- /* save dummy entries */
- pmap_ptep[i] = ptep;
- pmap_pfn[i] = pte_pfn(old_pte);
- pmap_prot[i] = pte_pgprot(old_pte);
-
-/* printk(KERN_INFO "%s(%d): saving dummy pmap entries: %u va=%p pfn=%lx\n", \
- __FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
-*/
- }
-
- /* clear the pte to make it illegal to access */
- for (i = 0; i < pmap_nr_pages; i ++)
- clear_pmap_pfn(i);
-
- return 0;
-}
-
-static void pmap_atomic_done(void)
-{
- int i;
-
- /* restore the dummy pages' pte */
- printk(KERN_INFO "pmbd: restoring dummy pmap entries\n");
- for (i = 0; i < pmap_nr_pages; i ++){
-/* void* va = PMAP_VA(i);
- printk(KERN_INFO "%s(%d): restoring dummy pmap entries: %d va=%p pfn=%lx\n", \
- __FUNCTION__, __LINE__, i, va, pmap_pfn[i]);
-*/
- /* restore the old pfn */
- update_pmap_pfn(pmap_pfn[i], i);
- pmap_ptep[i]= NULL;
- pmap_pfn[i] = 0;
- }
-
- /* free the dummy pages*/
- if (pmap_va_start)
- vfree(pmap_va_start);
- else
- panic("%s(%d): freeing dummy pages failed\n", __FUNCTION__, __LINE__);
-
- pmap_va_start = NULL;
- pmap_nr_pages = 0;
- return;
-}
-
-static void* pmap_atomic_pfn(unsigned long pfn, PMBD_DEVICE_T* pmbd, unsigned rw)
-{
- void* va = NULL;
- unsigned int idx = CUR_CPU_ID();
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- TIMESTAMP(time_p1);
-
- /* disable page fault temporarily */
- pagefault_disable();
-
- /* change the mapping to the specified pfn*/
- va = update_pmap_pfn(pfn, idx);
-
- TIMESTAMP(time_p2);
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_pmap[rw][cid] += time_p2 - time_p1;
- }
-
- return va;
-}
-
-static void punmap_atomic(void* va, PMBD_DEVICE_T* pmbd, unsigned rw)
-{
- unsigned int idx = PMAP_IDX(va);
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- TIMESTAMP(time_p1);
-
- /* clear the mapping */
- clear_pmap_pfn(idx);
-
- /* re-enable the page fault */
- pagefault_enable();
-
- TIMESTAMP(time_p2);
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_punmap[rw][cid] += time_p2 - time_p1;
- }
-
- return;
-}
-
-/* create the dummy pmap space */
-static int pmap_create(void)
-{
- pmap_atomic_init();
- return 0;
-}
-
-/* destroy the dummy pmap space */
-static void pmap_destroy(void)
-{
- pmap_atomic_done();
- return;
-}
-
-/*
- * *************************************************************************
- * Non-temporal memcpy
- * *************************************************************************
- * Non-temporal memcpy does the following:
- * (1) use movntq to copy into PM space
- * (2) use sfence to flush the data to memory controller
- *
- * Compared to regular temporal memcpy, it provides several benefits here:
- * (1) writes to PM bypass the CPU cache, which avoids polluting CPU cache
- * (2) reads from PM still benefit from the CPU cache
- * (3) sfence used for each write guarantees data will be flushed out of buffer
- */
-
-static void nts_memcpy_64bytes_v2(void* to, void* from, size_t size)
-{
- int i;
- unsigned bs = 64; /* write unit size 8 bytes */
-
- if (size < bs)
- panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
-
- if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
- panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
-
- /* start */
- kernel_fpu_begin();
-
- /* do the non-temporal mov */
- for (i = 0; i < size; i += bs){
- __asm__ __volatile__ (
- "movdqa (%0), %%xmm0\n"
- "movdqa 16(%0), %%xmm1\n"
- "movdqa 32(%0), %%xmm2\n"
- "movdqa 48(%0), %%xmm3\n"
- "movntdq %%xmm0, (%1)\n"
- "movntdq %%xmm1, 16(%1)\n"
- "movntdq %%xmm2, 32(%1)\n"
- "movntdq %%xmm3, 48(%1)\n"
- :
- : "r" (from), "r" (to)
- : "memory");
-
- to += bs;
- from += bs;
- }
-
- /* do sfence to push data out */
- __asm__ __volatile__ (
- " sfence\n" : :
- );
-
- /* end */
- kernel_fpu_end();
-
- /*NOTE: we assume it would be multiple units of 64 bytes*/
- if (i != size)
- panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
-
- return;
-}
-
-/* non-temporal store */
-static void nts_memcpy(void* to, void* from, size_t size)
-{
- if (size < 64){
- panic("no support for nt load smaller than 64 bytes yet\n");
- } else {
- nts_memcpy_64bytes_v2(to, from, size);
- }
-}
-
-
-static void ntl_memcpy_64bytes(void* to, void* from, size_t size)
-{
- int i;
- unsigned bs = 64; /* write unit size 16 bytes */
-
- if (size < bs)
- panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs);
-
- if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL))
- panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__);
-
- /* start */
- kernel_fpu_begin();
-
- /* do the non-temporal mov */
- for (i = 0; i < size; i += bs){
- __asm__ __volatile__ (
- "movntdqa (%0), %%xmm0\n"
- "movntdqa 16(%0), %%xmm1\n"
- "movntdqa 32(%0), %%xmm2\n"
- "movntdqa 48(%0), %%xmm3\n"
- "movdqa %%xmm0, (%1)\n"
- "movdqa %%xmm1, 16(%1)\n"
- "movdqa %%xmm2, 32(%1)\n"
- "movdqa %%xmm3, 48(%1)\n"
- :
- : "r" (from), "r" (to)
- : "memory");
-
- to += bs;
- from += bs;
- }
-
- /* end */
- kernel_fpu_end();
-
- /*NOTE: we assume it would be multiple units of 64 bytes (at least 512 bytes)*/
- if (i != size)
- panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size);
-
- return;
-}
-
-/* non-temporal load */
-static void ntl_memcpy(void* to, void* from, size_t size)
-{
- if (size < 64){
- panic("no support for nt load smaller than 128 bytes yet\n");
- } else {
- ntl_memcpy_64bytes(to, from, size);
- }
-}
-
-
-/*
- * *************************************************************************
- * COPY TO/FROM PM
- * *************************************************************************
- *
- * NOTE: copying into PM needs particular care, we use different solution here:
- * (1) pmap: we only map/unmap PM pages when we need to access, which provides
- * us the most protection, for both reads and writes
- * (2) non-pmap: we always map every page into the kernel space, however, we
- * put different protection for writes only. In both cases, PM pages are
- * initialized as read-only
- * - PTE manipulation: before each write, the page writable bit is enabled, and
- * disabled right after the write operation is done.
- * - CR0/WP switch: before each write, the WP bit in the CR0 register turned
- * off, and turned back on right after the write operation is done. Once
- * CR0/WP bit is turned off, the CPU would not check the writable bit in the
- * TLB in local CPU. So it is a tricky way to hack and walk around this
- * problem.
- *
- */
-
-#define PMBD_PMAP_DUMMY_BASE_VA (4096)
-#define PMBD_PMAP_VA_TO_PA(VA) (g_highmem_phys_addr + (VA) - PMBD_PMAP_DUMMY_BASE_VA)
-/*
- * copying from/to a contiguous PM space using pmap
- * @ram_va: the RAM virtual address
- * @pmbd_dummy_va: the dummy PM virtual address (for converting to phys addr)
- * @rw: 0 - read, 1 - write
- */
-
-#define MEMCPY_TO_PMBD(dst, src, bytes) { if (PMBD_USE_NTS()) \
- nts_memcpy((dst), (src), (bytes)); \
- else \
- memcpy((dst), (src), (bytes));}
-
-#define MEMCPY_FROM_PMBD(dst, src, bytes) { if (PMBD_USE_NTL()) \
- ntl_memcpy((dst), (src), (bytes)); \
- else \
- memcpy((dst), (src), (bytes));}
-
-static inline int _memcpy_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* ram_va, void* pmbd_dummy_va, size_t bytes, unsigned rw, unsigned do_fua)
-{
- unsigned long flags = 0;
- uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
-
- /* disable interrupt (PMAP entry is shared) */
- DISABLE_SAVE_IRQ(flags);
-
- /* do the real work */
- while(bytes){
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- unsigned long pfn = (pa >> PAGE_SHIFT); /* page frame number */
- unsigned off = pa & (~PAGE_MASK); /* offset in one page */
- unsigned size = MIN_OF((PAGE_SIZE - off), bytes);/* the size to copy */
-
- /* map it */
- void * map = pmap_atomic_pfn(pfn, pmbd, rw);
- void * pmbd_va = map + off;
-
- /* do memcopy */
- TIMESTAMP(time_p1);
- if (rw == READ) {
- MEMCPY_FROM_PMBD(ram_va, pmbd_va, size);
- } else {
- if (PMBD_USE_SUBPAGE_UPDATE()) {
- /* if we do subpage write, write a cacheline each time */
- /* FIXME: we probably need to check the alignment here */
- size = MIN_OF(size, PMBD_CACHELINE_SIZE);
- if (memcmp(pmbd_va, ram_va, size)){
- MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
- }
- } else {
- MEMCPY_TO_PMBD(pmbd_va, ram_va, size);
- }
- }
- TIMESTAMP(time_p2);
-
- /* emulating slowdown*/
- if(PMBD_DEV_USE_SLOWDOWN(pmbd))
- pmbd_rdwr_slowdown((pmbd), rw, time_p1, time_p2);
-
- /* for write check if we need to do clflush or do FUA*/
- if (rw == WRITE){
- if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
- pmbd_clflush_range(pmbd, pmbd_va, (size));
- }
-
- /* if write combine is used, we need to do sfence (like in ntstore) */
- if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM())
- sfence();
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_memcpy[rw][cid] += time_p2 - time_p1;
- }
-
- /* unmap it */
- punmap_atomic(map, pmbd, rw);
-
- /* prepare the next iteration */
- ram_va += size;
- bytes -= size;
- pa += size;
- }
-
- /* re-enable interrupt */
- ENABLE_RESTORE_IRQ(flags);
-
- return 0;
-}
-
-static inline int memcpy_from_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
-{
- return _memcpy_pmbd_pmap(pmbd, dst, src, bytes, READ, FALSE);
-}
-
-static inline int memcpy_to_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
-{
- return _memcpy_pmbd_pmap(pmbd, src, dst, bytes, WRITE, do_fua);
-}
-
-
-/*
- * memcpy from/to PM without using pmap
- */
-
-#define DISABLE_CR0_WP(CR0,FLAGS) {\
- if (PMBD_USE_WRITE_PROTECTION()){\
- DISABLE_SAVE_IRQ((FLAGS));\
- (CR0) = read_cr0();\
- write_cr0((CR0) & ~X86_CR0_WP);\
- }\
- }
-#define ENABLE_CR0_WP(CR0,FLAGS) {\
- if (PMBD_USE_WRITE_PROTECTION()){\
- write_cr0((CR0));\
- ENABLE_RESTORE_IRQ((FLAGS));\
- }\
- }
-
-static inline int memcpy_from_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
-{
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- /* start memcpy */
- TIMESTAMP(time_p1);
-#if 0
- if (PMBD_DEV_USE_VMALLOC((pmbd)))
- memcpy((dst), (src), (bytes));
- else if (PMBD_DEV_USE_HIGHMEM((pmbd)))
- memcpy_fromio((dst), (src), (bytes));
-#endif
- MEMCPY_FROM_PMBD(dst, src, bytes);
-
- TIMESTAMP(time_p2);
-
- /* emulating slowdown*/
- if(PMBD_DEV_USE_SLOWDOWN(pmbd))
- pmbd_rdwr_slowdown((pmbd), READ, time_p1, time_p2);
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_memcpy[READ][cid] += time_p2 - time_p1;
- }
-
- return 0;
-}
-
-static int memcpy_to_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
-{
-
- unsigned long cr0 = 0;
- unsigned long flags = 0;
- size_t left = bytes;
-
-
- /* get a bkup copy of the CR0 (to allow writable)*/
- if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
- DISABLE_CR0_WP(cr0, flags);
-
- /* do the real work */
- while(left){
- size_t size = left; // the size to copy
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- TIMESTAMP(time_p1);
- /* do memcopy */
- if (PMBD_USE_SUBPAGE_UPDATE()) {
- /* if we do subpage write, write a cacheline each time */
- size = MIN_OF(size, PMBD_CACHELINE_SIZE);
-
- if (memcmp(dst, src, size)){
- MEMCPY_TO_PMBD(dst, src, size);
- }
- } else {
- MEMCPY_TO_PMBD(dst, src, size);
- }
- TIMESTAMP(time_p2);
-
- /* emulating slowdown*/
- if(PMBD_DEV_USE_SLOWDOWN(pmbd))
- pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
-
- /* if write, check if we need to do clflush or we do FUA */
- if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS()))
- pmbd_clflush_range(pmbd, dst, (size));
-
- /* if write combine is used, we need to do sfence (like in ntstore) */
- if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM())
- sfence();
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_memcpy[WRITE][cid] += time_p2 - time_p1;
- }
-
- /* prepare the next iteration */
- dst += size;
- src += size;
- left -= size;
- }
-
- /* restore the CR0 */
- if (PMBD_DEV_USE_WPMODE_CR0(pmbd))
- ENABLE_CR0_WP(cr0, flags);
-
- return 0;
-}
-
-static int memcpy_to_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua)
-{
- uint64_t start = 0;
- uint64_t end = 0;
-
- /* start simulation timing */
- if (PMBD_DEV_SIM_PMBD((pmbd)))
- start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), WRITE);
-
- /* do memcpy now */
- if (PMBD_USE_PMAP()){
- memcpy_to_pmbd_pmap(pmbd, dst, src, bytes, do_fua);
- } else {
- memcpy_to_pmbd_nopmap(pmbd, dst, src, bytes, do_fua);
- }
-
- /* stop simulation timing */
- if (PMBD_DEV_SIM_PMBD((pmbd)))
- end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), WRITE, start);
-
- /* pause write for a while*/
- pmbd_rdwr_pause(pmbd, bytes, WRITE);
-
- return 0;
-}
-
-
-
-static int memcpy_from_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes)
-{
- uint64_t start = 0;
- uint64_t end = 0;
-
- /* start simulation timing */
- if (PMBD_DEV_SIM_PMBD((pmbd)))
- start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), READ);
-
- /* do memcpy here */
- if (PMBD_USE_PMAP()){
- memcpy_from_pmbd_pmap(pmbd, dst, src, bytes);
- }else{
- memcpy_from_pmbd_nopmap(pmbd, dst, src, bytes);
- }
-
- /* stop simulation timing */
- if (PMBD_DEV_SIM_PMBD((pmbd)))
- end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), READ, start);
-
- /* pause read for a while */
- pmbd_rdwr_pause(pmbd, bytes, READ);
-
- return 0;
-}
-
-
-
-/*
- * *************************************************************************
- * PMBD device buffer management
- * *************************************************************************
- *
- * Since write protection involves high performance overhead (due to TLB
- * shootdown and other system locking, linked list scan overhead related with
- * set_memory_* functions), we cannot change page table attributes for each
- * incoming write to PM space. In order to battle this issue, we added a
- * buffer to temporarily hold the incoming writes into a DRAM buffer, and
- * launch a syncer daemon to periodically flush dirty pages from the buffer to
- * the PM storage. This brings two benefits: first, more contiguous pages can
- * be clustered together, and we only need to do one page attribute change for
- * a cluster; second, high overhead is hidden in the background, since the
- * writes become asynchronous now.
- *
- */
-
-
-/* support functions to sort the bbi entries */
-static int compare_bbi_sort_entries(const void* m, const void* n)
-{
- PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
- PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
- if (a->pbn < b->pbn)
- return -1;
- else if (a->pbn == b->pbn)
- return 0;
- else
- return 1;
-
-}
-
-static void swap_bbi_sort_entries(void* m, void* n, int size)
-{
- PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m;
- PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n;
- PMBD_BSORT_ENTRY_T tmp;
- tmp = *a;
- *a = *b;
- *b = tmp;
- return;
-}
-
-
-/*
- * get the aligned in-block offsets for a given request
- * @pmbd: the pmbd device
- * @sector: the starting offset (in sectors) of the incoming request
- * @bytes: the size of the incoming request
- *
- * return: the in-block offset of the starting sector in the request
- *
- * Since the block size (4096 bytes) is larger than the sector size (512 bytes),
- * if the incoming request is not completely aligned in units of blocks, then
- * we need to pull the whole block from PM space into the buffer, and apply
- * changes to partial blocks. This function is needed to calculate the offset
- * for the beginning and ending sectors.
- *
- * For example: assuming sector size is 1024, buffer block size is 4096, sector
- * is 5, size is 1024, then the returned start offset is 1 (the second sector
- * in the buffer block), and the returned end offset is 2 (the third sector in
- * the buffer block)
- *
- * offset_s -----v v--- offset_e
- * ----------------------------------
- * | |*****| | |
- * ----------------------------------
- *
- */
-
-static sector_t pmbd_buffer_aligned_request_start(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
-{
- sector_t sector_s = sector;
- PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector_s);
- sector_t block_s = PBN_TO_SECTOR(pmbd, pbn_s); /* the block's starting offset (in sector) */
- sector_t offset_s = 0;
- if (sector_s >= block_s) /* if not aligned */
- offset_s = sector_s - block_s;
- return offset_s;
-}
-
-static sector_t pmbd_buffer_aligned_request_end(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
-{
- sector_t sector_e = sector + BYTE_TO_SECTOR(bytes) - 1;
- PBN_T pbn_e = SECTOR_TO_PBN(pmbd, sector_e);
- sector_t block_e = PBN_TO_SECTOR(pmbd, pbn_e); /* the block's starting offset (in sector) */
- sector_t offset_e = PBN_TO_SECTOR(pmbd, 1) - 1;
-
- if (sector_e >= block_e) /* if not aligned */
- offset_e = (sector_e - block_e);
- return offset_e;
-}
-
-
-/*
- * check and see if a physical block (pbn) is buffered
- * @pmbd: pmbd device
- * @pbn: buffer block number
- *
- * NOTE: The caller must hold the pbi->lock
- */
-static PMBD_BBI_T* _pmbd_buffer_lookup(PMBD_BUFFER_T* buffer, PBN_T pbn)
-{
- PMBD_BBI_T* bbi = NULL;
- PMBD_DEVICE_T* pmbd = buffer->pmbd;
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
-
- if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
- bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn);
- }
- return bbi;
-}
-
-/*
- * Alloc/flush buffer functions
- */
-
-/*
- * flushing a range of contiguous physical blocks from buffer to PM space
- * @pmbd: pmbd device
- * @pbn_s: the first physical block number to flush (start)
- * @pbn_e: the last physical block number to flush (end)
- *
- * This function only flushes blocks from buffer to PM and unlink(free) the
- * corresponding buffer blocks and physical PM blocks, and it does not update
- * the buffer control info (num_dirty, pos_dirty). This is because after
- * sorting, the processing order of buffer blocks (BBNs) may be different from
- * the spatial order of the buffer blocks, which makes it impossible to move
- * pos_dirty forward exactly one after one. In other words, pos_dirty only
- * points to the end of the dirty range, and we may flush a dirty block in the
- * middle of the range, rather than from the end first.
- *
- * NOTE: The caller must hold the flush_lock; only one thread is allowed to do
- * this sync; we also assume all the physical blocks in the specified range are
- * buffered.
- *
- */
-
-static unsigned long _pmbd_buffer_flush_range(PMBD_BUFFER_T* buffer, PBN_T pbn_s, PBN_T pbn_e)
-{
- PBN_T pbn = 0;
- unsigned long num_cleaned = 0;
- PMBD_DEVICE_T* pmbd = buffer->pmbd;
- void* dst = PMBD_BLOCK_VADDR(pmbd, pbn_s);
- size_t bytes = PBN_TO_BYTE(pmbd, (pbn_e - pbn_s + 1));
-
- /* NOTE: we are protected by the flush_lock here, no-one else here */
-
- /* set the pages readwriteable */
- /* if we use CR0/WP to temporarily switch the writable permission,
- * we don't have to change the PTE attributes directly */
- if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
- pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
-
-
- /* for each physical block, flush it from buffer to the PM space */
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
- BBN_T bbn = 0;
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- void* to = PMBD_BLOCK_VADDR(pmbd, pbn);
- size_t size = pmbd->pb_size;
- void* from = NULL; /* wait to get it in locked region */
- PMBD_BBI_T* bbi = NULL; /* wait to get it in locked region */
-
- /*
- * NOTE: This would not cause a deadlock, because the block
- * here are already buffered, and these blocks would not call
- * pmbd_buffer_alloc_block()
- */
- spin_lock(&pbi->lock); /* lock the block */
-
- /* get related buffer block info */
- if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) {
- bbn = pbi->bbn;
- bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn);
- from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
- } else {
- panic("pmbd: %s(%d) something wrong here \n", __FUNCTION__, __LINE__);
- }
-
- /* sync data from buffer into PM first */
- if (PMBD_BUFFER_BBI_IS_DIRTY(buffer, bbn)) {
- /* flush to PM */
- memcpy_to_pmbd(pmbd, to, from, size, FALSE);
-
- /* mark it as clean */
- PMBD_BUFFER_SET_BBI_CLEAN(buffer, bbn);
- }
- }
-
- /* set the pages back to read-only */
- if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
- pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
-
-
- /* finish the remaining work */
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- void* to = PMBD_BLOCK_VADDR(pmbd, pbn);
- size_t size = pmbd->pb_size;
- BBN_T bbn = pbi->bbn;
- void* from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn);
-
- /* verify that the write operation succeeded */
- if(PMBD_USE_WRITE_VERIFICATION())
- pmbd_verify_wr_pages(pmbd, to, from, size);
-
- /* reset the bbi and pbi link info */
- PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, bbn);
- PMBD_SET_BLOCK_UNBUFFERED(pmbd, pbn);
-
- /* unlock the block */
- spin_unlock(&pbi->lock);
-
- num_cleaned ++;
- }
-
- /* generate checksum */
- if (PMBD_USE_CHECKSUM())
- pmbd_checksum_on_write(pmbd, dst, bytes);
-
- return num_cleaned;
-}
-
-
-/*
- * core function of flushing the pmbd buffer
- * @pmbd: pmbd device
- *
- * NOTE: this function performs the flushing in the following steps
- * (1) get the flush lock (to allow only one to do flushing)
- * (2) get the buffer_lock to protect the buffer control info (num_dirty,
- * pos_dirty, pos_clean)
- * (3) check if someone else has already done the flushing work while waiting
- * for the lock
- * (4) copy the buffer block info from pos_dirty to pos_clean to a temporary
- * array
- * (5) release the buffer_lock (to allow alloc to proceed, as long as free
- * blocks exist)
- *
- * (6) sort the temporary array of buffer blocks in the order of their PBNs.
- * This is because we need to organize sequences of contiguous physical blocks,
- * so that we can use only one set_memory_* function for a sequence of memory
- * pages, rather than once for each page. So the larger the sequence is, the
- * more efficient it would be.
- * (7) scan the sorted list, and form sequences of contiguous physical blocks,
- * and call __pmbd_buffer_flush_range() to synchronize the sequences one by one
- *
- * (8) get the flush_lock again
- * (9) update the pos_dirty and num_dirty to reflect the recent changes
- * (10) release the flush_lock
- *
- * NOTE: The caller must not hold flush_lock and buffer_lock, but can hold
- * pbi->lock.
- *
- */
-static unsigned long pmbd_buffer_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean)
-{
- BBN_T i = 0;
- BBN_T bbn_s = 0;
- BBN_T bbn_e = 0;
- PBN_T first_pbn = 0;
- PBN_T last_pbn = 0;
- unsigned long num_cleaned = 0;
- unsigned long num_scanned = 0;
- PMBD_DEVICE_T* pmbd = buffer->pmbd;
- PMBD_BSORT_ENTRY_T* bbi_sort_buffer = buffer->bbi_sort_buffer;
-
- /* lock the flush_lock to ensure no-one else can do flush in parallel */
- spin_lock(&buffer->flush_lock);
-
- /* now we lock the buffer to protect buffer control info */
- spin_lock(&buffer->buffer_lock);
-
- /* check if num_to_clean is too large */
- if (num_to_clean > buffer->num_dirty)
- num_to_clean = buffer->num_dirty;
-
- /* check if the buffer is empty (someone else may have done the flushing job) */
- if (PMBD_BUFFER_IS_EMPTY(buffer) || num_to_clean == 0) {
- spin_unlock(&buffer->buffer_lock);
- goto done;
- }
-
- /* set up the range of BBNs we need to check */
- bbn_s = buffer->pos_dirty; /* the first bbn */
- bbn_e = PMBD_BUFFER_PRIO_POS(buffer, buffer->pos_clean);/* the last bbn */
-
- /* scan the buffer range and put it into the sort buffer */
- /*
- * NOTE: bbn_s could be equal to PMBD_BUFFER_NEXT_POS(buffer, bbn_e), if
- * the buffer is filled with dirty blocks, so we need to check num_scanned
- * here.
- * */
- for (i = bbn_s;
- (i != PMBD_BUFFER_NEXT_POS(buffer, bbn_e)) || (num_scanned == 0);
- i = PMBD_BUFFER_NEXT_POS(buffer, i)) {
- /*
- * FIXME: it may be possible that some blocks in the dirty
- * block range are "clean", because after the block is
- * allocated, and before it is being written, the block is
- * marked as CLEAN, but it is allocated already. However, it is
- * safe to attempt to flush it, because the pbi->lock would
- * protect us.
- *
- * UPDATES: we changed the allocator code to mark it dirty as
- * soon as the block is allocated. So the aforesaid situation
- * would not happen anymore.
- */
- if(PMBD_BUFFER_BBI_IS_CLEAN(buffer, i)){
- /* found clean blocks */
- panic("ERR: %s(%d)%u: found clean block in the range of dirty blocks (bbn_s=%lu bbn_e=%lu, i=%lu, num_scanned=%lu num_to_clean=%lu num_dirty=%lu pos_dirty=%lu pos_clean=%lu)\n",
- __FUNCTION__, __LINE__, __CURRENT_PID__,bbn_s, bbn_e, i, num_scanned, num_to_clean, buffer->num_dirty, buffer->pos_dirty, buffer->pos_clean);
- continue;
- } else {
- PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, i);
- PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + num_scanned;
-
- /* add it to the buffer for sorting */
- se->pbn = bbi->pbn;
- se->bbn = i;
- num_scanned ++;
-
- /* only clean num_to_clean blocks */
- if (num_scanned >= num_to_clean)
- break;
- }
- }
- /* unlock the buffer to let allocator continue */
- spin_unlock(&buffer->buffer_lock);
-
- /* if no valid dirty block to be cleaned*/
- if (num_scanned == 0)
- goto done;
-
- /*
- * sort the buffer to get sequences of contiguous blocks
- */
- if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
- sort(bbi_sort_buffer, num_scanned, sizeof(PMBD_BSORT_ENTRY_T), compare_bbi_sort_entries, swap_bbi_sort_entries);
-
- /* scan the sorted list to organize and flush the sequences of contiguous PBNs */
- for (i = 0; i < num_scanned; i ++) {
- PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + i;
- PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, se->bbn);
- if (i == 0) {
- /* the first one */
- first_pbn = bbi->pbn;
- last_pbn = bbi->pbn;
- continue;
- } else {
- if (bbi->pbn == (last_pbn + 1) ) {
- /* if blocks are contiguous */
- last_pbn = bbi->pbn;
- continue;
- } else {
- /* if blocks are not contiguous */
- num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
-
- /* start a new sequence */
- first_pbn = bbi->pbn;
- last_pbn = bbi->pbn;
- continue;
- }
- }
- }
-
- /* finish the last sequence of contiguous PBNs */
- num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn);
-
- /* update the buffer control info */
- spin_lock(&buffer->buffer_lock);
- buffer->pos_dirty = PMBD_BUFFER_NEXT_N_POS(buffer, bbn_s, num_cleaned); /* move pos_dirty forward */
- buffer->num_dirty -= num_cleaned; /* decrement the counter*/
- spin_unlock(&buffer->buffer_lock);
-
-done:
- spin_unlock(&buffer->flush_lock);
- return num_cleaned;
-}
-
-/*
- * entry function of flushing buffer
- * This function is called by both allocator and syncer
- * @pmbd: pmbd device
- * @num_to_clean: how many blocks to clean
- * @i_am_syncer: indicate which caller is (TRUE for syncer and FALSE for allocator)
- */
-static unsigned long pmbd_buffer_check_and_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean, unsigned caller)
-{
- unsigned long num_cleaned = 0;
-
- /*
- * Since there may exist more than one thread (e.g. alloc/flush or
- * alloc/alloc) trying to flush the buffer, we need to first check if
- * someone else has already done the job while waiting for the lock. If
- * true, we don't have to proceed and flush it again. This improves the
- * responsiveness of applications
- */
- if (caller == CALLER_DESTROYER){
- /* if destroyer calls this function, just flush everything */
- goto do_it;
-
- } else if (caller == CALLER_SYNCER) {
- /* if syncer calls this function and the buffer is empty, do nothing */
- spin_lock(&buffer->buffer_lock);
- if (PMBD_BUFFER_IS_EMPTY(buffer)){
- spin_unlock(&buffer->buffer_lock);
- goto done;
- }
- spin_unlock(&buffer->buffer_lock);
-
- } else if (caller == CALLER_ALLOCATOR){
-
- /* if reader/writer calls this function, some blocks are freed, then
- * we just do nothing */
- spin_lock(&buffer->buffer_lock);
- if (!PMBD_BUFFER_IS_FULL(buffer)){
- spin_unlock(&buffer->buffer_lock);
- goto done;
- }
- spin_unlock(&buffer->buffer_lock);
-
- } else {
- panic("ERR: %s(%d) unknown caller id\n", __FUNCTION__, __LINE__);
- }
-
- /* otherwise, we do flushing */
-do_it:
- num_cleaned = pmbd_buffer_flush(buffer, num_to_clean);
-
-done:
- return num_cleaned;
-}
-
-/*
- * Core function of allocating a buffer block
- *
- * We first grab the buffer_lock, and check to see if the buffer is full. If
- * not, we allocate a buffer block, move the pos_clean, and update num_dirty,
- * then release the buffer_lock. Since we already hold the pbi->lock, it is
- * safe to release the lock and let other threads proceed (before we really
- * write data into the buffer block), because no one else can read/write or
- * access the same buffer block concurrently. If the buffer is full, we release
- * the buffer_lock to allow others to proceed (because we may be blocked at
- * flush_lock later), and then we call the function to synchronously flush the
- * buffer. Note that someone else may be there already, so we may be blocked
- * there, and if we find someone has already flushed the buffer, we need to
- * grab the buffer_lock and check if there is available buffer block again.
- *
- * NOTE: The caller must hold the pbi->lock.
- *
- */
-static PMBD_BBI_T* pmbd_buffer_alloc_block(PMBD_BUFFER_T* buffer, PBN_T pbn)
-{
- BBN_T pos = 0;
- PMBD_BBI_T* bbi = NULL;
- PMBD_DEVICE_T* pmbd = buffer->pmbd;
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
-
- /* lock the buffer control info (we will check and update it) */
- spin_lock(&buffer->buffer_lock);
-
-check_again:
- /* check if the buffer is completely full, if yes, flush it to PM */
- if (PMBD_BUFFER_IS_FULL(buffer)) {
- /* release the buffer_lock (someone may be doing flushing)*/
- spin_unlock(&buffer->buffer_lock);
-
- /* If the buffer is full, we must flush it synchronously.
- *
- * NOTE: this on-demand flushing can improve performance a lot, since
- * the allocator has not to wait for waking up syncer to do this, which
- * is much faster. Another merit is that it makes the application run
- * more smoothly (it is abrupt if completely relying on syncer). Also
- * note that we only flush a batch (e.g. 1024) of blocks, rather than
- * all the buffer blocks, this is because we only need a few blocks to
- * satisfy the application's own need, and this reduces the time that
- * the application spends on allocation. */
- pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_ALLOCATOR);
-
- /* grab the lock and check the availability of free buffer blocks
- * again, because someone may use up all the free buffer blocks, right
- * after the buffer is flushed but before we can get one */
- spin_lock(&buffer->buffer_lock);
- goto check_again;
- }
-
- /* if buffer is not full, only reserve one spot first.
- *
- * NOTE that we do not have to do link and memcpy in the locked region,
- * because pbi->lock guarantees that no-one else can use it now. This
- * moves the high-cost operations out of the critical section */
- pos = buffer->pos_clean;
- buffer->pos_clean = PMBD_BUFFER_NEXT_POS(buffer, buffer->pos_clean);
- buffer->num_dirty ++;
-
- /* NOTE: we mark it "dirty" here, but actually the data has not been
- * really written into the PMBD buffer block yet. This is safe, because
- * we are protected by the pbi->lock */
- PMBD_BUFFER_SET_BBI_DIRTY(buffer, pos);
-
- /* now link them up (no-one else can see it) */
- bbi = PMBD_BUFFER_BBI(buffer, pos);
-
- bbi->pbn = pbn;
- pbi->bbn = pos;
-
- /* unlock the buffer_lock and let others proceed */
- spin_unlock(&buffer->buffer_lock);
-
- return bbi;
-}
-
-
-/*
- * syncer daemon worker function
- */
-
-static inline uint64_t pmbd_device_is_idle(PMBD_DEVICE_T* pmbd)
-{
- unsigned last_jiffies, now_jiffies;
- uint64_t interval = 0;
-
- now_jiffies = jiffies;
- PMBD_DEV_GET_ACCESS_TIME(pmbd, last_jiffies);
- interval = jiffies_to_usecs(now_jiffies - last_jiffies);
-
- if (PMBD_DEV_IS_IDLE(pmbd, interval)) {
- return interval;
- } else {
- return 0;
- }
-}
-
-static int pmbd_syncer_worker(void* data)
-{
- PMBD_BUFFER_T* buffer = (PMBD_BUFFER_T*) data;
-
- set_user_nice(current, 0);
-
- do {
- unsigned do_flush = 0;
-// unsigned long loop = 0;
- uint64_t idle_usec = 0;
- spin_lock(&buffer->buffer_lock);
-
- /* we start flushing, if
- * (1) the num of dirty blocks hits the high watermark, or
- * (2) the device has been idle for a while */
- if (PMBD_BUFFER_ABOVE_HW(buffer)) {
- //printk("High watermark is hit\n";
- do_flush = 1;
- }
-// if (pmbd_device_is_idle(buffer->pmbd) && !PMBD_BUFFER_IS_EMPTY(buffer)) {
- if ((idle_usec = pmbd_device_is_idle(buffer->pmbd)) && PMBD_BUFFER_ABOVE_LW(buffer)) {
- //printk("Device is idle for %llu uSeconds\n", idle_usec);
- do_flush = 1;
- }
- if (do_flush){
- unsigned long num_dirty = 0;
- unsigned long num_cleaned = 0;
-repeat:
- num_dirty = buffer->num_dirty;
- spin_unlock(&buffer->buffer_lock);
-
- /* start flushing
- *
- * NOTE: we only allocate a batch (e.g. 1024) of blocks each time. The
- * purpose is to let the applications wait for free blocks, so that they can
- * get a few free blocks and proceed, rather than waiting for the whole
- * buffer gets flushed. Otherwise, the bandwidth would be lower and the
- * applications cannot run smoothly.
- */
- num_cleaned = pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_SYNCER);
- //printk("Syncer(%u) activated (%lu) - Before (%lu) Cleaned (%lu) After (%lu)\n",
- // buffer->buffer_id, loop++, num_dirty, num_cleaned, buffer->num_dirty);
-
- /* continue to flush until we hit the low watermark */
- spin_lock(&buffer->buffer_lock);
- if (PMBD_BUFFER_ABOVE_LW(buffer)) {
-// if (buffer->num_dirty > 0) {
- goto repeat;
- }
- }
- spin_unlock(&buffer->buffer_lock);
-
- /* go to sleep */
- set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(1);
- set_current_state(TASK_RUNNING);
-
- } while(!kthread_should_stop());
- return 0;
-}
-
-static struct task_struct* pmbd_buffer_syncer_init(PMBD_BUFFER_T* buffer)
-{
- struct task_struct* tsk = NULL;
- tsk = kthread_run(pmbd_syncer_worker, (void*) buffer, "nsyncer");
- if (!tsk) {
- printk(KERN_ERR "pmbd: initializing buffer syncer failed\n");
- return NULL;
- }
-
- buffer->syncer = tsk;
- printk("pmbd: buffer syncer launched\n");
- return tsk;
-}
-
-static int pmbd_buffer_syncer_stop(PMBD_BUFFER_T* buffer)
-{
- if (buffer->syncer){
- kthread_stop(buffer->syncer);
- buffer->syncer = NULL;
- printk(KERN_INFO "pmbd: buffer syncer stopped\n");
- }
- return 0;
-}
-
-/*
- * read and write to PMBD with buffer
- */
-static void copy_to_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes)
-{
- PBN_T pbn = 0;
- void* from = src;
-
- /*
- * get the start and end in-block offset
- *
- * NOTE: Since the buffer block (4096 bytes) can be larger than the
- * sector(512 bytes), if incoming request is not completely aligned to
- * buffer blocks, we need to read the full block from PM into the
- * buffer block and apply writes to partial of the buffer block. Here,
- * offset_s and offset_e are the start and end in-block offsets (in
- * units of sectors) for the first and the last sector in the request,
- * they may or may not appear in the same buffer block, depending on the
- * request size.
- */
- PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
- PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
- sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
- sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
-
- /* for each physical block */
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
- void* to = NULL;
- PMBD_BBI_T* bbi = NULL;
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- sector_t sect_s = (pbn == pbn_s) ? offset_s : 0; /* sub-block access */
- sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
- size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */
- PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
-
- /* lock the physical block first */
- spin_lock(&pbi->lock);
-
- /* check if the physical block is buffered */
- bbi = _pmbd_buffer_lookup(buffer, pbn);
-
- if (bbi){
- /* if the block is already buffered */
- to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
- } else {
- /* if not buffered, allocate one free buffer block */
- bbi = pmbd_buffer_alloc_block(buffer, pbn);
-
- /* if not aligned to a full block, we have to copy the whole
- * block from the PM space to the buffer block first */
- if (size < pmbd->pb_size){
- memcpy_from_pmbd(pmbd, PMBD_BUFFER_BLOCK(buffer, pbi->bbn), PMBD_BLOCK_VADDR(pmbd, pbn), pmbd->pb_size);
- }
- to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
- }
-
- /* writing it into buffer */
- memcpy(to, from, size);
- PMBD_BUFFER_SET_BBI_DIRTY(buffer, pbi->bbn);
-
- /* unlock the block */
- spin_unlock(&pbi->lock);
-
- from += size;
- }
-
- return;
-}
-
-static void copy_from_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
-{
- PBN_T pbn = 0;
- void* to = dst;
-
- /* get the start and end in-block offset */
- PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
- PBN_T pbn_e = BYTE_TO_PBN(pmbd, SECTOR_TO_BYTE(sector) + bytes - 1);
- sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes);
- sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes);
-
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++){
- /* Scan the incoming request and check each block, for each block, we
- * check if it is in the buffer. If true, we read it from the buffer,
- * otherwise, we read from the PM space. */
-
- void* from = NULL;
- PMBD_BBI_T* bbi = NULL;
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- sector_t sect_s = (pbn == pbn_s) ? offset_s : 0;
- sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */
- size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */
- PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn);
-
- /* lock the physical block first */
- spin_lock(&pbi->lock);
-
- /* check if the block is in the buffer */
- bbi = _pmbd_buffer_lookup(buffer, pbn);
-
- /* start reading data */
- if (bbi) {
- /* if buffered, read it from the buffer */
- from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s);
-
- /* read it out */
- memcpy(to, from, size);
-
- } else {
- /* if not buffered, read it from PM space */
- from = PMBD_BLOCK_VADDR(pmbd, pbn) + SECTOR_TO_BYTE(sect_s);
-
- /* verify the checksum first */
- if (PMBD_USE_CHECKSUM())
- pmbd_checksum_on_read(pmbd, from, size);
-
- /* read it out*/
- memcpy_from_pmbd(pmbd, to, from, size);
- }
-
- /* unlock the block */
- spin_unlock(&pbi->lock);
-
- to += size;
- }
-
- return;
-}
-
-/*
- * buffer related space alloc/free functions
- */
-static int pmbd_pbi_space_alloc(PMBD_DEVICE_T* pmbd)
-{
- int err = 0;
-
- /* allocate checksum space */
- pmbd->pbi_space = vmalloc(PMBD_TOTAL_PB_NUM(pmbd) * sizeof(PMBD_PBI_T));
- if (pmbd->pbi_space) {
- PBN_T i;
- for (i = 0; i < PMBD_TOTAL_PB_NUM(pmbd); i ++) {
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, i);
- PMBD_SET_BLOCK_UNBUFFERED(pmbd, i);
- spin_lock_init(&pbi->lock);
- }
- printk(KERN_INFO "pmbd(%d): pbi space is initialized\n", pmbd->pmbd_id);
- } else {
- err = -ENOMEM;
- }
-
- return err;
-}
-
-static int pmbd_pbi_space_free(PMBD_DEVICE_T* pmbd)
-{
- if (pmbd->pbi_space){
- vfree(pmbd->pbi_space);
- pmbd->pbi_space = NULL;
- printk(KERN_INFO "pmbd(%d): pbi space is freed\n", pmbd->pmbd_id);
- }
- return 0;
-}
-
-static PMBD_BUFFER_T* pmbd_buffer_create(PMBD_DEVICE_T* pmbd)
-{
- int i;
- PMBD_BUFFER_T* buffer = kzalloc (sizeof(PMBD_BUFFER_T), GFP_KERNEL);
- if (!buffer){
- goto fail;
- }
-
- /* link to the pmbd device */
- buffer->pmbd = pmbd;
-
- /* set size */
- if (g_pmbd_bufsize[pmbd->pmbd_id] > PMBD_BUFFER_MIN_BUFSIZE) {
- buffer->num_blocks = MB_TO_BYTES(g_pmbd_bufsize[pmbd->pmbd_id]) / pmbd->pb_size;
- } else {
- if (PMBD_DEV_USE_BUFFER(pmbd)) {
- printk(KERN_INFO "pmbd(%d): WARNING - too small buffer size (%llu MBs). Buffer set to %d MBs\n",
- pmbd->pmbd_id, g_pmbd_bufsize[pmbd->pmbd_id], PMBD_BUFFER_MIN_BUFSIZE);
- }
- buffer->num_blocks = MB_TO_BYTES(PMBD_BUFFER_MIN_BUFSIZE) / pmbd->pb_size;
- }
-
- /* buffer space */
- buffer->buffer_space = vmalloc(buffer->num_blocks * pmbd->pb_size);
- if (!buffer->buffer_space)
- goto fail;
-
- /* BBI array */
- buffer->bbi_space = vmalloc(buffer->num_blocks * sizeof(PMBD_BBI_T));
- if (!buffer->bbi_space)
- goto fail;
- memset(buffer->bbi_space, 0, buffer->num_blocks * sizeof(PMBD_BBI_T));
-
- /* temporary array of bbi for sorting */
- buffer->bbi_sort_buffer = vmalloc(buffer->num_blocks * sizeof(PMBD_BSORT_ENTRY_T));
- if (!buffer->bbi_sort_buffer)
- goto fail;
-
- /* initialize the locks*/
- spin_lock_init(&buffer->buffer_lock);
- spin_lock_init(&buffer->flush_lock);
-
- /* initialize the BBI array */
- for (i = 0; i < buffer->num_blocks; i ++){
- PMBD_BUFFER_SET_BBI_CLEAN(buffer, i);
- PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, i);
- }
-
- /* initialize the buffer control info */
- buffer->num_dirty = 0;
- buffer->pos_dirty = 0;
- buffer->pos_clean = 0;
- buffer->batch_size = g_pmbd_buffer_batch_size[pmbd->pmbd_id];
-
- /* launch the syncer daemon */
- pmbd_buffer_syncer_init(buffer);
- if (!buffer->syncer)
- goto fail;
-
- printk(KERN_INFO "pmbd: pmbd device buffer (%u) allocated (%lu blocks - block size %u bytes)\n",
- buffer->buffer_id, buffer->num_blocks, pmbd->pb_size);
- return buffer;
-
-fail:
- if (buffer && buffer->bbi_sort_buffer)
- vfree(buffer->bbi_sort_buffer);
- if (buffer && buffer->bbi_space)
- vfree(buffer->bbi_space);
- if (buffer && buffer->buffer_space)
- vfree(buffer->buffer_space);
- if (buffer)
- kfree(buffer);
- printk(KERN_ERR "%s(%d) vzalloc failed\n", __FUNCTION__, __LINE__);
- return NULL;
-}
-
-static int pmbd_buffer_destroy(PMBD_BUFFER_T* buffer)
-{
- unsigned id = buffer->buffer_id;
-
- /* stop syncer first */
- pmbd_buffer_syncer_stop(buffer);
-
- /* flush the buffer to the PM space */
- pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
-
- /* FIXME: wait for the on-going operations to finish first? */
- if (buffer && buffer->bbi_sort_buffer)
- vfree(buffer->bbi_sort_buffer);
- if (buffer && buffer->bbi_space)
- vfree(buffer->bbi_space);
- if (buffer && buffer->buffer_space)
- vfree(buffer->buffer_space);
- if (buffer)
- kfree(buffer);
- printk(KERN_INFO "pmbd: pmbd device buffer (%u) space freed\n", id);
- return 0;
-}
-
-static int pmbd_buffers_create(PMBD_DEVICE_T* pmbd)
-{
- int i;
- for (i = 0; i < pmbd->num_buffers; i ++){
- pmbd->buffers[i] = pmbd_buffer_create(pmbd);
- if (pmbd->buffers[i] == NULL)
- return -ENOMEM;
- (pmbd->buffers[i])->buffer_id = i;
- }
- return 0;
-}
-
-static int pmbd_buffers_destroy(PMBD_DEVICE_T* pmbd)
-{
- int i;
- for (i = 0; i < pmbd->num_buffers; i ++){
- if(pmbd->buffers[i]){
- pmbd_buffer_destroy(pmbd->buffers[i]);
- pmbd->buffers[i] = NULL;
- }
- }
- return 0;
-}
-
-static int pmbd_buffer_space_alloc(PMBD_DEVICE_T* pmbd)
-{
- int err = 0;
-
- if (pmbd->num_buffers <= 0)
- return 0;
-
- /* allocate buffers array */
- pmbd->buffers = kzalloc (sizeof(PMBD_BUFFER_T*) * pmbd->num_buffers, GFP_KERNEL);
- if (pmbd->buffers == NULL){
- err = -ENOMEM;
- goto fail;
- }
-
- /* allocate each buffer */
- err = pmbd_buffers_create(pmbd);
- printk(KERN_INFO "pmbd: pmbd buffer space allocated.\n");
-fail:
- return err;
-}
-
-static int pmbd_buffer_space_free(PMBD_DEVICE_T* pmbd)
-{
- if (pmbd->num_buffers <=0)
- return 0;
-
- pmbd_buffers_destroy(pmbd);
- kfree(pmbd->buffers);
- pmbd->buffers = NULL;
- printk(KERN_INFO "pmbd: pmbd buffer space freed.\n");
-
- return 0;
-}
-
-
-/*
- * *************************************************************************
- * High memory based PMBD functions
- * *************************************************************************
- *
- * NOTE:
- * (1) memcpy_fromio() and memcpy_intoio() are used for reading/writing PM,
- * but it is unnecessary on x86 architectures.
- * (2) Currently we only allocate the reserved space to multiple PMBDs once.
- * No dynamic allocate/deallocate of the space is needed so far.
- */
-
-
-static void* pmbd_highmem_map(void)
-{
- /*
- * NOTE: we can also use ioremap_* functions to directly set memory
- * page attributes when do remapping, but to make it consistent with
- * using vmalloc(), we do ioremap_cache() and call set_memory_* later.
- */
-
- if (PMBD_USE_PMAP()){
- /* NOTE: If we use pmap(), we don't need to map the reserved
- * physical memory into the kernel space. Instead we use
- * pmap_atomic() to make and unmap the to-be-accessed pages on
- * demand. Since such mapping is private to the processor,
- * there is no need to change PTE, and TLB shootdown either.
- *
- * Also note that We use PMBD_PMAP_DUMMY_BASE_VA to make the rest
- * of code happy with a valid virtual address. The real
- * physical address is calculated as follows:
- * g_highmem_phys_addr + (vaddr) - PMBD_PMAP_DUMMY_BASE_VA
- *
- * (updated 10/25/2011)
- */
-
- g_highmem_virt_addr = (void*) PMBD_PMAP_DUMMY_BASE_VA;
- g_highmem_curr_addr = g_highmem_virt_addr;
- printk(KERN_INFO "pmbd: PMAP enabled - setting g_highmem_virt_addr to a dummy address (%d)\n", PMBD_PMAP_DUMMY_BASE_VA);
- return g_highmem_virt_addr;
-
- } else if ((g_highmem_virt_addr = ioremap_prot(g_highmem_phys_addr, g_highmem_size, g_pmbd_cpu_cache_flag))) {
-
- g_highmem_curr_addr = g_highmem_virt_addr;
- printk(KERN_INFO "pmbd: high memory space remapped (offset: %llu MB, size=%lu MB, cache flag=%s)\n",
- BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size), PMBD_CPU_CACHE_FLAG());
- return g_highmem_virt_addr;
-
- } else {
-
- printk(KERN_ERR "pmbd: %s(%d) - failed remapping high memory space (offset: %llu MB size=%lu MB)\n",
- __FUNCTION__, __LINE__, BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
- return NULL;
- }
-}
-
-static void pmbd_highmem_unmap(void)
-{
- /* de-remap the high memory from kernel address space */
- /* NOTE: if we use pmap(), the g_highmem_virt_addr is fake */
- if (!PMBD_USE_PMAP()){
- if(g_highmem_virt_addr){
- iounmap(g_highmem_virt_addr);
- g_highmem_virt_addr = NULL;
- printk(KERN_INFO "pmbd: unmapping high mem space (offset: %llu MB, size=%lu MB)is unmapped\n",
- BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size));
- }
- }
- return;
-}
-
-static void* hmalloc(uint64_t bytes)
-{
- void* rtn = NULL;
-
- /* check if there is still available reserve high memory space */
- if (bytes <= PMBD_HIGHMEM_AVAILABLE_SPACE) {
- rtn = g_highmem_curr_addr;
- g_highmem_curr_addr += bytes;
- } else {
- printk(KERN_ERR "pmbd: %s(%d) - no available space (< %llu bytes) in reserved high memory\n",
- __FUNCTION__, __LINE__, bytes);
- }
- return rtn;
-}
-
-static int hfree(void* addr)
-{
- /* FIXME: no support for dynamic alloc/dealloc in HIGH_MEM space */
- return 0;
-}
-
-
-/*
- * *************************************************************************
- * Device Emulation
- * *************************************************************************
- *
- * Our emulation is based on a simple model - access time and transfer time.
- *
- * emulated time = access time + (request size / bandwidth)
- * inserted delay = emulated time - observed time
- *
- * (1) Access time is applied to each request. We check each request's real
- * access time and pad it with an extra delay to meet the designated latency.
- * This is a best-effort solution, which means we just guarantee that no
- * request can be completed with a response time less than the specified
- * latency, but the real access latencies could be higher. In addition, if the
- * total number of threads is larger than the number of available processors,
- * the simulated latencies could be higher, due to CPU saturation.
- *
- * (2) Transfer time is calculated based on batches
- * - A batch is a sequence of consecutive requests with a short interval in
- * between; requests in a batch can be overlapped with each other (parallel
- * jobs); there is a limit for the total amount of data and the duration of
- * a batch
- * - For each batch, we calculate its target emulated transfer time as
- * "emul_trans_time = num_sectors/emul_bandwidth" and calculate a delay as
- * "delay = emul_trans_time - real_trans_time"
- * - The calculated delay is applied to each batch at the end
- * - A lock is used to slow down all threads, because bandwidth is a
- * system-wide specification. In this way, we serialize the threads
- * accessing the device, which simulates that the device is busy on a task.
- *
- * (3) Two types of delays implemented
- * - Sync delay: if delay is less than 10ms, we keep polling the TSC
- * counter, which is basically "busy waiting", like spin-lock. This allows
- * to reach precision of one hundred of cycles
- * - Async delay: if delay is more than 10ms, we call msleep() to sleep for
- * a while, which relinquish CPU control, which results in a low precision.
- * The left-over delay is done with sync delay in nanosecs. Async delay
- * cannot be used while holding a lock.
- *
- */
-
-
-static inline uint64_t DIV64_ROUND(uint64_t dividend, uint64_t divisor)
-{
- if (divisor > 0) {
- uint32_t quot1 = dividend / divisor;
- uint32_t mod = dividend % divisor;
- uint32_t mult = mod * 2;
- uint32_t quot2 = mult / divisor;
- uint64_t result = quot1 + quot2;
- return result;
- } else { // FIXME: how to handle this?
- printk(KERN_WARNING "pmbd: WARNING - %s(%d) divisor is zero\n", __FUNCTION__, __LINE__);
- return 0;
- }
-}
-
-static inline unsigned int get_cpu_freq(void)
-{
-#if 0
- unsigned int khz = cpufreq_quick_get(0); /* FIXME: use cpufreq_get() ??? */
- if (!khz)
- khz = cpu_khz;
- printk("khz=%u, cpu_khz=%u\n", khz, cpu_khz);
-#endif
- return cpu_khz;
-}
-
-static inline uint64_t _cycle_to_ns(uint64_t cycle, unsigned int khz)
-{
- return cycle * 1000000 / khz;
-}
-
-static inline uint64_t cycle_to_ns(uint64_t cycle)
-{
- unsigned int khz = get_cpu_freq();
- return _cycle_to_ns(cycle, khz);
-}
-
-/*
- * emulate the latency for a given request size/type on a device
- * @num_sectors: num of sectors to read/write
- * @rw: read or write
- * @pmbd: the pmbd device
- */
-static uint64_t cal_trans_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
-{
- uint64_t ns = 0;
- uint64_t bw = (rw == READ) ? pmbd->rdbw : pmbd->wrbw; /* bandwidth */
- if (bw) {
- uint64_t tmp = num_sectors * PMBD_SECTOR_SIZE;
- uint64_t tt = 1000000000UL >> MB_SHIFT;
- ns += DIV64_ROUND((tmp * tt), bw);
- }
- return ns;
-}
-
-static uint64_t cal_access_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd)
-{
- uint64_t ns = (rw == READ) ? pmbd->rdlat : pmbd->wrlat; /* access time */
- return ns;
-}
-
-static inline void sync_slowdown(uint64_t ns)
-{
- uint64_t start, now;
- unsigned int khz = get_cpu_freq();
- if (ns) {
- /*
- * We keep reading TSC counter to check if the delay has
- * been passed and this prevents CPU from being scaled down,
- * which provides a stable estimation of the elapsed time.
- */
- TIMESTAMP(start);
- while(1) {
- TIMESTAMP(now);
- if (_cycle_to_ns((now-start), khz) > ns)
- break;
- }
- }
- return;
-}
-
-static inline void sync_slowdown_cycles(uint64_t cycles)
-{
-
- uint64_t start, now;
- if (cycles){
- /*
- * We keep reading TSC counter to check if the delay has
- * been passed and this prevents CPU from being scaled down,
- * which provides a stable estimation of the elapsed time.
- */
- TIMESTAMP(start);
- while(1) {
- TIMESTAMP(now);
- if ((now - start) >= cycles)
- break;
- }
- }
- return;
-}
-
-static inline void async_slowdown(uint64_t ns)
-{
- uint64_t ms = ns / 1000000;
- uint64_t left = ns - (ms * 1000000);
- /* do ms delay with sleep */
- msleep(ms);
-
- /* make up the sub-ms delay */
- sync_slowdown(left);
-}
-
-#if 0
-static inline void slowdown_us(unsigned long long us)
-{
- set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(us * HZ / 1000000);
-}
-#endif
-
-static void pmbd_slowdown(uint64_t ns, unsigned in_lock)
-{
- /*
- * NOTE: if the delay is less than 10ms, we use sync_slowdown to keep
- * polling the CPU cycle counter and busy waiting for the delay elapse;
- * otherwise, we use msleep() to relinquish the CPU control.
- */
- if (ns > MAX_SYNC_SLOWDOWN && !in_lock)
- async_slowdown(ns);
- else if (ns > 0)
- sync_slowdown(ns);
-
- return;
-}
-
-/*
- * Emulating the transfer time for a batch of requests for specific bandwidth
- *
- * We group a bunch of consecutive requests as a "batch". In one batch, the
- * interval between two consecutive requests should be small, and the total
- * amount of accessed data should be a good size (not too small, not too
- * large), the duration is reasonable (not too long). For each batch, we
- * estimate the emulated transfer time and compare it with the real transfer
- * time (the start and end time of the batch), if the real transfer time is
- * less than the emulated time, we apply an extra delay to the end of batch for
- * making up the difference. In this way we can make the bandwidth emulation
- * closer to real situation. Note that, since requests from multiple threads
- * could be processed in parallel, so we must slowdown ALL the threads
- * accessing the PMBD device, thus, we use batch_lock to coordinate all threads.
- *
- * @num_sectors: the num of sectors of the request
- * @rw: read or write
- * @pmbd: the involved pmbd device
- *
- */
-
-static void pmbd_emul_transfer_time(int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
-{
- uint64_t interval_ns = 0;
- uint64_t duration_ns = 0;
- unsigned new_batch = FALSE;
- unsigned end_batch = FALSE;
- uint64_t now_cycle = 0;
-
- spin_lock(&pmbd->batch_lock);
-
- /* get a timestamp for now */
- TIMESTAMP(now_cycle);
-
- /* if this is the first timestamp */
- if (pmbd->batch_start_cycle[rw] == 0) {
- pmbd->batch_start_cycle[rw] = now_cycle;
- pmbd->batch_end_cycle[rw] = now_cycle;
- goto done;
- }
-
- /* calculate the interval from the last request */
- if (now_cycle >= pmbd->batch_end_cycle[rw]){
- interval_ns = cycle_to_ns(now_cycle - pmbd->batch_end_cycle[rw]);
- } else {
- panic(KERN_ERR "%s(%d): timestamp in the past found.\n", __FUNCTION__, __LINE__);
- }
-
- /* check the interval length (cannot be too distant) */
- if (interval_ns >= PMBD_BATCH_MAX_INTERVAL) {
- /* interval is too big, break it to two batches */
- new_batch = TRUE;
- end_batch = TRUE;
- } else {
- /* still in the same batch, good */
- pmbd->batch_sectors[rw] += num_sectors;
- pmbd->batch_end_cycle[rw] = now_cycle;
- }
-
- /* check current batch duration (cannot be too long) */
- duration_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
- if (duration_ns >= PMBD_BATCH_MAX_DURATION)
- end_batch = TRUE;
-
- /* check current batch data amount (cannot be too large) */
- if (pmbd->batch_sectors[rw] >= PMBD_BATCH_MAX_SECTORS)
- end_batch = TRUE;
-
- /* if the batch ends, check and apply slow-down */
- if (end_batch) {
- /* batch size must be large enough, if not, just skip it */
- if (pmbd->batch_sectors[rw] > PMBD_BATCH_MIN_SECTORS) {
- uint64_t real_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]);
- uint64_t emul_ns = cal_trans_time(pmbd->batch_sectors[rw], rw, pmbd);
-
- if (emul_ns > real_ns)
- pmbd_slowdown((emul_ns - real_ns), TRUE);
- }
-
- pmbd->batch_sectors[rw] = 0;
- pmbd->batch_start_cycle[rw] = now_cycle;
- pmbd->batch_end_cycle[rw] = now_cycle;
- }
-
- /* if a new batch begins, add the first request */
- if (new_batch) {
- pmbd->batch_sectors[rw] = num_sectors;
- pmbd->batch_start_cycle[rw] = now_cycle;
- pmbd->batch_end_cycle[rw] = now_cycle;
- }
-
-done:
- spin_unlock(&pmbd->batch_lock);
- return;
-}
-
-/*
- * Emulating access time for a request
- *
- * Different from emulating bandwidths, we emulate access time for each
- * individual access. Right after we simulate the transfer time, we examine
- * the real access time (including transfer time), if the real time is smaller
- * than the specified access time, we slow down the request by applying a delay
- * to make up the difference. Note that we do not use any lock to coordinate
- * multiple threads for a system-wide "slowdown", but apply this delay on each
- * request individually and separately.
- *
- * Also note that since we basically use "busy-waiting", when the total number
- * of threads exceeds or be close to the total number of processors, the
- * simulated access time observed at application level could be longer than the
- * specified access time due to high CPU usage. But for each request, after
- * directly examining the duration of being in the make_request() function, the
- * simulated access time is still very precise.
- *
- */
-static void pmbd_emul_access_time(uint64_t start, uint64_t end, int num_sectors, int rw, PMBD_DEVICE_T* pmbd)
-{
- /*
- * Access time can be overlapped with each other, so there is no need
- * to use a lock to serialize it.
- * FIXME: should we apply this on each batch or each request?
- */
- uint64_t real_ns = cycle_to_ns(end - start);
- uint64_t emul_ns = cal_access_time(num_sectors, rw, pmbd);
-
- if (emul_ns > real_ns)
- pmbd_slowdown((emul_ns - real_ns), FALSE);
-
- return;
-}
-
-/*
- * set the starting hook for PM emulation
- *
- * @pmbd: pmbd device
- * @num_sectors: sectors being accessed
- * @rw: READ/WRITE
- * return value: the start cycle
- */
-static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw)
-{
- uint64_t start = 0;
- if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
- /* start timer here */
- TIMESTAMP(start);
- }
- return start;
-}
-
-/*
- * set the stopping hook for PM emulation
- *
- * @pmbd: pmbd device
- * @num_sectors: sectors being accessed
- * @rw: READ/WRITE
- * @start: the starting cycle
- * return value: the end cycle
- */
-static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start)
-{
- uint64_t end = 0;
- uint64_t end2 = 0;
- /*
- * NOTE: emulation can be done in two ways - (1) directly specify the
- * read/write latencies and bandwidths (2) only specify a relative
- * slowdown ratio (X), compared to DRAM.
- *
- * Also note that if rdsx/wrsx is set, we will ignore
- * rdlat/wrlat/rdbw/wrbw.
- */
- if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) {
- /*
- * NOTE: we first attempt to meet the target bandwidth and then
- * latency. This means the actual bandwidth should be close
- * to the emulated bandwidth, and then we guarantee that the
- * latency would not be SMALLER than the target latency.
- */
-
- /* emulate the bandwidth first */
- if (pmbd->rdbw > 0 && pmbd->wrbw > 0) {
- /* emulate transfer time (bandwidth) */
- pmbd_emul_transfer_time(num_sectors, rw, pmbd);
- }
-
- /* emulate the latency now */
- TIMESTAMP(end);
- if (pmbd->rdlat > 0 || pmbd->wrlat > 0) {
- /* emulate access time (latency) */
- pmbd_emul_access_time(start, end, num_sectors, rw, pmbd);
- }
- }
- /* get the ending timestamp */
- TIMESTAMP(end2);
-
- return end2;
-}
-
-/*
- * *************************************************************************
- * PM space protection functions
- * - clflush
- * - write protection
- * - write verification
- * - checksum
- * *************************************************************************
- */
-
-/*
- * flush designated cache lines in CPU cache
- */
-
-static inline void pmbd_clflush_all(PMBD_DEVICE_T* pmbd)
-{
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- TIMESTAMP(time_p1);
- if (cpu_has_clflush){
-#ifdef CONFIG_X86
- wbinvd_on_all_cpus();
-#else
- printk(KERN_WARNING "pmbd: WARNING - %s(%d) flush_cache_all() not implemented\n", __FUNCTION__, __LINE__);
-#endif
- }
- TIMESTAMP(time_p2);
-
- /* emulating slowdown */
- if(PMBD_DEV_USE_SLOWDOWN(pmbd))
- pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_clflushall[WRITE][cid] += time_p2 - time_p1;
- }
- return;
-}
-
-static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes)
-{
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
-
- TIMESTAMP(time_p1);
- if (cpu_has_clflush){
- clflush_cache_range(dst, bytes);
- }
- TIMESTAMP(time_p2);
-
- /* emulating slowdown */
- if(PMBD_DEV_USE_SLOWDOWN(pmbd))
- pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2);
-
- /* update time statistics */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_clflush[WRITE][cid] += time_p2 - time_p1;
- }
- return;
-}
-
-
-/*
- * Write-protection
- *
- * Being used as storage, PMBD needs to provide certain protection on accidental
- * change caused by wild pointers. So we initialize all the PM pages as
- * read-only; before we perform write operations into PM space, we set the
- * pages writable, after done, we set it back to read-only. This would
- * introduce extra overhead. However, this is a realistic solution to tackle
- * wild pointer problem.
- *
- */
-
-/*
- * set PM pages to read-only
- * @addr - the starting virtual address (PM space)
- * @bytes - the range in bytes
- * @on_access - this change command from request or during creating/destroying
- */
-
-static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
-{
- if (PMBD_USE_WRITE_PROTECTION()) {
- /* FIXME: type conversion happens here */
- /* FIXME: add range and bytes check here?? - not so necessary */
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
- unsigned long offset = (unsigned long) addr;
- unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
- int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
-
- if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
- printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n",
- __FUNCTION__, __LINE__, vaddr, num_pages);
-
- TIMESTAMP(time_p1);
- set_memory_ro(vaddr, num_pages);
- TIMESTAMP(time_p2);
-
- /* update time statistics */
-// if(PMBD_USE_TIMESTAT() && on_access){
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_setpages_ro[WRITE][cid] += time_p2 - time_p1;
- }
- }
- return;
-}
-
-static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access)
-{
- if (PMBD_USE_WRITE_PROTECTION()) {
- uint64_t time_p1 = 0;
- uint64_t time_p2 = 0;
- unsigned long offset = (unsigned long) addr;
- unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset));
- int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1;
-
- if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1)))
- printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", __FUNCTION__, __LINE__, vaddr, num_pages);
-
- TIMESTAMP(time_p1);
- set_memory_rw(vaddr, num_pages);
- TIMESTAMP(time_p2);
-
- /* update time statistics */
-// if(PMBD_USE_TIMESTAT() && on_access){
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_setpages_rw[WRITE][cid] += time_p2 - time_p1;
- }
- }
- return;
-}
-
-
-/*
- * Write verification (EXPERIMENTAL)
- *
- * Note: Even we do write protection by setting PM space read-only, there is
- * still a short vulnerable window when we write pages into PM space - between
- * the time when the pages are set RW and the time when the pages are set back
- * to RO. So we need to verify that no data has been changed during this window
- * by reading out the written data and comparing with the source data.
- *
- */
-
-
-static inline int pmbd_verify_wr_pages_pmap(PMBD_DEVICE_T* pmbd, void* pmbd_dummy_va, void* ram_va, size_t bytes)
-{
-
- unsigned long flags = 0;
-
- /*NOTE: we assume src is starting from 0 */
- uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va);
-
- /* disable interrupt (FIXME: do we need to do this?)*/
- DISABLE_SAVE_IRQ(flags);
-
- /* do the real work */
- while(bytes){
- uint64_t pfn = (pa >> PAGE_SHIFT); // page frame number
- unsigned off = pa & (~PAGE_MASK); // offset in one page
- unsigned size = MIN_OF((PAGE_SIZE - off), bytes); // the size to copy
-
- /* map it */
- void * map = pmap_atomic_pfn(pfn, pmbd, WRITE);
- void * pmbd_va = map + off;
-
- /* do memcopy */
- if (memcmp(pmbd_va, ram_va, size)){
- punmap_atomic(map, pmbd, WRITE);
- goto bad;
- }
-
- /* unmap it */
- punmap_atomic(map, pmbd, WRITE);
-
- /* prepare the next iteration */
- ram_va += size;
- bytes -= size;
- pa += size;
- }
-
- /* re-enable interrupt */
- ENABLE_RESTORE_IRQ(flags);
- return 0;
-
-bad:
- ENABLE_RESTORE_IRQ(flags);
- return -1;
-}
-
-
-static inline int pmbd_verify_wr_pages_nopmap(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
-{
- if (memcmp(pmbd_va, ram_va, bytes))
- return -1;
- else
- return 0;
-}
-
-static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes)
-{
- int rtn = 0;
- uint64_t time_p1, time_p2;
-
- TIMESTAT_POINT(time_p1);
-
- /* check it */
- if (PMBD_USE_PMAP())
- rtn = pmbd_verify_wr_pages_pmap(pmbd, pmbd_va, ram_va, bytes);
- else
- rtn = pmbd_verify_wr_pages_nopmap(pmbd, pmbd_va, ram_va, bytes);
-
- /* found mismatch */
- if (rtn < 0){
- panic("pmbd: *** writing into PM failed (error found) ***\n");
- return -1;
- }
-
- TIMESTAT_POINT(time_p2);
-
- /* timestamp */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_wrverify[WRITE][cid] += time_p2 - time_p1;
- }
-
- return 0;
-}
-
-/*
- * Checksum (EXPERIMENTAL)
- *
- * Note: With write-protection and write verification, we can largely reduce
- * the risk of PM data corruption caused by wild in-kernel pointers, however,
- * it is still possible that some data gets corrupted (e.g. PM pages are
- * maliciously changed to writable). Thus, we need to provide another layer of
- * protection by checksuming the PM pages. When writing a page, we compute a
- * checksum and write it into memory; When reading a page, we compute its
- * checksum and compare it with the stored checksum. If a mismatch is found,
- * it indicates that either PM data or the checksum has been corrupted.
- *
- * FIXME:
- * (1) checksum should be stored in PM space, currently we just store it in RAM.
- * (2) probably we should use the CPU cache to speed up and avoid reading the same
- * chunk of data again.
- * (3) currently we always allocate checksum space, whether we enable or disable it
- * in the module config options; may need to make it more efficient in the future.
- *
- */
-
-
-static int pmbd_checksum_space_alloc(PMBD_DEVICE_T* pmbd)
-{
- int err = 0;
-
- /* allocate checksum space */
- pmbd->checksum_space= vmalloc(PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T));
- if (pmbd->checksum_space){
- memset(pmbd->checksum_space, 0, (PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T)));
- printk(KERN_INFO "pmbd(%d): checksum space is allocated\n", pmbd->pmbd_id);
- } else {
- err = -ENOMEM;
- }
-
- /* allocate checksum buffer space */
- pmbd->checksum_iomem_buf = vmalloc(pmbd->checksum_unit_size);
- if (pmbd->checksum_iomem_buf){
- memset(pmbd->checksum_iomem_buf, 0, pmbd->checksum_unit_size);
- printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is allocated\n", pmbd->pmbd_id);
- } else {
- err = -ENOMEM;
- }
-
- return err;
-}
-
-static int pmbd_checksum_space_free(PMBD_DEVICE_T* pmbd)
-{
- if (pmbd->checksum_space) {
- vfree(pmbd->checksum_space);
- pmbd->checksum_space = NULL;
- printk(KERN_INFO "pmbd(%d): checksum space is freed\n", pmbd->pmbd_id);
- }
- if (pmbd->checksum_iomem_buf) {
- vfree(pmbd->checksum_iomem_buf);
- pmbd->checksum_iomem_buf = NULL;
- printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is freed\n", pmbd->pmbd_id);
- }
- return 0;
-}
-
-
-/*
- * Derived from linux/lib/crc32.c GPL v2
- */
-static unsigned int crc32_my(unsigned char const *p, unsigned int len)
-{
- int i;
- unsigned int crc = 0;
- while (len--) {
- crc ^= *p++;
- for (i = 0; i < 8; i++)
- crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0);
- }
- return crc;
-}
-
-static inline PMBD_CHECKSUM_T pmbd_checksum_func(void* data, size_t size)
-{
- return crc32_my(data, size);
-}
-
-/*
- * calculate the checksum for a chunksum unit
- * @pmbd: the pmbd device
- * @data: the virtual address of the target data (must be aligned to the
- * checksum unit boundaries)
- */
-
-
-static inline PMBD_CHECKSUM_T pmbd_cal_checksum(PMBD_DEVICE_T* pmbd, void* data)
-{
- void* vaddr = data;
- size_t size = pmbd->checksum_unit_size;
- PMBD_CHECKSUM_T chk = 0;
-
-#if 0
-#ifndef CONFIG_X86
- /*
- * Note: If we are directly using vmalloc(), we don't have to copy it
- * to the checksum buffer; however, if we are using High Memory, we should not
- * directly dereference the ioremapped data (on non-x86 platform), so we have to
- * first copy it to a temporary buffer, this extra copy would significantly
- * slows down operations. We do this here is just to remove this extra copy on
- * x86 platform. (see kernel/Documents/IO-mapping.txt)
- *
- */
- if (PMBD_DEV_USE_HIGHMEM(pmbd) && VADDR_IN_PMBD_SPACE(pmbd, data)) {
- memcpy_fromio(pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
- vaddr = pmbd->checksum_iomem_buf;
- }
-#endif
-#endif
-
- if (pmbd->checksum_unit_size != PAGE_SIZE){
- panic("ERR: %s(%d) checksum unit size (%u) must be %lu\n", __FUNCTION__, __LINE__, pmbd->checksum_unit_size, PAGE_SIZE);
- return 0;
- }
-
- /* FIXME: do we really need to copy the data out first (if not pmap)*/
- memcpy_from_pmbd(pmbd, pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size);
-
- /* calculate the checksum */
- vaddr = pmbd->checksum_iomem_buf;
- chk = pmbd_checksum_func(vaddr, size);
-
- return chk;
-}
-
-static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
-{
- unsigned long i;
- unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
- unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
-
- uint64_t time_p1, time_p2;
-
- TIMESTAT_POINT(time_p1);
-
- for (i = ck_id_s; i <= ck_id_e; i ++){
- void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
- void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
-
- PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
- memcpy(chk, &checksum, sizeof(PMBD_CHECKSUM_T));
- }
-
- TIMESTAT_POINT(time_p2);
-
- /* timestamp */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_checksum[WRITE][cid] += time_p2 - time_p1;
- }
- return 0;
-}
-
-static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes)
-{
- unsigned long i;
- unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr);
- unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1));
-
- uint64_t time_p1, time_p2;
- TIMESTAT_POINT(time_p1);
-
- for (i = ck_id_s; i <= ck_id_e; i ++){
- void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i);
- void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
-
- PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data);
- if (memcmp(chk, &checksum, sizeof(PMBD_CHECKSUM_T))){
- printk(KERN_WARNING "pmbd(%d): checksum mismatch found!", pmbd->pmbd_id);
- }
- }
-
- TIMESTAT_POINT(time_p2);
-
- /* timestamp */
- if(PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- pmbd_stat->cycles_checksum[READ][cid] += time_p2 - time_p1;
- }
-
- return 0;
-}
-
-#if 0
-/* WARN: Calculating checksum for a big PM space is slow and could lockup system*/
-static int pmbd_checksum_space_init(PMBD_DEVICE_T* pmbd)
-{
- unsigned long i;
- PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, pmbd->mem_space);
- unsigned long ck_s = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_FIRST_BYTE(pmbd));
- unsigned long ck_e = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_LAT_BYTE(pmbd));
-
- for (i = ck_s; i <= ck_e; i ++){
- void* dst = CHECKSUM_IDX_TO_CKADDR(pmbd, i);
- memcpy(dst, &checksum, sizeof(PMBD_CHECKSUM_T));
- }
- return 0;
-}
-#endif
-
-/*
- * locks
- *
- * Note: We should prevent multiple threads from concurrently accessing the same
- * chunk of data. For example, if two writes access the same page, the PM page
- * could be corrupted with a merged content of two. So we allocate one spinlock
- * for each 4KB PM page. When read/writing PM data, we lock the related pages
- * and unlock them after done.
- *
- */
-
-static int pmbd_lock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
-{
- if (PMBD_USE_LOCK()) {
- PBN_T pbn = 0;
- PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
- PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
-
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- spin_lock(&pbi->lock);
- }
- }
- return 0;
-}
-
-static int pmbd_unlock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes)
-{
- if (PMBD_USE_LOCK()){
- PBN_T pbn = 0;
- PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector);
- PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1));
-
- for (pbn = pbn_s; pbn <= pbn_e; pbn ++) {
- PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn);
- spin_unlock(&pbi->lock);
- }
- }
- return 0;
-}
-
-/*
- **************************************************************************
- * Unbuffered Read/write functions
- **************************************************************************
- */
-static void copy_to_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes, unsigned do_fua)
-{
- void *dst;
-
- dst = pmbd->mem_space + sector * pmbd->sector_size;
-
- /* lock the pages */
- pmbd_lock_on_access(pmbd, sector, bytes);
-
- /* set the pages writable */
- /* if we use CR0/WP to temporarily switch the writable permission,
- * we don't have to change the PTE attributes directly */
- if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
- pmbd_set_pages_rw(pmbd, dst, bytes, TRUE);
-
- /* do memcpy */
- memcpy_to_pmbd(pmbd, dst, src, bytes, do_fua);
-
- /* finish up */
- /* set the pages read-only */
- if (PMBD_DEV_USE_WPMODE_PTE(pmbd))
- pmbd_set_pages_ro(pmbd, dst, bytes, TRUE);
-
- /* verify that the write operation succeeded */
- if(PMBD_USE_WRITE_VERIFICATION())
- pmbd_verify_wr_pages(pmbd, dst, src, bytes);
-
- /* generate check sum */
- if (PMBD_USE_CHECKSUM())
- pmbd_checksum_on_write(pmbd, dst, bytes);
-
- /* unlock the pages */
- pmbd_unlock_on_access(pmbd, sector, bytes);
-
- return;
-}
-
-
-static void copy_from_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
-{
- void *src = pmbd->mem_space + sector * pmbd->sector_size;
-
- /* lock the pages */
- pmbd_lock_on_access(pmbd, sector, bytes);
-
- /* check checksum first */
- if (PMBD_USE_CHECKSUM())
- pmbd_checksum_on_read(pmbd, src, bytes);
-
- /* read it out*/
- memcpy_from_pmbd(pmbd, dst, src, bytes);
-
- /* unlock the pages */
- pmbd_unlock_on_access(pmbd, sector, bytes);
-
- return;
-}
-
-
-/*
- * *************************************************************************
- * Read/write functions
- * *************************************************************************
- */
-
-static void copy_to_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes, unsigned do_fua)
-{
- if (PMBD_DEV_USE_BUFFER(pmbd)){
- copy_to_pmbd_buffered(pmbd, dst, sector, bytes);
- if (do_fua){
- /* NOTE:
- * When we use a FUA, if the buffer is enabled, we
- * still write into the buffer first, but then we
- * directly write into the PM space without using the
- * buffer again. This is suboptimal (we need to write
- * the data twice), however, it is better than changing
- * the buffering code.
- */
- copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
- }
- }else
- copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua);
- return;
-}
-
-static void copy_from_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes)
-{
- if (PMBD_DEV_USE_BUFFER(pmbd))
- copy_from_pmbd_buffered(pmbd, dst, sector, bytes);
- else
- copy_from_pmbd_unbuffered(pmbd, dst, sector, bytes);
- return;
-}
-
-static int pmbd_seg_read_write(PMBD_DEVICE_T* pmbd, struct page *page, unsigned int len,
- unsigned int off, int rw, sector_t sector, unsigned do_fua)
-{
- void *mem;
- int err = 0;
-
- mem = kmap_atomic(page);
-
- if (rw == READ) {
- copy_from_pmbd(pmbd, mem + off, sector, len);
- flush_dcache_page(page);
- } else {
- flush_dcache_page(page);
- copy_to_pmbd(pmbd, mem + off, sector, len, do_fua);
- }
-
- kunmap_atomic(mem);
-
- return err;
-}
-
-static int pmbd_do_bvec(PMBD_DEVICE_T* pmbd, struct page *page,
- unsigned int len, unsigned int off, int rw, sector_t sector, unsigned do_fua)
-{
- return pmbd_seg_read_write(pmbd, page, len, off, rw, sector, do_fua);
-}
-
-/*
- * Handling write barrier
- * @pmbd: the pmbd device
- *
- * When the application sends fsync(), a bio labeled with WRITE_BARRIER would be
- * received by pmbd_make_request(), and we need to stop accepting new incoming
- * writes (by locking pmbd->wr_barrier_lock), and wait for the on-the-fly writes
- * to complete (by checking pmbd->num_flying_wr), then if we use buffer, we flush
- * the whole entire DRAM buffer with clflush enabled. If we do not use the buffer,
- * we flush the CPU cache to let all the data securely be written into PM.
- *
- */
-
-
-static void __x86_mfence_all(void *arg)
-{
- unsigned long cache = (unsigned long)arg;
- if (cache && boot_cpu_data.x86 >= 4)
- mfence();
-}
-
-static void x86_mfence_all(unsigned long cache)
-{
- BUG_ON(irqs_disabled());
- on_each_cpu(__x86_mfence_all, (void*) cache, 1);
-}
-
-static inline void pmbd_mfence_all(PMBD_DEVICE_T* pmbd)
-{
- x86_mfence_all(1);
-}
-
-
-static void __x86_sfence_all(void *arg)
-{
- unsigned long cache = (unsigned long)arg;
- if (cache && boot_cpu_data.x86 >= 4)
- sfence();
-}
-
-static void x86_sfence_all(unsigned long cache)
-{
- BUG_ON(irqs_disabled());
- on_each_cpu(__x86_sfence_all, (void*) cache, 1);
-
-}
-
-static inline void pmbd_sfence_all(PMBD_DEVICE_T* pmbd)
-{
- x86_sfence_all(1);
-}
-
-static int pmbd_write_barrier(PMBD_DEVICE_T* pmbd)
-{
- unsigned i;
-
- /* blocking incoming writes */
- spin_lock(&pmbd->wr_barrier_lock);
-
- /* wait for all on-the-fly writes to finish first */
- while (atomic_read(&pmbd->num_flying_wr) != 0)
- ;
-
- if (PMBD_DEV_USE_BUFFER(pmbd)){
- /* if buffer is used, flush the entire buffer */
- for (i = 0; i < pmbd->num_buffers; i ++){
- PMBD_BUFFER_T* buffer = pmbd->buffers[i];
- pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER);
- }
- }
-
- /*
- * considering the following:
- * UC (write-through): strong ordering, we do nothing
- * UC-Minus: strong ordering (may be overridden by WC), we use sfence, do nothing
- * WC (write-combining): sfence should be used after each write, so we do nothing
- * WB (write-back): non-temporal store : sfence is used, do nothing
- * clflush/mfence: mfence is used in clflush_cache_range(), do nothing
- * nothing: wbinvd needed to drop the entire cache
- */
- if (PMBD_CPU_CACHE_USE_WB()){
- if (PMBD_USE_NTS()){
- /* sfence is used after each movntq, so it is safe, we
- * do nothing, just stop accepting any incoming requests */
- } else if (PMBD_USE_CLFLUSH()) {
- /* if use clflush/mfence to sync I/O, we do nothing*/
-// pmbd_mfence_all(pmbd);
- } else {
- /* if no sync operations, we have to drop the entire cache */
- pmbd_clflush_all(pmbd);
- }
- } else if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) {
- /* if using WC, sfence should used already, so do nothing */
-
- } else if (PMBD_CPU_CACHE_USE_UC()) {
- /* strong ordering is used, no need to do anything else*/
- } else {
- panic("%s(%d): something is wrong\n", __FUNCTION__, __LINE__);
- }
-
- /* unblock incoming writes */
- spin_unlock(&pmbd->wr_barrier_lock);
- return 0;
-}
-
-
-// #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & REQ_FLUSH) == REQ_FLUSH)
-// #define BIO_WR_BARRIER(BIO) ((BIO)->bi_rw & (REQ_FLUSH | REQ_FLUSH_SEQ))
- #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & WRITE_FLUSH) == WRITE_FLUSH)
- #define BIO_WR_FUA(BIO) (((BIO)->bi_rw & WRITE_FUA) == WRITE_FUA)
- #define BIO_WR_SYNC(BIO) (((BIO)->bi_rw & WRITE_SYNC) == WRITE_SYNC)
-
-static void pmbd_make_request(struct request_queue *q, struct bio *bio)
-{
- int i = 0;
- int err = -EIO;
- uint64_t start = 0;
- uint64_t end = 0;
- struct bio_vec *bvec;
- int rw = bio_rw(bio);
- sector_t sector = bio->bi_sector;
- int num_sectors = bio_sectors(bio);
- struct block_device *bdev = bio->bi_bdev;
- PMBD_DEVICE_T *pmbd = bdev->bd_disk->private_data;
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
- unsigned bio_is_write_fua = FALSE;
- unsigned bio_is_write_barrier = FALSE;
- unsigned do_fua = FALSE;
- uint64_t time_p1, time_p2, time_p3, time_p4, time_p5, time_p6;
- time_p1 = time_p2 = time_p3 = time_p4 = time_p5 = time_p6 = 0;
-
-
- TIMESTAT_POINT(time_p1);
-// printk("ACCESS: %u %d %X %d\n", sector, num_sectors, bio->bi_rw, rw);
-
- /* update rw */
- if (rw == READA)
- rw = READ;
- if (rw != READ && rw != WRITE)
- panic("pmbd: %s(%d) found request not read or write either\n", __FUNCTION__, __LINE__);
-
- /* handle write barrier (we don't do for BIO_WR_SYNC(bio) anymore*/
- if (BIO_WR_BARRIER(bio)){
- /*
- * Note: Linux kernel 2.6.37 and later use file systems and FUA
- * to ensure data reliability, rather than write barriers.
- * See http://monolight.cc/2011/06/barriers-caches-filesystems
- */
- bio_is_write_barrier = TRUE;
-// printk(KERN_INFO "pmbd: received barrier request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
-
- if (PMBD_USE_WB())
- pmbd_write_barrier(pmbd);
- }
-
- if (BIO_WR_FUA(bio)){
- bio_is_write_fua = TRUE;
-// printk(KERN_INFO "pmbd: received FUA request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw);
-
- if (PMBD_USE_FUA())
- do_fua = TRUE;
- }
-
- TIMESTAT_POINT(time_p2);
-
- /* blocking write until write barrier is done */
- if (rw == WRITE){
- spin_lock(&pmbd->wr_barrier_lock);
- spin_unlock(&pmbd->wr_barrier_lock);
- }
-
- /* increment on-the-fly writes counter */
- atomic_inc(&pmbd->num_flying_wr);
-
- /* starting emulation */
- if (PMBD_DEV_SIM_DEV(pmbd))
- start = emul_start(pmbd, num_sectors, rw);
-
- /* check if out of range */
- if (sector + (bio->bi_size >> SECTOR_SHIFT) > get_capacity(bdev->bd_disk)){
- printk(KERN_WARNING "pmbd: request exceeds the PMBD capacity\n");
- TIMESTAT_POINT(time_p3);
- goto out;
- }
-
-// printk("DEBUG: ACCESS %lu %d %d\n", sector, num_sectors, rw);
-
- /*
- * NOTE: some applications (e.g. fdisk) call fsync() to request
- * flushing dirty data from the buffer cache. In default, fsync() is
- * linked to blkdev_fsync() in the def_blk_fops structure, and
- * blkdev_fsync() will call blkdev_issue_flush(), which generates an
- * empty bio carrying a write barrier down to the block device through
- * generic_make_request(), which calls pmbd_make_request() in turn. If
- * we don't set err=0 here, this error message would pass upwards back
- * to the application. For example, fdisk will fail and reports error
- * when trying to write the partition table before it exits. Thus we
- * must reset the error code here if the bio is empty. Also note that
- * we directly check the bio size, rather than using bio_wr_barrier(),
- * to handle other cases.
- *
- */
- if (num_sectors == 0) {
- err = 0;
- TIMESTAT_POINT(time_p3);
- goto out;
- }
-
- /* update the access time*/
- PMBD_DEV_UPDATE_ACCESS_TIME(pmbd);
-
- TIMESTAT_POINT(time_p3);
-
- /*
- * Do read/write now. We first perform the operation, then check how
- * long it actually takes to finish the operation, then we calculate an
- * emulated time for a given slow-down model, if the actual access time
- * is less than the emulated time, we just make up the difference to
- * emulate a slower device.
- */
- bio_for_each_segment(bvec, bio, i) {
- unsigned int len = bvec->bv_len;
- err = pmbd_do_bvec(pmbd, bvec->bv_page, len,
- bvec->bv_offset, rw, sector, do_fua);
- if (err)
- break;
- sector += len >> SECTOR_SHIFT;
- }
-
-out:
- TIMESTAT_POINT(time_p4);
-
- bio_endio(bio, err);
-
- TIMESTAT_POINT(time_p5);
-
- /* ending emulation (simmode0)*/
- if (PMBD_DEV_SIM_DEV(pmbd))
- end = emul_end(pmbd, num_sectors, rw, start);
-
- /* decrement on-the-fly writes counter */
- atomic_dec(&pmbd->num_flying_wr);
-
- TIMESTAT_POINT(time_p6);
-
- /* update statistics data */
- spin_lock(&pmbd_stat->stat_lock);
- if (rw == READ) {
- pmbd_stat->num_requests_read ++;
- pmbd_stat->num_sectors_read += num_sectors;
- } else {
- pmbd_stat->num_requests_write ++;
- pmbd_stat->num_sectors_write += num_sectors;
- }
- if (bio_is_write_barrier)
- pmbd_stat->num_write_barrier ++;
- if (bio_is_write_fua)
- pmbd_stat->num_write_fua ++;
- spin_unlock(&pmbd_stat->stat_lock);
-
- /* cycles */
- if (PMBD_USE_TIMESTAT()){
- int cid = CUR_CPU_ID();
- pmbd_stat->cycles_total[rw][cid] += time_p6 - time_p1;
- pmbd_stat->cycles_wb[rw][cid] += time_p2 - time_p1; /* write barrier */
- pmbd_stat->cycles_prepare[rw][cid] += time_p3 - time_p2;
- pmbd_stat->cycles_work[rw][cid] += time_p4 - time_p3;
- pmbd_stat->cycles_endio[rw][cid] += time_p5 - time_p4;
- pmbd_stat->cycles_finish[rw][cid] += time_p6 - time_p5;
- }
-}
-
-
-/*
- **************************************************************************
- * Allocating memory space for PMBD device
- **************************************************************************
- */
-
-/*
- * Set the page attributes for the PMBD backstore memory space
- * - WB: cache enabled, write back (default)
- * - WC: cache disabled, write through, speculative writes combined
- * - UC: cache disabled, write through, no write combined
- * - UC-Minus: the same as UC
- *
- * REF:
- * - http://www.kernel.org/doc/ols/2008/ols2008v2-pages-135-144.pdf
- * - http://www.mjmwired.net/kernel/Documentation/x86/pat.txt
- */
-
-static int pmbd_set_pages_cache_flags(PMBD_DEVICE_T* pmbd)
-{
- if (pmbd->mem_space && pmbd->num_sectors) {
- /* NOTE: we convert it here with no problem on 64-bit system */
- unsigned long vaddr = (unsigned long) pmbd->mem_space;
- int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
-
- printk(KERN_INFO "pmbd: setting %s PTE flags (%lx:%d)\n", pmbd->pmbd_name, vaddr, num_pages);
- set_pages_cache_flags(vaddr, num_pages);
- printk(KERN_INFO "pmbd: setting %s PTE flags done.\n", pmbd->pmbd_name);
- }
- return 0;
-}
-
-static int pmbd_reset_pages_cache_flags(PMBD_DEVICE_T* pmbd)
-{
- if (pmbd->mem_space){
- unsigned long vaddr = (unsigned long) pmbd->mem_space;
- int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd);
- set_memory_wb(vaddr, num_pages);
- printk(KERN_INFO "pmbd: %s pages cache flags are reset to WB\n", pmbd->pmbd_name);
- }
- return 0;
-}
-
-
-/*
- * Allocate/free memory backstore space for PMBD devices
- */
-static int pmbd_mem_space_alloc (PMBD_DEVICE_T* pmbd)
-{
- int err = 0;
-
- /* allocate PM memory space */
- if (PMBD_DEV_USE_VMALLOC(pmbd)){
- pmbd->mem_space = vmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
- } else if (PMBD_DEV_USE_HIGHMEM(pmbd)){
- pmbd->mem_space = hmalloc (PMBD_MEM_TOTAL_BYTES(pmbd));
- }
-
- if (pmbd->mem_space) {
-#if 0
- /* FIXME: No need to do this. It's slow, system could be locked up */
- memset(pmbd->mem_space, 0, pmbd->sectors * pmbd->sector_size);
-#endif
- printk(KERN_INFO "pmbd: /dev/%s is created [%lu : %llu MBs]\n",
- pmbd->pmbd_name, (unsigned long) pmbd->mem_space, SECTORS_TO_MB(pmbd->num_sectors));
- } else {
- printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
- err = -ENOMEM;
- }
- return err;
-}
-
-static int pmbd_mem_space_free(PMBD_DEVICE_T* pmbd)
-{
- /* free it up */
- if (pmbd->mem_space) {
- if (PMBD_DEV_USE_VMALLOC(pmbd))
- vfree(pmbd->mem_space);
- else if (PMBD_DEV_USE_HIGHMEM(pmbd)) {
- hfree(pmbd->mem_space);
- }
- pmbd->mem_space = NULL;
- }
- return 0;
-}
-
-/* pmbd->pmbd_stat */
-static int pmbd_stat_alloc(PMBD_DEVICE_T* pmbd)
-{
- int err = 0;
- pmbd->pmbd_stat = (PMBD_STAT_T*)kzalloc(sizeof(PMBD_STAT_T), GFP_KERNEL);
- if (pmbd->pmbd_stat){
- spin_lock_init(&pmbd->pmbd_stat->stat_lock);
- } else {
- printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__);
- err = -ENOMEM;
- }
- return 0;
-}
-
-static int pmbd_stat_free(PMBD_DEVICE_T* pmbd)
-{
- if(pmbd->pmbd_stat) {
- kfree(pmbd->pmbd_stat);
- pmbd->pmbd_stat = NULL;
- }
- return 0;
-}
-
-/* /proc/pmbd/<dev> */
-static int pmbd_proc_pmbdstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
-{
- int rtn;
- if (offset > 0) {
- *eof = 1;
- rtn = 0;
- } else {
- //char local_buffer[1024];
- char* local_buffer = kzalloc(8192, GFP_KERNEL);
- PMBD_DEVICE_T* pmbd, *next;
- char rdwr_name[2][16] = {"read\0", "write\0"};
- local_buffer[0] = '\0';
-
- list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
- unsigned i, j;
- BBN_T num_dirty = 0;
- BBN_T num_blocks = 0;
- PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat;
-
- /* FIXME: should we lock the buffer? (NOT NECESSARY)*/
- for (i = 0; i < pmbd->num_buffers; i ++){
- num_blocks += pmbd->buffers[i]->num_blocks;
- num_dirty += pmbd->buffers[i]->num_dirty;
- }
-
- /* print stuff now */
- spin_lock(&pmbd->pmbd_stat->stat_lock);
-
- sprintf(local_buffer+strlen(local_buffer), "num_dirty_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) num_dirty);
- sprintf(local_buffer+strlen(local_buffer), "num_clean_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) (num_blocks - num_dirty));
- sprintf(local_buffer+strlen(local_buffer), "num_sectors_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_read);
- sprintf(local_buffer+strlen(local_buffer), "num_sectors_write[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_write);
- sprintf(local_buffer+strlen(local_buffer), "num_requests_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_requests_read);
- sprintf(local_buffer+strlen(local_buffer), "num_requests_write[%s] %llu\n",pmbd->pmbd_name, pmbd_stat->num_requests_write);
- sprintf(local_buffer+strlen(local_buffer), "num_write_barrier[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_barrier);
- sprintf(local_buffer+strlen(local_buffer), "num_write_fua[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_fua);
-
- spin_unlock(&pmbd->pmbd_stat->stat_lock);
-
-// sprintf(local_buffer+strlen(local_buffer), "\n");
-
- for (j = 0; j <= 1; j ++){
- int k=0;
-
- unsigned long long cycles_total = 0;
- unsigned long long cycles_prepare = 0;
- unsigned long long cycles_wb = 0;
- unsigned long long cycles_work = 0;
- unsigned long long cycles_endio = 0;
- unsigned long long cycles_finish = 0;
-
- unsigned long long cycles_pmap = 0;
- unsigned long long cycles_punmap = 0;
- unsigned long long cycles_memcpy = 0;
- unsigned long long cycles_clflush = 0;
- unsigned long long cycles_clflushall = 0;
- unsigned long long cycles_wrverify = 0;
- unsigned long long cycles_checksum = 0;
- unsigned long long cycles_pause = 0;
- unsigned long long cycles_slowdown = 0;
- unsigned long long cycles_setpages_ro = 0;
- unsigned long long cycles_setpages_rw = 0;
-
- for (k = 0; k < PMBD_MAX_NUM_CPUS; k ++){
- cycles_total += pmbd_stat->cycles_total[j][k];
- cycles_prepare += pmbd_stat->cycles_prepare[j][k];
- cycles_wb += pmbd_stat->cycles_wb[j][k];
- cycles_work += pmbd_stat->cycles_work[j][k];
- cycles_endio += pmbd_stat->cycles_endio[j][k];
- cycles_finish += pmbd_stat->cycles_finish[j][k];
-
- cycles_pmap += pmbd_stat->cycles_pmap[j][k];
- cycles_punmap += pmbd_stat->cycles_punmap[j][k];
- cycles_memcpy += pmbd_stat->cycles_memcpy[j][k];
- cycles_clflush += pmbd_stat->cycles_clflush[j][k];
- cycles_clflushall+=pmbd_stat->cycles_clflushall[j][k];
- cycles_wrverify += pmbd_stat->cycles_wrverify[j][k];
- cycles_checksum += pmbd_stat->cycles_checksum[j][k];
- cycles_pause += pmbd_stat->cycles_pause[j][k];
- cycles_slowdown += pmbd_stat->cycles_slowdown[j][k];
- cycles_setpages_ro+= pmbd_stat->cycles_setpages_ro[j][k];
- cycles_setpages_rw+= pmbd_stat->cycles_setpages_rw[j][k];
- }
-
- sprintf(local_buffer+strlen(local_buffer), "cycles_total_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_total);
- sprintf(local_buffer+strlen(local_buffer), "cycles_prepare_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_prepare);
- sprintf(local_buffer+strlen(local_buffer), "cycles_wb_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wb);
- sprintf(local_buffer+strlen(local_buffer), "cycles_work_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_work);
- sprintf(local_buffer+strlen(local_buffer), "cycles_endio_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_endio);
- sprintf(local_buffer+strlen(local_buffer), "cycles_finish_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_finish);
- sprintf(local_buffer+strlen(local_buffer), "cycles_pmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pmap);
- sprintf(local_buffer+strlen(local_buffer), "cycles_punmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_punmap);
- sprintf(local_buffer+strlen(local_buffer), "cycles_memcpy_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_memcpy);
- sprintf(local_buffer+strlen(local_buffer), "cycles_clflush_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflush);
- sprintf(local_buffer+strlen(local_buffer), "cycles_clflushall_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflushall);
- sprintf(local_buffer+strlen(local_buffer), "cycles_wrverify_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wrverify);
- sprintf(local_buffer+strlen(local_buffer), "cycles_checksum_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_checksum);
- sprintf(local_buffer+strlen(local_buffer), "cycles_pause_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pause);
- sprintf(local_buffer+strlen(local_buffer), "cycles_slowdown_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_slowdown);
- sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_ro_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_ro);
- sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_rw_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_rw);
- }
-
-#if 0
- /* print something temporary for debugging purpose */
- if (0) {
- spin_lock(&pmbd->tmp_lock);
- printk("%llu %lu\n", pmbd->tmp_data, pmbd->tmp_num);
- spin_unlock(&pmbd->tmp_lock);
- }
-#endif
- }
-
- memcpy(buffer, local_buffer, strlen(local_buffer));
- rtn = strlen(local_buffer);
- kfree(local_buffer);
- }
- return rtn;
-}
-
-/* /proc/pmbdcfg */
-static int pmbd_proc_pmbdcfg_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
-{
- int rtn;
- if (offset > 0) {
- *eof = 1;
- rtn = 0;
- } else {
- char* local_buffer = kzalloc(8192, GFP_KERNEL);
- PMBD_DEVICE_T* pmbd, *next;
- local_buffer[0] = '\0';
-
- /* global configurations */
- sprintf(local_buffer+strlen(local_buffer), "MODULE OPTIONS: %s\n", mode);
- sprintf(local_buffer+strlen(local_buffer), "\n");
-
- sprintf(local_buffer+strlen(local_buffer), "max_part %d\n", max_part);
- sprintf(local_buffer+strlen(local_buffer), "part_shift %d\n", part_shift);
-
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_type %u\n", g_pmbd_type);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_mergeable %u\n", g_pmbd_mergeable);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_clflush %u\n", g_pmbd_cpu_cache_clflush);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_flag %lu\n", g_pmbd_cpu_cache_flag);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_protect %u\n", g_pmbd_wr_protect);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_verify %u\n", g_pmbd_wr_verify);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_checksum %u\n", g_pmbd_checksum);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_lock %u\n", g_pmbd_lock);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_subpage_update %u\n", g_pmbd_subpage_update);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_pmap %u\n", g_pmbd_pmap);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nts %u\n", g_pmbd_nts);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_ntl %u\n", g_pmbd_ntl);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wb %u\n", g_pmbd_wb);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_fua %u\n", g_pmbd_fua);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_timestat %u\n", g_pmbd_timestat);
- sprintf(local_buffer+strlen(local_buffer), "g_highmem_size %lu\n", g_highmem_size);
- sprintf(local_buffer+strlen(local_buffer), "g_highmem_phys_addr %llu\n", (unsigned long long) g_highmem_phys_addr);
- sprintf(local_buffer+strlen(local_buffer), "g_highmem_virt_addr %llu\n", (unsigned long long) g_highmem_virt_addr);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nr %u\n", g_pmbd_nr);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_adjust_ns %llu\n", g_pmbd_adjust_ns);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_num_buffers %llu\n", g_pmbd_num_buffers);
- sprintf(local_buffer+strlen(local_buffer), "g_pmbd_buffer_stride %llu\n", g_pmbd_buffer_stride);
- sprintf(local_buffer+strlen(local_buffer), "\n");
-
- /* device specific configurations */
- list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
- int i = 0;
-
- sprintf(local_buffer+strlen(local_buffer), "pmbd_id[%s] %d\n", pmbd->pmbd_name, pmbd->pmbd_id);
- sprintf(local_buffer+strlen(local_buffer), "num_sectors[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->num_sectors);
- sprintf(local_buffer+strlen(local_buffer), "sector_size[%s] %u\n", pmbd->pmbd_name, pmbd->sector_size);
- sprintf(local_buffer+strlen(local_buffer), "pmbd_type[%s] %u\n", pmbd->pmbd_name, pmbd->pmbd_type);
- sprintf(local_buffer+strlen(local_buffer), "rammode[%s] %u\n", pmbd->pmbd_name, pmbd->rammode);
- sprintf(local_buffer+strlen(local_buffer), "bufmode[%s] %u\n", pmbd->pmbd_name, pmbd->bufmode);
- sprintf(local_buffer+strlen(local_buffer), "wpmode[%s] %u\n", pmbd->pmbd_name, pmbd->wpmode);
- sprintf(local_buffer+strlen(local_buffer), "num_buffers[%s] %u\n", pmbd->pmbd_name, pmbd->num_buffers);
- sprintf(local_buffer+strlen(local_buffer), "buffer_stride[%s] %u\n", pmbd->pmbd_name, pmbd->buffer_stride);
- sprintf(local_buffer+strlen(local_buffer), "pb_size[%s] %u\n", pmbd->pmbd_name, pmbd->pb_size);
- sprintf(local_buffer+strlen(local_buffer), "checksum_unit_size[%s] %u\n", pmbd->pmbd_name, pmbd->checksum_unit_size);
- sprintf(local_buffer+strlen(local_buffer), "simmode[%s] %u\n", pmbd->pmbd_name, pmbd->simmode);
- sprintf(local_buffer+strlen(local_buffer), "rdlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdlat);
- sprintf(local_buffer+strlen(local_buffer), "wrlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrlat);
- sprintf(local_buffer+strlen(local_buffer), "rdbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdbw);
- sprintf(local_buffer+strlen(local_buffer), "wrbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrbw);
- sprintf(local_buffer+strlen(local_buffer), "rdsx[%s] %u\n", pmbd->pmbd_name, pmbd->rdsx);
- sprintf(local_buffer+strlen(local_buffer), "wrsx[%s] %u\n", pmbd->pmbd_name, pmbd->wrsx);
- sprintf(local_buffer+strlen(local_buffer), "rdpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdpause);
- sprintf(local_buffer+strlen(local_buffer), "wrpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrpause);
-
- for (i = 0; i < pmbd->num_buffers; i ++){
- PMBD_BUFFER_T* buffer = pmbd->buffers[i];
- sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]buffer_id %u\n", i, pmbd->pmbd_name, buffer->buffer_id);
- sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]num_blocks %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->num_blocks);
- sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]batch_size %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->batch_size);
- }
-
- }
-
- memcpy(buffer, local_buffer, strlen(local_buffer));
- rtn = strlen(local_buffer);
- kfree(local_buffer);
- }
- return rtn;
-}
-
-
-
-static int pmbd_proc_devstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data)
-{
- int rtn;
- char local_buffer[1024];
- if (offset > 0) {
- *eof = 1;
- rtn = 0;
- } else {
- sprintf(local_buffer, "N/A\n");
- memcpy(buffer, local_buffer, strlen(local_buffer));
- rtn = strlen(local_buffer);
- }
- return rtn;
-}
-
-static int pmbd_proc_devstat_create(PMBD_DEVICE_T* pmbd)
-{
- /* create a /proc/pmbd/<dev> entry */
- pmbd->proc_devstat = create_proc_entry(pmbd->pmbd_name, S_IRUGO, proc_pmbd);
- if (pmbd->proc_devstat == NULL) {
- remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
- printk(KERN_ERR "pmbd: cannot create /proc/pmbd/%s\n", pmbd->pmbd_name);
- return -ENOMEM;
- }
- pmbd->proc_devstat->read_proc = pmbd_proc_devstat_read;
- printk(KERN_INFO "pmbd: /proc/pmbd/%s created\n", pmbd->pmbd_name);
-
- return 0;
-}
-
-static int pmbd_proc_devstat_destroy(PMBD_DEVICE_T* pmbd)
-{
- remove_proc_entry(pmbd->pmbd_name, proc_pmbd);
- printk(KERN_INFO "pmbd: /proc/pmbd/%s removed\n", pmbd->pmbd_name);
- return 0;
-}
-
-static int pmbd_create (PMBD_DEVICE_T* pmbd, uint64_t sectors)
-{
- int err = 0;
-
- pmbd->num_sectors = sectors;
- pmbd->sector_size = PMBD_SECTOR_SIZE; /* FIXME: now we use 512, do we need to change it? */
- pmbd->pmbd_type = g_pmbd_type;
- pmbd->checksum_unit_size = PAGE_SIZE;
- pmbd->pb_size = PAGE_SIZE;
-
- spin_lock_init(&pmbd->batch_lock);
- spin_lock_init(&pmbd->wr_barrier_lock);
-
- spin_lock_init(&pmbd->tmp_lock);
- pmbd->tmp_data = 0;
- pmbd->tmp_num = 0;
-
- /* allocate statistics info */
- if ((err = pmbd_stat_alloc(pmbd)) < 0)
- goto error;
-
- /* allocate memory space */
- if ((err = pmbd_mem_space_alloc(pmbd)) < 0)
- goto error;
-
- /* allocate buffer space */
- if ((err = pmbd_buffer_space_alloc(pmbd)) < 0)
- goto error;
-
- /* allocate checksum space */
- if ((err = pmbd_checksum_space_alloc(pmbd)) < 0)
- goto error;
-
- /* allocate block info space */
- if ((err = pmbd_pbi_space_alloc(pmbd)) < 0)
- goto error;
-
- /* create a /proc/pmbd/<dev> entry*/
- if ((err = pmbd_proc_devstat_create(pmbd)) < 0)
- goto error;
-
-#if 0
- /* FIXME: No need to do it. It's slow and could lock up the system*/
- pmbd_checksum_space_init(pmbd);
-#endif
-
- /* set up the page attributes related with CPU cache
- * if using vmalloc(), we need to set up the page cache flags (WB,WC,UC,UM);
- * if using high memory, we set up the page cache flag with ioremap_prot();
- * WARN: In Linux 3.2.1, this function is slow and could cause system hangs.
- */
-
- if (PMBD_USE_VMALLOC()){
- pmbd_set_pages_cache_flags(pmbd);
- }
-
- /* initialize PM pages read-only */
- if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())
- pmbd_set_pages_ro(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
-
- printk(KERN_INFO "pmbd: %s created\n", pmbd->pmbd_name);
-error:
- return err;
-}
-
-static int pmbd_destroy (PMBD_DEVICE_T* pmbd)
-{
- /* flush everything down */
- // FIXME: this implies flushing CPU cache
- pmbd_write_barrier(pmbd);
-
- /* free /proc entry */
- pmbd_proc_devstat_destroy(pmbd);
-
- /* free buffer space */
- pmbd_buffer_space_free(pmbd);
-
- /* set PM pages writable */
- if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION())
- pmbd_set_pages_rw(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE);
-
- /* reset memory attributes to WB */
- if (PMBD_USE_VMALLOC())
- pmbd_reset_pages_cache_flags(pmbd);
-
- /* free block info space */
- pmbd_pbi_space_free(pmbd);
-
- /* free checksum space */
- pmbd_checksum_space_free(pmbd);
-
- /* free memory backstore space */
- pmbd_mem_space_free(pmbd);
-
- /* free statistics data */
- pmbd_stat_free(pmbd);
-
- printk(KERN_INFO "pmbd: /dev/%s is destroyed (%llu MB)\n", pmbd->pmbd_name, SECTORS_TO_MB(pmbd->num_sectors));
-
- pmbd->num_sectors = 0;
- pmbd->sector_size = 0;
- pmbd->checksum_unit_size = 0;
- return 0;
-}
-
-static int pmbd_free_pages(PMBD_DEVICE_T* pmbd)
-{
- return pmbd_destroy(pmbd);
-}
-
-/*
- **************************************************************************
- * /proc file system entries
- **************************************************************************
- */
-
-static int pmbd_proc_create(void)
-{
- proc_pmbd= proc_mkdir("pmbd", 0);
- if(proc_pmbd == NULL){
- printk(KERN_ERR "pmbd: %s(%d): cannot create /proc/pmbd\n", __FUNCTION__, __LINE__);
- return -ENOMEM;
- }
-
- proc_pmbdstat = create_proc_entry("pmbdstat", S_IRUGO, proc_pmbd);
- if (proc_pmbdstat == NULL){
- remove_proc_entry("pmbdstat", proc_pmbd);
- printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdstat\n");
- return -ENOMEM;
- }
- proc_pmbdstat->read_proc = pmbd_proc_pmbdstat_read;
- printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat created\n");
-
- proc_pmbdcfg = create_proc_entry("pmbdcfg", S_IRUGO, proc_pmbd);
- if (proc_pmbdcfg == NULL){
- remove_proc_entry("pmbdcfg", proc_pmbd);
- printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdcfg\n");
- return -ENOMEM;
- }
- proc_pmbdcfg->read_proc = pmbd_proc_pmbdcfg_read;
- printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg created\n");
-
- return 0;
-}
-
-static int pmbd_proc_destroy(void)
-{
- remove_proc_entry("pmbdcfg", proc_pmbd);
- printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg is removed\n");
-
- remove_proc_entry("pmbdstat", proc_pmbd);
- printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat is removed\n");
-
- remove_proc_entry("pmbd", 0);
- printk(KERN_INFO "pmbd: /proc/pmbd is removed\n");
- return 0;
-}
-
-/*
- **************************************************************************
- * device driver interface hook functions
- **************************************************************************
- */
-
-static int pmbd_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec) {
- static int flag = 0;
-
- if(PMBD_IS_MERGEABLE()) {
- /* always merge */
- if (!flag) {
- printk(KERN_INFO "pmbd: bio merging enabled\n");
- flag = 1;
- }
- return biovec->bv_len;
- } else {
- /* never merge */
- if (!flag) {
- printk(KERN_INFO "pmbd: bio merging disabled\n");
- flag = 1;
- }
- if (!bvm->bi_size) {
- return biovec->bv_len;
- } else {
- return 0;
- }
- }
-}
-
-int pmbd_fsync(struct file* file, struct dentry* dentry, int datasync)
-{
- printk(KERN_WARNING "pmbd: pmbd_fsync not implemented\n");
-
- return 0;
-}
-
-int pmbd_open(struct block_device* bdev, fmode_t mode)
-{
- printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) opened\n", bdev->bd_disk->disk_name);
- return 0;
-}
-
-int pmbd_release (struct gendisk* disk, fmode_t mode)
-{
- printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) released\n", disk->disk_name);
- return 0;
-}
-
-static const struct block_device_operations pmbd_fops = {
- .owner = THIS_MODULE,
-// .open = pmbd_open,
-// .release = pmbd_release,
-};
-
-/*
- * NOTE: partial of the following code is derived from linux/block/brd.c
- */
-
-
-static PMBD_DEVICE_T *pmbd_alloc(int i)
-{
- PMBD_DEVICE_T *pmbd;
- struct gendisk *disk;
-
- /* no more than 26 devices */
- if (i >= PMBD_MAX_NUM_DEVICES)
- return NULL;
-
- /* alloc and set up pmbd object */
- pmbd = kzalloc(sizeof(*pmbd), GFP_KERNEL);
- if (!pmbd)
- goto out;
- pmbd->pmbd_id = i;
- pmbd->pmbd_queue = blk_alloc_queue(GFP_KERNEL);
- sprintf(pmbd->pmbd_name, "pm%c", ('a' + i));
- pmbd->rdlat = g_pmbd_rdlat[i];
- pmbd->wrlat = g_pmbd_wrlat[i];
- pmbd->rdbw = g_pmbd_rdbw[i];
- pmbd->wrbw = g_pmbd_wrbw[i];
- pmbd->rdsx = g_pmbd_rdsx[i];
- pmbd->wrsx = g_pmbd_wrsx[i];
- pmbd->rdpause = g_pmbd_rdpause[i];
- pmbd->wrpause = g_pmbd_wrpause[i];
- pmbd->simmode = g_pmbd_simmode[i];
- pmbd->rammode = g_pmbd_rammode[i];
- pmbd->wpmode = g_pmbd_wpmode[i];
- pmbd->num_buffers = g_pmbd_num_buffers;
- pmbd->buffer_stride = g_pmbd_buffer_stride;
- pmbd->bufmode = (g_pmbd_bufsize[i] > 0 && g_pmbd_num_buffers > 0) ? TRUE : FALSE;
-
- if (!pmbd->pmbd_queue)
- goto out_free_dev;
-
- /* hook functions */
- blk_queue_make_request(pmbd->pmbd_queue, pmbd_make_request);
-
- /* set flush capability, otherwise, WRITE_FLUSH and WRITE_FUA will be filtered in
- generic_make_request() */
- if (PMBD_USE_FUA())
- blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH | REQ_FUA);
- else if (PMBD_USE_WB())
- blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH);
-
- blk_queue_max_hw_sectors(pmbd->pmbd_queue, 1024);
- blk_queue_bounce_limit(pmbd->pmbd_queue, BLK_BOUNCE_ANY);
- blk_queue_merge_bvec(pmbd->pmbd_queue, pmbd_mergeable_bvec);
-
- disk = pmbd->pmbd_disk = alloc_disk(1 << part_shift);
- if (!disk)
- goto out_free_queue;
-
- disk->major = PMBD_MAJOR;
- disk->first_minor = i << part_shift;
- disk->fops = &pmbd_fops;
- disk->private_data = pmbd;
- disk->queue = pmbd->pmbd_queue;
- strcpy(disk->disk_name, pmbd->pmbd_name);
- set_capacity(disk, GB_TO_SECTORS(g_pmbd_size[i])); /* num of sectors */
-
- /* allocate PM space */
- if (pmbd_create(pmbd, GB_TO_SECTORS(g_pmbd_size[i])) < 0)
- goto out_free_queue;
-
- /* done */
- return pmbd;
-
-out_free_queue:
- blk_cleanup_queue(pmbd->pmbd_queue);
-out_free_dev:
- kfree(pmbd);
-out:
- return NULL;
-}
-
-static void pmbd_free(PMBD_DEVICE_T *pmbd)
-{
- put_disk(pmbd->pmbd_disk);
- blk_cleanup_queue(pmbd->pmbd_queue);
- pmbd_free_pages(pmbd);
- kfree(pmbd);
-}
-
-static void pmbd_del_one(PMBD_DEVICE_T *pmbd)
-{
- list_del(&pmbd->pmbd_list);
- del_gendisk(pmbd->pmbd_disk);
- pmbd_free(pmbd);
-}
-
-static int __init pmbd_init(void)
-{
- int i, nr;
- unsigned long range;
- PMBD_DEVICE_T *pmbd, *next;
-
- /* parse input options */
- pmbd_parse_conf();
-
- /* initialize pmap start*/
- pmap_create();
-
- /* ioremap high memory space */
- if (PMBD_USE_HIGHMEM()) {
- if (pmbd_highmem_map() == NULL)
- return -ENOMEM;
- }
-
- part_shift = 0;
- if (max_part > 0)
- part_shift = fls(max_part);
-
- if (g_pmbd_nr > 1UL << (MINORBITS - part_shift))
- return -EINVAL;
-
- if (g_pmbd_nr) {
- nr = g_pmbd_nr;
- range = g_pmbd_nr;
- } else {
- printk(KERN_ERR "pmbd: %s(%d) - g_pmbd_nr=%d\n", __FUNCTION__, __LINE__, g_pmbd_nr);
- return -EINVAL;
- }
-
- pmbd_proc_create();
-
- if (register_blkdev(PMBD_MAJOR, PMBD_NAME))
- return -EIO;
- else
- printk(KERN_INFO "pmbd: registered device at major %d\n", PMBD_MAJOR);
-
- for (i = 0; i < nr; i++) {
- pmbd = pmbd_alloc(i);
- if (!pmbd)
- goto out_free;
- list_add_tail(&pmbd->pmbd_list, &pmbd_devices);
- }
-
- /* point of no return */
- list_for_each_entry(pmbd, &pmbd_devices, pmbd_list)
- add_disk(pmbd->pmbd_disk);
-
- printk(KERN_INFO "pmbd: module loaded\n");
- return 0;
-
-out_free:
- list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) {
- list_del(&pmbd->pmbd_list);
- pmbd_free(pmbd);
- }
- unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
-
- return -ENOMEM;
-}
-
-
-static void __exit pmbd_exit(void)
-{
- unsigned long range;
- PMBD_DEVICE_T *pmbd, *next;
-
- range = g_pmbd_nr ? g_pmbd_nr : 1UL << (MINORBITS - part_shift);
-
- /* deactivate each pmbd instance*/
- list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list)
- pmbd_del_one(pmbd);
-
- /* deioremap high memory space */
- if (PMBD_USE_HIGHMEM()) {
- pmbd_highmem_unmap();
- }
-
- /* destroy pmap entries */
- pmap_destroy();
-
- unregister_blkdev(PMBD_MAJOR, PMBD_NAME);
-
- pmbd_proc_destroy();
-
- printk(KERN_INFO "pmbd: module unloaded\n");
- return;
-}
-
-/* module setup */
-MODULE_AUTHOR("Intel Corporation <linux-pmbd at intel.com>");
-MODULE_ALIAS("pmbd");
-MODULE_LICENSE("GPL v2");
-MODULE_VERSION("0.9");
-MODULE_ALIAS_BLOCKDEV_MAJOR(PMBD_MAJOR);
-module_init(pmbd_init);
-module_exit(pmbd_exit);
-
-/* THE END */
-
-
diff --git a/include/linux/pmbd.h b/include/linux/pmbd.h
deleted file mode 100644
index 8e8691f..0000000
--- a/include/linux/pmbd.h
+++ /dev/null
@@ -1,509 +0,0 @@
-/*
- * Intel Persistent Memory Block Driver
- * Copyright (c) <2011-2013>, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program; if not, write to the Free Software Foundation, Inc.,
- * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
- */
-
-/*
- * Intel Persistent Memory Block Driver (v0.9)
- *
- * pmbd.h
- *
- * Intel Corporation <linux-pmbd at intel.com>
- * 03/24/2011
- */
-
-#ifndef PMBD_H
-#define PMBD_H
-
-#define PMBD_MAJOR 261 /* FIXME: temporarily use this */
-#define PMBD_NAME "pmbd" /* pmbd module name */
-#define PMBD_MAX_NUM_DEVICES 26 /* max num of devices */
-#define PMBD_MAX_NUM_CPUS 32 /* max num of cpus*/
-
-/*
- * type definitions
- */
-typedef uint32_t PMBD_CHECKSUM_T;/* we use CRC32 to calculate checksum */
-typedef sector_t BBN_T; /* BBN_T */
-typedef sector_t PBN_T; /* BBN_T */
-
-
-/*
- * PMBD device buffer control structure
- * NOTE:
- * (1) buffer_space is an array of num_blocks of blocks, the size of which is
- * defined as pmbd->pb_size
- * (2) bbi_space is an array of num_blocks of bbi (buffer block info) units,
- * each of which contains the metadata information of each block in the buffer
-
- buffer space management variables
- * num_dirty - total number of dirty blocks in buffer
- * pos_dirty - point to the end of the sequence of dirty blocks
- * pos_clean - point to the end of the sequence of clean blocks
- *
- * post_dirty and pos_clean logically segment the buffer into
- * dirty/clean regions as follows.
- *
- * pos_dirty ----v v--- pos_clean
- * ----------------------------
- * | clean |*DIRTY*| clean |
- * ----------------------------
- * buffer_lock - protects reads/writes to the aforesaid three
- */
-typedef struct pmbd_bbi { /* pmbd buffer block info (BBI) */
- PBN_T pbn; /* physical block number in PM (converted from sector) */
- unsigned dirty; /* dirty (1) or clean (0)*/
-} PMBD_BBI_T;
-
-typedef struct pmbd_bsort_entry { /* pmbd buffer block info for sorting */
- BBN_T bbn; /* buffer block number (in buffer)*/
- PBN_T pbn; /* physical block number (in PMBD)*/
-} PMBD_BSORT_ENTRY_T;
-
-typedef struct pmbd_buffer {
- unsigned buffer_id;
- struct pmbd_device* pmbd; /* the linked pmbd device */
-
- BBN_T num_blocks; /* buffer space size (# of blocks) */
- void* buffer_space; /* buffer space base vaddr address */
- PMBD_BBI_T* bbi_space; /* array of buffer block info (BBI)*/
-
- BBN_T num_dirty; /* num of dirty blocks */
- BBN_T pos_dirty; /* the first dirty block */
- BBN_T pos_clean; /* the first clean block */
- spinlock_t buffer_lock; /* lock to protect metadata updates */
- unsigned int batch_size; /* the batch size for flushing buffer pages */
-
- struct task_struct* syncer; /* the syncer daemon */
-
- spinlock_t flush_lock; /* lock to protect metadata updates */
- PMBD_BSORT_ENTRY_T* bbi_sort_buffer;/* a temp array of the bbi for sorting */
-} PMBD_BUFFER_T;
-
-/*
- * PM physical block information (each corresponding to a PM block)
- *
- * (1) if the physical block is buffered, bbn contains a valid buffer block
- * number (BBN) between 0 - (buffer->num_blocks-1), otherwise, it contains an
- * invalid value (buffer->num_blocks + 1)
- * (2) any access to the block (read/write/sync) must have this lock first to
- * prevent multiple concurrent accesses to the same PM block
- */
-typedef struct pmbd_pbi{
- BBN_T bbn;
- spinlock_t lock;
-} PMBD_PBI_T;
-
-typedef struct pmbd_stat{
- /* stat_lock does not protect cycles_*[] counters */
- spinlock_t stat_lock; /* protection lock */
-
- unsigned last_access_jiffies; /* the timestamp of the most recent access */
- uint64_t num_sectors_read; /* total num of sectors being read */
- uint64_t num_sectors_write; /* total num of sectors being written */
- uint64_t num_requests_read; /* total num of requests for read */
- uint64_t num_requests_write; /* total num of request for write */
- uint64_t num_write_barrier; /* total num of write barriers received */
- uint64_t num_write_fua; /* total num of write barriers received */
-
- /* cycles counters (enabled/disabled by timestat)*/
- uint64_t cycles_total[2][PMBD_MAX_NUM_CPUS]; /* total cycles for read in make_request*/
- uint64_t cycles_prepare[2][PMBD_MAX_NUM_CPUS]; /* total cycles for prepare in make_request*/
- uint64_t cycles_wb[2][PMBD_MAX_NUM_CPUS]; /* total cycles for write barrier in make_request*/
- uint64_t cycles_work[2][PMBD_MAX_NUM_CPUS]; /* total cycles for work in make_request*/
- uint64_t cycles_endio[2][PMBD_MAX_NUM_CPUS]; /* total cycles for endio in make_request*/
- uint64_t cycles_finish[2][PMBD_MAX_NUM_CPUS]; /* total cycles for finish-up in make_request*/
-
- uint64_t cycles_pmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private mapping*/
- uint64_t cycles_punmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private unmapping */
- uint64_t cycles_memcpy[2][PMBD_MAX_NUM_CPUS]; /* total cycles for memcpy */
- uint64_t cycles_clflush[2][PMBD_MAX_NUM_CPUS]; /* total cycles for clflush_range */
- uint64_t cycles_clflushall[2][PMBD_MAX_NUM_CPUS];/* total cycles for clflush_all */
- uint64_t cycles_wrverify[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing write verification */
- uint64_t cycles_checksum[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing checksum */
- uint64_t cycles_pause[2][PMBD_MAX_NUM_CPUS]; /* total cycles for pause */
- uint64_t cycles_slowdown[2][PMBD_MAX_NUM_CPUS]; /* total cycles for slowdown*/
- uint64_t cycles_setpages_ro[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to ro*/
- uint64_t cycles_setpages_rw[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to rw*/
-} PMBD_STAT_T;
-
-/*
- * pmbd_device structure (each corresponding to a pmbd instance)
- */
-#define PBN_TO_PMBD_BUFFER_ID(PMBD, PBN) (((PBN)/(PMBD)->buffer_stride) % (PMBD)->num_buffers)
-#define PBN_TO_PMBD_BUFFER(PMBD, PBN) ((PMBD)->buffers[PBN_TO_PMBD_BUFFER_ID((PMBD), (PBN))])
-
-typedef struct pmbd_device {
- int pmbd_id; /* dev id */
- char pmbd_name[DISK_NAME_LEN];/* device name */
-
- struct request_queue * pmbd_queue;
- struct gendisk * pmbd_disk;
- struct list_head pmbd_list;
-
- /* PM backstore space */
- void* mem_space; /* pointer to the kernel mem space */
- uint64_t num_sectors; /* PMBD device capacity (num of 512-byte sectors)*/
- unsigned sector_size; /* 512 bytes */
-
- /* configurations */
- unsigned pmbd_type; /* vmalloc() or high_mem */
- unsigned rammode; /* RAM mode (no write protection) or not */
- unsigned bufmode; /* use buffer or not */
- unsigned wpmode; /* write protection mode: PTE change (0) or CR0/WP bit switch (1)*/
-
- /* buffer management */
- PMBD_BUFFER_T** buffers; /* buffer control structure */
- unsigned num_buffers; /* number of buffers */
- unsigned buffer_stride; /* the number of contiguous blocks mapped to the same buffer */
-
-
-
- /* physical block info (metadata) */
- PMBD_PBI_T* pbi_space; /* physical block info space (each) */
- unsigned pb_size; /* the unit size of each block (4096 in default) */
-
- /* checksum */
- PMBD_CHECKSUM_T* checksum_space; /* checksum array */
- unsigned checksum_unit_size; /* checksum unit size (bytes) */
- void* checksum_iomem_buf; /* one unit buffer for ioremapped PM */
-
- /* emulating PM with injected latency */
- unsigned simmode; /* simulating whole device (0) or PM only (1)*/
- uint64_t rdlat; /* read access latency (in nanoseconds)*/
- uint64_t wrlat; /* write access latency (in nanoseconds)*/
- uint64_t rdbw; /* read bandwidth (MB/sec) */
- uint64_t wrbw; /* write bandwidth (MB/sec) */
- unsigned rdsx; /* read slowdown (X) */
- unsigned wrsx; /* write slowdown (X) */
- uint64_t rdpause; /* read pause (cycles per 4KB page) */
- uint64_t wrpause; /* write pause (cycles per 4KB page) */
-
- spinlock_t batch_lock; /* lock protecting batch_* fields */
- uint64_t batch_start_cycle[2]; /* start time of the batch (cycles)*/
- uint64_t batch_end_cycle[2]; /* end time of the batch (cycles) */
- uint64_t batch_sectors[2]; /* the total num of sectors in the batch */
-
- PMBD_STAT_T* pmbd_stat; /* statistics data */
- struct proc_dir_entry* proc_devstat; /* the proc output */
-
- spinlock_t wr_barrier_lock;/* for write barrier and other control */
- atomic_t num_flying_wr; /* the counter of writes on the fly */
-
- spinlock_t tmp_lock;
- uint64_t tmp_data;
- unsigned long tmp_num;
-} PMBD_DEVICE_T;
-
-/*
- * support definitions
- */
-#define TRUE 1
-#define FALSE 0
-
-#define __CURRENT_PID__ (current->pid)
-#define CONFIG_PMBD_DEBUG 1
-//#define PRINTK_DEBUG_HDR "DEBUG %s(%d)%u - "
-//#define PRINTK_DEBUG_PAR __FUNCTION__, __LINE__, __CURRENT_PID__
-//#define PRINTK_DEBUG_1 if(CONFIG_PMBD_DEBUG >= 1) printk
-//#define PRINTK_DEBUG_2 if(CONFIG_PMBD_DEBUG >= 2) printk
-//#define PRINTK_DEBUG_3 if(CONFIG_PMBD_DEBUG >= 3) printk
-
-#define MAX_OF(A, B) (((A) > (B))? (A) : (B))
-#define MIN_OF(A, B) (((A) < (B))? (A) : (B))
-
-#define SECTOR_SHIFT 9
-#define PAGE_SHIFT 12
-#define SECTOR_SIZE (1UL << SECTOR_SHIFT)
-//#define PAGE_SIZE (1UL << PAGE_SHIFT)
-#define SECTOR_MASK (~(SECTOR_SIZE-1))
-#define PAGE_MASK (~(PAGE_SIZE-1))
-#define PMBD_SECTOR_SIZE SECTOR_SIZE
-#define PMBD_PAGE_SIZE PAGE_SIZE
-#define KB_SHIFT 10
-#define MB_SHIFT 20
-#define GB_SHIFT 30
-#define MB_TO_BYTES(N) ((N) << MB_SHIFT)
-#define GB_TO_BYTES(N) ((N) << GB_SHIFT)
-#define BYTES_TO_MB(N) ((N) >> MB_SHIFT)
-#define BYTES_TO_GB(N) ((N) >> GB_SHIFT)
-#define MB_TO_SECTORS(N) ((N) << (MB_SHIFT - SECTOR_SHIFT))
-#define GB_TO_SECTORS(N) ((N) << (GB_SHIFT - SECTOR_SHIFT))
-#define SECTORS_TO_MB(N) ((N) >> (MB_SHIFT - SECTOR_SHIFT))
-#define SECTORS_TO_GB(N) ((N) >> (GB_SHIFT - SECTOR_SHIFT))
-#define SECTOR_TO_PAGE(N) ((N) >> (PAGE_SHIFT - SECTOR_SHIFT))
-#define SECTOR_TO_BYTE(N) ((N) << SECTOR_SHIFT)
-#define BYTE_TO_SECTOR(N) ((N) >> SECTOR_SHIFT)
-#define PAGE_TO_SECTOR(N) ((N) << (PAGE_SHIFT - SECTOR_SHIFT))
-#define BYTE_TO_PAGE(N) ((N) >> (PAGE_SHIFT))
-
-#define IS_SPACE(C) (isspace(C) || (C) == '\0')
-#define IS_DIGIT(C) (isdigit(C) && (C) != '\0')
-#define IS_ALPHA(C) (isalpha(C) && (C) != '\0')
-
-#define DISABLE_SAVE_IRQ(FLAGS) {local_irq_save((FLAGS)); local_irq_disable();}
-#define ENABLE_RESTORE_IRQ(FLAGS) {local_irq_restore((FLAGS)); local_irq_enable();}
-#define CUR_CPU_ID() smp_processor_id()
-
-/*
- * PMBD related config
- */
-
-#define PMBD_CONFIG_VMALLOC 0 /* vmalloc() based PMBD (default) */
-#define PMBD_CONFIG_HIGHMEM 1 /* ioremap() based PMBD */
-
-
-/* global config */
-#define PMBD_IS_MERGEABLE() (g_pmbd_mergeable == TRUE)
-#define PMBD_USE_VMALLOC() (g_pmbd_type == PMBD_CONFIG_VMALLOC)
-#define PMBD_USE_HIGHMEM() (g_pmbd_type == PMBD_CONFIG_HIGHMEM)
-#define PMBD_USE_CLFLUSH() (g_pmbd_cpu_cache_clflush == TRUE)
-#define PMBD_CPU_CACHE_FLAG() ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB)? "WB" : \
- ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC)? "WC" : \
- ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC)? "UC" : \
- ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS)? "UC-Minus" : "UNKNOWN"))))
-
-#define PMBD_CPU_CACHE_USE_WB() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB) /* write back */
-#define PMBD_CPU_CACHE_USE_WC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC) /* write combining */
-#define PMBD_CPU_CACHE_USE_UC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC) /* uncachable */
-#define PMBD_CPU_CACHE_USE_UM() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS) /* uncachable minus */
-
-#define PMBD_USE_WRITE_PROTECTION() (g_pmbd_wr_protect == TRUE)
-#define PMBD_USE_WRITE_VERIFICATION() (g_pmbd_wr_verify == TRUE)
-#define PMBD_USE_CHECKSUM() (g_pmbd_checksum == TRUE)
-#define PMBD_USE_LOCK() (g_pmbd_lock == TRUE)
-#define PMBD_USE_SUBPAGE_UPDATE() (g_pmbd_subpage_update == TRUE)
-
-#define PMBD_USE_PMAP() (g_pmbd_pmap == TRUE && g_pmbd_type == PMBD_CONFIG_HIGHMEM)
-#define PMBD_USE_NTS() (g_pmbd_nts == TRUE)
-#define PMBD_USE_NTL() (g_pmbd_ntl == TRUE)
-#define PMBD_USE_WB() (g_pmbd_wb == TRUE)
-#define PMBD_USE_FUA() (g_pmbd_fua == TRUE)
-#define PMBD_USE_TIMESTAT() (g_pmbd_timestat == TRUE)
-
-#define TIMESTAMP(TS) rdtscll((TS))
-#define TIMESTAT_POINT(TS) {(TS) = 0; if (PMBD_USE_TIMESTAT()) rdtscll((TS));}
-
-/* instanced based config */
-#define PMBD_DEV_USE_VMALLOC(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_VMALLOC)
-#define PMBD_DEV_USE_HIGHMEM(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_HIGHMEM)
-#define PMBD_DEV_USE_BUFFER(PMBD) ((PMBD)->bufmode)
-#define PMBD_DEV_USE_WPMODE_PTE(PMBD) ((PMBD)->wpmode == 0)
-#define PMBD_DEV_USE_WPMODE_CR0(PMBD) ((PMBD)->wpmode == 1)
-
-#define PMBD_DEV_USE_EMULATION(PMBD) ((PMBD)->rdlat || (PMBD)->wrlat || (PMBD)->rdbw || (PMBD)->wrbw)
-#define PMBD_DEV_SIM_PMBD(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 1)
-#define PMBD_DEV_SIM_DEV(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 0)
-#define PMBD_DEV_USE_SLOWDOWN(PMBD) ((PMBD)->rdsx > 1 || (PMBD)->wrsx > 1)
-
-/* support functions */
-#define PMBD_MEM_TOTAL_SECTORS(PMBD) ((PMBD)->num_sectors)
-#define PMBD_MEM_TOTAL_BYTES(PMBD) ((PMBD)->num_sectors * (PMBD)->sector_size)
-#define PMBD_MEM_TOTAL_PAGES(PMBD) (((PMBD)->num_sectors) >> (PAGE_SHIFT - SECTOR_SHIFT))
-#define PMBD_MEM_SPACE_FIRST_BYTE(PMBD) ((PMBD)->mem_space)
-#define PMBD_MEM_SPACE_LAST_BYTE(PMBD) ((PMBD)->mem_space + PMBD_MEM_TOTAL_BYTES(PMBD) - 1)
-#define PMBD_CHECKSUM_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->checksum_unit_size)
-#define PMBD_LOCK_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->lock_unit_size)
-#define VADDR_IN_PMBD_SPACE(PMBD, ADDR) ((ADDR) >= PMBD_MEM_SPACE_FIRST_BYTE(PMBD) \
- && (ADDR) <= PMBD_MEM_SPACE_LAST_BYTE(PMBD))
-
-#define BYTE_TO_PBN(PMBD, BYTES) ((BYTES) / (PMBD)->pb_size)
-#define PBN_TO_BYTE(PMBD, PBN) ((PBN) * (PMBD)->pb_size)
-#define SECTOR_TO_PBN(PMBD, SECT) (BYTE_TO_PBN((PMBD), SECTOR_TO_BYTE(SECT)))
-#define PBN_TO_SECTOR(PMBD, PBN) (BYTE_TO_SECTOR(PBN_TO_BYTE((PMBD), (PBN))))
-
-
-#define PMBD_CACHELINE_SIZE (64) /* FIXME: configure this machine by machine? (check x86_clflush_size)*/
-
-/* buffer related functions */
-#define CALLER_ALLOCATOR (0)
-#define CALLER_SYNCER (1)
-#define CALLER_DESTROYER (2)
-
-#define PMBD_BLOCK_VADDR(PMBD, PBN) ((PMBD)->mem_space + ((PMBD)->pb_size * (PBN)))
-#define PMBD_BLOCK_PBI(PMBD, PBN) ((PMBD)->pbi_space + (PBN))
-#define PMBD_TOTAL_PB_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->pb_size)
-#define PMBD_BLOCK_IS_BUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn < PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks)
-#define PMBD_SET_BLOCK_BUFFERED(PMBD, PBN, BBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = (BBN))
-#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PMBD_TOTAL_PB_NUM((PMBD)) + 3)
-//#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks + 1)
-
-#define PMBD_BUFFER_MIN_BUFSIZE (4) /* buffer size (in MBs) */
-#define PMBD_BUFFER_BLOCK(BUF, BBN) ((BUF)->buffer_space + (BUF)->pmbd->pb_size*(BBN))
-#define PMBD_BUFFER_BBI(BUF, BBN) ((BUF)->bbi_space + (BBN))
-#define PMBD_BUFFER_BBI_INDEX(BUF, ADDR) ((ADDR)-(BUF)->bbi_space)
-#define PMBD_BUFFER_SET_BBI_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = FALSE)
-#define PMBD_BUFFER_SET_BBI_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = TRUE)
-#define PMBD_BUFFER_BBI_IS_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == FALSE)
-#define PMBD_BUFFER_BBI_IS_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == TRUE)
-#define PMBD_BUFFER_SET_BBI_BUFFERED(BUF,BBN,PBN)((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = (PBN))
-#define PMBD_BUFFER_SET_BBI_UNBUFFERED(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = PMBD_TOTAL_PB_NUM((BUF)->pmbd) + 2)
-
-#define PMBD_BUFFER_FLUSH_HW (0.7) /* high watermark */
-#define PMBD_BUFFER_FLUSH_LW (0.1) /* low watermark */
-#define PMBD_BUFFER_IS_FULL(BUF) ((BUF)->num_dirty >= (BUF)->num_blocks)
-#define PMBD_BUFFER_IS_EMPTY(BUF) ((BUF)->num_dirty == 0)
-#define PMBD_BUFFER_ABOVE_HW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
-#define PMBD_BUFFER_BELOW_HW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW)))
-#define PMBD_BUFFER_ABOVE_LW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
-#define PMBD_BUFFER_BELOW_LW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW)))
-#define PMBD_BUFFER_BATCH_SIZE_DEFAULT (1024) /* the batch size for each flush */
-
-#define PMBD_BUFFER_NEXT_POS(BUF, POS) (((POS)==((BUF)->num_blocks - 1))? 0 : ((POS)+1))
-#define PMBD_BUFFER_PRIO_POS(BUF, POS) (((POS)== 0)? ((BUF)->num_blocks - 1) : ((POS)-1))
-#define PMBD_BUFFER_NEXT_N_POS(BUF,POS,N) (((POS)+(N))%((BUF)->num_blocks))
-#define PMBD_BUFFER_PRIO_N_POS(BUF,POS,N) ((BUF)->num_blocks - (((N)+(BUF)->num_blocks-(POS))%(BUF)->num_blocks))
-
-/* high memory */
-#define PMBD_HIGHMEM_AVAILABLE_SPACE (g_highmem_virt_addr + g_highmem_size - g_highmem_curr_addr)
-
-/* emulation */
-#define MAX_SYNC_SLOWDOWN (10000000) /* use async_slowdown, if larger than 10ms */
-#define OVERHEAD_NANOSEC (100)
-#define PMBD_USLEEP(n) {set_current_state(TASK_INTERRUPTIBLE); \
- schedule_timeout((n)*HZ/1000000);}
-
-/* statistics */
-#define PMBD_BATCH_MAX_SECTORS (4096) /* maximum data amount requested in a batch */
-#define PMBD_BATCH_MIN_SECTORS (256) /* maximum data amount requested in a batch */
-#define PMBD_BATCH_MAX_INTERVAL (1000000) /* maximum interval between two requests in a batch*/
-#define PMBD_BATCH_MAX_DURATION (10000000) /* maximum duration of a batch (ns)*/
-
-/* write protection*/
-#define VADDR_TO_PAGE(ADDR) ((ADDR) >> PAGE_SHIFT)
-#define PAGE_TO_VADDR(PAGE) ((PAGE) << PAGE_SHIFT)
-
-/* checksum */
-#define VADDR_TO_CHECKSUM_IDX(PMBD, ADDR) (((ADDR) - (PMBD)->mem_space) / (PMBD)->checksum_unit_size)
-#define CHECKSUM_IDX_TO_VADDR(PMBD, IDX) ((PMBD)->mem_space + (IDX) * (PMBD)->checksum_unit_size)
-#define CHECKSUM_IDX_TO_CKADDR(PMBD, IDX) ((PMBD)->checksum_space + (IDX))
-
-/* idle period timer */
-#define PMBD_BUFFER_FLUSH_IDLE_TIMEOUT (2000) /* 1 millisecond */
-#define PMBD_DEV_UPDATE_ACCESS_TIME(PMBD) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
- (PMBD)->pmbd_stat->last_access_jiffies = jiffies; \
- spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
-#define PMBD_DEV_GET_ACCESS_TIME(PMBD, T) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \
- (T) = (PMBD)->pmbd_stat->last_access_jiffies; \
- spin_unlock(&(PMBD)->pmbd_stat->stat_lock);}
-#define PMBD_DEV_IS_IDLE(PMBD, IDLE) ((IDLE) > PMBD_BUFFER_FLUSH_IDLE_TIMEOUT)
-
-/* Help info */
-#define USAGE_INFO \
-"\n\n\
-============================================\n\
-Intel Persistent Memory Block Driver (v0.9)\n\
-============================================\n\n\
-usage: $ modprobe pmbd mode=\"pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];[Option3];..\"\n\
-\n\
-GENERAL OPTIONS: \n\
-\t pmbd<#,#..> \t set PM block device size (GBs) \n\
-\t HM|VM \t\t use high memory (HM default) or vmalloc (VM) \n\
-\t hmo<#> \t high memory starting offset (GB) \n\
-\t hms<#> \t high memory size (GBs) \n\
-\t pmap<Y|N> \t use private mapping (Y) or not (N default) - (note: must enable HM and wrprotN) \n\
-\t nts<Y|N> \t use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), or regular memcpy (N default)\n\
-\t wb<Y|N> \t use write barrier (Y) or not (N default)\n\
-\t fua<Y|N> \t use WRITE_FUA (Y default) or not (N) \n\
-\t ntl<Y|N> \t use non-temporal load (MOVNTDQA) to do memcpy (Y), or regular memcpy (N default) - this option enforces memory type of write combining\n\
-\n\
-SIMULATION: \n\
-\t simmode<#,#..> use the specified numbers to the whole device (0 default) or PM only (1)\n\
-\t rdlat<#,#..> \t set read access latency (ns) \n\
-\t wrlat<#,#..> \t set write access latency (ns)\n\
-\t rdbw<#,#..> \t set read bandwidth (MB/sec) (if set 0, no emulation) \n\
-\t wrbw<#,#..> \t set write bandwidth (MB/sec) (if set 0, no emulation) \n\
-\t rdsx<#,#..> \t set the relative slowdown (x) for read \n\
-\t wrsx<#,#..> \t set the relative slowdown (x) for write \n\
-\t rdpause<#,.> \t set a pause (cycles per 4KB) for each read\n\
-\t wrpause<#,.> \t set a pause (cycles per 4KB) for each write\n\
-\t adj<#> \t set an adjustment to the system overhead (nanoseconds) \n\
-\n\
-WRITE PROTECTION: \n\
-\t wrprot<Y|N> \t use write protection for PM pages? (Y or N)\n\
-\t wpmode<#,#,..> write protection mode: use the PTE change (0 default) or switch CR0/WP bit (1) \n\
-\t clflush<Y|N> \t use clflush to flush CPU cache for each write to PM space? (Y or N) \n\
-\t wrverify<Y|N> \t use write verification for PM pages? (Y or N) \n\
-\t checksum<Y|N> \t use checksum to protect PM pages? (Y or N)\n\
-\t bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)\n\
-\t bufnum<#> \t the number of buffers for a PMBD device (16 buffers, at least 1 if using buffer, 0 -no buffer) \n\
-\t bufstride<#> \t the number of contiguous blocks(4KB) mapped into one buffer (bucket size for round-robin mapping) (1024 in default)\n\
-\t batch<#,#> \t the batch size (num of pages) for flushing PMBD device buffer (1 means no batching) \n\
-\n\
-MISC: \n\
-\t mgb<Y|N> \t mergeable? (Y or N) \n\
-\t lock<Y|N> \t lock the on-access page to serialize accesses? (Y or N) \n\
-\t cache<WB|WC|UC> use which CPU cache policy? Write back (WB), Write Combined (WB), or Uncachable (UC)\n\
-\t subupdate<Y|N> only update the changed cachelines of a page? (Y or N) (check PMBD_CACHELINE_SIZE) \n\
-\t timestat<Y|N> enable the detailed timing statistics (/proc/pmbd/pmbdstat)? (Y or N) (This will cause significant performance slowdown) \n\
-\n\
-NOTE: \n\
-\t (1) Option rdlat/wrlat only specifies the minimum access times. Real access times can be higher.\n\
-\t (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. \n\
-\t (3) Option simmode1 applies the simulated specification to the PM space, rather than the whole device, which may have buffer.\n\
-\n\
-WARNING: \n\
-\t (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning may appear. Use \"nosoftlockup\" boot option to disable it.\n\
-\t (2) Enabling timestat may cause performance degradation.\n\
-\t (3) FUA is supported in PMBD, but if buffer is used (for PT-based protection), enabling FUA lowers performance due to double writes.\n\
-\t (4) No support for changing CPU cache related PTE attributes for VM-based PMBD (RCU stalls).\n\
-\n\
-PROC ENTRIES: \n\
-\t /proc/pmbd/pmbdcfg config info about the PMBD devices\n\
-\t /proc/pmbd/pmbdstat statistics of the PMBD devices (if timestat is enabled)\n\
-\n\
-EXAMPLE: \n\
-\t Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:\n\
-\t (1) Basic (Ramdisk): \n\
-\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;\"\n\n\
-\t (2) Protected (with private mapping): \n\
-\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;\"\n\n\
-\t (3) Protected and synced (with private mapping, non-temp store): \n\
-\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;\"\n\n\
-\t (4) *** RECOMMENDED CONFIG *** \n\
-\t Protected, synced, and ordered (with private mapping, non-temp store, write barrier): \n\
-\t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;wbY;\"\n\
-\n"
-
-/* functions */
-static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
-static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access);
-static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes);
-static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes);
-static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
-static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes);
-
-static inline int put_ulong(unsigned long arg, unsigned long val)
-{
- return put_user(val, (unsigned long __user *)arg);
-}
-static inline int put_u64(unsigned long arg, u64 val)
-{
- return put_user(val, (u64 __user *)arg);
-}
-
-static inline void mfence(void)
-{
- asm volatile("mfence": : :);
-}
-
-static inline void sfence(void)
-{
- asm volatile("sfence": : :);
-}
-
-#endif
-/* THEN END */
--
1.8.3.4
More information about the Linux-pmfs
mailing list