[RFC] LFTL: a FTL for large parallel IO flash cards

Sat Nov 17 06:37:39 EST 2012

On Sat, Nov 17, 2012 at 2:04 AM, Ezequiel Garcia <elezegarcia at gmail.com> wrote:
> Hi,
>
> Thanks for the patch!
>
> Though I haven't read the code in detail, I have a few minor comments to make.

Thanks for the comments. This is my first big patch. I have only sent
trivial patches before.
Sorry for the mistakes
I will resend with formatted patch later.

I realize the code is not perfect. There are also some very long functions.
For this thread, I request people to kindly ignore code format.
Some general comments with respect to design, any reference
papers/code i should be aware of,
or ideas to improve performance(particularly random IO performance)
 will be very helpful for me.

thanks,
mugunthan

>
> See below.
>
> On Fri, Nov 16, 2012 at 4:34 PM, srimugunthan dhandapani
> <srimugunthan.dhandapani at gmail.com> wrote:
>> Hi all,
>>
>>  Due to fundamental limits like size-per-chip and interface speed
>>  limits all  large capacity Flash  are made of multiple chips or banks.
>>  The presence of multiple chips readily offers parallel read or write support.
>>  Unlike an SSD, for a raw flash card , this parallelism is visible to
>>  the software layer and there are many opportunities
>>  for exploiting this parallelism.
>>
>>  The presented LFTL is meant for flash cards with multiple banks and
>>  larger minimum write sizes.
>>  LFTL mostly reuses code from mtd_blkdevs.c and mtdblock.c.
>>  The LFTL was tested on  a 512GB raw flash card which has no firmware
>>  for wearlevelling or garbage collection.
>>
>>  The following are the important points regarding the LFTL:
>>
>>  1. multiqueued/multithreaded design:(Thanks  to Joern engel for a
>>  mail-discussion)
>>  The mtd_blkdevs.c dequeues block I/O requests from the block layer
>>  provided request queue from a single kthread.
>>  This design of IO requests dequeued from a single queue by a single
>>  thread is a bottleneck for flash cards that supports hundreds of MB/sec.
>>  We use a multiqueued and multithreaded design.
>>  We bypass the block layer by registering a new make_request and
>>  the LFTL maintains several queues of its own and the block IO requests are
>>  put in one of these queues. For every queue there is an associated kthread
>>  that processes requests from that queue. The number of "FTL IO kthreads"
>>  is #defined as 64 currently.
>>
>>  2. Block allocation and Garbage collection:
>>  The block allocation algorithm allocates blocks across the banks and
>>  so the multiple FTL IO kthreads can write to different banks
>>  simultaneously. With this design we were able to get upto 300MB/sec(
>>  although some more bandwidth is possible from the device)
>>  Further the block allocation tries to  exclude any bank that is being
>>  garbage collected and tries to allocate the block on bank that is
>>  idle.  Similar to IO, the garbage collection is performed from several
>>  co-running threads. The garbage collection is started on a bank where
>>  there is no I/O happening.  By scheduling the writes and garbage collection
>>  on the different banks,  we can considerably avoid garbage collection latency
>>  in the write path.
>>
>>
>>  3. Buffering within FTL :
>>  The LFTL exposes a block size of 4K. The flash card that we used has
>>  a minimum writesize of 32K. So we use several caching buffers inside
>>  the FTL to accumulate the 4Ks to a single 32K and then flush to the flash.
>>  The number of caching buffers is #defined as 256  but they can be
>>  tunable. The per-buffer size is set to page size of 32K.
>>  First the write requests are absorbed in these multiple FTL
>>  cache-buffers before they are flushed to flash device.
>>  We also flush the buffers  that are not modified for a considerable
>>  amount of time from a separate background thread.
>>
>>  4. Adaptive garbage collection:
>>   The amount of garbage collection that is happening at a particular
>>  instant is controlled by the number of active garbage collecting threads.
>>  The number of active garbage collecting threads is varied according to the
>>  current I/O load on the device. The current IO load level is inferred from
>>  the number of active FTL IO kthreads.  During idle times(number of active FTL
>>  IO kthreads = 0), the number of active garbage collecting threads is
>>  increased to maximum. When all FTL IO threads are active, we reduce
>>  the number of active garbage collecting threads to atmost one
>>
>>
>>  5. Checkpointing:
>>  Similar to Yaffs2, LFTL during module unload checkpoints the important
>>  data structures and during reload they are used to reconstruct the FTL
>>  datastructures.  The initial device scan is also parallelised and searching for
>>  checkpoint block during module load also happens from multiple
>>  kthreads.  The checkpoint information is stored in a chained fashion with one
>>  checkpoint block pointing to the next. The first checkpoint block is
>>  guaranteed to be in the first or last few blocks of any bank.
>>  So the parallel running threads only need to find the first checkpoint block
>>  and we  subsequently follow the linked chain.
>>
>>
>>  With LFTL we were able to get upto 300MB/sec for sequential workload.
>>  The device actually can support more than 300MB/sec and the LFTL still
>>  needs improvement (and more testing).
>>
>>  Despite the code being far from perfect, i am  sending the patch to
>>  get some initial feedback comments.
>>  Thanks in advance for your inputs for improving the FTL.
>>  -mugunthan
>>
>>  Signed-off-by: srimugunthan <srimugunthan.dhandapani at gmail.com>
>> ---
>>
>> diff --git a/drivers/mtd/Kconfig b/drivers/mtd/Kconfig
>> index 4be8373..c68b5d2 100644
>> --- a/drivers/mtd/Kconfig
>> +++ b/drivers/mtd/Kconfig
>> @@ -237,6 +237,15 @@ config NFTL
>>           hardware, although under the terms of the GPL you're obviously
>>           permitted to copy, modify and distribute the code as you wish. Just
>>           not use it.
>> +
>> +config LFTL
>> +       tristate "LFTL (FTL for parallel IO flash card) support"
>> +
>> +       ---help---
>> +         This provides support for the NAND Flash Translation Layer which is
>> +         meant for large capacity Raw flash cards with parallel I/O
>> +         capability
>> +
>>
>>  config NFTL_RW
>>         bool "Write support for NFTL"
>> diff --git a/drivers/mtd/Makefile b/drivers/mtd/Makefile
>> index 39664c4..8d36339 100644
>> --- a/drivers/mtd/Makefile
>> +++ b/drivers/mtd/Makefile
>> @@ -19,6 +19,7 @@ obj-$(CONFIG_MTD_BLOCK)               += mtdblock.o
>>  obj-$(CONFIG_MTD_BLOCK_RO)     += mtdblock_ro.o
>>  obj-$(CONFIG_FTL)              += ftl.o
>>  obj-$(CONFIG_NFTL)             += nftl.o
>> +obj-$(CONFIG_LFTL)             += lftl.o
>>  obj-$(CONFIG_INFTL)            += inftl.o
>>  obj-$(CONFIG_RFD_FTL)          += rfd_ftl.o
>>  obj-$(CONFIG_SSFDC)            += ssfdc.o
>> diff --git a/drivers/mtd/lftl.c b/drivers/mtd/lftl.c
>> new file mode 100644
>> index 0000000..7f446e0
>> --- /dev/null
>> +++ b/drivers/mtd/lftl.c
>> @@ -0,0 +1,6417 @@
>> +/*
>> + * lftl: A FTL for Multibanked flash cards with parallel I/O capability
>> + *
>> + *
>> + * this file heavily reuses code from  linux-mtd layer
>> + * Modified over  the files
>> + *     1. mtdblock.c (authors:  David Woodhouse <dwmw2 at infradead.org>
>> and Nicolas Pitre <nico at fluxnic.net>)
>> + *     2. mtd_blkdevs.c(authors: David Woodhouse <dwmw2 at infradead.org>)
>> + * code reuse from from urcu library for  lock-free queue (author:
>> Mathieu Desnoyers <mathieu.desnoyers at efficios.com>)
>> + * author of this file: Srimugunthan Dhandapani
>> <srimugunthan.dhandapani at gmail.com>
>> + *
>> + * follows the same licensing of mtdblock.c and mtd_blkdevs.c
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> + *
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/slab.h>
>> +#include <linux/module.h>
>> +#include <linux/list.h>
>> +#include <linux/fs.h>
>> +
>> +#include <linux/mtd/mtd.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/blkpg.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/hdreg.h>
>> +#include <linux/init.h>
>> +#include <linux/mutex.h>
>> +#include <linux/kthread.h>
>> +#include <asm/uaccess.h>
>> +#include <linux/random.h>
>> +#include <linux/kthread.h>
>> +#include "lftl.h"
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>
> Why these blank lines?
>
>
>> +#define INVALID_VALUE (-1)
>> +
>> +#define MAX_FTL_CACHEBUFS 256
>> +
>> +#define STATE_EMPTY 0
>> +#define STATE_DIRTY 1
>> +#define STATE_FULL 2
>> +#define STATE_CLEAN 3
>> +
>> +#define GC_THRESH 100000
>> +#define INVALID -1
>> +#define RAND_SEL -2
>> +
>> +#define ACCEPTABLE_THRESH 256
>> +
>> +#define INVALID_PAGE_NUMBER (0xFFFFFFFF)
>> +
>> +
>> +#define INVALID_CACHE_NUM 0x7F
>> +#define INVALID_SECT_NUM 0xFFFFFF
>> +
>> +#define NO_BUF_FOR_USE -1
>> +#define NUM_FREE_BUFS_THRESH 5
>> +
>> +
>> +
>
> More blank lines here... and there are many more.
> They make the code ugly, at least to me.
>
>
>> +#define BLK_BITMAP_SIZE 4096
>> +
>> +
>> +#define GC_NUM_TOBE_SCHEDULED 2
>> +
>> +#define DATA_BLK 1
>> +#define MAP_BLK 2
>> +#define FREE_BLK 0xFFFFFFFF
>> +#define NUM_GC_LEVELS 3
>> +#define GC_LEVEL0 0
>> +#define GC_LEVEL1 1
>> +#define GC_LEVEL2 2
>> +
>> +#define GC_OP 0
>> +#define WR_OP 1
>> +#define PFTCH_OP 2
>> +#define RD_OP 3
>> +
>> +
>> +#define INVALID_PAGE_NUMBER_32 0xFFFFFFFF
>> +
>> +#define NUM_GC_THREAD 8
>> +
>> +#define MAX_NUM_PLL_BANKS 64
>> +
>> +#define MAX_PAGES_PER_BLK 64
>> +
>> +#define CKPT_RANGE 10
>> +
>> +uint32_t numpllbanks = MAX_NUM_PLL_BANKS;
>> +
>> +module_param(numpllbanks, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
>> +MODULE_PARM_DESC(numpllbanks, "Number of parallel bank units in the flash");
>> +
>> +uint32_t first_time = 0;
>> +module_param(first_time, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
>> +MODULE_PARM_DESC(numpllbanks, "boolean value, if the module is loaded
>> firsttime");
>> +
>> +
>> +static atomic_t num_gcollected;
>> +static atomic_t gc_on_writes_collisions;
>> +static atomic_t num_gc_wakeups;
>> +
>> +static atomic_t num_l0_gcollected;
>> +static atomic_t num_l1_gcollected;
>> +static atomic_t num_l2_gcollected;
>> +static atomic_t num_erase_gcollected;
>> +static atomic_t num_cperase_gcollected;
>> +
>> +static atomic_t num_gc_threads;
>> +
>> +
>> +
>> +
>> +
>> +
>> +static unsigned long *page_bitmap;
>> +static unsigned long *page_incache_bitmap;
>> +static unsigned long *maptab_bitmap;
>> +static unsigned long *gc_map;
>> +static unsigned long *gc_bankbitmap;
>> +
>> +
>> +
>> +struct extra_info_struct
>> +{
>> +       uint64_t sequence_number;
>> +};
>> +
>> +struct extra_info_struct *extra_info;
>> +
>> +struct ftlcache
>> +{
>> +
>> +       uint8_t cache_state;
>> +       unsigned long cache_offset;
>> +       unsigned long sect_idx;
>> +       unsigned long page_idx;
>> +       uint32_t logic_page;
>> +       long unsigned int written_mask;
>> +
>> +
>> +       atomic_t writes_in_progress ;
>> +       atomic_t flush_in_progress;
>> +       atomic_t wait_to_flush;
>> +       unsigned long last_touch;
>> +
>> +}__attribute__((packed));
>> +
>> +
>> +
>> +struct cur_wr_info{
>> +       uint32_t first_blk;
>> +       uint32_t last_blk;
>> +       uint32_t last_gc_blk;
>> +       uint32_t blk;
>> +       uint8_t state;
>> +       uint8_t last_wrpage;
>> +       int centroid;
>> +};
>> +
>> +struct scan_thrd_info
>> +{
>> +       struct lftlblk_dev *mtdblk;
>> +       int bank;
>> +};
>> +
>> +
>> +struct rw_semaphore map_tabl_lock;
>> +
>> +static uint64_t *map_table;
>> +
>> +static uint64_t *reverse_map_tab;
>> +static uint64_t *scanseqnumber;
>> +
>> +uint64_t buf_lookup_tab[MAX_FTL_CACHEBUFS];
>> +
>> +
>> +
>> +static struct kmem_cache *qnode_cache;
>> +
>> +static struct lfq_queue_rcu empty_bufsq;
>> +static struct lfq_queue_rcu full_bufsq;
>> +
>> +static struct lfq_queue_rcu spare_bufQ;
>> +
>> +static struct lfq_queue_rcu spare_oobbufQ;
>> +
>> +
>> +void *spare_cache_list_ptr[2*MAX_FTL_CACHEBUFS];
>> +void *spare_oobbuf_list_ptr[2*MAX_FTL_CACHEBUFS];
>> +
>> +
>> +int scheduled_for_gc[GC_NUM_TOBE_SCHEDULED];
>> +
>> +
>> +
>> +
>> +
>> +struct per_bank_info
>> +{
>> +       atomic_t perbank_nfree_blks;
>> +       atomic_t perbank_ndirty_pages;
>> +};
>> +
>> +struct per_blk_info
>> +{
>> +
>> +       atomic_t num_valid_pages;
>> +       DECLARE_BITMAP(valid_pages_map, MAX_PAGES_PER_BLK );
>> +};
>> +
>> +
>> +struct oob_data
>> +{
>> +       char blk_type;  /*  Status of the block: data pages/map pages/unused */
>> +       uint32_t logic_page_num;
>> +       int32_t seq_number;     /* The sequence number of this block */
>> +
>> +}__attribute__((packed));
>> +
>> +
>> +
>> +
>> +struct per_blk_info *blk_info;
>> +
>> +struct per_bank_info bank_info[MAX_NUM_PLL_BANKS];
>> +
>> +
>> +
>> +struct bank_activity_matrix
>> +{
>> +       atomic_t  num_reads[MAX_NUM_PLL_BANKS];
>> +       atomic_t  num_writes[MAX_NUM_PLL_BANKS];
>> +       atomic_t  gc_goingon[MAX_NUM_PLL_BANKS];
>> +       atomic_t  num_reads_pref[MAX_NUM_PLL_BANKS];
>> +};
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +static atomic_t activenumgcthread;
>> +
>> +
>> +
>> + struct lftlblk_dev;
>> +
>> + struct gcthread_arg_data
>> + {
>> +        int thrdnum;
>> +        struct lftlblk_dev *mtdblk_ptr;
>> + };
>> +
>> + struct lftlblk_dev {
>> +        struct lftl_blktrans_dev mbd;
>> +        int count;
>> +        unsigned int cache_size;
>> +        atomic_t freeblk_count;
>> +        uint32_t num_blks;
>> +        uint32_t num_cur_wr_blks;
>> +
>> +        unsigned long *free_blk_map;
>> +
>> +        uint64_t ckptrd_mask;
>> +
>> +        uint32_t blksize;
>> +        uint8_t blkshift;
>> +        uint8_t pageshift;
>> +        uint32_t num_parallel_banks;
>> +        uint32_t blks_per_bank;
>> +        uint32_t pages_per_blk;
>> +        uint32_t num_total_pages;
>> +
>> +        struct cur_wr_info cur_writing[MAX_NUM_PLL_BANKS];
>> +        struct rw_semaphore cur_wr_state[MAX_NUM_PLL_BANKS];
>> +
>> +
>> +        struct rw_semaphore **free_map_lock;
>> +
>> +
>> +
>> +        struct mutex select_buf_lock;
>> +
>> +        uint8_t *exper_buf;
>> +        uint8_t *FFbuf;
>> +        int exper_buf_sect_idx;
>> +        struct mutex exper_buf_lock;
>> +        struct mutex flush_buf_lock;
>> +        uint8_t *buf[MAX_FTL_CACHEBUFS];
>> +        struct mutex buf_lock[MAX_FTL_CACHEBUFS];
>> +        struct ftlcache cached_buf[MAX_FTL_CACHEBUFS];
>> +
>> +
>> +        struct mutex  buf_lookup_tab_mutex;
>> +        uint64_t cache_fullmask;
>> +
>> +        atomic_t cache_assign_count;
>> +        atomic_t seq_num;
>> +
>> +
>> +        struct bank_activity_matrix activity_matrix;
>> +        struct task_struct *bufflushd;
>> +        int gc_thresh[NUM_GC_LEVELS];
>> +        struct task_struct *ftlgc_thrd[NUM_GC_THREAD];
>> +        int reserved_blks_per_bank;
>> +
>> +        int first_ckpt_blk;
>> +
>> +
>> +        int hwblks_per_bank;
>> +        unsigned long last_wr_time;
>> +
>> +        unsigned long *gc_active_map;
>> +
>> +        int init_not_done;
>> +        struct gcthread_arg_data gcthrd_arg[NUM_GC_THREAD];
>> +
>> + };
>> +
>> + static struct mutex mtdblks_lock;
>> +
>> +#define lftl_assert(expr) do {                                                \
>> +         if (unlikely(!(expr))) {
>>           \
>> +                 printk(KERN_CRIT "lftl: assert failed in %s at %u
>> (pid %d)\n", \
>> +                        __func__, __LINE__, current->pid);
>>           \
>> +                 dump_stack();
>>           \
>> +         }
>>           \
>> + } while (0)
>> +
>> +extern struct mutex mtd_table_mutex;
>> +extern struct mtd_info *__mtd_next_device(int i);
>> +
>> +#define mtd_for_each_device(mtd)                       \
>> +       for ((mtd) = __mtd_next_device(0);              \
>> +            (mtd) != NULL;                             \
>> +            (mtd) = __mtd_next_device(mtd->index + 1))
>> +
>> +
>> +static LIST_HEAD(blktrans_majors);
>> +static DEFINE_MUTEX(blktrans_ref_mutex);
>> +
>> +
>> +static struct kmem_cache *lftlbiolist_cachep;
>> +static mempool_t *biolistpool;
>> +static struct lfq_queue_rcu rcuqu[VIRGO_NUM_MAX_REQ_Q];
>> +static uint32_t last_lpn[VIRGO_NUM_MAX_REQ_Q];
>> +
>> +struct bio_node {
>> +       struct lfq_node_rcu list;
>> +       struct rcu_head rcu;
>> +       struct bio *bio;
>> +};
>> +
>> +#ifdef MAP_TABLE_ONE_LOCK
>> +       #define map_table_lock(pagenum)   {     down_read(&(map_tabl_lock)); }
>> +       #define map_table_unlock(pagenum) {  up_read(&(map_tabl_lock)); }
>> +#else
>> +       #define map_table_lock(pagenum)   do{ while
>> (test_and_set_bit((pagenum), maptab_bitmap) != 0){      \
>> +                                                                               schedule();             \
>> +                                                                               } \
>> +                                       }while(0)
>> +
>> +       #define map_table_unlock(pagenum) do{   if
>> (test_and_clear_bit((pagenum), maptab_bitmap) == 0) \
>> +                                                                               { \
>> +                                                                               printk(KERN_ERR "lftl: mapbitmap cleared wrong"); \
>> +                                                                               BUG(); \
>> +                                                                               }\
>> +                                                                        }while(0)
>> +#endif
>> +
>> +void lftl_blktrans_dev_release(struct kref *kref)
>> +{
>> +       struct lftl_blktrans_dev *dev =
>> +                       container_of(kref, struct lftl_blktrans_dev, ref);
>> +
>> +       dev->disk->private_data = NULL;
>> +       blk_cleanup_queue(dev->rq);
>> +       put_disk(dev->disk);
>> +       list_del(&dev->list);
>> +       kfree(dev);
>> +}
>> +
>> +static struct lftl_blktrans_dev *lftl_blktrans_dev_get(struct gendisk *disk)
>> +{
>> +       struct lftl_blktrans_dev *dev;
>> +
>> +       mutex_lock(&blktrans_ref_mutex);
>> +       dev = disk->private_data;
>> +
>> +       if (!dev)
>> +               goto unlock;
>> +       kref_get(&dev->ref);
>> +unlock:
>> +       mutex_unlock(&blktrans_ref_mutex);
>> +       return dev;
>> +}
>> +
>> +void lftl_blktrans_dev_put(struct lftl_blktrans_dev *dev)
>> +{
>> +       mutex_lock(&blktrans_ref_mutex);
>> +       kref_put(&dev->ref, lftl_blktrans_dev_release);
>> +       mutex_unlock(&blktrans_ref_mutex);
>> +}
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +void init_device_queues(struct lftl_blktrans_dev *dev)
>> +{
>> +
>> +       int i;
>> +
>> +
>> +       lftlbiolist_cachep = kmem_cache_create("mybioQ",
>> +                                            sizeof(struct lftl_bio_list), 0, SLAB_PANIC, NULL);
>> +
>> +       biolistpool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab,
>> +                                    mempool_free_slab, lftlbiolist_cachep);
>> +       for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++)
>> +               INIT_LIST_HEAD(&dev->qu[i].qelem_ptr);
>> +
>> +
>> +
>> +
>> +
>> +       for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++)
>> +               lfq_init_rcu(&rcuqu[i], call_rcu);
>> +
>> +       for(i = 0;i < VIRGO_NUM_MAX_REQ_Q;i++)
>> +               spin_lock_init(&dev->mybioq_lock[i]);
>> +}
>> +
>> +void deinit_device_queues(struct lftl_blktrans_dev *dev)
>> +{
>> +       mempool_destroy(biolistpool);
>> +
>> +       kmem_cache_destroy(lftlbiolist_cachep);
>> +
>> +}
>> +
>> +
>> +static int lftl_make_request(struct request_queue *rq, struct bio *bio)
>> +{
>> +
>> +       struct lftl_blktrans_dev *dev;
>> +       int qnum;
>> +       gfp_t gfp_mask;
>> +       struct lftl_bio_list *tmp;
>> +       unsigned long temp_rand;
>> +
>> +       int i;
>> +       int found;
>> +
>> +
>> +       uint32_t lpn;
>> +
>> +
>> +       dev = rq->queuedata;
>> +
>> +       if (dev == NULL)
>> +               goto fail;
>> +       if(bio_data_dir(bio) == WRITE)
>> +       {
>
> This is not the correct coding style.
> Please read Documentation/CodingStyle (and perhaps
> Documentation/SubmittingPatches)
>
> Also, it seems to me you haven't passed this code through
> ./scripts/checkpatch.pl.
> Regularly, every patch sent should have no errors or warnings reported
> by ./scripts/checkpatch.pl.
>
> Hope this helps,
>
>     Ezequiel