<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

    Hi everyone<br>

    <br>

    I conducted a few experiments with a workload to compare the

    following<br>

    parameters with this patchset and without this patchset:<br>

    1.The performance of the workload<br>

    2.The sum of the waitime to run of the processes queued on each

    cpu-the<br>

      cumulative latency.<br>

    3.The number of migrations of tasks between cpus.<br>

    <br>

    The observations and inferences are given below:<br>

    <br>

    <u>Experimental setup:<br>

      <br>

    </u>1.The workload is at the end of the mail.Every run of the

    workload was for<br>

      10s.<br>

    2.Different number of long running and short running threads were

    run each time.<br>

    3.The setup was on a two socket Pre-Nehalam machine,but one socket

    had all its cpus<br>

      offlined.Thus only one socket was active throughout the

    experiment.The socket<br>

      consisted of 4 cores.<br>

    4.The statistics below have been collected from /proc/schedstats

    except<br>

      throughput which is output by the workload.<br>

      -Latency has been observed from the eighth field in the cpu

    statistics<br>

       in /proc/schedstat<br>

       cpu<N> 1 2 3 4 5 6 7 "8" 9<br>

      -Number of migrations has been calculated by summing up the #pulls

    during<br>

       the idle,busy and newly_idle states of all the cpus.This is also

    given by<br>

       /proc/schedstats<br>

    <br>

    5.FieldA->#short-running-tasks [For every 10ms passed sleep for

    9ms,work for 1ms]<br>

      a 10% task.<br>

      FieldB->#long-running-tasks<br>

      Field1->Throughput with patch (records/s read)<br>

      Field2->Throughput without patch (records/s read)<br>

      Field3->#Migrations with patch<br>

      Field4->#Migrations without patch<br>

      Field5->Latency with patch<br>

      Field6->Latency without patch<br>

     <br>

        A     B         1                   2            3          

    4         5           6<br>

-------------------------------------------------------------------------------------<br>

        5     5    49,93,368    48,68,351    108        28      22s     

    18.3s<br>

        4     2    34,37,669    34,37,547      58        50       0.6s  

       0.17s<br>

       16    0    38,66,597    38,74,580  1151    1014       1.88s   

    1.65s<br>

    <br>

    <u>Inferences</u><br>

    <br>

    1.Clearly an increase in the number of pulls can be seen with this

    patch,this<br>

      has resulted in an increase in the latency.This *should have*

    resulted in a<br>

      decrease in throughput but in the first two cases this is not

    reflected.This<br>

      could be due to some error in the benchmark itself or the way I am

    calculating<br>

      the throughput.Keeping this issue aside,I focus on the #pulls and

    latency effect.<br>

    <br>

    2.On integrating PJT's metric with the load balancer,#Migrations<br>

      increase due to the following reason, which I figured out by going

    through the<br>

      traces.<br>

    <br>

                                                       Task1       

    Task3<br>

                                                       Task2       

    Task4<br>

                                                      ------           

    ------<br>

                                                     Group1      Group2<br>

    <br>

      Case1:Load_as_per_pjt             1028        1121<br>

      Case2:Load_without_pjt            2048        2048<br>

    <br>

                                                              Fig1.<br>

    <br>

      During load balancing<br>

      Case1: Group2 is overloaded,one of the tasks is moved to Group1<br>

      Case2: Group1 and Group2 are equally loaded,hence no migrations<br>

    <br>

      This is observed so many times,that it is no wonder that the

    #migrations have<br>

      increased with this patch.Here Group refers to sched_group.<br>

    <br>

    3.The next obvious step was be to see if so many migrations with my

    patch is<br>

      prudent or not.The latency numbers reflect that it is not.<br>

    <br>

    4.As I said earlier,I keep throughput out of these inferences

    because it<br>

      distracts us from something that is stark clear<br>

      *Migrations incurred due to PJT's metric is not affecting the

    tasks<br>

       positively.*<br>

    <br>

    5.The above is my first observation.This does not however say that

    using PJT's<br>

      metric with the load balancer might be a bad idea.This could mean

    many things<br>

      out which the correct one has to be figured out.Among them I list

    out a few.<br>

    <br>

      a)Simply replacing the existing metric used by Load Balancer with

    PJT's<br>

        metric might not really derive the benefit that PJT's metric has

    to offer.<br>

      b)I have not been able to figure out what kind of workloads

    actually<br>

        benefit from the way we have applied the PJT's metric.Maybe we

    are using<br>

        a workload which is adversely getting affected.<br>

    <br>

    6.My next step in my opinion will be to resolve the following issues

    in the<br>

      decreasing order of priority:<br>

    <br>

      a)Run some other benchmark like kernbench and find out if the<br>

        throughput reflects increase in latency correctly.If it

    does,then I will need<br>

        to find out why the current benchmark was behaving weird,else I

    will need to<br>

        go through the traces to figure out this issue.<br>

      b)If I find out that the throughput is consistent with the

    latency,then we need<br>

        to modify the strictness(the granularity of time at which the

    load is<br>

        getting updated) with which PJT's metric is calculating load,or

    use it<br>

        in some other way in load balancing.<br>

    <br>

    Looking forward to your feedback on this :)<br>

    <br>

    --------------------------BEGIN

    WORKLOAD---------------------------------<br>

    /*<br>

     * test.c - Two instances of this program is run.One instance where

    sleep<br>

     * time is 0 and another instance which sleeps between regular

    instances<br>

     * of time.This is done to create both long running and short

    running tasks<br>

     * on the cpu.<br>

     *<br>

     * Multiple threads are created of each instance.The threads request

    for a<br>

     * memory chunk,write into it and then free it.This is done

    throughout the<br>

     * period of the run.<br>

     *<br>

     * This program is free software; you can redistribute it and/or<br>

     * modify it under the terms of the GNU General Public License as<br>

     * published by the Free Software Foundation; version 2 of the

    License.<br>

     *<br>

     * This program is distributed in the hope that it will be useful,<br>

     * but WITHOUT ANY WARRANTY; without even the implied warranty of<br>

     * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the<br>

     * GNU General Public License for more details.<br>

     *<br>

     * You should have received a copy of the GNU General Public License<br>

     * along with this program; if not, write to the Free Software<br>

     * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 

    02111-1307<br>

     * USA<br>

     */<br>

    <br>

    #include <stdio.h><br>

    #include <unistd.h><br>

    #include <stdlib.h><br>

    #include <sys/mman.h><br>

    #include <pthread.h><br>

    #include <string.h><br>

    #include <time.h><br>

    #include <sys/time.h><br>

    #include <sys/resource.h><br>

    #include "malloc.h"<br>

    <br>

    /* Variable entities */<br>

    static unsigned int seconds;<br>

    static unsigned int threads;<br>

    static unsigned int mem_chunk_size;<br>

    static unsigned int sleep_at;<br>

    static unsigned int sleep_interval;<br>

    <br>

    <br>

    /* Fixed entities */<br>

    typedef size_t mem_slot_t;/* 8 bytes */<br>

    static unsigned int slot_size = sizeof(mem_slot_t);<br>

    <br>

    /* Other parameters */<br>

    static volatile int start;<br>

    static time_t start_time;<br>

    static unsigned int records_read;<br>

    pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;<br>

    <br>

    <br>

    static unsigned int write_to_mem(void)<br>

    {<br>

        int i, j;<br>

        mem_slot_t *scratch_pad, *temp;<br>

        mem_chunk_size = slot_size * 256;<br>

        mem_slot_t *end;<br>

        sleep_at = 2800; /* sleep for every 2800 records-short runs,else

    sleep_at=0 */<br>

        sleep_interval = 9000; /* sleep for 9 ms */<br>

    <br>

        for (i=0; start == 1; i++)<br>

        {<br>

            /* ask for a memory chunk */<br>

            scratch_pad = (mem_slot_t *)malloc(mem_chunk_size);<br>

            if (scratch_pad == NULL) {<br>

                fprintf(stderr,"Could not allocate memory\n");<br>

                exit(1);<br>

            }<br>

            end = scratch_pad + (mem_chunk_size / slot_size);<br>

            /* write into this chunk */<br>

            for (temp = scratch_pad, j=0; temp < end; temp++, j++)<br>

                *temp = (mem_slot_t)j;<br>

    <br>

            /* Free this chunk */<br>

            free(scratch_pad);<br>

    <br>

            /* Decide the duty cycle;currently 10 ms */<br>

            if (sleep_at && !(i % sleep_at))<br>

                usleep(sleep_interval);<br>

    <br>

        }<br>

        return (i);<br>

    }<br>

    <br>

    static void *<br>

    thread_run(void *arg)<br>

    {<br>

    <br>

        unsigned int records_local;<br>

    <br>

        /* Wait for the start signal */<br>

    <br>

        while (start == 0);<br>

    <br>

        records_local = write_to_mem();<br>

    <br>

        pthread_mutex_lock(&records_count_lock);<br>

        records_read += records_local;<br>

        pthread_mutex_unlock(&records_count_lock);<br>

    <br>

        return NULL;<br>

    }<br>

    <br>

    static void start_threads()<br>

    {<br>

        double diff_time;<br>

        unsigned int i;<br>

        int err;<br>

        threads = 8;<br>

        seconds = 10;<br>

    <br>

        pthread_t thread_array[threads];<br>

        for (i = 0; i < threads; i++) {<br>

            err = pthread_create(&thread_array[i], NULL, thread_run,

    NULL);<br>

            if (err) {<br>

                fprintf(stderr, "Error creating thread %d\n", i);<br>

                exit(1);<br>

            }<br>

        }<br>

        start_time = time(NULL);<br>

        start = 1;<br>

        sleep(seconds);<br>

        start = 0;<br>

        diff_time = difftime(time(NULL), start_time);<br>

    <br>

        for (i = 0; i < threads; i++) {<br>

            err = pthread_join(thread_array[i], NULL);<br>

            if (err) {<br>

                fprintf(stderr, "Error joining thread %d\n", i);<br>

                exit(1);<br>

            }<br>

        }<br>

         printf("%u records/s\n",<br>

            (unsigned int) (((double) records_read)/diff_time));<br>

    <br>

    }<br>

    int main()<br>

    {<br>

        start_threads();<br>

        return 0;<br>

    }<br>

    <br>

    ------------------------END

    WORKLOAD------------------------------------<br>

    Regards<br>

    Preeti U Murthy<br>

    <br>

  </body>

</html>