<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
Hi everyone<br>
<br>
I conducted a few experiments with a workload to compare the
following<br>
parameters with this patchset and without this patchset:<br>
1.The performance of the workload<br>
2.The sum of the waitime to run of the processes queued on each
cpu-the<br>
cumulative latency.<br>
3.The number of migrations of tasks between cpus.<br>
<br>
The observations and inferences are given below:<br>
<br>
<u>Experimental setup:<br>
<br>
</u>1.The workload is at the end of the mail.Every run of the
workload was for<br>
10s.<br>
2.Different number of long running and short running threads were
run each time.<br>
3.The setup was on a two socket Pre-Nehalam machine,but one socket
had all its cpus<br>
offlined.Thus only one socket was active throughout the
experiment.The socket<br>
consisted of 4 cores.<br>
4.The statistics below have been collected from /proc/schedstats
except<br>
throughput which is output by the workload.<br>
-Latency has been observed from the eighth field in the cpu
statistics<br>
in /proc/schedstat<br>
cpu<N> 1 2 3 4 5 6 7 "8" 9<br>
-Number of migrations has been calculated by summing up the #pulls
during<br>
the idle,busy and newly_idle states of all the cpus.This is also
given by<br>
/proc/schedstats<br>
<br>
5.FieldA->#short-running-tasks [For every 10ms passed sleep for
9ms,work for 1ms]<br>
a 10% task.<br>
FieldB->#long-running-tasks<br>
Field1->Throughput with patch (records/s read)<br>
Field2->Throughput without patch (records/s read)<br>
Field3->#Migrations with patch<br>
Field4->#Migrations without patch<br>
Field5->Latency with patch<br>
Field6->Latency without patch<br>
<br>
A B 1 2 3
4 5 6<br>
-------------------------------------------------------------------------------------<br>
5 5 49,93,368 48,68,351 108 28 22s
18.3s<br>
4 2 34,37,669 34,37,547 58 50 0.6s
0.17s<br>
16 0 38,66,597 38,74,580 1151 1014 1.88s
1.65s<br>
<br>
<u>Inferences</u><br>
<br>
1.Clearly an increase in the number of pulls can be seen with this
patch,this<br>
has resulted in an increase in the latency.This *should have*
resulted in a<br>
decrease in throughput but in the first two cases this is not
reflected.This<br>
could be due to some error in the benchmark itself or the way I am
calculating<br>
the throughput.Keeping this issue aside,I focus on the #pulls and
latency effect.<br>
<br>
2.On integrating PJT's metric with the load balancer,#Migrations<br>
increase due to the following reason, which I figured out by going
through the<br>
traces.<br>
<br>
Task1
Task3<br>
Task2
Task4<br>
------
------<br>
Group1 Group2<br>
<br>
Case1:Load_as_per_pjt 1028 1121<br>
Case2:Load_without_pjt 2048 2048<br>
<br>
Fig1.<br>
<br>
During load balancing<br>
Case1: Group2 is overloaded,one of the tasks is moved to Group1<br>
Case2: Group1 and Group2 are equally loaded,hence no migrations<br>
<br>
This is observed so many times,that it is no wonder that the
#migrations have<br>
increased with this patch.Here Group refers to sched_group.<br>
<br>
3.The next obvious step was be to see if so many migrations with my
patch is<br>
prudent or not.The latency numbers reflect that it is not.<br>
<br>
4.As I said earlier,I keep throughput out of these inferences
because it<br>
distracts us from something that is stark clear<br>
*Migrations incurred due to PJT's metric is not affecting the
tasks<br>
positively.*<br>
<br>
5.The above is my first observation.This does not however say that
using PJT's<br>
metric with the load balancer might be a bad idea.This could mean
many things<br>
out which the correct one has to be figured out.Among them I list
out a few.<br>
<br>
a)Simply replacing the existing metric used by Load Balancer with
PJT's<br>
metric might not really derive the benefit that PJT's metric has
to offer.<br>
b)I have not been able to figure out what kind of workloads
actually<br>
benefit from the way we have applied the PJT's metric.Maybe we
are using<br>
a workload which is adversely getting affected.<br>
<br>
6.My next step in my opinion will be to resolve the following issues
in the<br>
decreasing order of priority:<br>
<br>
a)Run some other benchmark like kernbench and find out if the<br>
throughput reflects increase in latency correctly.If it
does,then I will need<br>
to find out why the current benchmark was behaving weird,else I
will need to<br>
go through the traces to figure out this issue.<br>
b)If I find out that the throughput is consistent with the
latency,then we need<br>
to modify the strictness(the granularity of time at which the
load is<br>
getting updated) with which PJT's metric is calculating load,or
use it<br>
in some other way in load balancing.<br>
<br>
Looking forward to your feedback on this :)<br>
<br>
--------------------------BEGIN
WORKLOAD---------------------------------<br>
/*<br>
* test.c - Two instances of this program is run.One instance where
sleep<br>
* time is 0 and another instance which sleeps between regular
instances<br>
* of time.This is done to create both long running and short
running tasks<br>
* on the cpu.<br>
*<br>
* Multiple threads are created of each instance.The threads request
for a<br>
* memory chunk,write into it and then free it.This is done
throughout the<br>
* period of the run.<br>
*<br>
* This program is free software; you can redistribute it and/or<br>
* modify it under the terms of the GNU General Public License as<br>
* published by the Free Software Foundation; version 2 of the
License.<br>
*<br>
* This program is distributed in the hope that it will be useful,<br>
* but WITHOUT ANY WARRANTY; without even the implied warranty of<br>
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the<br>
* GNU General Public License for more details.<br>
*<br>
* You should have received a copy of the GNU General Public License<br>
* along with this program; if not, write to the Free Software<br>
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307<br>
* USA<br>
*/<br>
<br>
#include <stdio.h><br>
#include <unistd.h><br>
#include <stdlib.h><br>
#include <sys/mman.h><br>
#include <pthread.h><br>
#include <string.h><br>
#include <time.h><br>
#include <sys/time.h><br>
#include <sys/resource.h><br>
#include "malloc.h"<br>
<br>
/* Variable entities */<br>
static unsigned int seconds;<br>
static unsigned int threads;<br>
static unsigned int mem_chunk_size;<br>
static unsigned int sleep_at;<br>
static unsigned int sleep_interval;<br>
<br>
<br>
/* Fixed entities */<br>
typedef size_t mem_slot_t;/* 8 bytes */<br>
static unsigned int slot_size = sizeof(mem_slot_t);<br>
<br>
/* Other parameters */<br>
static volatile int start;<br>
static time_t start_time;<br>
static unsigned int records_read;<br>
pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;<br>
<br>
<br>
static unsigned int write_to_mem(void)<br>
{<br>
int i, j;<br>
mem_slot_t *scratch_pad, *temp;<br>
mem_chunk_size = slot_size * 256;<br>
mem_slot_t *end;<br>
sleep_at = 2800; /* sleep for every 2800 records-short runs,else
sleep_at=0 */<br>
sleep_interval = 9000; /* sleep for 9 ms */<br>
<br>
for (i=0; start == 1; i++)<br>
{<br>
/* ask for a memory chunk */<br>
scratch_pad = (mem_slot_t *)malloc(mem_chunk_size);<br>
if (scratch_pad == NULL) {<br>
fprintf(stderr,"Could not allocate memory\n");<br>
exit(1);<br>
}<br>
end = scratch_pad + (mem_chunk_size / slot_size);<br>
/* write into this chunk */<br>
for (temp = scratch_pad, j=0; temp < end; temp++, j++)<br>
*temp = (mem_slot_t)j;<br>
<br>
/* Free this chunk */<br>
free(scratch_pad);<br>
<br>
/* Decide the duty cycle;currently 10 ms */<br>
if (sleep_at && !(i % sleep_at))<br>
usleep(sleep_interval);<br>
<br>
}<br>
return (i);<br>
}<br>
<br>
static void *<br>
thread_run(void *arg)<br>
{<br>
<br>
unsigned int records_local;<br>
<br>
/* Wait for the start signal */<br>
<br>
while (start == 0);<br>
<br>
records_local = write_to_mem();<br>
<br>
pthread_mutex_lock(&records_count_lock);<br>
records_read += records_local;<br>
pthread_mutex_unlock(&records_count_lock);<br>
<br>
return NULL;<br>
}<br>
<br>
static void start_threads()<br>
{<br>
double diff_time;<br>
unsigned int i;<br>
int err;<br>
threads = 8;<br>
seconds = 10;<br>
<br>
pthread_t thread_array[threads];<br>
for (i = 0; i < threads; i++) {<br>
err = pthread_create(&thread_array[i], NULL, thread_run,
NULL);<br>
if (err) {<br>
fprintf(stderr, "Error creating thread %d\n", i);<br>
exit(1);<br>
}<br>
}<br>
start_time = time(NULL);<br>
start = 1;<br>
sleep(seconds);<br>
start = 0;<br>
diff_time = difftime(time(NULL), start_time);<br>
<br>
for (i = 0; i < threads; i++) {<br>
err = pthread_join(thread_array[i], NULL);<br>
if (err) {<br>
fprintf(stderr, "Error joining thread %d\n", i);<br>
exit(1);<br>
}<br>
}<br>
printf("%u records/s\n",<br>
(unsigned int) (((double) records_read)/diff_time));<br>
<br>
}<br>
int main()<br>
{<br>
start_threads();<br>
return 0;<br>
}<br>
<br>
------------------------END
WORKLOAD------------------------------------<br>
Regards<br>
Preeti U Murthy<br>
<br>
</body>
</html>