[RFC v2 PATCH 0/2] sched: Integrating Per-entity-load-tracking with the core scheduler
Preeti U Murthy
preeti at linux.vnet.ibm.com
Tue Dec 4 03:16:52 EST 2012
Hi everyone
I conducted a few experiments with a workload to compare the following
parameters with this patchset and without this patchset:
1.The performance of the workload
2.The sum of the waitime to run of the processes queued on each cpu-the
cumulative latency.
3.The number of migrations of tasks between cpus.
Experimental setup:
1.The workload is at the end of the mail.Every run of the workload was for
10s.
2.Different number of long running and short running threads were run.
3.The setup was on a two socket Pre-Nehalam machine,but one socket had all its cpus
offlined.Thus only one socket was active throughout the experiment.The socket
consisted of 4 cores.
4.The statistics below have been collected from /proc/schedstats except
throughput which is output by the workload.
-Latency has been observed from the eighth field in the cpu statistics
in /proc/schedstat
cpu<N> 1 2 3 4 5 6 7 "8" 9
-Number of migrations has been calculated by summing up the #pulls during
the idle,busy and newly_idle states of all the cpus.This is also given by
/proc/schedstats
5.FieldA->#short-running-tasks [For every 10ms passed sleep for 9ms,work for
1ms]-10% task.
FieldB->#long-running-tasks
Field1->Throughput with patch (records/s read)
Field2->Throughput without patch (records/s read)
Field3->#Pull tasks with patch
Field4->#Pull tasks without patch
Field5->Latency with patch
Field6->Latency without patch
A B 1 2 3 4 5 6
------------------------------------------------------------------------------
5 5 49,93,368 48,68,351 108 28 22s 18.3s
4 2 34,37,669 34,37,547 58 50 0.6s 0.17s
16 0 38,66,597 38,74,580 1151 1014 1.88s 1.65s
*Inferences*:
1.Clearly an increase in the number of pulls can be seen with this patch,this
has resulted in an increase in the latency.This *should have* resulted in a
decrease in throughput but in the first two cases this is not reflected.This
could be due to some error in the benchmark itself or the way I am calculating
the throughput.Keeping this issue aside,I focus on the #pulls and latency effect.
2.On integrating PJT's metric with the load balancer,#pulls/#Migrations
increase due to the following reason, which I figured out by going through the
traces.
Task1 Task3
Task2 Task4
------ ------
Group1 Group2
Case1:Load_as_per_pjt 1028 1121
Case2:Load_without_pjt 2048 2048
Fig1.
During load balancing
Case1: Group2 is overloaded,one of the tasks is moved to Group1
Case2: Group1 and Group2 are equally loaded,hence no migrations
This is observed so many times,that it is no wonder that the #migrations have
increased with this patch.Here Group refers to sched_group.
3.The next obvious step was be to see if so many migrations with my patch is
prudent or not.The latency numbers reflect that it is not.
4.As I said earlier,I keep throughput out of these inferences because it
distracts us from something that is stark clear
*Migrations incurred due to PJT's metric is not affecting the tasks
positively.*
5.The above is my first observation.This does not however say that using PJT's
metric with the load balancer might be a bad idea.This could mean many things
out which the correct one has to be figured out.Among them I list out a few.
a)Simply replacing the existing metric used by Load Balancer with PJT's
metric might not really derive the benefit that PJT's metric has to offer.
b)I have not been able to figure out what kind of workloads actually
benefit from the way we have applied the PJT's metric.Maybe we are using
a workload which is adversely getting affected.
6.My next step in my opinion will be to resolve the following issues in the
decreasing order of priority:
a)Run some other benchmark like kernbench and find out if the
throughput reflects increase in latency correctly.If it does,then I will need
to find out why the current benchmark was behaving weird,else I will need to
go through the traces to figure out this issue.
b)If I find out that the throughput is consistent with the latency,then we need
to modify the strictness(the granularity of time at which the load is
getting updated) with which PJT's metric is calculating load,or use it
in some other way in load balancing.
Looking forward to your feedback on this :)
--------------------------BEGIN WORKLOAD---------------------------------
/*
* test.c - Two instances of this program is run.One instance where sleep
* time is 0 and another instance which sleeps between regular instances
* of time.This is done to create both long running and short running tasks
* on the cpu.
*
* Multiple threads are created of each instance.The threads request for a
* memory chunk,write into it and then free it.This is done throughout the
* period of the run.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; version 2 of the License.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
* USA
*/
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <pthread.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
#include "malloc.h"
/* Variable entities */
static unsigned int seconds;
static unsigned int threads;
static unsigned int mem_chunk_size;
static unsigned int sleep_at;
static unsigned int sleep_interval;
/* Fixed entities */
typedef size_t mem_slot_t;/* 8 bytes */
static unsigned int slot_size = sizeof(mem_slot_t);
/* Other parameters */
static volatile int start;
static time_t start_time;
static unsigned int records_read;
pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned int write_to_mem(void)
{
int i, j;
mem_slot_t *scratch_pad, *temp;
mem_chunk_size = slot_size * 256;
mem_slot_t *end;
sleep_at = 0; /* sleep for every 2800 records */
sleep_interval = 9000; /* sleep for 9 ms */
for (i=0; start == 1; i++)
{
/* ask for a memory chunk */
scratch_pad = (mem_slot_t *)malloc(mem_chunk_size);
if (scratch_pad == NULL) {
fprintf(stderr,"Could not allocate memory\n");
exit(1);
}
end = scratch_pad + (mem_chunk_size / slot_size);
/* write into this chunk */
for (temp = scratch_pad, j=0; temp < end; temp++, j++)
*temp = (mem_slot_t)j;
/* Free this chunk */
free(scratch_pad);
/* Decide the duty cycle;currently 10 ms */
if (sleep_at && !(i % sleep_at))
usleep(sleep_interval);
}
return (i);
}
static void *
thread_run(void *arg)
{
unsigned int records_local;
/* Wait for the start signal */
while (start == 0);
records_local = write_to_mem();
pthread_mutex_lock(&records_count_lock);
records_read += records_local;
pthread_mutex_unlock(&records_count_lock);
return NULL;
}
static void start_threads()
{
double diff_time;
unsigned int i;
int err;
threads = 8;
seconds = 10;
pthread_t thread_array[threads];
for (i = 0; i < threads; i++) {
err = pthread_create(&thread_array[i], NULL, thread_run, NULL);
if (err) {
fprintf(stderr, "Error creating thread %d\n", i);
exit(1);
}
}
start_time = time(NULL);
start = 1;
sleep(seconds);
start = 0;
diff_time = difftime(time(NULL), start_time);
for (i = 0; i < threads; i++) {
err = pthread_join(thread_array[i], NULL);
if (err) {
fprintf(stderr, "Error joining thread %d\n", i);
exit(1);
}
}
printf("%u records/s\n",
(unsigned int) (((double) records_read)/diff_time));
}
int main()
{
start_threads();
return 0;
}
------------------------END WORKLOAD------------------------------------
Regards
Preeti U Murthy
More information about the linux-arm-kernel
mailing list