Before measuring anything, your first step will be to configure your hardware to enable you to more reliably measure operations. Modern CPUs have a variety of performance and power optimizations, including hyper-threading, turbo boost, and dynamic frequency scaling, which make measurement using cycle counters more challenging.
Fortunately, recent x86 CPUs have a feature called constant TSC. With these CPUs, the cycle counter updates at a fixed frequency independent of the operating frequency of the CPU. In short, constant TSC makes cycle counting straightforward again. You can check whether your CPU has this feature by looking for the "constant_tsc" flag for the CPU; e.g., with cat /proc/cpuinfo on Linux:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
If your CPU does not have this feature (or equivalent), or if you are unsure, you can also typically disable the power optimization features. The most straightforward way to disable them is via the BIOS during machine boot, which has the added virtue of being independent of the operating system you are using.
Converting from cycles to wall clock times starts with determining the cycle time (or, inversely, the frequency of the CPU cycle counter). The OS reports the CPU frequency in various ways, but the most reliable method is simply to measure the cycle time. So how much time is one cycle? You can compute the cycle time by counting the number of cycles in a known time interval. For instance, measuring multiple executions of the following code:
[record start cycle time] sleep (10); /* unit of time is seconds; see "man 3 sleep" on unix */ [record end cycle time]
on this CPU:
model name : Intel(R) Xeon(R) CPU E5-2403 0 @ 1.80GHz cpu MHz : 1200.000
gives the following results:
cycles: 17999865186 cycles: 17999903346 cycles: 17999857728 cycles: 17999892942 cycles: 17999857590 cycles: 17999886126 cycles: 17999857302 cycles: 17999899602 cycles: 17999859600 cycles: 17999898618 samples: 10 mean: 17999877804.000000 sd: 18925.556816
and indicates that the CPU frequency is 1.8 GHz (and not 1200 MHz).
System calls for measuring time are typically coarse grained. For instance, "gettimeofday" on Unix has microsecond granularity, which is fine for measuring operations like reading a block of data from disk. However, the CPU and memory benchmarks ask you to measure operations of very short duration, and so counting cycles is your only recourse.
Fortunately, modern CPUs make it easy to read the CPU cycle counter. On the x86, "rdtsc" and "rdtscp" are the instructions you want to use. You will find many examples and suggestions for using these instructions when searching the Web. However, a good tutorial is an Intel whitepaper by Gabriele Paoloni: How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (September 2010).
Read this whitepaper and use its methods for reading the x86 cycle counter (using "rdtscp" if your CPU supports it, or "rdtsc" otherwise). For very fine-grained measurements, CPU optimizations like superscaler issue and out-of-order execution can introduce unexpected variability, and this whitepaper discusses how to account for it.
Note: You do not need to implement a kernel module to disable preemption and turn off interrupts. You will be running your experiments at user level and living with any variability that multiprogramming introduces.
Note: These issues only matter when measuring fine-grained operations. When measuring I/O devices, for instance, they do not hurt, but accounting for out-of-order execution of "rdtsc(p)" and adjacent instructions will be vastly dwarfed by the time of what you are measuring.
When measuring fine-grained operations, it is good practice to make sure that you are measuring what you think you are measuring. For instance, you typically implement your benchmarking code in a high-level language, and it is the compiler that ultimately generates the instructions that get executed.
The best way to be sure is to disassemble the code generated by the compiler and sanity check the instruction sequence you are measuring. For instance, you can look at the instructions generated when invoking a function with different numbers of arguments to see when registers are used to pass arguments versus when the stack is used.
Let's take reading the cycle counter as an example. From Paoloni's paper, here is code for taking the start and end timestamps using "rdtscp":
asm volatile ("cpuid\n\t" "rdtsc\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" : "=r" (cycles_high0), "=r" (cycles_low0) :: "%rax", "%rbx", "%rcx", "%rdx"); /* code to measure */ asm volatile ("rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "cpuid\n\t" : "=r" (cycles_high1), "=r" (cycles_low1) :: "%rax", "%rbx", "%rcx", "%rdx");
Let's say that we put nothing in the "code to measure" section because we want to measure the overhead of reading the cycle counter. Then what exactly will this code measure if we take the difference between the end and start times? It looks like it will measure two "mov" instructions and the "rdtscp" instruction. Using gcc with default settings, here is what the compiler generated:
4005f0: 0f a2 cpuid 4005f2: 0f 31 rdtsc 4005f4: 41 89 d5 mov %edx,%r13d 4005f7: 41 89 c4 mov %eax,%r12d 4005fa: 44 89 6d d8 mov %r13d,-0x28(%rbp) 4005fe: 44 89 65 dc mov %r12d,-0x24(%rbp) 400602: 0f 01 f9 rdtscp 400605: 41 89 d5 mov %edx,%r13d 400608: 41 89 c4 mov %eax,%r12d 40060b: 0f a2 cpuid 40060d: 44 89 6d e0 mov %r13d,-0x20(%rbp) 400611: 44 89 65 e4 mov %r12d,-0x1c(%rbp)
Note that gcc generated two additional "mov" instructions, and that two of the "mov" instructions are writing to the stack (although those writes will go into the write-back cache).
On Linux, you can use objdump to disassemble your code, and on Windows you can use dumpbin. Or, you can load your executable into a debugger and use it to show the instructions in the code you are measuring.
When measuring the time to perform specific operations, you will be collecting many samples and computing summary statistics on them (mean, standard deviation, etc.). As a sanity check, it is often useful to look at the raw samples that you are computing statistics over. You can print them out to a file, or drop them into a quick-n-dirty graph.
Looking at the raw data can help you detect unexpected results. For instance, you might be calculating the means of 10 samples, and then calculating the mean/sd of those means. Since outliers can dramatically affect the value of the mean, by looking at the raw samples you can sanity check that the sample values are what you would reasonably expect them to be.