CSE 221: System Measurement Project
Winter 2024
Deadlines
Draft of Intro, Machine Description, and CPU Operations: |
Tuesday, February 6 at 11:59pm |
Draft of Memory Operations: |
Thursday, February 22 at 11:59pm |
Final report with all measurements plus code: |
Thursday, March 14 at 11:59pm |
Overview
In building an operating system, it is important to be able to
determine the performance characteristics of underlying hardware
components (CPU, RAM, disk, network, etc.), and to understand how
their performance influences or constrains operating system services.
Likewise, in building an application, one should understand the
performance of the underlying hardware and operating system, and how
they relate to the user's subjective sense of that application's
"responsiveness". While some of the relevant quantities can be found
in specs and documentation, many must be determined experimentally.
While some values may be used to predict others, the relations between
lower-level and higher-level performance are often subtle and
non-obvious.
In this project, you will create, justify, and apply a set of
experiments to a system to characterize and understand its
performance. In addition, you may explore the relations between some
of these quantities. In doing so, you will study how to use
benchmarks to usefully characterize a complex system. You should also
gain an intuitive feel for the relative speeds of different basic
operations, which is invaluable in identifying performance bottlenecks.
You have complete choice over the operation system and hardware
platform for your measurements. You can use your laptop that you are
comfortable with, an operating system running in a virtual machine
monitor, a smartphone, a game system, or even a supercomputer. You
can use a mainstream OS like Linux, MacOS, or Windows;
a wide range of other Unix-based OSes (e.g., FreeBSD,
NetBSD, OpenBSD, DragonFly BSD);
or alternative OSes such as
Fuschia,
Haiku,
or
Plan 9/Inferno.
The project tasks originally had mainstream hardware platforms and
operating systems in mind. But we also want to encourage you to do
the project in more exotic platforms and environments. As a result,
we can adapt the project tasks to the particular platform
configuration you choose (e.g., if an unusual OS does not support a
particular feature, such as a driver for your network card). Talk to
us and we can adapt the project as needed.
You may work either alone or in 2–3 person groups. All
groups do the same project, and all members receive the same grade.
Note that working in groups may or may not make the project easier,
depending on how the group interactions work out. If collaboration
issues arise, contact your instructor as soon as possible: flexibility
in dealing with such issues decreases as the deadline approaches.
This project has two aspects: you will implement and perform
a series of experiments, and you will write a report documenting
the methodology and results of your experiments. When you finish, you
will submit your report as well as the code used to perform your
experiments.
Report
Your report will have a number of sections including an
introduction, a machine description, and descriptions and discussions
of your experiments.
1) Introduction
Describe the goals of the project and, if you are in a group, who
performed which experiments. State the language you used to implement
your measurements, and the compiler version and optimization settings
you used to compile your code. If you are measuring in an unusual
environment (e.g., virtual machine, Web browser, compute cloud, etc.),
discuss the implications of the environment on the measurement task
(e.g., additional variance that is difficult for you to control for).
Estimate the amount of time you spent on this project.
2) Machine Description
Your report should contain a reasonably detailed description of the
test machine(s). For mainstream operating systems, the relevant
information should be available either from the system
(e.g., /proc
and /sys/devices/system/cpu
on Linux, System Profiler on
Mac OS X, Device Manager on Windows,
sysctl
on BSD, the cpuid
x86 instruction),
or online. Gathering this
information should not require much work, but in explaining and
analyzing your results you will find these numbers useful. You should
report at least the following quantities:
- Processor model, cycle time, cache sizes (L1 instruction and data, L2, L3)
- DRAM type, clock, and capacity
- Memory bus bandwidth
- I/O bus type (e.g., PCIe 3.0), bandwidth
- Disk
- SSD: model, capacity, transfer rates, IOPs, latencies (if reported on spec)
- Hard drive: model, capacity, RPM, transfer rates, latencies
- Network card bandwidth
- Operating system (including version/release)
If you are on an unusual platform, you may not be able to track
down all of this information. In that case, just report "Unknown".
3) Experiments
You will conduct experiments for a set of operations (described below).
For each operation, perform your experiments by following these steps:
- Estimate the base hardware performance of the operation and cite
the source you used to determine this quantity (system info, a
particular document). For example, when measuring disk read
performance for a particular size, you can refer to the disk
specification (found online) to determine performance characteristics.
Based on these values, you can estimate the average time to read a
given amount of data from the disk assuming no software overheads.
For operations where the hardware performance does not apply or is
difficult to measure (e.g., procedure call), you can leave hardware
performance blank.
- Make a guess as to how much overhead software will add to the
base hardware performance. For a disk read, this overhead will
include the system call, arranging the read I/O operation, handling
the completed read, and copying the data read into the user buffer.
We will not grade you on your guess, this is for you to test your
intuition. (Obviously you can do this after performing the experiment
to derive an accurate "guess", but where is the fun in that? Also, if
your guesses are consistently accurate and essentially the same as
your measured results, then we're going to be highly suspicious.) For
a procedure call, this overhead will consist of the instructions used
to manage arguments and make the jump. (Note that you do not need to
track down the number of cycles required for each different kind of
instruction. Instead, just track down a reasonable estimate of CPI
for your CPU and multiply the CPI with your estimate of number of
instructions.)
If you are measuring a system in an unusual environment (e.g.,
virtual machine, compute cloud, Web browser, etc.), estimate the
degree of variability and error that might be introduced when
performing your measurements.
- Combine the base hardware performance (when applicable) and your
estimate of software overhead into an overall prediction of
performance.
- Implement and perform the measurement. In all cases, you should
run your experiment multiple times, and for long enough to obtain
repeatable measurements. You should examine the raw values of your
samples to make sure that nothing unexpected is happening (e.g., a
process context switch), and then compute a summary statistic across
the samples. By default, compute the mean (average). To reduce
the impact of unexplained outliers, you can also use the trimmed mean
(for details see the hbench paper listed at the bottom of the page).
Also compute the standard deviation across the measurements. Note
that, when measuring an operation using many iterations (e.g., system
call overhead), consider each run of iterations as a single trial and
compute the standard deviation across multiple trials (not each
individual iteration).
- Use a low-overhead mechanism for reading timestamps. All modern
processors have a cycle counter that applications can read using a
special instruction
(e.g., rdtsc).
Searching for "rdtsc" in Google, for instance, will provide you with a
plethora of additional examples. Note, though, that in the modern age
of power-efficient multicore processors, you will need to take
additional steps to reliably use the cycle counter to measure the
passage of time. You will want to disable dynamically adjusted CPU
frequency (the mechanism will depend on your platform) so that the
frequency at which the processor computes is determinstic and does not
vary. Use "nice" to boost your process priority. Restrict
your measurement programs to using a single core
→ More tips on measuring time
In your report:
- Clearly explain the methodology of your experiment.
- Present your results:
- For measurements of single quantities (e.g., system call
overhead), use a table to summarize your results. In the table report
the base hardware performance (when applicable), your estimate of
software overhead, your prediction of operation time, and your
measured operation time. For units, use wall-clock time (e.g.,
microseconds) instead of cycles.
- For measurements of operations as a function of some other
quantity, report your results as a graph with operation time on the
y-axis and the varied quantity on the x-axis. Where possible, include
your estimates of base hardware performance and overall prediction of
operation time as curves on the graph as well.
- Discuss your results:
- Cite the source for the base hardware performance.
- Compare the measured performance with the predicted performance.
If they are wildly different, speculate on reasons why. What
may be contributing to the overhead?
- Evaluate the success of your methodology. How accurate
do you think your results are?
- For graphs, explain any interesting features of the curves.
- Answer any questions specifically mentioned with the operation.
- At the end of your report, summarize your results in a table for
a complete overview. The columns in your table should include
"Operation", "Base Hardware Performance", "Estimated Software
Overhead", "Predicted Time", and "Measured Time". (Not required for
the draft.)
- State the units of all reported values.
Do not underestimate the time it takes to describe your methodology
and results.
4) Operations
- CPU, Scheduling, and OS Services
- Measurement overhead:
Report the overhead of reading time, and report
the overhead of using a loop to measure many iterations
of an operation.
- Procedure call overhead:
Report as a function of number of integer arguments from 0-7.
What is the incremental overhead of an argument?
- System call overhead: Report the cost of a minimal system
call. How does it compare to the cost of a procedure
call? Note that some operating systems will cache the
results of some system calls (e.g., idempotent system
calls like getpid), so only the first call by a process
will actually trap into the OS.
- Task creation time: Report the time to create and run both
a process and a kernel thread (kernel threads run at
user-level, but they are created and managed by the OS;
e.g., pthread_create on modern Linux will create a
kernel-managed thread). How do they compare?
- Context switch time: Report the time to context switch
from one process to another, and from one kernel thread to
another. How do they compare? In the past students have
found using blocking pipes to be useful for forcing
context switches. (For insight into why a context switch
can be much more expensive than a procedure call, consider
the evolution of the Linux kernel trap on x86.)
For methodology examples see the "lmbench" and "hbench"
papers listed below at the bottom of the project page. It is
also important to keep in mind that measuring short execution
sequences can be challenging since things like cache effects
can skew the results; see the discussions in the lmbench (Sec
3.4) and hbench (Sec 2.1) papers for ways to reduce these
effects, such as warming caches and manual loop unrolling.
- Memory
- RAM access time: Report latency for individual integer
accesses to main memory and the L1 and L2 caches. Present
results as a graph with the x-axis as the log of the size
of the memory region accessed, and the y-axis as the
average latency. Note that the lmbench paper is a good
reference for this experiment. In terms of the lmbench
paper, measure the "back-to-back-load" latency and report
your results in a graph similar to Fig. 1 in the paper.
The ideal case is that you do not need to use
information about the machine or the size of the L1, L2,
etc., caches when implementing the experiment, and that
the experiment will reveal these sizes. However, modern
CPU architectures are incredibly complex and have many
optimizations in them that can lead to seemingly funky
behavior (e.g.,
a fun
exploration of why the value of the write (writing
zeros) can lead to different performance).
With this in mind, your experiments may not reveal nice,
sharp boundaries like the original paper. Do the best you
can and, while the boundaries may not be crystal clear in
your experiments, state in the report where your results
do indicate different hardware regimes (L1 to L2
transition, etc.).
- RAM bandwidth: Report bandwidth for both reading and
writing. Use loop unrolling to get more accurate results,
and keep in mind the effects of cache line prefetching
(e.g., see the lmbench paper). (Also keep in mind that
RAM bandwidth is likely limited by the DRAM and not the
memory bus bandwidth.)
- Page fault service time:
Report the time for faulting an entire page from disk (mmap
is one useful mechanism).
Dividing by the size of a page, how does it compare to the
latency of accessing a byte from main memory?
- Network
- Round trip time. Compare application-level round trip
time with the time to perform a ping using the ping command
(ICMP requests are handled at kernel level).
- Peak bandwidth.
- Connection overhead: Report setup and tear-down.
Evaluate for the TCP protocol. For each quantity, compare both
remote and loopback (localhost) interfaces. Comparing the
remote and loopback results, what can you deduce about baseline
network performance and the overhead of OS software? For both
round trip time and bandwidth, how close to ideal hardware
performance do you achieve? What are reasons why the TCP
performance does not match ideal hardware performance (e.g.,
what are the pertinent overheads)? In describing your
methodology for the remote case, either provide a machine
description for the second machine (as above), or use two
identical machines.
- File System
- Size of file cache: Note that the file cache size is
determined by the OS and will be sensitive to other load on
the machine; for an application accessing lots of file
system data, an OS will use a notable fraction of main
memory (GBs) for the file system cache. Report results as a
graph whose x-axis is the size of the file being accessed
and the y-axis is the average read I/O time. Do not use a
system call or utility program to determine this metric
except to sanity check.
- File read time: Report for both sequential and random
access as a function of file size. Ensure that you are not
measuring cached data when reading. The OS will cache file
data by default when processes both read and write file
data. Fortunately, OSes will generally provide mechanisms
for managing the cache. On Linux, for instance, from the
command line you can flush the entire cache (see
/proc/sys/vm/drop_caches) or specific files (e.g., see
oflag=nocache with the dd utility); programmatically, see
the posix_fadvise system call. Report as a graph with a
log/log plot with the x-axis the size of the file and y-axis
the average per-block time.
- Remote file read time: Repeat the previous experiment for
a remote file system. What is the "network penalty" of
accessing files over the network? You can either
configure your second machine to provide remote file
access, or you can perform the experiment on a department
machine (e.g., APE lab). On these machines your home
directory is mounted over NFS, so accessing a file under
your home directory will be a remote file access
(although, again, keep in mind file caching effects).
- Contention: Report the average time to read one file
system block of data as a function of the number of
processes simultaneously performing the same operation on
different files on the same disk (and not in the file
buffer cache).
References
During the quarter you will have read a number of papers
describing various system measurements, particularly in the second
half of the quarter. You may find those papers on the reading list
useful as references.
In addition, other papers you may find useful for help with system
measurement are:
- John K. Ousterhout, Why
Aren't Operating Systems Getting Faster as Fast as Hardware?,
Proc. of USENIX Summer Conference, pp. 247-256, June 1990.
- J. Bradley Chen, Yasuhiro Endo, Kee Chan, David Mazières,
Antonio Dias, Margo Seltzer, and Michael D. Smith, The
Measured Performance of Personal Computer Operating Systems,
Proc. of ACM SOSP, pp. 299-313, December 1995.
- Larry McVoy and Carl Staelin, lmbench:
Portable Tools for Performance Analysis, Proc. of USENIX Annual
Technical Conference, January 1996.
- Aaron B. Brown and Margo I. Seltzer, Operating
system benchmarking in the wake of lmbench: a case study of the
performance of NetBSD on the Intel x86 architecture, Proc. of ACM
SIGMETRICS, pp. 214-224, June 1997. (Colloquially known as the
"hbench" paper.)
- John Ousterhout, Always Measure One Level Deeper, Communications of the ACM, Vol. 61, No. 7, July 2018, pp. 74–83.
- Xiang (Jenny) Ren, Kirk Rodrigues, Luyuan Chen, Camilo Vega, Michael Stumm, and Ding Yuan, An Analysis of Performance Evolution of Linux's Core Operations, Proc.
of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019),
pp. 554–569, October 2019.
- Gernot Heiser, Systems Benchmarking Crimes.
You may read these papers, or other references, for strategies on
performing measurements, but you may not examine code to copy or
replicate the implementation of a measurement. For example, reading
the lmbench paper is fine, but downloading and looking at the lmbench
code violates the intent of the project.
Finally, it goes almost without saying that you must implement all
of your measurements. You may not download a tool to perform the
measurements for you.
Grading
We will grade your project on the relative accuracy of your
measurement results (disk reads performing faster than the buffer
cache are a bad sign) as well as the quality of your report in terms
of methodology description (can we understand what you did and why?),
discussion of results (answering specific questions, discussing
unexpected behavior), and the writing.
In the past, a frequent issue we see with project reports is that
they do not clearly explain the reasoning behind the estimates,
methodology, results, etc. As a result, we do not fully understand
what you did and why you did it that way. Be sure to explain your
reasoning as well.
You will submit two drafts of your project throughout the quarter
as well as a final complete report at the end. For the first draft,
you should cover the
first two parts of the report (Introduction and Machine Description), and the first set of
operations (CPU, Scheduling, and
OS Services). For the second draft, extend your first draft with
results for the second set of operations (Memory).
The first and second drafts will each be worth 10% of your overall
project grade. Why so little? The idea with the drafts is that they
are primarily for your own benefit: they will get you
started on the project early, and will give you a sense for how
long it will take you to complete the project by the end of the
quarter (in the past, students have reported that it has taken them
40-120 hours on the project). As a result, you should be able to
better budget your time as the end of the quarter arrives. You can
also update and add to the earlier sections of your report for the
final report if we have feedback.
For the checkpoint drafts and the final report, submit them as pdf
files in your github repo. Create a top-level directory "reports"
in your repo, and place your documents in that directory with the
names "first", "second", and "final".