hbench-REBUTTAL

In June of 1997, Margo Seltzer and Aaron Brown published a paper in
Sigmetrics called "Operating System Benchmarking in the Wake of Lmbench:
A Case Study of the Performance of NetBSD on the Intel x86 Architecture".


This papers claims to have found flaws in the original lmbench work.
With the exception of one bug, which we have of course fixed, we find
the claims inaccurate, misleading, and petty.  We don't understand
what appears to be a pointless attack on something that has obviously
helped many researchers and industry people alike.  lmbench was warmly
received and is widely used and referenced.  We stand firmly behind the
work and results of the original benchmark.  We continue to improve and
extend the benchmark.  Our focus continues to be on providing a useful,
accurate, portable benchmark suite that is widely used.  As always, we
welcome constructive feedback.


To ease the concerns of gentle benchmarkers around the world, we have
spent at least 4 weeks reverifying the results.  We modified lmbench to
eliminate any effects of

	. clock resolution
	. loop overhead
	. timing interface overhead

Our prediction was that that this would not make any difference and our
prediction was correct.  All of the results reported in lmbench 1.x are
valid except the file reread benchmark which may be 20% optimistic on
some platforms.

We've spent a great deal of time and energy, for free, at the expense
of our full time jobs, to address the issues raised by hbench.  We feel
that we were needlessly forced into a lose/lose situation of arguing
with a fellow researcher.  We intend no disrespect towards their work,
but did not feel that it was appropriate for what we see as incorrect
and misleading claims to go unanswered.

We wish to move on to the more interesting and fruitful work of extending
lmbench in substantial ways.

Larry McVoy & Carl Staelin, June 1997

--------------------------------------------------------------------------

Detailed responses to their claims:

Claim 1: 

	"it did not have the statistical rigor and self-consistency
	needed for detailed architectural studies"

Reply:

	This is an unsubstantiated claim.  There are no numbers which back 
	up this claim.  

Claim 2: 

	"with a reasonable compiler, the test designed to read and touch
	data from the file system buffer cache never actually touched
	the data"

Reply:
	
	Yes, this was a bug in lmbench 1.0.  It has been fixed.
	On platforms such as a 120 Mhz Pentium, we see change of a 20%
	in the results, i.e., without the bug fix it is about 20% faster.

Claim 3:

    This is a multi part claim:

	a) gettimeofday() is too coarse.

Reply:  
    	
	The implication is that there are number of benchmarks in
	lmbench that finish in less time than the clock resolution
	with correspondingly incorrect results.  There is exactly one
	benchmark, TCP connection latency, where this is true and that
	is by design, not by mistake.  All other tests run long enough 
	to overcome 10ms clocks (most modern clocks are microsecond
	resolution).

	Seltzer/Brown point out that lmbench 1.x couldn't accurately
	measure the L1/L2 cache bandwidths.  lmbench 1.x didn't attempt
	to report L1/L2 cache bandwidths so it would seem a little
	unreasonable to imply inaccuracy in something the benchmark
	didn't measure.  It's not hard to get this right by the way, we
	do so handily in lmbench 2.0.


    	b)  TCP connection latency is reported as 0 on the DEC Alpha.

Reply:

	We could have easily run the TCP latency connection benchmark in
	a loop long enough to overcome the clock resolution.  We were,
	and are, well aware of the problem on DEC Alpha boxes.	We run
	only a few interations of this benchmark because the benchmark
	causes a large number of sockets to get stuck in TIME_WAIT,
	part of the TCP shutdown protocol.   Almost all protocol stacks
	degrade somewhat in performance when there are large numbers of
	old sockets in their queues.  We felt that showing the degraded
	performance was not representative of what users would see.
	So we run only for a small number (about 1000) interations and
	report the result.  We would not consider changing the benchmark
	the correct answer - DEC needs to fix their clocks if they wish
	to see accurate results for this test.

	We would welcome a portable solution to this problem.  Reading
	hardware specific cycle counters is not portable.

Claim 4:

	"lmbench [..] was inconsistent in its statistical treatment of
	the data"
	...
	"The most-used statistical policy in lmbench is to take the
	minimum of a few repetitions of the measurement"

Reply:

	Both of these claims are false, as can be seen by a quick inspection
	of the code.   The most commonly used timing method (16/19 tests
	use this) is

		start_timing
		do the test N times
		stop_timing
		report results in terms of duration / N
	
	In fact, the /only/ case where a minimum is used is in the
	context switch test.

	The claim goes on to try and say that taking the minimum causes
	incorrect results in the case of the context switch test.
	Another unsupportable claim, one that shows a clear lack of
	understanding of the context switch test.  The real issue is cache
	conflicts due to page placement in the cache.  Page placement is
	something not under our control, it is under the control of the
	operating system.  We did not, and do not, subscribe to the theory
	that one should use better ``statistical methods'' to eliminate
	the variance in the context switch benchmark.  The variance is
	what actually happened and happens to real applications.

	The authors also claim "if the virtually-contiguous pages of
	the buffer are randomly assigned to physical addresses, as they
	are in many systems, ... then there is a good probability that
	pages of the buffer will conflict in the cache".

	We agree with the second part but heartily disagree with
	the first.  It's true that NetBSD doesn't solve this problem.
	It doesn't follow that others don't.  Any vendor supplied
	operating system that didn't do this on a direct mapped L2
	cache would suffer dramatically compared to it's competition.
	We know for a fact that Solaris, IRIX, and HPUX do this.

	A final claim is that they produced a modified version of the
	context switch benchmark that does not have the variance of
	the lmbench version.  We could not support this.  We ran that
	benchmark on an SGI MP and saw the same variance as the original
	benchmark.

Claim 5:

	"The lmbench bandwidth tests use inconsistent methods of accessing
	memory, making it hard to directly compare the results of, say
	memory read bandwidth with memory write bandwidth, or file reread
	bandwidth with memory copy bandwidth"
	...
	"On the Alpha processor, memory read bandwidth via array indexing
	is 26% faster than via pointer indirection; the Pentium Pro is
	67% faster when reading with array indexing, and an unpipelined
	i386 is about 10% slower when writing with pointer indirection"

Reply:
	In reading that, it would appear that they are suggesting that
	their numbers are up to 67% different than the lmbench numbers.
	We can only assume that this was delibrately misleading.
	Our results are identical to theirs.  How can this be?

		. We used array indexing for reads, so did they.
		  They /implied/ that we did it differently, when in fact
		  we use exactly the same technique.  They get about
		  87MB/sec on reads on a P6, so do we.	We challenge
		  the authors to demonstrate the implied 67% difference
		  between their numbers and ours.  In fact, we challenge
		  them to demonstrate a 1% difference.

		. We use pointers for writes exactly because we wanted
		  comparable numbers.  The read case is a load and
		  an integer add per word.  If we used array indexing
		  for the stores, it would be only a store per word.
		  On older systems, the stores can appear to go faster
		  because the load/add is slower than a single store.

	While the authors did their best to confuse the issue, the
	results speak for themselves.  We coded up the write benchmark
	our way and their way.  Results for a Intel P6:

			pointer	  array	   difference
		L1 $	587	  710	      18%
		L2 $	414	  398	       4%
		memory   53	   53	       0%
	

Claim 5a:
	The harmonic mean stuff.

Reply:
	They just don't understand modern architectures.  The harmonic mean
	theory is fine if and only if the process can't do two things at
	once.  Many modern processors can indeed do more than one thing at
	once, the concept is known as super scalar, and can and does include
	load/store units.  If the processor supports both outstanding loads
	and outstanding stores, the harmonic mean theory fails.

Claim 6:

	"we modified the memory copy bandwidth to use the same size
	data types as the memory read and write benchmark (which use the
	machine's native word size); originally, on 32-bit machines, the
	copy benchmark used 64-bit types whereas the memory read/write
	bandwidth tests used 32- bit types"

Reply:

	The change was to use 32 bit types for bcopy.  On even relatively
	modern systems, such as a 586, this change has no impact - the
	benchmark is bound by memory sub systems.  On older systems, the
	use of multiple load/store instructions, as required for the smaller
	types, resulted in lower results than the meory system could produce.

	The processor cycles required actually slow down the results.  This 
	is still true today for in cache numbers.  For example, an R10K
	shows L1 cache bandwidths of 750MB/sec and 377MB/sec with 64 bit
	vs 32 bit loads.  It was our intention to show the larger number and
	that requires the larger types.  
	
	Perhaps because the authors have not ported their benchmark to
	non-Intel platforms, they have not noticed this.  The Intel
	platform does not have native 64 bit types so it does two
	load/stores for what C says is a 64 bit type.  Just because it 
	makes no difference on Intel does not mean it makes no difference.