*BSD News Article 4043

Path: sserve!manuel!munnari.oz.au!mips!mips!swrinde!cs.utexas.edu!usc!elroy.jpl.nasa.gov!ames!sgi!rigden.wpd.sgi.com!rpw3
From: rpw3@rigden.wpd.sgi.com (Rob Warnock)
Newsgroups: comp.unix.bsd
Subject: Re: ne1000 slow performance
Message-ID: <oudcs3g@sgi.sgi.com>
Date: 23 Aug 92 14:46:16 GMT
Sender: rpw3@rigden.wpd.sgi.com
Organization: Silicon Graphics, Inc.  Mountain View, CA
Lines: 177

spedpr@thor.cf.ac.uk (Paul Richards) writes:
+---------------
| Is there benchmarking software for ethernet cards/drivers. 
| I'd be interested what sort of performance I'm getting and ftp stats
| vary from file to file so are meaningless for this purpose. What are
| people using for this?
+---------------

I use a combination of things, depending on what I'm trying to measure or
tune, but the basic tools are "ping" and "ttcp".

BE CAREFUL! MANY OF THESE TESTS CAN IMPOSE SIGNIFICANT LOADS ON YOUR NETWORK!
Do these tests at off hours or on private nets, to keep your co-workers
from getting upset with you. (And to keep their normal work from skewing
your results. ;-} )

For measuring latency, good old "ping" (a.k.a. ICMP ECHO) works well,
especially "ping -f". For example, from my 386bsd system (486/33) to
my SGI workstation, one might see:

	bsdj 64# time ping -fq -c10000 rigden
	PING rigden.wpd.sgi.com (192.26.75.58): 56 data bytes

	--- rigden.wpd.sgi.com ping statistics ---
	10012 packets transmitted, 10000 packets received, 0% packet loss
	round-trip min/avg/max = 0/35/60 ms
	0.8u 36.8s 0:37.90 99.3% 0+0k 0+0io 0pf+0w
	bsdj 65#

I used "time", since the 386bsd version of "ping" does not report pkt/sec,
and given the time I can manually calculate that we sent 10012/37.9 = 264
pkt/sec. Since this is more than "ping"'s default 100 pkt/sec, most of these
were probably sent due to the previous response coming back, so the average
latency was about 3.8ms. [Read the source for "ping" is this is not clear.
Don't put too much trust in the latency numbers you see from "ping" on the
PC machines, the clock's not fine-grained enough.]

In the other direction, from SGI to "bsdj", we get:

	rpw3@rigden <157> ping -fq -c10000 bsdj
	PING bsdj.wpd.sgi.com (192.26.75.188): 56 data bytes

	----bsdj.wpd.sgi.com PING Statistics----
	10000 packets transmitted, 10000 packets received, 0% packet loss
	round-trip (ms)  min/avg/max = 1/1/28    548.97 packets/sec
	0.6u 5.0s 0:18 30%
	rpw3@rigden <158> 

The increase in pkts/sec and decrease in latency is not surprising: The SGI
box is about twice as fast (36 MHz R3000) and has much more highly tuned
networking code. Also, the target of a ping does *much* less work, and all of
it at interrupt level. So what we're largely seeing here is the user-to-kernel
system call overhead for a sendmsg(), a recvmsg(), and several context switches
(sleep/wakeups): about 3.78ms for the 386bsd system, and about 1.8ms for the
SGI. [The actual Ethernet transmission time totals only 0.15ms round-trip.]

For raw throughput, I use (as many people do) the program "ttcp". It's readily
available (from sgi.com:~ftp/sgi/src/ttcp/* if you don't have it locally),
and is well understood. You start up a "ttcp -r -s" on one system and then
fire off a "ttcp -t -s {target}" on the other. E.g.:

PC:	bsdj 66# ttcp -r -s
	ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001  tcp
	ttcp-r: socket
	ttcp-r: accept from 192.26.75.58
	ttcp-r: 16777216 bytes in 38.38 real seconds = 426.89 KB/sec +++
	ttcp-r: 4100 I/O calls, msec/call = 9.59, calls/sec = 106.83
	ttcp-r: 0.0user 6.9sys 0:38real 18% 0i+0d 0maxrss 0+0pf 4085+94csw
	0.0u 7.0s 0:44.52 15.9% 0+0k 0+0io 0pf+0w
	bsdj 67#                  

SGI:	rpw3@rigden <159> ttcp -t -s bsdj
	ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp  -> bsdj
	ttcp-t: socket
	ttcp-t: connect
	ttcp-t: 16777216 bytes in 38.22 real seconds = 428.69 KB/sec +++
	ttcp-t: 2048 I/O calls, msec/call = 19.11, calls/sec = 53.59
	ttcp-t: 0.0user 0.5sys 0:38real 1%
	rpw3@rigden <160> 

Because of the buffering inherent in the net code, it is important when
reporting "ttcp" performance to use *only* the numbers reported by the
*receiving* side, and even then, make sure your "ttcp" run is long enough
(by using the "-n" option, if necessary) to swamp start-up transients.
The above example (over 30 secs) doesn't show it, but on very short runs
the transmitting side can finish noticably before the receiver, and report
artificially high rates.

It's also important to run "ttcp" in both directions, as the work done by
transmitting and receiving is assymetrical. For example, above we saw that a
486/33 (BSDJ) can receive from an SGI 4D/35 (Irix) at some 425 KB/s, but as
we see below the reverse is not true. From 486 to SGI we get only 360 KB/s:

SGI:	rpw3@rigden <160> ttcp -r -s
	ttcp-r: buflen=8192, nbuf=2048, align=16384/0, port=5001  tcp
	ttcp-r: socket
	ttcp-r: accept from 192.26.75.188
	ttcp-r: 16777216 bytes in 45.60 real seconds = 359.31 KB/sec +++
	ttcp-r: 16339 I/O calls, msec/call = 2.86, calls/sec = 358.32
	ttcp-r: 0.0user 3.4sys 0:45real 7%
	0.0u 3.4s 0:52 6%
	rpw3@rigden <161> 

PC:	bsdj 67# ttcp -t -s rigden
	ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp  -> rigden
	ttcp-t: socket
	ttcp-t: connect
	ttcp-t: 16777216 bytes in 45.62 real seconds = 359.14 KB/sec +++
	ttcp-t: 2048 I/O calls, msec/call = 22.81, calls/sec = 44.89
	ttcp-t: 0.0user 42.0sys 0:45real 92% 0i+0d 0maxrss 0+0pf 1105+432csw
	0.0u 42.1s 0:45.78 92.2% 0+0k 0+0io 0pf+0w
	bsdj 68# 

Such assymetries are not at all unusual. Usually there is more CPU time spent
on the sending side, so sending from the faster machine to the slower will
get better performance. This is borne out by the observation from the above
numbers that when receiving at 425 KB/s the BSDJ machine consumed 18% of its
CPU, but while sending at 360 KBK/s it used 92%.

The SGI machine used almost no CPU in either case, but it's capable of nearly
10 MB/s of "ttcp" over FDDI, and can easily saturate an Ethernet. By the way,
I hope to speed up BSDJ's networking some, too. I suspect much of the problem
is in the WD board, and want to try an AMD LANCE "DMA bus master" card. Any
modern workstation of 10 MIPS or more *should* be able to saturate an Ethernet
(~1.1 MB/s of "ttcp").

Finally, if you don't have "ttcp", a crude kind of "throughput" measurement
can also be done with "ping", by using the "-s" option. For example:

	basj 68# time ping -fq -c1000 -s2048 rigden
	PING rigden.wpd.sgi.com (192.26.75.58): 2048 data bytes

	--- rigden.wpd.sgi.com ping statistics ---
	1001 packets transmitted, 1000 packets received, 0% packet loss
	round-trip min/avg/max = 20/35/50 ms
	0.5u 17.2s 0:17.96 98.9% 0+0k 0+0io 0pf+0w
	basj #69 

This shows that we transmitted *and* received 1000 2048-byte datagrams (which,
because of Ethernet's 1500-byte maximum, were fragmented into two parts and
reassembled at the other end, then frag'd/reassembled on the way back) in
18 seconds, for a total of about 144 KB/s of "request/response"-type traffic.

It should not be surprising that this is lower than the "ttcp" numbers above,
since this test is closer to a "stop & wait" protocol than TCP's windowed
streaming.  On the other hand, since the average time per packets (18ms) was
more than "ping -f"'s maximum 10ms interval, at least two requests were "in
flight" at once, so it wasn't pure "stop & wait". It is such considerations
that make it hard to get meaningful results with the "ping -f -s<BIG>" test.
But sometimes you have to use what's available, and sometimes "ping" is when
"ttcp" isn't.


Again, CAUTION! Network stress tests do just that: cause stress. Both on the
nominal systems-under-test and (sometimes) on others. Don't be surprised if
you uncover strange bugs. For example, I have seen some hosts that would crash
if sent very large ICMP ECHO packets (i.e., "ping -s<BIG>"). And in running
the above tests to write this message, I crash 386bsd once with the "kmem_map"
panic, even though my kernel has both of the patches in. (I was running a test
from an rlogin'd job, and I forgot to put the "q" on a "ping -fq". Without the
"-q", "ping" outputs characters to stderr for each packet sent/received, and
I suspect the added network traffic led to the "kmem_map" panic somehow.)


-Rob

p.s. Despite temptation from various bad influences, do *not* believe anyone
who tries to sell you results from "ttcp -u" (UDP instead of TCP). Quite often,
the UDP numbers only show how fast the network driver can throw away data
when the buffers fill up. I have heard of vendors who tried to quote "ttcp -u"
numbers which exceeded the bandwidth of the network medium!

-----
Rob Warnock, MS-9U/510		rpw3@sgi.com
Silicon Graphics, Inc.		(415)390-1673
2011 N. Shoreline Blvd.
Mountain View, CA  94043