*BSD News Article 82012


Return to BSD News archive

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!spool.mu.edu!newspump.sol.net!posts.execpc.com!earth.execpc.com!not-for-mail
From: jgreco@ns.sol.net (Joe Greco)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: Performance measurement via vmstat puzzling
Date: 30 Oct 1996 22:02:58 -0600
Organization: sol.net Network Services - Milwaukee, WI
Lines: 274
Sender: joegreco@earth.execpc.com
Message-ID: <5598di$4d7@earth.alpha.net>
References: <ytl20ehs4fl.fsf@data.mr.net>
NNTP-Posting-Host: earth.execpc.com

In article <ytl20ehs4fl.fsf@data.mr.net>,
Scott Lystig Fritchie <fritchie@MR.Net> wrote:
>Greetings --
>
>I've been puzzled for a while now about using "vmstat" to try to
>figure out how much oomph! my P5-120 PCI, dual Adaptec 2940, SMC
>EtherPower10/100 FreeBSD 2.1.5-RELEASE box has got left.  I've been
>looking at "vmstat"'s output, and the more I look, the more puzzled I
>got ... so I stopped looking.  :-) Perhaps a gentle reader could
>assist.
>
>[Perhaps I should be using a different measuring tool?]
>
>This box is busy acting as an NNTP feeder machine.  On its busiest
>24-hour period so far it has successfully sent just over 7 million
>articles (not counting the IHAVE/TAKETHIS offers and sent-but-rejected
>articles).

Okay I am impressed, newspump.sol.net does about two million per day.

>At one minute before the hour, all of the "innxmit" processes are
>killed, just so that we don't have a big problem with 'hung "innxmit"
>processes'.  The machine's performance goes utterly to Hell for 90
>seconds or so, since all the innxmit processes are busy trying to
>re-write their batch files.  On the hour exactly (and every 10 minutes
>thereafter), "nntpsend" is run via cron, which shrinks the batch files
>(if necessary) and then runs "innxmit" for each feed.
>
>Scattered throughout the vmstat output is a message, displayed every
>minute, of how many "innxmit" processes are running at the time +
>"uptime" output.
>
>% vmstat -w 10 sd0 sd1 sd2
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
>[first line deleted, since it always seems to contain bogus data]
> 6 0 0318504 45860  365 152   0   1 386   0 50 16 16  0 3102 1807 1128 22 52 26
> 2 0 0301372 45820  398 272   0   1 476   0 54 16 15  0 3517 1922 1317 19 59 22
> 5 0 0310536 45552  291 118   0   0 224   0 13 20 14  0 3522 1628 1195 14 51 35
> 8 0 0321820 44936  146 111   0   0  79   0 14 18 14  0 3411 1778 1019 14 68 18
>     46 ... 12:56AM  up 4 days, 14:32, 2 users, load averages: 4.43, 3.96, 3.04
>
>For those who don't want to wade through the rest of my message, I'll
>get right to the point.  For those 40 seconds worth of samples, the
>load average was about 4.4, but the CPU was allegedly idle for at
>least 1/4 of the time.  So, is "vmstat" lying about the amount of idle
>CPU time?  Or are there too many context switches and/or interrupts
>going on to service those processes in the runable queue?  Is the
>number of processes in the runable queue inflated?

The way that I read this is that you have LOTS of runnable or running
processes.  Processes do not seem to get "blocked" unless they are
really waiting on I/O, and even then it seems like situations where I
would expect to see more "blocked" processes I usually see runnable
processes.

The amount of idle time and the number of runnable processes actually
do not have much to do with each other in a system dominated by I/O,
at least not the way you would expect.  This may have something to
do with how these are measured - I do not know.

>The reason I ask all of this is that I want to be able to guess how
>many more feeds I can put onto this box.  If I go by idle CPU time
>(and amount of bandwidth available on the Fast Ethernet), I can add
>perhaps 25% more feeds.  If I go by load average, I should stop adding
>feeds now (and perhaps move some off).  I'm quite puzzled....

I would agree with the latter before the former, particularly noticing
the HIGH percentages of time spent in system mode.  

>The rest of the "vmstat" output below contains better examples of high
>CPU "idle" time yet relatively high runable queue sizes.  If you're
>curious, continue reading, otherwise you've probably read enough.
>
>We continue...
>
> 5 2 0315368 45024  350 186   0   2 344   0 54 16 16  0 3548 1951 1223 19 67 14
> 3 0 0337528 43740  195 219   0   0 182   0 24 14 16  0 3524 1868 1114 21 79  1
> 1 1 0329212 44276  315 134   0   0 255   0 23 21 21  0 4022 1854 1362 24 74  2
> 5 0 0320900 43620  267 128   1   3 200   0 30 19 18  0 3619 1692 1335 24 55 21
> 7 0 0307228 45708  366 284   0   0 445   0 56 17 12  0 3580 2069 1422 25 62 13
> 7 0 0298552 43784  330 163   0   0 259   0 13 17 19  0 3933 1907 1513 24 60 16
>12 2 0315892 43048   81 146   1   2  40   0 30 17 16  0 3855 1821 1242 22 68 11
> 7 0 0311112 43276  128 172   0   0  93   0 23 17 17  0 4290 1465 1436 24 76  1
>15 0 0320016 42212  162 169   0   0 102   0 32 21 17  0 4172 1596 1510 28 72  0
>     46 ... 12:57AM  up 4 days, 14:33, 2 users, load averages: 6.01, 4.58, 3.38
>11 1 0327772 43084   47 213   0   2 118   0 31 19 15  0 4039 1358 1363 23 77  0
> 8 0 0293420 44020  200 222   0   0 323   0 25 15 19  0 3927 1597 1329 25 74  1
> 7 0 0301440 44668  191 175   0   0 163   0 16 17 15  0 3887 1482 1386 26 61 13
> 3 0 0301876 44052  358 131   0   0 267   0 19 19 20  0 3934 1857 1348 22 61 17
> 2 7 0289916 44048  306 125   0   1 239   0 24 18 22  0 3751 1815 1269 20 55 25
> 6 2 0298328 43800  303 139   0   0 234   0 15 22 21  0 3392 2220 1164 21 54 25
>11 0 0320688 42644  270 121   0   0 186   0  8 23 29  0 3162 2155 1102 13 59 28
> 6 0 0337076 41444  233  96   0   1 117   0  9 15 18  0 2658 1945 798 13 86  1
>
>... right here's we're about to kill off all 46 "innxmit" processes.
>The activity on sd0, where the batch files are stored, thrashes
>something fierce (it's a 2GB Hawk drive, which is capable of more
>transactions/second than it shows below).
>
>The size of the blocked queue is pretty funny, too.
>
>     46 ... 12:59AM  up 4 days, 14:35, 2 users, load averages: 6.15, 4.90, 3.60
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
> 323 0322628 42880  321 107   0   0 244   0 15 14 13  0 2674 1825 857 13 84  3
> 037 0309792 42776   71 100   0   0  65   0 69  4  5  0 1034  858 349 32 33 35
> 142 0279824 42892   49  45   0   0  90   0 52  2  1  0  503  459 149 30 23 47
> 244 0331156 43000   40  28   0   0  72   0 58  2  1  0  414  345 121 29 18 53
> 145 0321028 44976   60  21   0   0 192   0 59  1  2  0  358  349 115 26 20 54
> 143 0348280 46520  108  31   0   0 330   0 58  2  2  0  423  358 180 24 18 58
> 045 0348292 50072  203  15   0   0 709   0 53  1  1  0  342  353 166 19 21 61
> 225 0282164 52928  198  25   0   0 439   0 69  2  1  0  489  468 175 24 22 54
>15 7 0230668 55164  157  20   0   0 442   0 64  1  1  0  380  407 112 23 22 55
> 116 0149232 57544  165   7   0   0 431   0 55  0  0  0  324  423  88 30 24 46

Why is THAT funny?  If EVERYONE is trying to write to the same disk, it is
logical that most of the people are going to have to wait a while!  That is
actually VERY expected, I think.  See if you can add a "sleep 5" after
killing each individual innxmit.

>     18 ...  1:00AM  up 4 days, 14:36, 2 users, load averages: 1.98, 3.84, 3.33
> 0 4 0149272 63244  397   7   0   0 667   0 52  2  2  0  313  524  99 25 31 45
> 3 1 0162324 63196  397   8   0   0 487   0 46  1  1  0  338  436  89 21 20 59
> 1 2 0149536 66444  347  10   0   0 750   0 49  4  3  0  430  481  96 14 17 70
> 1 2 0133720 65248  477  31   0   0 496   0 65  4  3  0  535  715 173 12 20 68
> 1 1 0120512 65432  464  44   0   0 550   0 64  3  3  0  667  953 271 21 23 56
> 1 1 0138128 62048  421  27   1   0 321   0 66  5  3  0  853 1221 460 44 22 34
> 5 0 0172656 61044  385  21   1   2 257   0 56  3  6  0  845 1275 389 40 47 14
>      8 ...  1:02AM  up 4 days, 14:37, 2 users, load averages: 1.78, 3.39, 3.20
>
>... sd0's activity remains high while "nntpsend" shrinks individual
>batch files, as required, before starting "innxmit".
>
> 1 1 0168960 60748  378  98   1   0 304   0 60  6  7  0  875 1352 379 51 33 16
> 4 0 0160352 62372  234 153   0   0 266   0 58  6  6  0 1173 1298 502 48 31 21
> 6 0 0160668 59876  374 170   0   0 369   0 47  6  7  0 1380 1534 697 54 34 12
> 1 0 0152824 59920  292  79   0   4 229   0 59  5  8  0 1418 1500 648 50 31 19
> 2 1 0161312 59500  382  86   0   0 298   0 48  4  6  0 1427 1397 681 54 29 17
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
> 0 5 0179268 58256  389 136   0   0 363   0 49  6  8  0 1182 1351 511 48 32 20
> 6 2 0183644 58748  242  82   0   4 249   0 46  6  6  0 1184 1034 405 27 66  7
>     15 ...  1:03AM  up 4 days, 14:38, 2 users, load averages: 2.64, 3.27, 3.16
> 4 0 0200788 58544  397  94   0   0 456   0 50  5  6  0 1428  900 516 39 32 29
> 3 0 0183896 58300  458  39   0   0 377   0 46  6  7  0 1582  876 657 52 29 19
> 0 3 0183840 58136  384 189   0   0 467   0 56  7  6  0 1510  913 583 37 35 27
> 1 4 0193288 58964  395 112   0   0 473   0 60  5  6  0 1545  836 533 26 34 40
> 1 1 0202900 56124  356  69   2   0 280   0 53  8  6  0 1629 1221 651 46 32 22
> 5 0 0224448 55692  465  74   0   3 364   0 58  6  8  0 1647 1366 601 39 35 26
>     22 ...  1:04AM  up 4 days, 14:40, 2 users, load averages: 2.71, 3.10, 3.10
> 0 2 0224260 56084  292  57   0   0 202   0 20  4  6  0 1338 1405 454 28 70  2
>
>... so here's a case where the idle time is pretty high, but so is the
>size of the runable queue.  The blocked queue is also relatively large,
>and sd0's activity is still very high, so are the processes in the
>runable queue processes which have just been removed from the blocked
>queue and put into the runable queue?

I would hazard a guess that that is what is happening.

> 2 4 0224392 55476  278  67   0   0 217   0 61  8  7  0 1798 1162 649 27 33 40
> 2 2 0194820 55728  453 293   0   0 676   0 67  8  9  0 1951 1386 698 22 41 37
> 6 3 0216020 55440  350 132   0   4 352   0 64  5  7  0 1766 1074 592 23 36 40
> 3 2 0202100 56616  521 224   0   0 664   0 67  5  7  0 1738 1183 635 26 38 36
> 3 0 0212272 55836  336 180   0   0 421   0 57 12  7  0 2421 1281 1046 38 46 16
> 6 0 0285124 53764  490 162   0   3 358   0 53 11  7  0 1959 1411 748 39 50 11
> 8 0 0259120 53616  100 189   1   0 196   0 50  9 10  0 2061 1724 680 25 75  0
>     26 ...  1:05AM  up 4 days, 14:41, 2 users, load averages: 3.34, 3.10, 3.09
> 3 3 0241744 57260  425 228   0   0 529   0 54 12 11  0 2144 1555 777 37 51 12
> 2 2 0246284 54972  249 165   0   4 205   0 44 11  8  0 2186 1466 702 11 36 53
> 2 1 0229864 54288  421 176   0   0 500   0 53 10  9  0 2640 1383 860 13 46 41
> 2 3 0247308 53744  367 295   0   0 515   0 52  9  9  0 2555 1411 896 16 48 36
> 2 0 0247932 54920  465 164   0   0 611   0 71  7 10  0 2512 1271 826 14 43 42
> 5 1 0243604 52420  399 174   1   2 375   0 60  9  9  0 2594 1599 953 24 45 31
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
> 7 0 0288040 50296  250 155   0   0 210   0 37 12 10  0 3377 1389 1021 18 77  6
>     33 ...  1:06AM  up 4 days, 14:42, 2 users, load averages: 3.47, 3.08, 3.07
> 3 0 0253692 51800  224 222   1   0 329   0 32  8  9  0 2579 1611 821 20 78  2
> 1 1 0281712 50296  920 113   1   6 469   0 64 11  9  0 2426 1977 915 33 47 20
> 2 0 0272796 48824  585  80   0   0 256   0 12  9  8  0 2973 1169 1241 57 43  0
> 3 0 0276856 48248  314  71   0   0 236   0  3  7  8  0 2590 1126 1104 67 33  0
> 7 5 0306968 47636  583 117   0   0 435   0 25  8  9  0 2389 1339 985 58 39  3
> 322 0304700 47388  426 206   0   0 372   0 76 15 10  0 2873 1803 1041 24 52 24
>
>... Whoa!  22 processes in the blocked queue.  We haven't seen that
>many processes reported blocked in quite a while.  Is it a quirk of
>fate (or design) that "vmstat" hasn't reported more processes blocked?

Probably fate.

> 4 0 0160352 62372  234 153   0   0 266   0 58  6  6  0 1173 1298 502 48 31 21
> 6 0 0160668 59876  374 170   0   0 369   0 47  6  7  0 1380 1534 697 54 34 12
> 1 0 0152824 59920  292  79   0   4 229   0 59  5  8  0 1418 1500 648 50 31 19
> 2 1 0161312 59500  382  86   0   0 298   0 48  4  6  0 1427 1397 681 54 29 17
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
> 0 5 0179268 58256  389 136   0   0 363   0 49  6  8  0 1182 1351 511 48 32 20
> 6 2 0183644 58748  242  82   0   4 249   0 46  6  6  0 1184 1034 405 27 66  7
>     15 ...  1:03AM  up 4 days, 14:38, 2 users, load averages: 2.64, 3.27, 3.16
> 4 0 0200788 58544  397  94   0   0 456   0 50  5  6  0 1428  900 516 39 32 29
> 3 0 0183896 58300  458  39   0   0 377   0 46  6  7  0 1582  876 657 52 29 19
> 0 3 0183840 58136  384 189   0   0 467   0 56  7  6  0 1510  913 583 37 35 27
> 1 4 0193288 58964  395 112   0   0 473   0 60  5  6  0 1545  836 533 26 34 40
> 1 1 0202900 56124  356  69   2   0 280   0 53  8  6  0 1629 1221 651 46 32 22
> 5 0 0224448 55692  465  74   0   3 364   0 58  6  8  0 1647 1366 601 39 35 26
>     22 ...  1:04AM  up 4 days, 14:40, 2 users, load averages: 2.71, 3.10, 3.10
> 0 2 0224260 56084  292  57   0   0 202   0 20  4  6  0 1338 1405 454 28 70  2
> 2 4 0224392 55476  278  67   0   0 217   0 61  8  7  0 1798 1162 649 27 33 40
> 2 2 0194820 55728  453 293   0   0 676   0 67  8  9  0 1951 1386 698 22 41 37
> 6 3 0216020 55440  350 132   0   4 352   0 64  5  7  0 1766 1074 592 23 36 40
> 3 2 0202100 56616  521 224   0   0 664   0 67  5  7  0 1738 1183 635 26 38 36
> 3 0 0212272 55836  336 180   0   0 421   0 57 12  7  0 2421 1281 1046 38 46 16
> 6 0 0285124 53764  490 162   0   3 358   0 53 11  7  0 1959 1411 748 39 50 11
> 8 0 0259120 53616  100 189   1   0 196   0 50  9 10  0 2061 1724 680 25 75  0
>     26 ...  1:05AM  up 4 days, 14:41, 2 users, load averages: 3.34, 3.10, 3.09
> 3 3 0241744 57260  425 228   0   0 529   0 54 12 11  0 2144 1555 777 37 51 12
> 2 2 0246284 54972  249 165   0   4 205   0 44 11  8  0 2186 1466 702 11 36 53
> 2 1 0229864 54288  421 176   0   0 500   0 53 10  9  0 2640 1383 860 13 46 41
> 2 3 0247308 53744  367 295   0   0 515   0 52  9  9  0 2555 1411 896 16 48 36
> 2 0 0247932 54920  465 164   0   0 611   0 71  7 10  0 2512 1271 826 14 43 42
> 5 1 0243604 52420  399 174   1   2 375   0 60  9  9  0 2594 1599 953 24 45 31
> procs   memory     page                    disks         faults      cpu
> r b w   avm   fre  flt  re  pi  po  fr  sr s0 s1 s2 f0   in   sy  cs us sy id
> 7 0 0288040 50296  250 155   0   0 210   0 37 12 10  0 3377 1389 1021 18 77  6
>     33 ...  1:06AM  up 4 days, 14:42, 2 users, load averages: 3.47, 3.08, 3.07
> 3 0 0253692 51800  224 222   1   0 329   0 32  8  9  0 2579 1611 821 20 78  2
> 1 1 0281712 50296  920 113   1   6 469   0 64 11  9  0 2426 1977 915 33 47 20
> 2 0 0272796 48824  585  80   0   0 256   0 12  9  8  0 2973 1169 1241 57 43  0
> 3 0 0276856 48248  314  71   0   0 236   0  3  7  8  0 2590 1126 1104 67 33  0
> 7 5 0306968 47636  583 117   0   0 435   0 25  8  9  0 2389 1339 985 58 39  3
> 322 0304700 47388  426 206   0   0 372   0 76 15 10  0 2873 1803 1041 24 52 24
> 519 0347620 56352  349 240   0   0 709   0 76 15 10  0 2852 1586 1012 19 52 28
>
>... Whoa!  There's another big jump in the blocked queue size.
>
>16 7 0338896 56244  261 400   0   0 506   0 76 11 10  0 2832 1639 991 20 77  3
>10 1 0306212 56392  173 291   0   4 357   0 63 12 13  0 3197 1582 1055 21 79  0
>     36 ...  1:08AM  up 4 days, 14:44, 2 users, load averages: 7.15, 3.99, 3.40
> 3 7 0330700 55532  387 143   0   0 355   0 33 17 18  0 3136 2029 1100 25 71  4
> 5 2 0317968 54840  287 254   0   0 305   0 63 13 22  0 3552 1791 1293 26 57 17
> 5 0 0293896 54008  429 192   0   3 423   0 42 15 18  0 3418 1859 1277 27 57 16
> 6 0 0298084 54288   12 146   0   0  18   0  7 15 21  0 3905 1692 1343 25 54 21
> 6 0 0269752 52384   48 200   0   0   0   0  8 17 16  0 3383 1792 1225 33 51 16
> 5 0 0261352 52408    2 132   0   1   6   0 17 18 16  0 3506 1942 1228 23 48 29
> 6 0 0317228 50716  147 119   0   0  76   0 11 10 13  0 2870 1543 905 17 51 31
> 7 0 0285404 50436   60 126   0   0   2   0  3 11 11  0 2975 1677 857 20 80  0
>
>... OK, now we're more-or-less at the state we'll run at for the rest
>of the hour.  CPU idle time fluctuates between single digits (rarely)
>to a more typical 30-45%.  Yet the load average is pretty high, for a
>single CPU system.
>
>Any insight would be appreciated.  More "vmstat" output to follow my
>.sig.

Well I am personally unhappy if my load averages > 1.0...  it either means
the CPU is too wimpy, or the I/O subsystem bites.  This has been a truism
on every news server I have seen.

I suspect that the numbers you are seeing are skewed in favor of runnable
processes instead of blocked processes.  You definitely do not want to see
a lot of blocked processes...  

Your out of band description of your local hardware and software situation
and the fact that you say you are feeding 7M articles/day simply serves
to confirm my suspicion that you are probably seeing latency due to I/O
contention.

Anyways I have another comment or two but they are only relevant to your
situation...  so I will take it off line.

... JG