*BSD News Article 80756

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!news.rmit.EDU.AU!news.unimelb.EDU.AU!munnari.OZ.AU!news.mel.connect.com.au!news.mel.aone.net.au!imci4!newsfeed.internetmci.com!in1.uu.net!twwells!twwells!not-for-mail
From: bill@twwells.com (T. William Wells)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: FreeBSD as news-server??
Date: 14 Oct 1996 17:56:35 -0400
Organization: None, Mt. Laurel, NJ
Lines: 229
Message-ID: <53ucuj$8qh@twwells.com>
References: <537ddl$3cc@amd40.wecs.org> <53ott7$579@adv.IAEhv.nl> <53pm5c$5ks@twwells.com> <53u1ic$61i@flash.noc.best.net>
NNTP-Posting-Host: twwells.com

Before I get into this, one thing I don't recall being mentioned,
so I'll mention it just in case: be sure to compile the kernel on
your news machine to allow lots of open files and lots of child
processes. Otherwise, you _will_ run out of resources.....

In article <53u1ic$61i@flash.noc.best.net>,
Matthew Dillon <dillon@best.com> wrote:
: :In article <53pm5c$5ks@twwells.com>, T. William Wells <bill@twwells.com> wrote:
: :>No, it wouldn't. Almost certainly, INN is slower for a single
: :>incoming newsfeed than C news. In this day of huge news spool
: :>directories, it is absolutely necessary that the process
: :>accepting incoming NNTP *not* write the articles to the spool.
: :>The latency this introduces into the protocol slows it down way
: :>too much. (No, streaming doesn't help -- many providers have
: :>found quite the opposite and have stopped using it....)
: :>
: :>With bare INN, you cannot even get 2 articles/second on typical
:
:     woa woa!  Not true any more!  Just make sure all of your feeds
:     understand INN's streaming mode.

Streaming mode is bad unless you have *just one* feed. Otherwise,
it steps on itself with latency. Even worse than nonstreaming
feeds do. And not all providers will send you a streaming feed....

Also, experience (and my theoretical analysis) shows that multiple
parallel feeds generally work better than streaming.

:     understand INN's streaming mode.  I get about a 5 articles/sec
:     transfer rate from my main news machine to my nntp machine
:     under medium load conditions (around 200 nnrpd's users).

I'm going to bet that you aren't using "typical PC hardware". :-)
When I was doing INN with streaming mode, I wasn't even getting 2
articles/second.

:     It's harder to tell on the newsfeeds machine, since it has a dozen
:     incoming feeds, but I would say the aggregate burst transfer is on the
:     order of 10 articles/sec.

I can get rates like that, even with the disks I have. But not
often. :-)

:     This is true to a degree, but you hit the big problem with CNews
:     on the nose below:
:
: :>If you have more than one incoming feed, things get complex. I'll
: :>save my fingers explaining why, as I have no reason to believe
: :>that this person has more than one feed.
:
:     ... which is why most people run INN now rather then CNews.

At one time, INN was a clear winner for NNTP feeds. Thus it got a
following. When the volume went up, streaming was instituted --
this helped with the NNTP protocol's problems. But eventually, it
hurts more than it helps.  People are now doing parallel feeds
once they figure out just how much streaming is hurting them.

(I know a lot of people will say "But *my* streaming feeds work
fine!" That's because you have nice fast hardware that keeps up
even if streaming *is* screwing you over. Or you're not trying to
run a full feed. But when you're working with marginal hardware
and a complete feed [actually, 4 of them], like I am, these
effects make the difference between a server that will keep up and
one that won't.)

So the bottom line is that INN is preferred because in the past it
*was* better. Now, there's very little research that compares the
two to say which is better but a lot of opinion based on earlier,
and no longer relevant, experience. I *have* done some research
and in most cases a C news-like system with a msgid daemon will
beat the pants off an INN system, on the same piece of harware.

(My current system has a msgidd with a four hour cache, an
optimized feed reader that creates spool files, and an innd that
reads spool files from disk instead of accepting them via nntp.)

:     * You need lots of ram.  The machine cannot afford to swap *at all*,
:       plus you need enough to keep most of the history file and all of the
:       history file page table in core, plus you need enough to be able to
:       keep your feeds coming in *while* an expire run is going on.  Expire
:       has about the same memory utilization as innd, effectively doubling
:       your in-core memory requirements.  Finally, you need lots of left
:       over memory for filesystem caching.

This is one of the things that INN simply has done *wrong*. I
won't bore you with the details -- I've written lots of posts
elsewhere on the subject -- but what happened is that the INN
designers minimized the *immediate* use of resources, without
taking into account secondary effects. (Or, to be fair, possibly
they did but things have changed radically since INN was
designed.)

There is something you want to keep in mind, regarding newfeeds. A
typical article requires 64K of disk activity to write just the
article. (Or, did, last year. This is an O(n) in the size of the
newsfeed effect -- which means that article disk activity is
O(n**2) in the newsfeed size.)

What this means is that optimizations regarding the history file
are generally pointless. Keeping the history file in memory cuts
out at most 8K per article of disk activity -- while INN spends
time waiting on that 64K (it's mostly directory stuff, so INN
doesn't get buffer cache benefits for it). Since these two
operations can be done somewhat asynchronously, you don't get
much "win" by minimizing history accesses.

And, in fact, I haven't seen much effect either way between
mapping history in and reading it -- except that whenever INN
forks you get multiple copies of that frigging history file and if
you have a lot of forks happening simultaneously, you run out of
swap.

Anyhow, regarding C news -- you need less RAM than you do with
INN precisely because its components use less memory.  Sure, that
means more CPU time spent on, say, kernel calls in expire. But it
doesn't increase disk activity (and may reduce it, actually.)

Related to that, if you can at all do it, don't have innd accept
nnrpd connections. Use a separate daemon like connectd to do it
on a different IP address than innd's. This will not only make
the initial connection *much* faster, it'll cut down on the space
used as innd forks, and thus on swapping.

:     * Lots of spindles.  Separate by functionality.
:
:       (a) /usr/local/lib/news or /news or whatever you want to call it
:           should own its own disk.  The logs *can* be put on the same disk.

Yeah.

:           The partition containing the history file MUST have at *least* 1GB
:           free.  The reason is that it must not only support the potentially
:           100-200MB history file, it must also support the expire run's
:           history file rebuild *AND* support active references to unlinked
:           history files by nnrpd and other programs that will prevent the
:           'old' history file's space from being reclaimed.

This is unnecessary. You need space for just two copies of the
history file, plus lots of log space. 300M free works. For now.
:-)

The trick is that you do two different types of expires: once a
day, you expire and rebuild the history file.  (And do
expireover.) If you need to expire more often, don't rebuild the
history file on the additional expires, just get rid of the
articles. (Done right, what you do is keep a list of articles
from the previous expire, use comm to eliminate those you've
already expired, and then run that through fastrm.)

Then you make sure that no innxmits or nnrpds run longer than 24
hours and you're set.

:       (b) /var/spool/news or whatever you call it.. the news spool,
:           should generally own several disks.  I suggest a minimum of
:           three disks.

I'm doing it on two, though I'd love to have three.

:       (c) Overview... it is not strictly necessary to put the overview
:           files on a separate physical disk if you (1) have three or more
:           disks for your main spool and (2) buffer the overview records
:           in the newsfeeds file correctly.

But be sure to put the overview files in a separate directory tree
-- otherwise overchan spends a lot of time directory searching.

:     * If you normally have more then a dozen or so active NNTP users, have
:       a *second* machine.  That is, use one machine for your newsfeeds machine
:       with a minimal spool, and a second machine for your reader machine with
:       a huge spool.  nnrpd processes *kill* INN.

This is a relatively low time for me and I have 17 nnrpds. I've
seen like 40 and it doesn't bother my innd at all.

Then again, one of the major hacks in my server is that articles
are stored in subdirectories of the newsgroup tree. Instead of
"%ld", artnum, they're stored as "%07ld.a/%03ld", artnum / 1000,
artnum % 1000. This made a *huge* difference in efficiency, both
increasing the speed of innd and decreasing the effects of a
large number of nnrpds.

:     * Make sure INN is compiled properly
:
:       (a) use the history file page table in-core option or the history
:           file mmap() option.  I actually suggest the page table in-core
:           option because most UNIX system's buffer caching algorithms
:           seem to work better with lseek()/read() then with mmap()/access,
:           even though the overhead is greater with lseek()/read().

As I said, I don't think this makes much difference anymore. For
sure, on the system I have, it makes things *much* worse to have
a large data segment for innd.

:       (b) Buffer writes to the log file.  It's another configuration option.
:           Be generous :-)  This allows you to put the logs on the same
:           physical disk as the history file.

It's not a configuration option. It's an argument to innd. If you
don't specify it, you get buffering.

:       (c) Use the absolute latest INN release, with the streaming mode
:           extensions.

That's inn1.4unoff4, I believe. There is a 1.5beta.....

:       (d) If you run nnrpd, for gods sake use the shared-active patched
:           version!

I might give that a try, but for now, I haven't seen a whole lot
of evidence that it'll make much difference in my system. Then
again, I don't swap much. If I were, I'd probably want that extra
space consumed in each nnrpd.

:     I also hear people complain about all the fork/exec's... I point out
:     to such people that (a) channels have to fork/exec too, and with
:     much greater overhead doing so from innd rather then cron, and
:     (b) unless you have > 20 feeds, doing 20 fork/exec's from cron once
:     every 5 minutes has almost no overhead, and you can even stagger them
:     from cron to create less disk contention.  This is verse the real
:     time channel feeds which, even when buffered, give you NO ability
:     to stagger their operational starts to reduce disk contention.

And no ability to ensure that 20 * 60M won't really, really,
screw you up memory-wise. Basically, it's a bad idea to run
channel feeds. For that matter, I think I'm going to remove the
last of mine (for overview). Then innd will *never* fork -- and
that's one less thing to get in the way of shovelling articles as
fast as possible. :-)