*BSD News Article 32624

Newsgroups: comp.os.386bsd.misc
Path: sserve!newshost.anu.edu.au!harbinger.cc.monash.edu.au!msuinfo!agate!library.ucla.edu!europa.eng.gtefsd.com!MathWorks.Com!panix!zip.eecs.umich.edu!umn.edu!csus.edu!netcom.com!hasty
From: hasty@netcom.com (Amancio Hasty Jr)
Subject: Pentium secrets
Message-ID: <hastyCsqJ3L.L1o@netcom.com>
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Distribution: comp.os.386bsd.misc
Date: Sun, 10 Jul 1994 17:26:09 GMT
Lines: 249

Hi,

I am re-posting this article with the hope that it can further improve
FreeBSD's performance :)  


	Happy Reading,
	Amancio
------------------------ start hacking --------------------------------
Article 17295 of comp.sys.intel:
Xref: netcom.com comp.sys.intel:17295
Path: netcom.com!csus.edu!csulb.edu!nic-nac.CSU.net!usc!cs.utexas.edu!asuvax!chnews!ornews.intel.com!news.jf.intel.com!news.jf.intel.com!glew
From: glew@ichips.intel.com (Andy Glew)
Newsgroups: comp.sys.intel
Subject: Re: "Pentium Secrets"
Date: 10 Jul 1994 06:52:35 GMT
Organization: Intel Corp., Hillsboro, Oregon
Lines: 216
Message-ID: <GLEW.94Jul9235235@pdx007.intel.com>
References: <2ucnjf$hrk@hearst.cac.psu.edu> <Cs6Fws.9s@murdoch.acc.Virginia.EDU>
	<2utmvh$nl4@vkhdib01.hda.hydro.com>
NNTP-Posting-Host: pdx007.intel.com
In-reply-to: terjem@hda.hydro.com's message of 30 Jun 1994 05:59:13 GMT

Now that Terje Mathisen has published in Byte most of the details
about the Pentium(tm) processor performance counters - a facility that
has come to be called EMON, standing for Event Monitoring - I'd like
to add a few notes to protect Intel's interests.

Note that I am *NOT* doing this as an official representative of
Intel.  I write the following to try and prevent people from writing
non-portable code that will cause both of us headaches.

(1) One of the biggest reasons for EMON being kept "secret" was that
Intel does not want to get forced into a compatibility corner by EMON.
I.e. we want to have the freedom to change the EMON counters in
arbitrary ways in the future, e.g. by changing event codes,
e.g. taking statistics that are meaningless on one processor and
replacing them by things more useful.
    Therefore, portable software should not depend on the existence of
the EMON facility, or on particular event codes or register formats.
    The EMON facility should be considered model specific, useful for
tuning code on a particular model. I can almost 100% guarantee that
Pentium(tm) processor EMON code will *not* run on P6.
    We do *not* want anybody except a university researcher to do
things like using EMON data to do processor cache affinity process
scheduling (to take one possible application from an earlier,
pre-Intel, life). {On the other hand, I'd like university researchers
to consider doing things like that. It's a good area for research. We
just don't want to freeze this feature now.}
    
(2) Furthermore, *anything* in MSR space is model specific, and not
portable unless Intel makes great big bold letter statements to the
contrary. "MSR" stands for "Model Specific Register" after all.

(3) RDMSR(MSR=10h) versus RDTSC: yes, indeed, MSR=10h is the TimeStamp
Counter (TSC).  However, accessing this via RDMSR and WRMSR is *not*
portable.
    RDTSC is the *portable*, architectural, way of accessing the
timestamp counter. It's faster, and it has certain other conveniences.
Please avoid using RDMSR(MSR=10).
    There is no portable way of writing the TSC. WRMSR(MSR=10h) works
to a degree, but is non-portable. Moreover, arbitrary writeability is
*not* guaranteed - it may not be possible to write any arbitrary bit
pattern to the counter.

(4) TSC semantics:
    I'd also like to emphasize a few points about the TSC. The TSC's
architectural purpose is as a *timestamp* counter - a value that is
guaranteed to be monotonically increasing (modulo wrap), every time it
is read.
    On the Pentium(tm) processor, RDTSC just happens to also be a
clock count, which is useful for performance monitoring. However, that
performance monitoring usage is model specific. Portable software
should not depend on it being a measure of absolute time, although it
will nearly always be a measure of the amount of work a processor can
complete.
    Hell - it's not even clear how to measure absolute time in terms
of clock cycles on future processors.  There are processors from other
companies that are capable of continuously varying the clock,
dynamically changing frequency to save power. So "CPU clocks" would be
useless as a measure of absolute time.
    In a particular system and implementation, where the software is
written with knowledge of the system clocking strategy and the model
of CPU in use, it may be acceptable to use RDTSC as a measure of
absolute time. E.g. I might be willing to do that myself in a
benchmarketing war. But generic software that will run on many
different platforms should not do this. Usage in a DLL or shared
library may be advised.

(5) Don't write TSC!
    Furthermore, one of the first things an OS developer is going to
do on seeing TSC is to wonder "Should TSC be a global, or should TSC
be context switched so that it can be a process (or thread) virtual
time?" 
    The answer is, emphatically, NO! TSC should not be context
switched (forgetting secure OS issues for the moment).
    Recall that WRMSR(MSR=10h) is not a portable way of writing the
TSC. Furthermore, the ability to write arbitrary values is *not*
guaranteeed.
    Therefore, do *not* write the TSC.
    Instead, if you must play games with providing global, per
process, or per-thread times, do the smart thing and provide an offset
that your library code can add to the raw TSC value to get the
appropriate correction. Use the classic HI;LO;HI algorithm to read the
two values "atomically":
    E.g.
    	volatile global int64 AbsoluteTimeOffset;
    	volatile global int64 ProcessTimeOffset;
    	volatile global int64 ProcessUserTimeOffset;
    	volatile global int64 ThreadTimeOffset;
    	....

    	int64 ReadUserTime() {
    	    int64 off1, off2, tsc;

    	    off1 = ProcessUserTimeOffset;
    	    tsc = RDTSC();   	    	    /* an appropriately fenced asm function */
    	    off2 = ProcessUserTimeOffset;
    	    if( off1 == off2 ) 
    	    	return off1 + tsc;
    	    else
    	    	/* do something special to handle this case.
    	    	E.g. retry, or return off2+tsc 
    	    	(which can only be done if there are conventions on permitted range of values.
    	    	or do an OS call to make atomic, or... */
    	}   
This is better long-run, because you can then implement arbitrary varietis of 
timers.

(6) Fencing of RDTSC:
    Remember that RDTSC is a timestamp counter? That guarantees that
successive invocations always return different, monotonically
increasing, values. I.e.  it makes a statement about the ordering of
RDTSC instructions.
    But it doesn't make any statement at all about the ordering of
RDTSC with *other* instructions. So, e.g. if you are trying to use
RDTSC to time a single instruction, as in
    
    a = RDTSC()
    MOV mem, eax    	/* Store eax to memory */
    b = RDTSC

It is entirely possible that the second RDTSC may execute *before* the
instruction under test, e.g. MOV mem, eax
    E.g. on the Pentium(tm) processor, writes may be buffered - so the
second RDTSC may be executed before the buffered write gets done.
    This is a simple case. On future processors, there may be many
more examples of such overlap.

    If you really want to measure a particular instruction, you must
insert the appropriate fewncing directives. The easiest "serializing"
instruction is CPUID. So, to really time an individual store, you must
do:
    CPUID
    a = RDTSC()
    CPUID
    MOV mem, eax    	/* Store eax to memory */
    CPUID
    b = RDTSC
    CPUID
and then account for the time of the CPUID serializations.
(Warning: it is also possible for a system board to be built that 
prevents CPUID from properly serializing. I'd discount the possibility,
except that such a system will be a little bit faster, and will run nearly all,
but notr all, software. I.e. it's tempting. So check first, if you can.)

    For Joe Programmer trying to time his code, this is overkill - you
probably don't care about a few cycles of noise due to overlap, if you
are RDTSC'ing, e.g., at the beginning of every function call and
return.  So leave out the CPUIDs in this usage model. 

(7) Extensibility of EMON
    Mathisen's Byte article says "An obvious extension for Intel's
next CPU... would be to use all 64 bits of MSR 11h and add two more
stat counters as MSR 14h and 15h".
    Remember point (1) above?  Well, I can't tell you anything about
P6, how many counters it has, or whether P6 has EMON at all, but I
feel obliged to tell you: MSR 11h does not work in anywhere near the
same way on P6 as it does on the Pentium(tm) processor.
    So please do not provide OS level facilities to program MSR 11h
and expect exactly the same thing to work on P6.

(8) Finally, I'd like to share a very useful feature that *is*
documented, but which very few people have picked up on about the
Pentium(tm) EMON counters.  
    There are pins called something like "PM0/BP0" and "PM1/BP1"
documented in the Pentium(tm) processor data sheet (the names change a
bit in different revs of the document).
    "PM" stands for "Performance Monitoring".
    E.g. these external pins can be configured to toggle when an
internal event like a BTB hit occurs.  They can also be configured to
toggle on overflow of the counter - so you can load a value like -1000
into it, and cause the pins to wiggle after a 1000 cache misses.
    A summer student working for me has soldered a wire from the PM0
pin to the NMI pin of the processor (we cut the existing trace to NMI,
didn't need a failsafe timer), and now we can get an interrupt every
1000 cache misses. The NMI handler reprograms the EMON counter, et
voila, we have an interesting form of statistical profiling that is
*not* based on time.  Very useful for tuning programs - you can see
exactly where the performance problems occur.


CONCLUSION:
    Please bear the above in mind if using the facilities Terje
Mathisen reverse engineered and documented in Byte. I'm afraid that I
can't tell you about anything in Appendix H [*], but at least I can try to
prevent y'all getting tied up into compatibility problems on the
reverse engineered stuff.

[*] Amusingly, I don't even have a copy of Appendix H at the moment.
    I left it out on my desk one night, and a security guard on a
    sweep swiped it, and hasn't given it back in over a week. :-(

----------

Oh, and by the way: at the end of his article Terje says: "I hope that
    Intel makes official information available to all programmers and
    that such useful features are incorporated into other
    architectures such as Alpha, PowerPC, and SPARC.

Hearsay and postings on the net make me 100% confident that various
DEC Alpha and HP PA implementations have performance counters and high
res timers. I'm not so sure about PowerPC, but I do note that IBM's
Power2 architecture has a great big wishlist of performance counters,
discussed (but not to the extent you could use them) in an articile in
the second volume of RS6000 papers.

I.e. I'm pretty sure that a whole slew of other chips have them, but
they are undocumented in exactly the same way Intel's EMON is. For
probably the very same reasons. I'm not defending this, just pointing
it out.

--

Andy "Krazy" Glew, glew@ichips.intel.com, Intel, 
M/S JF1-19, 5200 NE Elam Young Pkwy, Hillsboro, Oregon 97124-6497.  
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer.





-- 
FREE unix, gcc, tcp/ip, X, open-look, netaudio,  tcl/tk, MIME, midi,sound
at  freebsd.cdrom.com:/pub/FreeBSD
Amancio Hasty,  Consultant |
Home: (415) 495-3046       |  
e-mail hasty@netcom.com	   |  ftp-site depository of all my work:    
                           |  sunvis.rtpnc.epa.gov:/pub/386bsd/X