[SERVER-9393] Very, very fast counters Created: 18/Apr/13 Updated: 10/May/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Paul Pedersen | Assignee: | DO NOT USE - Backlog - Platform Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | move-sa, platforms-re-triaged | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Fast Counters Scope To provide very fast counters and timers - making possible an increase in the volume and granularity of stats accumulated within the server. Not in scope DTrace integration is not part of this project. Non-x86 processors will not be supported. Design Each counter consists of a vector of counters, one per core (or virtual core, in the case of hyper-threading). Counter increments occurs per-core independently and without locking. The counter value is aggregated across the per-core counters. Experiments confirm that core migration errors occur about 1 in 10M counts, an acceptable error rate. Implementation The key to the fast implementation is the x86 rdtscp instruction. It loads the 64-bit timestamp counter into EAX,EDX, and the core / hyper-core (i.e.) “node” label into ECX. We need to implement a module that determines the node count in order to correctly allocate fast counter / timer arrays. Portability issues Within the class of x86 processors the instruction set varies. For older intel processors, OS support for finding the core label is needed (mainly access to model-specific registers, as opposed to the rdtscp instruction). Each of Linux, Windows, Free BSD, Solaris, and OS X have different system calls for accessing msr’s. The initial implementation should use OS call for obtaining the processor label. It greatly simplifies the code. Testing Results Testing shows that per-core counters provide large speed improvements v. single atomic integer with fetch-and-add (15 v. 230 nanos per incr). Using non-locking v. locking increment instructions (with per-node counters) provides about 2X speedup (15 v. 30 nanos per incr). Using Linux sched_getcpu() v. inlined rdtscp instructions provides small additional speedup. It seems very likely that Linux is in fact uses the rdtscp instruction via vsyscall (see: http://lxr.linux.no/#linux+v3.8.7/arch/x86/include/asm/vsyscall.h in the Linux cross-reference). Other OS’s may not be as smart. System Calls Windows (1) find current core id DWORD cpui; cpuid = GetCurrentProcessorNumber(); Requirements Minimum supported client Windows Vista Minimum supported server Windows Server 2003 Header WinBase.h on Windows Server 2003, Windows Vista, Windows 7, Windows Server 2008, Windows Server 2008 R2 (with Windows.h); Processthreadsapi.h on Windows 8, Windows Server 2012 Library Kernel32.lib DLL Kernel32.dll (2) Find core count vSYSTEM_INFO sysinfo; numCPU = sysinfo.dwNumberOfProcessors; Requirements Minimum supported client Windows 2000 Professional Minimum supported server Windows 2000 Server Header WinBase.h (include Windows.h) Linux (1) find current core id #include <sched.h> Requirements glibc 2.6 (2) Find core count #include <unistd.h> numCPU = sysconf( _SC_NPROCESSORS_ONLN ); Requirements This is a POSIX standard function, although _SC_NPROCESSOES_ONLN is non-standard. FreeBSD, Mac OS X (1) find current core id: tbd (2) Find core count: int mib[2]; size_t len = 4; uint32_t numCPU; mib[0] = CTL_HW; sysctl(mib, 2, &numCPU, &len, NULL, 0); if (numCPU < 1) { mib[1] = HW_NCPU; sysctl( mib, 2, &numCPU, &len, NULL, 0 ); if (numCPU < 1) numCPU = 1; }Requirements OS X versions >= 10.2. Solaris (1) find current core id: #include <sys/processor.h> (2) Find core count: #include <unistd.h> numCPU = sysconf( _SC_NPROCESORS_ONLN ); |
| Comments |
| Comment by Steven Vannelli [ 10/May/22 ] |
|
Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions. |
| Comment by Martin Bligh [ 06/Jul/15 ] |
|
paul@10gen.com do you still have the code for this somewhere? Would like to replace global atomics by this in various places. |
| Comment by Andy Schwerin [ 18/Apr/13 ] |
|
On systems for which the current core id is unavailable, you could fallback to threadid % COUNTER_VECTOR_SIZE. You can steer COUNTER_VECTOR_SIZE up and down based on how averse you are to multiple threads simultaneously using the same counter. |