[SERVER-9393] Very, very fast counters Created: 18/Apr/13  Updated: 10/May/22

Status: Backlog
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Paul Pedersen Assignee: DO NOT USE - Backlog - Platform Team
Resolution: Unresolved Votes: 0
Labels: move-sa, platforms-re-triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-9297 Btree index counters are not thread-safe Closed
Related
Participants:

 Description   

Fast Counters

Scope

To provide very fast counters and timers - making possible an increase in the volume and granularity of stats accumulated within the server.

Not in scope

DTrace integration is not part of this project. Non-x86 processors will not be supported.

Design

Each counter consists of a vector of counters, one per core (or virtual core, in the case of hyper-threading). Counter increments occurs per-core independently and without locking. The counter value is aggregated across the per-core counters. Experiments confirm that core migration errors occur about 1 in 10M counts, an acceptable error rate.

Implementation

The key to the fast implementation is the x86 rdtscp instruction. It loads the 64-bit timestamp counter into EAX,EDX, and the core / hyper-core (i.e.) “node” label into ECX. We need to implement a module that determines the node count in order to correctly allocate fast counter / timer arrays.

Portability issues

Within the class of x86 processors the instruction set varies. For older intel processors, OS support for finding the core label is needed (mainly access to model-specific registers, as opposed to the rdtscp instruction). Each of Linux, Windows, Free BSD, Solaris, and OS X have different system calls for accessing msr’s. The initial implementation should use OS call for obtaining the processor label. It greatly simplifies the code.

Testing Results

Testing shows that per-core counters provide large speed improvements v. single atomic integer with fetch-and-add (15 v. 230 nanos per incr). Using non-locking v. locking increment instructions (with per-node counters) provides about 2X speedup (15 v. 30 nanos per incr). Using Linux sched_getcpu() v. inlined rdtscp instructions provides small additional speedup. It seems very likely that Linux is in fact uses the rdtscp instruction via vsyscall (see: http://lxr.linux.no/#linux+v3.8.7/arch/x86/include/asm/vsyscall.h in the Linux cross-reference). Other OS’s may not be as smart.

System Calls

Windows

(1) find current core id

DWORD cpui;

cpuid = GetCurrentProcessorNumber();

Requirements

Minimum supported client Windows Vista

Minimum supported server Windows Server 2003

Header WinBase.h on Windows Server 2003,

Windows Vista, Windows 7, Windows Server 2008,

Windows Server 2008 R2 (with Windows.h);

Processthreadsapi.h on Windows 8, Windows Server 2012

Library Kernel32.lib

DLL Kernel32.dll

(2) Find core count

vSYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );

numCPU = sysinfo.dwNumberOfProcessors;

Requirements

Minimum supported client Windows 2000 Professional

Minimum supported server Windows 2000 Server

Header WinBase.h (include Windows.h)

Linux

(1) find current core id

#include <sched.h>
int cpu = sched_getcpu();

Requirements

glibc 2.6

(2) Find core count

#include <unistd.h>

numCPU = sysconf( _SC_NPROCESSORS_ONLN );

Requirements

This is a POSIX standard function, although _SC_NPROCESSOES_ONLN

is non-standard.

FreeBSD, Mac OS X

(1) find current core id:

tbd

(2) Find core count:

int mib[2];

size_t len = 4;

uint32_t numCPU;

mib[0] = CTL_HW;
mib[1] = HW_AVAILCPU;

sysctl(mib, 2, &numCPU, &len, NULL, 0);

if (numCPU < 1)

{ mib[1] = HW_NCPU; sysctl( mib, 2, &numCPU, &len, NULL, 0 ); if (numCPU < 1) numCPU = 1; }

Requirements

OS X versions >= 10.2.

Solaris

(1) find current core id:

#include <sys/processor.h>
processorid_t getcpuid(void);

(2) Find core count:

#include <unistd.h>

numCPU = sysconf( _SC_NPROCESORS_ONLN );



 Comments   
Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by Martin Bligh [ 06/Jul/15 ]

paul@10gen.com do you still have the code for this somewhere? Would like to replace global atomics by this in various places.

Comment by Andy Schwerin [ 18/Apr/13 ]

On systems for which the current core id is unavailable, you could fallback to threadid % COUNTER_VECTOR_SIZE. You can steer COUNTER_VECTOR_SIZE up and down based on how averse you are to multiple threads simultaneously using the same counter.

Generated at Thu Feb 08 03:20:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.