|
Workload:
- Collection size: ~ 40GB.
- approxDocumentSize: 2000 # 2kB.
- documentCount: 20000000. # for an approximate total of ~40GB.
- Indexes: 10 secondary indexes.
- Running dbcheck with default limits.
- maxDocsPerBatch: 5000
- maxBytesPerBatch: 20kB
- maxBatchTimeMillis: 1000ms
- Dbcheck modes workloads:
- DbCheck_dataConsistency
- DbCheck_dataConsistencyAndMissingIndexKeysCheck
- Secondary Index:
- DbCheck_extraIndexKeysCheck_3Int: Key is compound of 3 integer.
- DbCheck_extraIndexKeysCheck_1Int: Key is on 1 integer field.
- DbCheck_extraIndexKeysCheck_stringShort: key is on a string field and string length is between (5-15) letters.
- DbCheck_extraIndexKeysCheck_stringLong: key is on a string field and string length is between (1-100) letters.
- DbCheck_extraIndexKeysCheck_SameKey: Key is on 1 integer field but all the values are the same.
So I think in our runs we will have around 10 doc per batch (20KB/2KB).
Results:
1- DBcheck Latency
latency: takes around ~15 min.
One batch can take up to ~900 ms.
{
|
t: {
|
$date: "2023-09-11T04:21:43.809+00:00"
|
},
|
s: "I",
|
c: "COMMAND",
|
id: 51803,
|
ctx: "conn763",
|
msg: "Slow query",
|
attr: {
|
type: "command",
|
ns: "testDB.$cmd",
|
appName: "Genny",
|
command: {
|
update: "Collection0",
|
ordered: true,
|
$db: "testDB",
|
lsid: {
|
id: {
|
$uuid: "57dbbea3-af90-4e9e-a840-4fd85e9a1720"
|
}
|
},
|
txnNumber: 1341649,
|
$clusterTime: {
|
clusterTime: {
|
$timestamp: {
|
t: 1694406102,
|
i: 794
|
}
|
},
|
signature: {
|
hash: {
|
$binary: {
|
base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
|
subType: "0"
|
}
|
},
|
keyId: 0
|
}
|
}
|
},
|
numYields: 0,
|
reslen: 245,
|
locks: {
|
FeatureCompatibilityVersion: {
|
acquireCount: {
|
r: 1,
|
w: 2
|
}
|
},
|
ReplicationStateTransition: {
|
acquireCount: {
|
w: 3
|
}
|
},
|
Global: {
|
acquireCount: {
|
r: 1,
|
w: 2
|
}
|
},
|
Database: {
|
acquireCount: {
|
w: 2
|
}
|
},
|
Collection: {
|
acquireCount: {
|
w: 2
|
}
|
}
|
},
|
flowControl: {
|
acquireCount: 1,
|
acquireWaitCount: 1,
|
timeAcquiringMicros: 915000
|
},
|
readConcern: {
|
level: "local",
|
provenance: "implicitDefault"
|
},
|
writeConcern: {
|
w: 1,
|
wtimeout: 0,
|
provenance: "customDefault"
|
},
|
storage: {
|
|
},
|
cpuNanos: 10420661,
|
remote: "10.2.0.10:43586",
|
protocol: "op_msg",
|
durationMillis: 925
|
}
|
}
|
2- CRUD operations:
+----------------------------------+----------------------------------+
|
| Local writes | Majority writes |
|
| | |
|
+----------------+-----------------+----------------+-----------------+
|
| AverageLatency | 99th percentile | AverageLatency | 90th percentile |
|
+----------------+-----------------+----------------+-----------------+
|
| 2x | ~ Same | 1000x | ~ Same |
|
+----------------+-----------------+----------------+-----------------+
|
1- Local writes:
I can see from the logs that the most slow queries are coming from waiting for the lock, ex logs from a patch: (timeAcquiringMicros: 913000) (Expected).
{
|
t: {
|
$date: "2023-09-11T04:19:46.799+00:00"
|
},
|
s: "I",
|
c: "COMMAND",
|
id: 51803,
|
ctx: "conn763",
|
msg: "Slow query",
|
attr: {
|
type: "command",
|
ns: "testDB.$cmd",
|
appName: "Genny",
|
command: {
|
update: "Collection0",
|
ordered: true,
|
$db: "testDB",
|
lsid: {
|
id: {
|
$uuid: "57dbbea3-af90-4e9e-a840-4fd85e9a1720"
|
}
|
},
|
txnNumber: 1248453,
|
$clusterTime: {
|
clusterTime: {
|
$timestamp: {
|
t: 1694405985,
|
i: 570
|
}
|
},
|
signature: {
|
hash: {
|
$binary: {
|
base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
|
subType: "0"
|
}
|
},
|
keyId: 0
|
}
|
}
|
},
|
numYields: 0,
|
reslen: 245,
|
locks: {
|
FeatureCompatibilityVersion: {
|
acquireCount: {
|
r: 1,
|
w: 2
|
}
|
},
|
ReplicationStateTransition: {
|
acquireCount: {
|
w: 3
|
}
|
},
|
Global: {
|
acquireCount: {
|
r: 1,
|
w: 2
|
}
|
},
|
Database: {
|
acquireCount: {
|
w: 2
|
}
|
},
|
Collection: {
|
acquireCount: {
|
w: 2
|
}
|
}
|
},
|
flowControl: {
|
acquireCount: 1,
|
acquireWaitCount: 1,
|
timeAcquiringMicros: 913000
|
},
|
readConcern: {
|
level: "local",
|
provenance: "implicitDefault"
|
},
|
writeConcern: {
|
w: 1,
|
wtimeout: 0,
|
provenance: "customDefault"
|
},
|
storage: {
|
|
},
|
cpuNanos: 837301,
|
remote: "10.2.0.10:43586",
|
protocol: "op_msg",
|
durationMillis: 913
|
}
|
}
|
Different modes of dbcehck (link)

2- Majority writes:
The big hit is when we performs a majority writes as dbcheck oplog application makes the secondary lag increases exponentially.

As you can see the slow query log (durationMillis: 224,975 and waitForWriteConcernDurationMillis: 224974) ~ 3.7 min.
{
|
t: {
|
$date: "2023-09-09T02:18:01.718+00:00"
|
},
|
s: "I",
|
c: "COMMAND",
|
id: 51803,
|
ctx: "conn751",
|
msg: "Slow query",
|
attr: {
|
type: "command",
|
ns: "testDB.$cmd",
|
appName: "Genny",
|
command: {
|
update: "Collection0",
|
ordered: true,
|
$db: "testDB",
|
lsid: {
|
id: {
|
$uuid: "3081572c-9354-4706-b01f-bb17b90af607"
|
}
|
},
|
txnNumber: 487,
|
$clusterTime: {
|
clusterTime: {
|
$timestamp: {
|
t: 1694225656,
|
i: 8
|
}
|
},
|
signature: {
|
hash: {
|
$binary: {
|
base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
|
subType: "0"
|
}
|
},
|
keyId: 0
|
}
|
}
|
},
|
numYields: 0,
|
reslen: 245,
|
locks: {
|
FeatureCompatibilityVersion: {
|
acquireCount: {
|
w: 2
|
}
|
},
|
ReplicationStateTransition: {
|
acquireCount: {
|
w: 3
|
}
|
},
|
Global: {
|
acquireCount: {
|
w: 2
|
}
|
},
|
Database: {
|
acquireCount: {
|
w: 2
|
}
|
},
|
Collection: {
|
acquireCount: {
|
w: 2
|
}
|
}
|
},
|
flowControl: {
|
acquireCount: 1
|
},
|
readConcern: {
|
level: "local",
|
provenance: "implicitDefault"
|
},
|
writeConcern: {
|
w: "majority",
|
wtimeout: 0,
|
provenance: "implicitDefault"
|
},
|
waitForWriteConcernDurationMillis: 224974,
|
storage: {
|
|
},
|
cpuNanos: 472994,
|
remote: "10.2.0.10:50908",
|
protocol: "op_msg",
|
durationMillis: 224975
|
}
|
}
|
AverageLatency:

90th percentile:

Throughput (Ops/Sec)

|