[SERVER-78927] Add dbCheck canary performance workload Created: 13/Jul/23  Updated: 29/Oct/23  Resolved: 23/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Xuerui Fa Assignee: Moustafa Maher
Resolution: Fixed Votes: 0
Labels: pm-855-milestone-3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2023-09-11-16-44-01-367.png     PNG File image-2023-09-11-16-55-00-591.png     PNG File image-2023-09-11-16-55-15-351.png     PNG File image-2023-09-11-17-03-52-514.png     PNG File image-2023-09-11-17-04-34-574.png     PNG File image-2023-09-11-17-06-05-598.png    
Issue Links:
Depends
is depended on by SERVER-80338 Add index consistency checks to perf ... Closed
Related
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Sprint: Repl 2023-08-21, Repl 2023-09-04
Participants:

 Description   

We should run a performance workloads with dbCheck running in the background. This workload can consist of a straightforward CRUD workload on a large cluster with various indexes.



 Comments   
Comment by Moustafa Maher [ 12/Sep/23 ]

Workload

  • Collection size: ~ 40GB.
    • approxDocumentSize: 2000   # 2kB. 
    • documentCount: 20000000.  # for an approximate total of ~40GB.
  • Indexes10 secondary indexes.
  • Running dbcheck with default limits.
    • maxDocsPerBatch: 5000
    • maxBytesPerBatch: 20kB
    • maxBatchTimeMillis: 1000ms
  • Dbcheck modes workloads:
    • DbCheck_dataConsistency
    • DbCheck_dataConsistencyAndMissingIndexKeysCheck
    • Secondary Index:
      • DbCheck_extraIndexKeysCheck_3Int: Key is compound of 3 integer.
      • DbCheck_extraIndexKeysCheck_1Int: Key is on 1 integer field.
      • DbCheck_extraIndexKeysCheck_stringShort: key is on a string field and string length is between (5-15) letters.
      • DbCheck_extraIndexKeysCheck_stringLong: key is on a string field and string length is between (1-100) letters.
      • DbCheck_extraIndexKeysCheck_SameKey: Key is on 1 integer field but all the values are the same.

So I think in our runs we will have around 10 doc per batch (20KB/2KB).

Results:

1- DBcheck Latency

latency: takes around ~15 min.
One batch can take up to ~900 ms.

{
    t: {
        $date: "2023-09-11T04:21:43.809+00:00"
    },
    s: "I",
    c: "COMMAND",
    id: 51803,
    ctx: "conn763",
    msg: "Slow query",
    attr: {
        type: "command",
        ns: "testDB.$cmd",
        appName: "Genny",
        command: {
            update: "Collection0",
            ordered: true,
            $db: "testDB",
            lsid: {
                id: {
                    $uuid: "57dbbea3-af90-4e9e-a840-4fd85e9a1720"
                }
            },
            txnNumber: 1341649,
            $clusterTime: {
                clusterTime: {
                    $timestamp: {
                        t: 1694406102,
                        i: 794
                    }
                },
                signature: {
                    hash: {
                        $binary: {
                            base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                            subType: "0"
                        }
                    },
                    keyId: 0
                }
            }
        },
        numYields: 0,
        reslen: 245,
        locks: {
            FeatureCompatibilityVersion: {
                acquireCount: {
                    r: 1,
                    w: 2
                }
            },
            ReplicationStateTransition: {
                acquireCount: {
                    w: 3
                }
            },
            Global: {
                acquireCount: {
                    r: 1,
                    w: 2
                }
            },
            Database: {
                acquireCount: {
                    w: 2
                }
            },
            Collection: {
                acquireCount: {
                    w: 2
                }
            }
        },
        flowControl: {
            acquireCount: 1,
            acquireWaitCount: 1,
            timeAcquiringMicros: 915000
        },
        readConcern: {
            level: "local",
            provenance: "implicitDefault"
        },
        writeConcern: {
            w: 1,
            wtimeout: 0,
            provenance: "customDefault"
        },
        storage: {
 
        },
        cpuNanos: 10420661,
        remote: "10.2.0.10:43586",
        protocol: "op_msg",
        durationMillis: 925
    }
}

2- CRUD operations:

+----------------------------------+----------------------------------+
|           Local writes           |          Majority writes         |
|                                  |                                  |
+----------------+-----------------+----------------+-----------------+
| AverageLatency | 99th percentile | AverageLatency | 90th percentile |
+----------------+-----------------+----------------+-----------------+
| 2x             | ~ Same          | 1000x          | ~ Same          |
+----------------+-----------------+----------------+-----------------+

 

1- Local writes:

I can see from the logs that the most slow queries are coming from waiting for the lock, ex logs from a patch: (timeAcquiringMicros: 913000) (Expected).

{
    t: {
        $date: "2023-09-11T04:19:46.799+00:00"
    },
    s: "I",
    c: "COMMAND",
    id: 51803,
    ctx: "conn763",
    msg: "Slow query",
    attr: {
        type: "command",
        ns: "testDB.$cmd",
        appName: "Genny",
        command: {
            update: "Collection0",
            ordered: true,
            $db: "testDB",
            lsid: {
                id: {
                    $uuid: "57dbbea3-af90-4e9e-a840-4fd85e9a1720"
                }
            },
            txnNumber: 1248453,
            $clusterTime: {
                clusterTime: {
                    $timestamp: {
                        t: 1694405985,
                        i: 570
                    }
                },
                signature: {
                    hash: {
                        $binary: {
                            base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                            subType: "0"
                        }
                    },
                    keyId: 0
                }
            }
        },
        numYields: 0,
        reslen: 245,
        locks: {
            FeatureCompatibilityVersion: {
                acquireCount: {
                    r: 1,
                    w: 2
                }
            },
            ReplicationStateTransition: {
                acquireCount: {
                    w: 3
                }
            },
            Global: {
                acquireCount: {
                    r: 1,
                    w: 2
                }
            },
            Database: {
                acquireCount: {
                    w: 2
                }
            },
            Collection: {
                acquireCount: {
                    w: 2
                }
            }
        },
        flowControl: {
            acquireCount: 1,
            acquireWaitCount: 1,
            timeAcquiringMicros: 913000
        },
        readConcern: {
            level: "local",
            provenance: "implicitDefault"
        },
        writeConcern: {
            w: 1,
            wtimeout: 0,
            provenance: "customDefault"
        },
        storage: {
 
        },
        cpuNanos: 837301,
        remote: "10.2.0.10:43586",
        protocol: "op_msg",
        durationMillis: 913
    }
}

Different modes of dbcehck (link)

2- Majority writes:

The big hit is when we performs a majority writes as dbcheck oplog application makes the secondary lag increases exponentially.

As you can see the slow query log (durationMillis: 224,975 and waitForWriteConcernDurationMillis: 224974) ~ 3.7 min.

{
    t: {
        $date: "2023-09-09T02:18:01.718+00:00"
    },
    s: "I",
    c: "COMMAND",
    id: 51803,
    ctx: "conn751",
    msg: "Slow query",
    attr: {
        type: "command",
        ns: "testDB.$cmd",
        appName: "Genny",
        command: {
            update: "Collection0",
            ordered: true,
            $db: "testDB",
            lsid: {
                id: {
                    $uuid: "3081572c-9354-4706-b01f-bb17b90af607"
                }
            },
            txnNumber: 487,
            $clusterTime: {
                clusterTime: {
                    $timestamp: {
                        t: 1694225656,
                        i: 8
                    }
                },
                signature: {
                    hash: {
                        $binary: {
                            base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
                            subType: "0"
                        }
                    },
                    keyId: 0
                }
            }
        },
        numYields: 0,
        reslen: 245,
        locks: {
            FeatureCompatibilityVersion: {
                acquireCount: {
                    w: 2
                }
            },
            ReplicationStateTransition: {
                acquireCount: {
                    w: 3
                }
            },
            Global: {
                acquireCount: {
                    w: 2
                }
            },
            Database: {
                acquireCount: {
                    w: 2
                }
            },
            Collection: {
                acquireCount: {
                    w: 2
                }
            }
        },
        flowControl: {
            acquireCount: 1
        },
        readConcern: {
            level: "local",
            provenance: "implicitDefault"
        },
        writeConcern: {
            w: "majority",
            wtimeout: 0,
            provenance: "implicitDefault"
        },
        waitForWriteConcernDurationMillis: 224974,
        storage: {
 
        },
        cpuNanos: 472994,
        remote: "10.2.0.10:50908",
        protocol: "op_msg",
        durationMillis: 224975
    }
}

Graph example

AverageLatency:

90th percentile:

Throughput (Ops/Sec)

 

 

Comment by Githook User [ 23/Aug/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-78927 Add dbCheck canary performance workload
Branch: master
https://github.com/mongodb/mongo/commit/771494b07c54b4c98234a444fcaa95d5e2449fab

Generated at Thu Feb 08 06:39:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.