ISSUE DESCRIPTION AND IMPACT
This issue in MongoDB 4.4.10 to 4.4.13 and 5.0.4 to 5.0.7 may cause replication to stall on secondary replica set members in a sharded cluster handling cross-shard transactions.
The bug is triggered when WiredTiger erroneously returns a write conflict when deciding if an update to a record is allowed. If MongoDB decides to retry the operation that caused the conflict in WiredTiger, it will enter an indefinite retry loop, and oplog application will stall on secondary nodes.
A MongoDB cluster may be affected by this bug if:
- the cluster is sharded
- the application uses cross-shard transactions
- the cluster is using versions 4.4.10 to 4.4.13 or 5.0.4 to 5.0.7 on secondary nodes
If the bug is triggered, the cluster's secondary nodes will experience indefinite growth in replication lag.
REMEDIATION AND WORKAROUNDS
Secondary nodes that have replication stalled may be restarted to resume replication.
This issue is fixed in MongoDB 4.4.14 and 5.0.8.
While implementing FLCS related changes in
WT-8019 a change was made to stop checking if the insert list on the cbt was null prior to checking against the on disk time window. This change may be correct for FLCS but isn't correct for row-store.
This is only a problem if the cbt->slot isn't unset or UINT32_MAX. It's possible that an alternative solution would be to clear the cbt slot on an insert list row search however that is still open for discussion.
- is caused by
WT-8019 VLCS snapshot-isolation search mismatch
- is duplicated by
WT-8440 Investigate out of order timestamp assertion fires
- is related to
SERVER-73972 mongodb 4.4 secondary replication hang