[SERVER-73972] mongodb 4.4 secondary replication hang Created: 14/Feb/23  Updated: 27/Feb/23  Resolved: 27/Feb/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: jing xu Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File SERVER-73972.png     PNG File single active writer.png    
Issue Links:
Related
related to WT-8924 Don't check against on disk time wind... Closed
Assigned Teams:
Storage Engines
Operating System: ALL
Participants:

 Description   

1、i check replication info that i find the one secondary lag more than 9000s
shard4:SECONDARY> rs.printSlaveReplicationInfo()
source: mongodb68:31004
syncedTo: Tue Feb 14 2023 12:36:01 GMT+0800 (CST)
0 secs (0 hrs) behind the primary
source: mongodb69:31004
syncedTo: Tue Feb 14 2023 10:03:40 GMT+0800 (CST)
9141 secs (2.54 hrs) behind the primary
2、i check lock. serverStatus:1 is locked by "ParallelBatchWriterMode" : "r"

{
"type" : "op",
"desc" : "conn323699",
"connectionId" : 323699,
"clientMetadata" : {
"driver" :

{ "name" : "mongo-go-driver", "version" : "v1.0.1" }

,
"os" :

{ "type" : "linux", "architecture" : "amd64" }

,
"platform" : "go1.13.10"
},
"active" : true,
"currentOpTime" : "2023-02-14T10:43:50.888+08:00",

"opid" : -1523928877,
"secs_running" : NumberLong(2831),
"microsecs_running" : NumberLong("2831816411"),
"op" : "command",
"ns" : "admin.$cmd",
"command" : {
"serverStatus" : 1,
"advisoryHostFQDNs" : 1,
"locks" : 0,
"recordStats" : 0,
"oplog" : 0,
"$clusterTime" : {
"clusterTime" : Timestamp(1676339797, 2140),
"signature" :

{ "hash" : BinData(0,"/2uRF//p/2dkSYG0iauiyFybodI="), "keyId" : NumberLong("7143870411116118530") }

},
"$db" : "admin",
"$readPreference" :

{ "mode" : "primaryPreferred" }

},
"numYields" : 0,
"waitingForLatch" :

{ "timestamp" : ISODate("2023-02-14T02:43:50.876Z"), "captureName" : "CondVarLockGrantNotification::_mutex" }

,
"locks" :

{ "ParallelBatchWriterMode" : "r" }

,
"waitingForLock" : true,
"lockStats" : {
"ParallelBatchWriterMode" : {
"acquireCount" :

{ "r" : NumberLong(2) }

,
"acquireWaitCount" :

{ "r" : NumberLong(2) }

,
"timeAcquiringMicros" :

{ "r" : NumberLong("2831703688") }

}
},
"waitingForFlowControl" : false,
"flowControlStats" : {

}
},
{
"type" : "op",
"desc" : "OplogApplier-0",
"active" : true,
"currentOpTime" : "2023-02-14T11:20:14.842+08:00",
"effectiveUsers" : [

{ "user" : "__system", "db" : "local" }

],
"opid" : -1524195878,
"op" : "none",
"ns" : "",
"command" : {

},
"numYields" : 0,
"locks" :

{ "ParallelBatchWriterMode" : "W" }

,
"waitingForLock" : false,
"lockStats" : {
"ParallelBatchWriterMode" : {
"acquireCount" :

{ "W" : NumberLong(1) }

},
"ReplicationStateTransition" : {
"acquireCount" :

{ "w" : NumberLong(4) }

},
"Global" : {
"acquireCount" :

{ "r" : NumberLong(1), "w" : NumberLong(3) }

},
"Database" : {
"acquireCount" :

{ "r" : NumberLong(1), "w" : NumberLong(3) }

},
"Collection" : {
"acquireCount" :

{ "r" : NumberLong(1), "w" : NumberLong(3) }

},
"Mutex" : {
"acquireCount" :

{ "r" : NumberLong(4) }

}
},
"waitingForFlowControl" : false,
"flowControlStats" : {

}
}



 Comments   
Comment by Yuan Fang [ 27/Feb/23 ]

Hi 601290552@qq.com,

After reviewing the FTDC you provided, we noticed an increase in update conflicts following the point in time when the secondary began to lag. Based on the comment from haribabu.kommi@mongodb.com, we recommend that you upgrade to the latest version of MongoDB 4.4 (currently 4.4.18). We will be closing this ticket for now, but if the issue persists even after upgrading to the latest version, please let us know by following up here, and we will reopen the ticket.

Regards,
Yuan

Comment by Haribabu Kommi [ 27/Feb/23 ]

Thanks to yuan.fang@mongodb.com for the details and 601290552@qq.com for the bug report.

Usually, the update conflicts are possible when there are any active transactions running in the system. Still, from the FTDC data, I couldn't find any active transactions in the system.

While going through the couple of bugs that are fixed in WiredTiger, I found a bug WT-8924 that exhibits the same symptoms reported in this bug of looping with update conflicts and replication to stall is fixed in MongoDB 4.4.14.

The Server version that reproduced this bug is running MongoDB 4.4.12 version. Updating the MongoDB server to the latest supported version could fix this problem.

Comment by jing xu [ 23/Feb/23 ]

hi:
it's ok.thank you,i will upload it .

Comment by Yuan Fang [ 22/Feb/23 ]

Hi 601290552@qq.com,

Thank you for providing the FTDC. While we are investigating this, could you also please provide the mongod.log that covers the time when the incident occurred, as requested in our previous comment?  upload portal is still available for you to use.

Regards,
Yuan

Comment by jing xu [ 21/Feb/23 ]

hi:
i upload it for you.thank you.

Comment by Yuan Fang [ 14/Feb/23 ]

Hi 601290552@qq.com,

Thank you for your report. Could you provide additional diagnostic data to help us further investigate this problem? I created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan

Generated at Thu Feb 08 06:26:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.