[SERVER-16396] Replication stall, then one secondary would not shut down (mmapv1) Created: 02/Dec/14  Updated: 21/Jan/15  Resolved: 21/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc1
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Cailin Nelson Assignee: Andy Schwerin
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mms-on-prem-3.backtrace     PNG File onprem-2.png     PNG File onprem.png    
Issue Links:
Duplicate
duplicates SERVER-16834 Secondary nodes can hang during shutd... Closed
Related
Participants:

 Description   

Please see attached graphs showing behavior on our 2.8.0rc1 (mmapv1) replica set.

We experienced the following series of events:

  • Rapidly climbing replication lag on both secondaries. Observed IOPS on the secondaries was very high.
  • Getmore counter dropped off to zero on the primary
  • Restarted one secondary (onprem-2). On restart, its replication lag fell off immediately back down to zero.
  • Getmore counter on primary started looking more normal
  • Attempted to shutdown the other secondary (onprem-3). It would not shutdown. gdb dump attached.
  • After hard killing the other secondary and restarting it's replication lag also fell off to zero.

Will link to logs for all nodes.



 Comments   
Comment by Andy Schwerin [ 21/Jan/15 ]

Duplicate of SERVER-16834.

Comment by Andy Schwerin [ 03/Dec/14 ]

The attached stack trace indicates that the secondary that got stuck shutting down froze up waiting for the bgsync thread to terminate. That thread is waiting on a condition variable, presumably this one, but I'm not 100% certain because the stack trace appears to have been taken from a host that didn't have access to the debugging symbols.

Notice that thread 11 is waiting in bgsync, and thread 2 is waiting for replication to shutdown a thread (presumably bgsync). Thread 1 is just sleeping forever because it was not the first thread to call exitCleanly.

Generated at Thu Feb 08 03:40:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.