[SERVER-24773] Secondary block on initialSync when the next oplog needed was removed on primary Created: 24/Jun/16  Updated: 15/Nov/21  Resolved: 05/Jul/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.10, 3.2.3
Fix Version/s: 3.2.8, 3.3.10

Type: Bug Priority: Major - P3
Reporter: Zhang Youdong Assignee: Judah Schvimer
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

1. prepare replicaset with big dataset or small oplog.
2. send resync command to Secondary.

Sprint: Repl 16 (06/24/16), Repl 17 (07/15/16)
Participants:

 Description   

During initialSync, secondary need to apply oplog serveral times which will call _applyOplogUntil, if the next oplog was removed on primary, the secondary will block on this call, resync command will have no effect in this sutiation, and the process cannot be killed normally.

Thread 54 (Thread 0x2b559c19f700 (LWP 60398)):
#0  0x0000003349a0b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000e44f58 in mongo::repl::BackgroundSync::waitForMore() ()
#2  0x0000000000f30c5a in mongo::repl::SyncTail::tryPopAndWaitForMore(mongo::OperationContext*, mongo::repl::SyncTail::OpQueue*) ()
#3  0x0000000000e70c25 in mongo::repl::InitialSync::_applyOplogUntil(mongo::OperationContext*, mongo::repl::OpTime const&) ()
#4  0x0000000000e71353 in mongo::repl::InitialSync::oplogApplication(mongo::OperationContext*, mongo::repl::OpTime const&) ()
#5  0x0000000000f1cc0f in ?? ()
#6  0x0000000000f1e530 in ?? ()
#7  0x0000000000f1f23f in mongo::repl::syncDoInitialSync() ()
#8  0x0000000000f29770 in mongo::repl::runSyncThread() ()
#9  0x0000000001b0f500 in execute_native_thread_routine ()
#10 0x0000003349a07851 in start_thread () from /lib64/libpthread.so.0
#11 0x00000033496e767d in clone () from /lib64/libc.so.6



 Comments   
Comment by Ramon Fernandez Marina [ 08/Jul/16 ]

All,

MongoDB release candidate 3.2.8-rc1 is now available for download, and includes a fix for this issue. The MongoDB 3.2.8 release is scheduled for next week.

Regards,
Ramón.

Comment by Zhang Youdong [ 06/Jul/16 ]

Judah Schvimer I got your point,great thanks.

Comment by Judah Schvimer [ 05/Jul/16 ]

The resync command will be fixed in 3.4 with other initial sync work we are doing. For now you should be able to resync simply by shutting down and restarting your server. We will try to fix this in 3.2 in a future commit.

Comment by Zhang Youdong [ 05/Jul/16 ]

Judah Schvimer I think _initialSyncRequestedFlag also should be checked to make the 『resync』 command work in this situation.

Comment by Githook User [ 05/Jul/16 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-24773 _applyOplogUntil checks for shutdown while trying to pop from network queue

(cherry picked from commit 08ee223880898fc2153cf4eedc124fc2e4dfc133)
Branch: v3.2
https://github.com/mongodb/mongo/commit/44c63e6802bdbd2ceaebe1b190cd8575d073961a

Comment by Githook User [ 05/Jul/16 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-24773 _applyOplogUntil checks for shutdown while trying to pop from network queue
Branch: master
https://github.com/mongodb/mongo/commit/08ee223880898fc2153cf4eedc124fc2e4dfc133

Comment by Zhang Youdong [ 02/Jul/16 ]

Judah Schvimer , that's great, resync and shutdown is really important for users in this case, they cannot do anything else.

Comment by Judah Schvimer [ 01/Jul/16 ]

Hi,

Thank you very much for the detailed clarification and for finding this bug. What you're saying is correct and I've been able to reproduce it on 3.2. I'm also glad to hear that you enjoyed MongoDB World! We will fix _applyOplogUntil to be shutdown aware and terminate on shutdown.

Thanks,
Judah

Comment by Zhang Youdong [ 01/Jul/16 ]

hi,Judah Schvimer, sorry for reply so late, I am atending the MongoDB World 2016 in New York these days, it's really a great conference.

This problem is hard to reproduce by constructing data and oplog config, but it think it's abviously by reading the source code, and I will keep working on how to reproduce it . Here is the first three event sequences

1. initial sync begin at t1
2. _initialSyncClone finish at t2
3. _initialSyncApplyOplog from t1 to t2

the oplog from t1 to t2 is fetched by produceThread, if the some oplog in this range is removed between t1 and t2, the _initialSyncApplyOplog will never return.

Next I will explain why resync and shutdown will not work in this case.

In SyncTail::oplogApplication(), if the resync command is received, it will go to the very begining to do initial sync.

720     while (!inShutdown()) {
 721         OpQueue ops;
 722
 723         do {
 724             if (BackgroundSync::get()->getInitialSyncRequestedFlag()) {
 725                 // got a resync command
 726                 return;
 727             }

but in InitialSync::oplogApplication

When it's waiting the next oplog, a resync will not restart the initial sync.

Comment by Judah Schvimer [ 27/Jun/16 ]

Hi Zhang,

I have tried to reproduce your failure, but have been unable to get to a state where I cannot kill the process normally. I started up a node with a 1MB oplog. While inside of initial sync, I did 40 inserts of 100KB documents so that the first ones were forced off the end of the oplog. The new node then said it was too stale and went into RECOVERY mode, at which point the _applyOplogUntil thread continued to run as you specified. I connected to the new node from a second shell and ran

db.runCommand({shutdown:1})

and the node shutdown properly. Is this reproducing the same situation? Do you have additional reproduction steps that I can follow to recreate what you saw?

Thanks,
Judah

Generated at Thu Feb 08 04:07:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.