[SERVER-40707] Secondary couldn't signal OplogWaiters to advance the lastCommittedOpTime in Chained replication mode Created: 18/Apr/19  Updated: 06/Dec/22  Resolved: 16/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.4, 3.4.19, 3.4.20
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: FirstName lipengchong Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 1
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-30827 Provide new interface for getting new... Closed
is duplicated by SERVER-40065 spike latency issue on shard cluster Closed
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

a sharding cluster ,three config is  chaining. a->b->c

we can easily get majorityOpTime < opTime in rs.isMaster().lastWrite of c。

 

Participants:

 Description   

A replSet has three members, aa/bb/cc. In this situation , aa->bb->cc , aa is primary and the sync source of bb , bb is the sync source of cc . aa advances the lastCommittedOpTime , and it signal OplogWaiters , so bb can advance his lastCommittedOpTime , but sometimes cc couldn't advance its lastCommittedOpTime.
 
The key problem here is the _localOplogCollection always is NULL for secondary.

@oplog.cpp 
void signalOplogWaiters() { 
if (_localOplogCollection) { 
_localOplogCollection->notifyCappedWaitersIfNeeded(); 
} 
}

the only way to set _localOplogCollection is in function getLocalOplogCollection,and called here.

void logOp(OperationContext* txn, 
const char* opstr, 
const char* ns, 
const BSONObj& obj, 
const BSONObj* o2, 
bool fromMigrate) 
  
void logOps(OperationContext* txn, 
const char* opstr, 
const NamespaceString& nss, 
std::vector<BSONObj>::const_iterator begin, 
std::vector<BSONObj>::const_iterator end, 
bool fromMigrate)

but for secondary ,it returned early because of oplog Disabled. 

So because _localOplogCollection = NULL , when secondary update _lastCommittedOpTime , it cann't signal OplogWaiters.



 Comments   
Comment by Gregory McKeon (Inactive) [ 16/May/19 ]

Given that this is fixed in 3.6, the recommended resolution is to upgrade to 3.6.

As a workaround in 3.4, you can disable chaining on the config server replica set.

Comment by Siyuan Zhou [ 22/Apr/19 ]

lpc, I saw the problem and agreed about your analysis.

Comment by FirstName lipengchong [ 20/Apr/19 ]

Hi Siyuan,  thanks for your response!

First , I know about the process of the last majority committed OpTime . But you can consider this problem , for the chain A->B->C , A determine the last majority committed OpTime ,and node B learn it after received a getmore command from Node C. So Node C cann't learned the the last majority committed OpTime until the a new oplog or the getmore timeout ( 5 seconds). At that time ,we can get delay queries with read concern "majority" . The key problem is node C send a getmore command to node B and blocked, then node B update the last majority committed OpTime and it doesn't signal the getmore command right now. So isn't it a bug here ?
 
And the problem i get just is the SERVER-40065 . The config server is a chain. in some update_one operation , mongos get the balanceConfig (config.settings) in node C and executed 5 seconds. So at last we get the strange phenomena.
 
Just as you say,it's fixed in 3.6. In 3.6 , secondary will first initialize the cached pointer to the oplog collection in startSteadyStateReplication.

void ReplicationCoordinatorExternalStateImpl::startSteadyStateReplication( OperationContext* opCtx, ReplicationCoordinator* replCoord) {  
acquireOplogCollectionForLogging(opCtx); // Initialize the cached pointer to the oplog collection, for writing to the oplog. 
... }

So in the foregoing scenario, node B can signal the getmore query of Node C, then Node c learn the last majority committed OpTime.Maybe the purpose of the foregoing changes in 3.6 is not fixing this problem, but in fact it fixed the problem.

Comment by Siyuan Zhou [ 19/Apr/19 ]

lpc, thanks for filing the ticket!

 we can easily get majorityOpTime < opTime in rs.isMaster().lastWrite of c

This is expected and doesn't necessarily mean a bug, since Node C doesn't know it's the last to receive a write and it's always the primary's job to determine the last majority committed OpTime. Node C has to apply the operation (updating the "lastWrite"), report its progress to the primary via its sync source, then learn the the last majority committed OpTime from the primary via its sync source. Thus the delay is expected.

The code you pointed out might be a reason to delay queries with read concern "majority" on a secondary at the end of a chain, but before diving into the code, I'm still not sure of the problem or unexpected behavior you observed. Could you please describe the unexpected behavior and your expectation? A reproduction would be very helpful.

eric.sedor, re-assigning to you to help get a reproduction or a clarification of the unexpected behavior. This might be an issue we have fixed in 3.6, but we need to confirm the problem before evaluating the fix.

Thanks,
Siyuan

Comment by Tess Avitabile (Inactive) [ 19/Apr/19 ]

Apologies, I'm out sick today and on vacation next week. Judah is also on vacation today. Assigning to Andy to triage.

Generated at Thu Feb 08 04:55:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.