[SERVER-8073] Possible bug in primary's tracking of how up-to-date its secondaries are. Created: 03/Jan/13  Updated: 10/Dec/14  Resolved: 12/Apr/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-8420 GhostSync::percolate() should refresh... Closed
Operating System: ALL
Participants:

 Description   

It seems possible in some cases for the optimes reported in a primary's local.slaves collection to differ from the optimes reported by replSetGetStatus. This can cause problems if the primary tries to wait for all secondaries to be caught up in replication.



 Comments   
Comment by Eric Milkie [ 12/Apr/13 ]

I believe this is fixed via SERVER-8420

Comment by Eric Milkie [ 07/Mar/13 ]

SERVER-8420 could be part of the cause here.

Comment by Eric Milkie [ 07/Mar/13 ]

We currently notify a primary about a secondary's oplog position via calling getMore on a query. The secondary can't call getMore on the cursor until it has reached the last op in the prior getMore call; once getMore is received by the primary, it updates that secondary's position to be the last op in the previous getMore call by that secondary.

If you have chained secondaries, they all communicate via this mechanism; the getMore chunks will quite possibly not be aligned through the chain. This means that for a given op applied at the end of a chain, that notification might bubble up through half the chain but then stop because further up the chain we're not ready to call getMore again. In other words, the longer the chain you have, the more that the slave tracking on the primary will be off. However, at steady state (with no writes happening on the primary), all slaves should eventually catch up and bump their cursors up to the end of the oplog.

Note that the replica set optime statuses as reported by replSetGetStatus command are populated by the heartbeats that come from each slave; this will always be accurate quantized to the heartbeat duration (every two seconds).

I've gone through the logic for percolating chained secondaries' positions and I don't yet see how they could break such that they stop tracking. However, if our local.me identifiers were conflicting, there might be ways for things to get mixed up. There are quite a few code paths that do not log unexpected errors at default logging level. Some of the errors are only reported at log level 2.

local.me shouldn't ever be conflicting. However, if database files were copied to a new host, it might be possible to end up with a conflicting id. We check to see if the 'hostname' is the same as the one recorded in local.me in order to determine if a new one needs to be generated; this check is not infallible.

Generated at Thu Feb 08 03:16:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.