[SERVER-12293] initial sync of a capped collection can often fail if highly transient Created: 08/Jan/14  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.8, 2.5.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 6
Labels: PM248, former-robust-initial-sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-16049 Replicate capped collection deletes e... Closed
is depended on by TOOLS-1636 mongodump fails when capped collectio... Waiting (Blocked)
Related
is related to SERVER-32827 Initial sync can fail when syncing a ... Backlog
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

If a capped collection is hot, an initial sync of a new replica set member can often fail because the cursor gets overrun while syncing.
One possible solution is when detected on a syncing secondary, to stop cloning the collection, and let the oplog sync take care of it.
Note: if this happens, it could mean the oplog sync will never converge as well.

– OLD BELOW –

Any write to a full capped collection deletes old record(s).
The delete seems to invalidate all cursors on the collection,

https://github.com/mongodb/mongo/blob/master/src/mongo/db/clientcursor.cpp#L251

Attempt to initial sync a capped collection with master/latest that's being inserted into:

2014-01-08T09:00:07.481-0800 [rsSync] 		cloning collection test.cap to test.cap on asyasmacbook.local:40001 with filter {}
2014-01-08T09:00:07.949-0800 [rsSync] replSet initial sync exception: 13127 getMore: cursor didn't exist on server, possible restart or timeout? 0 attempts remaining

Note, failure is almost instant, unlike in 2.4 where such failure would happen "eventually" if the writes were "fast enough".

It appears that if the failure does not immediately happen, then the clone succeeds - possible timing interaction issue?



 Comments   
Comment by Eric Milkie [ 10/Nov/21 ]

Note that File Copy Based Initial Sync (or any snapshot-based initial sync) does not suffer from this issue.

Comment by Louis Williams [ 11/Oct/21 ]

Moving back to "Open" because the dependent ticket, SERVER-16049, was fixed in 5.0.

Comment by John Feibusch [ 26/Jan/14 ]

By specifying that option, the user is asserting that the capped collection wraps quickly. If that assertion is incorrect, then the secondary would be inconsistent. In that sense, it would be the same as the --fastsync option.

Comment by Asya Kamsky [ 25/Jan/14 ]

John, that's not possible as that would create a secondary that would possibly have empty capped collection if there were no more inserts into it during the initial sync and after - the secondary couldn't enter SECONDARY status until it has a consistent copy of primary's data.

Comment by John Feibusch [ 21/Jan/14 ]

I think a possible solution would be to have some option to not copy the data in a capped collection during initial sync. That is, on the new node, the capped collection would be created with the same size as on the source node, but no data would be copied. The capped collection would still end up with the same data as the primary after one wrap.

Comment by Asya Kamsky [ 09/Jan/14 ]

Yes. It seems a little easier to reproduce in 2.5.5-pre but I can consistently make it happen in both (just by starting a loop inserting into the capped collection on the primary right before starting initial sync of the secondary).

Comment by Eric Milkie [ 08/Jan/14 ]

Does this affect both 2.4 and master branch? (my guess is yes)

Comment by Asya Kamsky [ 08/Jan/14 ]

I think we've confirmed that what happens is inserts arrive faster than getmore batches (this is more likely to happen with very large documents which force fewer docs in each batch) and delete the record the cursor for getmore is pointing to.

Comment by Eric Milkie [ 08/Jan/14 ]

From the description, it sounds like it affects more than syncing – it would be very difficult / impossible to do a read scan of a capped collection if someone else is writing to it and it's already wrapped around.

Generated at Thu Feb 08 03:28:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.