[SERVER-12293] initial sync of a capped collection can often fail if highly transient Created: 08/Jan/14 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.8, 2.5.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Asya Kamsky | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 6 |
| Labels: | PM248, former-robust-initial-sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
If a capped collection is hot, an initial sync of a new replica set member can often fail because the cursor gets overrun while syncing. – OLD BELOW – Any write to a full capped collection deletes old record(s). https://github.com/mongodb/mongo/blob/master/src/mongo/db/clientcursor.cpp#L251 Attempt to initial sync a capped collection with master/latest that's being inserted into:
Note, failure is almost instant, unlike in 2.4 where such failure would happen "eventually" if the writes were "fast enough". It appears that if the failure does not immediately happen, then the clone succeeds - possible timing interaction issue? |
| Comments |
| Comment by Eric Milkie [ 10/Nov/21 ] |
|
Note that File Copy Based Initial Sync (or any snapshot-based initial sync) does not suffer from this issue. |
| Comment by Louis Williams [ 11/Oct/21 ] |
|
Moving back to "Open" because the dependent ticket, |
| Comment by John Feibusch [ 26/Jan/14 ] |
|
By specifying that option, the user is asserting that the capped collection wraps quickly. If that assertion is incorrect, then the secondary would be inconsistent. In that sense, it would be the same as the --fastsync option. |
| Comment by Asya Kamsky [ 25/Jan/14 ] |
|
John, that's not possible as that would create a secondary that would possibly have empty capped collection if there were no more inserts into it during the initial sync and after - the secondary couldn't enter SECONDARY status until it has a consistent copy of primary's data. |
| Comment by John Feibusch [ 21/Jan/14 ] |
|
I think a possible solution would be to have some option to not copy the data in a capped collection during initial sync. That is, on the new node, the capped collection would be created with the same size as on the source node, but no data would be copied. The capped collection would still end up with the same data as the primary after one wrap. |
| Comment by Asya Kamsky [ 09/Jan/14 ] |
|
Yes. It seems a little easier to reproduce in 2.5.5-pre but I can consistently make it happen in both (just by starting a loop inserting into the capped collection on the primary right before starting initial sync of the secondary). |
| Comment by Eric Milkie [ 08/Jan/14 ] |
|
Does this affect both 2.4 and master branch? (my guess is yes) |
| Comment by Asya Kamsky [ 08/Jan/14 ] |
|
I think we've confirmed that what happens is inserts arrive faster than getmore batches (this is more likely to happen with very large documents which force fewer docs in each batch) and delete the record the cursor for getmore is pointing to. |
| Comment by Eric Milkie [ 08/Jan/14 ] |
|
From the description, it sounds like it affects more than syncing – it would be very difficult / impossible to do a read scan of a capped collection if someone else is writing to it and it's already wrapped around. |