[SERVER-20973] Initial sync during index drop can cause loss on new member Created: 16/Oct/15  Updated: 29/Oct/15  Resolved: 16/Oct/15

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.6.11, 3.0.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: James Wahlin Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod3-0-7.patch     File repro.js    
Issue Links:
Depends
Duplicate
duplicates SERVER-2454 Queries that are killed during a yiel... Closed
Operating System: ALL
Steps To Reproduce:

1) Build MongoDB 3.0.7 with attached "mongod3-0-7.patch" applied. This patch forces mongoD queries to yield frequently and adds a 100ms sleep during the yield, making this issue easier to reproduce.
2) Run the attached reproduction script via mongo shell:

mongo --nodb repro.js

Participants:

 Description   

During the initial sync process mongod will clone data from it's sync source via a getMore() operation. It is possible for this getMore to end early for a given collection on cursor invalidation, having returned only a partial data set. This results in a new replica member with an incomplete data set.

To trigger this issue the following must occur:
1) The cloner's getMore() is currently in progress on the sync source, in a yielded state
2) An index on the collection being cloned is dropped (either directly or due to a failed index build, one example being a unique index build that hits a duplicate key)

This does not appear to be an issue under 3.2.0-rc0.



 Comments   
Comment by David Storch [ 29/Oct/15 ]

james.wahlin, on the 2.6 branch, DEAD Runners will not return an error to the client:

https://github.com/mongodb/mongo/blob/ec704c707f617981e7c38f9ea557fe5f505779bc/src/mongo/db/query/new_find.cpp#L667-L682

It should be the same as 3.0 in this respect.

Comment by James Wahlin [ 29/Oct/15 ]

schwerin - sorry, missed your comment earlier. This does impact 2.6 but as Eric mentioned may have a different root clause. From a first pass it looks like 2.6 will return an error on a dead PlanExecutor so it may indeed be that the error is reported but ignored by the cloner. I need to do some work on the repro script for 2.6 but will confirm once I have validated this theory. (CC: milkie)

Comment by Eric Milkie [ 16/Oct/15 ]

I was incorrect about 2.6; the issue is a bit different there though, as I believe the cloner doesn't detect query errors even if we started returning them if the cursor is closed prematurely.

Comment by Andy Schwerin [ 16/Oct/15 ]

The report claims that the 2.6 series is affected. james.wahlin, can you confirm?

Comment by Eric Milkie [ 16/Oct/15 ]

This is confirmed to be a duplicate of SERVER-2454. I have marked that ticket for backporting to 3.0. I believe the 2.6 series is not affected by this.

Generated at Thu Feb 08 03:55:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.