[SERVER-10014] critical section recovery check is inaccurate if SCC::findOne unexpectedly throws after preparing Created: 24/Jun/13  Updated: 06/Dec/22  Resolved: 12/Jul/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.5.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

Note - the correct behavior here may need more discussion.

If, while attempting to write the version in the critical section of a migrate, the write fails on the first (or second) server, mongod can throw an exception that is caught by the critical section recovery code (see DBClientInterface::findN, looks like the SCC::findOne does not expect this to happen). Writes to subsequent servers are not performed, potentially resulting in inconsistency.

The check afterwards (which reads the version written and shuts down the server if it differs) does not catch the problem if the write to the server actually went through. This happens in the case of config server timeout, for example.



 Comments   
Comment by Gregory McKeon (Inactive) [ 12/Jul/18 ]

Gone away with the migration changes in 3.4.

Comment by Greg Studer [ 22/Nov/13 ]

Reporting what happens on all config servers after errors occur and not relying on findOne() is probably a good first step.

Generated at Thu Feb 08 03:22:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.