[SERVER-32851] setFeatureCompatibilityVersion can race with createCollection such that FCV 3.6 is set and some collections do not have UUIDs Created: 23/Jan/18 Updated: 30/Oct/23 Resolved: 30/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 3.6.0, 3.7.1 |
| Fix Version/s: | 3.6.3, 3.7.2 |
| Type: | Improvement | Priority: | Critical - P2 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v3.6
|
||||||||
| Sprint: | Storage 2018-01-29 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
Create collection takes a database exclusive lock and then checks the FCV to determine whether to create a collection with a UUID or not. The setFeatureCompatibilityVersion command, however, sets the FCV to upgrading without taking any locks. Therefore, createCollection could get as far as checking the FCV, not seeing upgradingTo36 or fullyUpgradedTo36, then the setFeatureCompatibilityVersion can assign all the existing collections to have UUIDs, and finally the createCollection finishes without assigning a UUID to the new collection. |
| Comments |
| Comment by Adrien Jarthon [ 21/Mar/20 ] |
|
Hi, We're having some trouble which this upgrade on our RS already reported to Enterprise support under ticket 00642150 So I'm wondering what has gone wrong in this change, you seem to say that the operation can take a long time:
But on the other end you end up tacking a global S lock for the duration of the UUIDs update (and not just the flag set) which (as far as I know and we noticed in production) will block every write, because they need IX on all ancestors, right? I don't see any update to this code until 3.6.17 so did I misunderstood something or is this migration supposed to block everybody's database for any duration between a couple of seconds to several hours? Thanks for your help. |
| Comment by Githook User [ 31/Jan/18 ] |
|
Author: {'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna Hohensee', 'username': 'DiannaHohensee'}Message: (cherry picked from commit 9081e227228f11744200fbe66a9683b0bd82e5b1) |
| Comment by Githook User [ 30/Jan/18 ] |
|
Author: {'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna Hohensee', 'username': 'DiannaHohensee'}Message: |
| Comment by Andy Schwerin [ 29/Jan/18 ] |
|
I think the current approach is acceptable. Normally, I'd prefer to delay the DDL ops in favor of unblocking the CRUD ops, but this is a really rare event, and the global S lock proposal is really easy to understand. |
| Comment by Spencer Brody (Inactive) [ 26/Jan/18 ] |
|
The fix in the CR works by taking and immediately releasing a global S lock after setting the FCV targetVersion. This creates a barrier for createCollection such that the collection will either be created after setting the targetVersion, and thus createCollection will make it with a UUID, or the createCollection happens before, but is guaranteed to be completed by the time the FCV command starts assigning UUIDs to existing collections. An alternative approach that dianna.hohensee and I just discussed is that we could instead take and hold global and database IX locks for the duration of where we list databases and collections and set UUIDs on them. The latter approach has the advantage that crud write ops are never blocked, but the downside that DDL ops are blocked for a longer window of time. I think both approaches are valid, and I don't have a strong preference for which we do, I just thought it'd be worth considering both. If no one else has a strong opinion then I'm included to default to the one that's already written, using the global S lock barrier. |
| Comment by Dianna Hohensee (Inactive) [ 26/Jan/18 ] |
|
Per offline discussion with Andy, going back to using a global lock, because adding too many miscellaneous locks outside of the global/database/collection hierarchy confuses lock taking order expectations and can easily lead to deadlocks. |
| Comment by Dianna Hohensee (Inactive) [ 24/Jan/18 ] |
|
The proposed solution for this ticket is to create a new ResourceMutex lock. createCollection will take it in shared mode for the duration of its operations. setFCV will take it in exclusive mode only while doing the FCV document write. This will prevent the FCV document from being updated in the middle of createCollection operations, between determining whether an UUID is needed and actually creating the collection. So createCollection will be able to run in parallel with other createCollection calls (as long as different dbs because createCollection already takes a db X lock), and no global exclusive lock is needed. note: the existing fcvLock cannot be used because it would prevent createCollection from running for the duration of setFCV, which could take a long time and does a lot of network calls. |
| Comment by Dianna Hohensee (Inactive) [ 24/Jan/18 ] |
|
schwerin, I'm going to take advantage of the fcvLock instead of taking the global exclusive lock in setFCV. Updating the title now. Do you still want repl on the CR? update: I'll actually make a new ResourceMutex lock, because the existing fcvLock is held too long. |
| Comment by Ian Whalen (Inactive) [ 23/Jan/18 ] |
|
dianna.hohensee request from Andy to have someone from Repl also review this please. |