[SERVER-80236] Race in migration source registration and capturing writes for xferMods for deletes Created: 18/Aug/23 Updated: 29/Oct/23 Resolved: 31/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.2.0, 4.4.0, 5.0.0, 6.0.0, 7.0.0 |
| Fix Version/s: | 7.2.0-rc0, 7.0.2, 7.1.0-rc1, 5.0.22, 6.0.11, 4.4.26 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Randolph Tan | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v7.1, v7.0, v6.0, v5.0, v4.4
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 108 | ||||||||||||||||||||
| Description |
|
The migration cloner installs itself to the CollectionShardingRuntime while holding the csr lock in exclusive mode. The op observers take the csr lock in shared mode to extract the cloner and capture the write. However, it completely skips this if it cannot find the cloner. Therefore the following scenario is possible:
|
| Comments |
| Comment by Githook User [ 28/Sep/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit f343f8dd0efbd885aa1db8a26de7018a84345689) |
| Comment by Githook User [ 14/Sep/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit c6961408c5fba8783acf6c5eb507ccc769c69c05) |
| Comment by Githook User [ 14/Sep/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit dcac81ff8729972a057f42d2b074889524d62467) |
| Comment by Githook User [ 06/Sep/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit 1c690ead56668593cb741aba0a78ba212df74fd1) |
| Comment by Githook User [ 05/Sep/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit 1c690ead56668593cb741aba0a78ba212df74fd1) |
| Comment by Githook User [ 30/Aug/23 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: |
| Comment by Randolph Tan [ 18/Aug/23 ] |
|
It looks like this only affects deletes. For update and inserts, the migration op observer gets called after the oplog time slots have already been reserved. So even if the write was not captured, the cloner will be able to see the writes because it waits for replication here and ends up waiting for the hole. The issue with deletes is that it has aboutToDelete, which saves the decision to skip capturing the write and can happen before any opTime is reserved, so the cloner will not wait for the delete even if the delete decided ahead of time to skip capturing the op. |