[SERVER-62716] Handle spurious finishWaitingForOneOpTime in WaitForMajorityServiceTest Created: 18/Jan/22 Updated: 29/Oct/23 Resolved: 23/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.0.0-rc0, 5.0.10 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vojislav Stojkovic | Assignee: | Matt Diener (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v5.3, v5.0
|
||||||||
| Sprint: | Service Arch 2022-2-21, Service Arch 2022-03-07 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 18 | ||||||||
| Description |
|
As determined in BF-22420, there is a possibility that WaitForMajorityService will call waitForWriteConcern on a request that has been cancelled but has not been removed from its request collection yet. The service uses two background threads: one for processing requests (by waiting for replication) and the other for removing requests whose futures have been cancelled. This behavior can manifest if the cancelled request is at the front of the queue and the processing thread wakes up and acquires the mutex after the cancellation thread has marked the request as processed but before it has removed it from the collection. While the behavior itself is not a bug, it does cause problems in WaitForMajorityServiceTest where test logic assumes that the request has been handled only once. For example, CancelingEarlierOpTimeRequestDoesNotAffectLaterOpTimeRequests calls finishWaitingOneOpTime twice and assumes that the first request will have been processed after the first call and the second after the second call. One way to fix this would be to call getLastOpTimeWaited after finishWaitingOneOpTime and retry when necessary. |
| Comments |
| Comment by Githook User [ 21/Jun/22 ] |
|
Author: {'name': 'Matt Diener', 'email': 'matt.diener@mongodb.com', 'username': 'mattdiener'}Message: |
| Comment by Githook User [ 22/Feb/22 ] |
|
Author: {'name': 'Matt Diener', 'email': 'matt.diener@mongodb.com', 'username': 'mattdiener'}Message: |
| Comment by Matthew Saltz (Inactive) [ 31/Jan/22 ] |
|
After discussion in triage, it seems like this is a problem with a test rather than the service. The race described can happen in production, but it's not actually a problem since we just end up waiting for an opTime that we'd eventually surpass anyway. So we should just fix the unit test. |