[SERVER-37886] Remove config server as coordinator crutch from coordinator stepdown targeted tests Created: 01/Nov/18 Updated: 29/Oct/23 Resolved: 17/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.11 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ShardedTxn:DistributedCommit | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Sprint: | Sharding 2019-03-11, Sharding 2019-04-22 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 11 | ||||||||||||||||||||
| Description |
|
Currently, in order to test coordinator failover, we have an override that causes the router to always use the config server as the coordinator shard for a transaction. Once prepare failover is ready, we should move the coordinator back onto the first shard to be touched by a transaction. |
| Comments |
| Comment by Githook User [ 17/Apr/19 ] |
|
Author: {'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}Message: This reverts commit 7f620154e595d2c1e6c7af79fc62070ced3bb941. |
| Comment by Githook User [ 16/Apr/19 ] |
|
Author: {'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}Message: This reverts commit bc276668e8d992ff833fdbfd0f4280419d11eda1. |
| Comment by Esha Maharishi (Inactive) [ 18/Mar/19 ] |
|
I also marked this as depends on |
| Comment by Esha Maharishi (Inactive) [ 18/Mar/19 ] |
|
Marking as depends on The deadlock is a result of a lock order inversion between taking the ReplicationCoordinator mutex and checking out a session:
The deadlock that ensues is either: 1) Stepdown acquires the ReplicationCoordinator mutex before the command calls ReplicationCoordinator::awaitReplication. The command blocks trying to acquire the ReplicationCoordinator mutex in ReplicationCoordinator::awaitReplication, and the stepdown blocks in checkOutSessionForKill. OR 2) Stepdown acquires the ReplicationCoordinator mutex while the command is in waitForConditionOrInterrupt as part of ReplicationCoordinator::_awaitReplication_inlock. The stepdown successfully marks the command's OperationContext as killed, but the command blocks trying to acquire the ReplicationCoordinator mutex to wake up from its sleep, and the stepdown blocks in checkOutSessionForKill. I originally thought that |
| Comment by Githook User [ 08/Mar/19 ] |
|
Author: {'name': 'Esha Maharishi', 'username': 'EshaMaharishi', 'email': 'esha.maharishi@mongodb.com'}Message: Revert " This reverts commit 8f3ad3eab4631a393cb4c2bfff69015baf63ebc9. |
| Comment by Githook User [ 08/Mar/19 ] |
|
Author: {'name': 'Esha Maharishi', 'email': 'esha.maharishi@mongodb.com', 'username': 'EshaMaharishi'}Message: Revert " This reverts commit a0555e9be29e21c1c822d4bfc860c66b6838a00f. |
| Comment by Githook User [ 08/Mar/19 ] |
|
Author: {'name': 'Esha Maharishi', 'username': 'EshaMaharishi', 'email': 'esha.maharishi@mongodb.com'}Message: |
| Comment by Githook User [ 08/Mar/19 ] |
|
Author: {'name': 'Esha Maharishi', 'username': 'EshaMaharishi', 'email': 'esha.maharishi@mongodb.com'}Message: |
| Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ] |
|
Actually, using the override that retries the transaction entirely would allow the tests to pass while still testing that the coordinator resumed coordinating commit on stepup. I was worried the override would mean the tests would pass even if the coordinator never resumed coordinating any commits, but that's not true. If the coordinator never resumed coordinating the commit and there were prepared participants, the tests would hang. |
| Comment by Esha Maharishi (Inactive) [ 07/Dec/18 ] |
|
Since participants abort unprepared transactions on stepdown, we will never be able to do passthrough testing of coordinator failover unless the coordinator is not a participant (the tests running in the passthrough may expect the transaction to commit). |