[SERVER-33165] Don't return from ReplSetTest.initiate until there is a stable checkpoint Created: 07/Feb/18 Updated: 29/Oct/23 Resolved: 20/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 3.7.6 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Judah Schvimer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | rollback-functional | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | Repl 2018-02-12, Repl 2018-03-12, Repl 2018-03-26, Repl 2018-04-09, Repl 2018-04-23 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
|
Some tests kill nodes. For them to recover they need to have a stable checkpoint. Whatever solution we do, we should apply to both sets of test fixtures. The proposed solution is to write a "done: true" field to the checkpointTimestamp document as soon as the stable checkpoint is finished being taken. Tests can then poll that field to see when it's set. |
| Comments |
| Comment by Githook User [ 20/Apr/18 ] |
|
Author: {'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'}Message: |
| Comment by William Schultz (Inactive) [ 27/Mar/18 ] |
|
Ah, thanks, I don't know why I missed that! |
| Comment by William Schultz (Inactive) [ 27/Mar/18 ] |
|
max.hirschhorn daniel.gottlieb We may also need to do something similar for Jepsen tests. |
| Comment by Judah Schvimer [ 15/Feb/18 ] |
|
This may require |
| Comment by Judah Schvimer [ 15/Feb/18 ] |
|
A server change would not be sufficient. ReplSetTest.initiate must wait for a stable checkpoint on every single node, not just the primary, which is not something that the server can do. We already have accepted that right after an initial sync or initiate, we may not be able to recover properly in the case of a rollback or shutdown. |
| Comment by William Schultz (Inactive) [ 14/Feb/18 ] |
|
judah.schvimer Is this a change that might be worth making in the server, to the replSetInitiate command itself? It could internally wait until a node has a stable checkpoint before returning. I'm not clear on whether the existence of a stable checkpoint is considered a requirement in order for a replica set node to begin operation in all cases, but if it is, then it seems the server change might be the right thing to do. |
| Comment by Judah Schvimer [ 07/Feb/18 ] |
|
Calling "awaitLastOpCommitted" may be useful, though it's probably not sufficient. |
| Comment by Max Hirschhorn [ 07/Feb/18 ] |
|
judah.schvimer, a similar change would need to be made in how resmoke.py initiates a replica set in ReplicaSetFixture.setup() in addition to ReplSetTest#initiate(). |