[SERVER-33165] Don't return from ReplSetTest.initiate until there is a stable checkpoint Created: 07/Feb/18  Updated: 29/Oct/23  Resolved: 20/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.7.6

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: rollback-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-33349 Add command to get stable checkpoint ... Closed
is depended on by SERVER-33525 Fix replication and sharding tests to... Closed
Related
related to SERVER-34635 import pymongo.write_concern in repli... Closed
is related to SERVER-36101 Replication should not depend on the ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-02-12, Repl 2018-03-12, Repl 2018-03-26, Repl 2018-04-09, Repl 2018-04-23
Participants:
Linked BF Score: 0

 Description   

Some tests kill nodes. For them to recover they need to have a stable checkpoint. Whatever solution we do, we should apply to both sets of test fixtures.

The proposed solution is to write a "done: true" field to the checkpointTimestamp document as soon as the stable checkpoint is finished being taken. Tests can then poll that field to see when it's set.



 Comments   
Comment by Githook User [ 20/Apr/18 ]

Author:

{'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'}

Message: SERVER-33165 Don't return from ReplSetTest.initiate until there is a stable checkpoint
Branch: master
https://github.com/mongodb/mongo/commit/5aec800d301a6806d82eac3a6bc5753b8c16dc5d

Comment by William Schultz (Inactive) [ 27/Mar/18 ]

Ah, thanks, I don't know why I missed that!

Comment by William Schultz (Inactive) [ 27/Mar/18 ]

max.hirschhorn daniel.gottlieb We may also need to do something similar for Jepsen tests.

Comment by Judah Schvimer [ 15/Feb/18 ]

This may require SERVER-33349 to be completed first.

Comment by Judah Schvimer [ 15/Feb/18 ]

A server change would not be sufficient. ReplSetTest.initiate must wait for a stable checkpoint on every single node, not just the primary, which is not something that the server can do. We already have accepted that right after an initial sync or initiate, we may not be able to recover properly in the case of a rollback or shutdown.

Comment by William Schultz (Inactive) [ 14/Feb/18 ]

judah.schvimer Is this a change that might be worth making in the server, to the replSetInitiate command itself? It could internally wait until a node has a stable checkpoint before returning. I'm not clear on whether the existence of a stable checkpoint is considered a requirement in order for a replica set node to begin operation in all cases, but if it is, then it seems the server change might be the right thing to do.

Comment by Judah Schvimer [ 07/Feb/18 ]

Calling "awaitLastOpCommitted" may be useful, though it's probably not sufficient.

Comment by Max Hirschhorn [ 07/Feb/18 ]

judah.schvimer, a similar change would need to be made in how resmoke.py initiates a replica set in ReplicaSetFixture.setup() in addition to ReplSetTest#initiate().

Generated at Thu Feb 08 04:32:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.