Consider the following sequence of events during an batch insert of 1000 documents with ordered:true and w:majority writeConcern.
- Insert 500 documents and unlock
- Pause the inserting thread
- Another node steps up and the original primary rolls back the 500 writes already done
- The original primary steps back up
- The inserting thread then does the remaining writes which get new optimes
- That thread then waits for majority confirmation of the last writes, and successfully returns to the user
In this case we've lost 500 writes that are w:majority confirmed, and we've written later ops without the earlier ops even with ordered:true. This is caused by a combination of not killing all ops (at least all writing ops) on all replSet stepdown paths, not closing all connections, and always asking "can I currently write to this namespace" rather than "have I always been able to write to this namespace since starting this op".
This issue also effects any operations that write multiple oplog entries with a release of the global lock in between, and "no-op" ops that get the last optime after releasing the global lock. A non-exhaustive list:
- All batch write operations (insert, update, delete)
- Multi-update and Multi-delete
- Agg with $out
- Fail all write ops and waitForWriteConcern if the electionId (or rbid) changed since the op began
- Interrupt all write ops (or all ops) on all stepdown paths. Also need to either:
a) Ensure all write ops check for interrupt every time they aquire the global lock after acquiring it (currently they check first)
b) Make all lock acquisitions checkForInterrupt (this is planned already to support interruptable locking)
- Record the term at the beginning of every operation, in the logOp (and awaitReplication) code check that the term of the write matches what was recorded and abort the write if not.