[SERVER-27534] All writing operations must fail if the term changes Created: 28/Dec/16  Updated: 16/Aug/22  Resolved: 19/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication, Write Ops
Affects Version/s: None
Fix Version/s: 3.6.5, 3.7.6

Type: Bug Priority: Critical - P2
Reporter: Mathias Stearn Assignee: Justin Seyster
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-34682 Old primary should vote yes and store... Closed
Related
related to SERVER-34672 Unable to add shard on 3.7.5 sharded ... Closed
related to SERVER-37574 Force reconfig should kill user opera... Closed
related to SERVER-68874 Consider making waitAfterPinningCurso... Closed
related to SERVER-37381 Allow prepared transactions to surviv... Closed
is related to SERVER-38354 Allow shutdown error when reading las... Closed
is related to SERVER-31277 Cancel all user operations on heartbe... Closed
is related to SERVER-27545 Include RBID in replSet metadata of c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Repl 2017-10-02, Repl 2017-10-23, Repl 2017-11-13, Repl 2017-12-04, Repl 2017-12-18, Query 2018-03-12, Query 2018-03-26, Query 2018-04-09, Query 2018-04-23
Participants:
Linked BF Score: 68

 Description   

Consider the following sequence of events during an batch insert of 1000 documents with ordered:true and w:majority writeConcern.

  1. Insert 500 documents and unlock
  2. Pause the inserting thread
  3. Another node steps up and the original primary rolls back the 500 writes already done
  4. The original primary steps back up
  5. The inserting thread then does the remaining writes which get new optimes
  6. That thread then waits for majority confirmation of the last writes, and successfully returns to the user

In this case we've lost 500 writes that are w:majority confirmed, and we've written later ops without the earlier ops even with ordered:true. This is caused by a combination of not killing all ops (at least all writing ops) on all replSet stepdown paths, not closing all connections, and always asking "can I currently write to this namespace" rather than "have I always been able to write to this namespace since starting this op".

This issue also effects any operations that write multiple oplog entries with a release of the global lock in between, and "no-op" ops that get the last optime after releasing the global lock. A non-exhaustive list:

  • All batch write operations (insert, update, delete)
  • Multi-update and Multi-delete
  • Agg with $out
  • MapReduce

Potential solutions:

  1. Fail all write ops and waitForWriteConcern if the electionId (or rbid) changed since the op began
  2. Interrupt all write ops (or all ops) on all stepdown paths. Also need to either:
    a) Ensure all write ops check for interrupt every time they aquire the global lock after acquiring it (currently they check first)
    b) Make all lock acquisitions checkForInterrupt (this is planned already to support interruptable locking)
  3. Record the term at the beginning of every operation, in the logOp (and awaitReplication) code check that the term of the write matches what was recorded and abort the write if not.


 Comments   
Comment by David Storch [ 20/Mar/19 ]

tess.avitabile, I don't recall if there was a specific reason that we did not backport this to 3.4. It looks like it would theoretically be possible to backport, though I don't recall all of the details. justin.seyster might remember more than me.

Comment by Tess Avitabile (Inactive) [ 18/Mar/19 ]

david.storch, justin.seyster, do you know why we did not backport this to 3.4? Did the issue not exist on 3.4? I am wondering because I am considering how far to backport SERVER-37574.

Comment by Githook User [ 08/May/18 ]

Author:

{'email': 'justin.seyster@mongodb.com', 'name': 'Justin Seyster', 'username': 'jseyster'}

Message: SERVER-27534 All writing operations must fail if the term changes

Note that in our master branch, we committed a more ambitious fix for
this, which modifies lock acquisitions to always check for interrupt
(obviating the need for the interrupt checks in these batch modify
functions). That will hopefully prevent future problems like this.

On the 3.6 branch, though, we just fix the case described in the Jira
ticket by moving interrupt checks to happen after we take a collection
lock (which helps because a stepdown can't occur while the lock is
held).
Branch: v3.6
https://github.com/mongodb/mongo/commit/97f6f0e4612fa04f7268c6f354c7065281075267

Comment by Githook User [ 18/Apr/18 ]

Author:

{'email': 'justin.seyster@mongodb.com', 'name': 'Justin Seyster', 'username': 'jseyster'}

Message: SERVER-27534 All writing operations must fail if the term changes.

This reapplies bc19d43f, which was reverted by ae50776b. It also adds
more test fixes.
Branch: master
https://github.com/mongodb/mongo/commit/d9a5a306690d7cdb8831e64441a66cdd503d8064

Comment by Justin Seyster [ 14/Apr/18 ]

Reverted, because of a failing test. This should be back in the build early next week.

Comment by Githook User [ 14/Apr/18 ]

Author:

{'email': 'justin.seyster@mongodb.com', 'name': 'Justin Seyster', 'username': 'jseyster'}

Message: Revert "SERVER-27534 All writing operations must fail if the term changes."

This reverts commit bc19d43fdc4aab85264def96f638128c0ddb8483.
Branch: master
https://github.com/mongodb/mongo/commit/ae50776bce051930d8d68493aa13c71cdcef4970

Comment by Githook User [ 12/Apr/18 ]

Author:

{'email': 'justin.seyster@mongodb.com', 'name': 'Justin Seyster', 'username': 'jseyster'}

Message: SERVER-27534 All writing operations must fail if the term changes.

The description of this SERVER ticket describes a nasty race that can
occur if elections happen inbetween two batches during a large
update. (The included test confirms that the race is possible.)

To fix this, we want to check the operation context for interrupts
with each batch, and we need to make sure the check happens after
the collection lock gets taken and before the batch
inserts/updates/deletes execute.

A recent change to locking gives us almost exactly this for free: if a
collection lock has to wait, it throws an exception when the operation
context is interrupted, ending the operation. If the lock doesn't
wait, though, there is no check.

This patch adds that check in. Acquiring a lock now always throws if
the operation context is interrupted, which closes the race window in
this bug.
Branch: master
https://github.com/mongodb/mongo/commit/bc19d43fdc4aab85264def96f638128c0ddb8483

Comment by Justin Seyster [ 12/Mar/18 ]

geert.bosch I was thinking of the new behavior in SERVER-32638: if the current operation context is marked as killed (interrupted), DB lock acquisition calls do not even try to wait for the lock. Instead, they immediately fail (by uasserting).

It seems reasonable (though certainly not necessary) to expect parallel behavior when acquiring a lock without waiting. I want to check that we've all agreed on the semantics of this operation before I start adding new checkForInterrupt calls (solution 2a in the description).

Comment by Geert Bosch [ 11/Mar/18 ]

Normally the expectation is that any operation does only a bounded amount
of work before unlocking/relocking. You want to do some work after
obtaining locks. We don't asynchronously stop operations, so there is
nothing special about the instant we have gotten a lock, or 5 statements
later. In particular, if you'd get a lock with a timeout of 0, the desired
behavior is that of a try-lock where you don't have to wait. If you'd go
check if no time has passed after obtaining the lock, it would always fail,
even if the lock was not contended.

On Fri, Mar 9, 2018 at 6:57 PM, Justin Seyster (JIRA) <jira@mongodb.org>

Comment by Justin Seyster [ 09/Mar/18 ]

It looks like SERVER-32638 is not enough to prevent this scenario, because acquiring a DBLock in an interrupted op context only uasserts if the DBLock has to wait. An uncontested acquisition will always succeed, so we'll still need to add checkForInterrupt calls after acquiring the DBLock (solution 2a in the description).

louis.williams Is that the originally intended behavior for interruptible DBLocks?

Comment by Geert Bosch [ 23/Feb/18 ]

I'm arguing that SERVER-32638 looks unsafe for back-porting. For just mainline I agree approach 2B should be fine.

Comment by Spencer Brody (Inactive) [ 21/Feb/18 ]

geert.bosch, are you suggesting that we go with the 2A approach from the description, rather than the 2B approach as we had planned? Or are you suggesting that after SERVER-32638 we specifically audit the handful of commands that can do writes and ensure that all of them are opting in to this lock interruptability behavior?

Comment by Geert Bosch [ 13/Feb/18 ]

SERVER-32638 currently exposes a lot of places in the code that acquire locks, but are not prepared by such an attempt to throw an exception. While we're working toward removing all such cases for MongoDB 4.0, it will take a while before we can be confident on having handled all locations. This clearly is not safe for back porting.

It probably would be better to add explicit checks for interrupts at a few specific places where we can make sure exceptions are handled correctly.

Comment by Andy Schwerin [ 12/Jan/18 ]

On master, this ticket can be resolved after SERVER-32638 commits.

Comment by Eric Milkie [ 12/Jan/18 ]

SERVER-32638

Comment by Andy Schwerin [ 12/Jan/18 ]

milkie, is 2b in the proposed solution on the docket for 3.8?

Comment by Spencer Brody (Inactive) [ 21/Dec/17 ]

We'll need to check batch command locking for times we drop the lock outside of query yield system. Any write operation that drops locks will need to checkForInterrupt after re-acquiring their locks.

Comment by David Storch [ 30/Oct/17 ]

If I understand correctly, 2a should be fairly straightforward. This work is not planned to be accomplished outside of a fix for this ticket.

Comment by Eric Milkie [ 27/Oct/17 ]

I don't believe 2b is planned soon, but I can't really tell because I was actually unable to find a ticket reflecting this work on the storage backlog. Is anyone aware of one?

Comment by Spencer Brody (Inactive) [ 27/Oct/17 ]

2a seems like it belongs on the query team, while 2b seems like it belongs to storage. david.storch milkie, can you comment on the relative complexity of 2a and 2b and whether either is currently planned anytime soon?

Comment by Spencer Brody (Inactive) [ 27/Oct/17 ]

The first part of solution #2 has been implemented. Now we just need to do either 2a or 2b to be safe from this issue.

Comment by Spencer Brody (Inactive) [ 26/Sep/17 ]

This problem may disappear when SERVER-31277 is implemented

Comment by Mathias Stearn [ 07/Sep/17 ]

spencer matthew.russotto renctan We just discussed another interesting case that a fix for this should handle. If a single operation does a read followed by a write, we must ensure that there was no rollback between them. It is probably best in general to check for rollback every time the global lock is acquired rather than trying to find a smaller subset of cases.

Comment by Spencer Brody (Inactive) [ 17/Apr/17 ]

Note the naive implementation will add an extra mutex acquisition to every command execution. If we want to avoid that we may be able to add a cached copy of the current term to ReplicationCoordinator that has the same concurrency rules as _canAcceptNonLocalWrites, and decorate the opctx with the term at the same place as the command execution code calls canAcceptWritesFor to check that the command is safe to run.

Comment by Spencer Brody (Inactive) [ 17/Apr/17 ]

Probably the best solution is to start decorating the operation context at the beginning of command execution with the current term, then checking in logOp and awaitReplication that the term of the optime being written or waited for hasn't changed from the term that was recorded on the opctx.

Generated at Thu Feb 08 04:15:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.