[SERVER-33727] Do not wait for write concern if opTime didn't change during write Created: 07/Mar/18  Updated: 29/Oct/23  Resolved: 23/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: James Wahlin
Resolution: Fixed Votes: 1
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-12750 Docs for SERVER-33727: Do not wait fo... Closed
Related
related to SERVER-33232 _configCreateCollection should proper... Closed
related to SERVER-33475 Retried findAndModify doesn't properl... Closed
is related to SERVER-27067 Some Commands do not wait for write c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2019-06-03
Participants:
Linked BF Score: 50

 Description   

If a command accepts a write concern, it will always call waitForWriteConcern, no matter if and how it fails. The only question is what OpTime it waits for. If no write was attempted (and consequently the GlobalLockAcquisitionTracker did not move forward the client opTime), we will simply wait for write concern on the client opTime from before the command began. The client opTime may be null, in which case we don't do any waiting, or may already be committed in which we also do no waiting: https://github.com/mongodb/mongo/blob/0bc90017e2563ed7432d56051483763c91d28025/src/mongo/db/service_entry_point_mongod.cpp#L77-L88
If there was a previous uncommitted write on the client, however, we end up waiting for write concern on the previous write, even if the current command received a parse error and should do no waiting. Alternatively, if that previous write was in a previous term, waitForWriteConcern will fail here instead of waiting on it or skipping waiting entirely: https://github.com/mongodb/mongo/blob/0bc90017e2563ed7432d56051483763c91d28025/src/mongo/db/repl/replication_coordinator_impl.cpp#L1489-L1496. In the case where no write was even attempted, rather than wait on the previous write, we should not call waitForWriteConcern at all and just return early. This will prevent waiting on previous writes when unnecessary and failing commands unnecessarily when they shouldn't need to wait for write concern at all.

There is an assumption here that if we do not attempt to take the GlobalLock at any point then we should not wait for write concern at all. This is not always true, such as with retryable writes retrying successful operations(SERVER-33475), in which case those commands should push the client OpTime forwards themselves to ensure we wait on the proper opTime.



 Comments   
Comment by Githook User [ 23/May/19 ]

Author:

{'email': 'james@mongodb.com', 'name': 'James Wahlin', 'username': 'jameswahlin'}

Message: SERVER-33727 Do not wait for write concern if opTime didn't change during write
Branch: master
https://github.com/mongodb/mongo/commit/358c0af2fe875d6a768cf87d7ddfaeb3181f804a

Comment by Charlie Swanson [ 17/May/19 ]

Amazing! Thank you so much everyone above! james.wahlin see above for some great context here. Also happy to catch up in-person. It looks like you have time to slot this into the upcoming sprint but let me know if you don't.

Comment by A. Jesse Jiryu Davis [ 16/May/19 ]

Acknowledged, Andy, I retract my objection. =)

Comment by Tess Avitabile (Inactive) [ 16/May/19 ]

Thanks, jesse and shane.harvey.

charlie.swanson, I think this work is now sufficiently well-defined that the Query team can take it on. My only recommendation is to avoid changing the behavior of getLastError, if possible.

Comment by Andy Schwerin [ 16/May/19 ]

I believe that step 5 of Judah's proposal (if a global write lock was acquired, still wait for replication) makes the scenario described by Jesse and Shane above safe. Client A does need the guarantee that the acknowledgement in example step 3 above means the database state won't roll back.

Comment by A. Jesse Jiryu Davis [ 16/May/19 ]

I talked this over with shane.harvey. On the one hand, we have no specific knowledge as Driver Team people that makes us say this is a bad idea. But we did think of a scenario that makes us concerned about this change.

Years ago we knew of customers who would stream unacknowledged writes followed by a GLE, like Andy describes. Mongos broke that knowingly when it changed some connection pool logic, and drivers broke it similarly soon after. (Java and Python removed "thread affinity" from their pools.) So we think drivers don't support this technique anymore and applications have adapted.

Still, it seems like Judah's proposed change might not be a good idea. Consider:

  1. Client A updates a document with {$set: {x: 1}} and w: majority.
  2. Client B races to also update the document with {$set: {x: 1}}.
  3. Client A receives w: majority acknowledgement.
  4. The primary crashes.

Today, Client A knows that x was set to 1 and won't be rolled back, no matter whether Client A or Client B won the race. But it seems to us that after the proposed change, Client A loses this guarantee. Users would lose this simple technique to do an update and know the durable outcome.

Comment by Andy Schwerin [ 13/May/19 ]

Well, once upon a time, it mattered. I don't know if it still should. The use case was that clients would send a sequence of fire-and-forget writes and then do gLE at the end. Since we hung up on connections on step-down, a successful gLE would guarantee that all prior writes had committed. I don't know that we promised that the combined write+gle of using a write command counted in the same way. I also don't know if we should preserve that behavior even if we used to leverage it. It hasn't been a guarantee preserved by sharding for at least 4 years, and I suspect that it wasn't correctly preserved before then.

jesse, could you think about this for 10 minutes and weigh in?

Comment by Tess Avitabile (Inactive) [ 13/May/19 ]

That's a good point, we will need to make sure this change does not affect the behavior of getLastError.

However, I was particularly asking whether we have any implicit guarantee that a successful majority writeConcern implies that all previous operations performed by the client are majority committed. That is, supposing aggregate now always accepts a writeConcern, should a successful aggregate with writeConcern w:majority (that did not attempt to write) imply that all previous operations performed by the client are majority committed?

Comment by Andy Schwerin [ 09/May/19 ]

Ugh, this is really hard to answer. The getLastError command is still supported and may not take a write lock? Even if it doesn't, I'm not sure if all commands that choose to not write do so having already taken a write lock.

Comment by Tess Avitabile (Inactive) [ 09/May/19 ]

Got it, thanks. I think this will be okay. My only concern is whether we have any implicit guarantee that a successful majority writeConcern implies that all previous operations performed by the client are majority committed. We might make that implicit guarantee in order to provide feature parity with getLastError. schwerin, do you believe we are making this implicit guarantee today?

Comment by Judah Schvimer [ 09/May/19 ]

My proposal is slightly different than what we do today. Today we wait for write concern no matter what. In my proposal, if both (1) lastOpAfterRun == lastOpBeforeRun and (2) we did not take a global write lock, then we would not wait for write concern.

If lastOpBeforeRun was not committed and the writeConcern is majority, then we will wait for this OpTime to be committed.

Today I agree. If we did not wait for write concern when both (1) lastOpAfterRun == lastOpBeforeRun and (2) we did not take a global write lock, then this would no longer be the case.

Comment by Tess Avitabile (Inactive) [ 09/May/19 ]

judah.schvimer, I think your proposal is how the code works today.

This means that if lastOpAfterRun == lastOpBeforeRun and we did not take a global write lock at all we would not wait for write concern.

I don't think this is true. If lastOpBeforeRun was not committed and the writeConcern is majority, then we will wait for this OpTime to be committed. I think we would need to take additional action to skip the wait for writeConcern if lastOpAfterRun == lastOpBeforeRun.

Comment by Judah Schvimer [ 09/May/19 ]

The proposal is that we would do the following:

  1. get the client's lastOp before running the command: lastOpBeforeRun
  2. run the command
  3. get the client's lastOp after running the command: lastOpAfterRun
  4. if lastOpAfterRun != lastOpBeforeRun, wait for write concern on lastOpAfterRun
  5. if lastOpAfterRun == lastOpBeforeRun and a global write lock was taken, bump forward the client's lastOp to the system's last optime and wait for write concern on the client's new lastOp

This means that if lastOpAfterRun == lastOpBeforeRun and we did not take a global write lock at all we would not wait for write concern.

One concern here is if the system's last optime is equal to lastOpBeforeRun and we want to wait for write concern on a successful retryable write. An addendum to the above would be to track when we've set the client's lastOp explicitly in the current operation and wait for write concern no matter what in that case.

Alternatively, we could just not wait for write concern if lastOpAfterRun == lastOpBeforeRun, the global write lock was not taken, and lastOpAfterRun's term doesn't match the current term.

Comment by Tess Avitabile (Inactive) [ 09/May/19 ]

I'm confused about the proposed solution. My understanding is that we would only wait for writeConcern if the command took a write lock, and that we would have retryable writes push the client OpTime forward when retrying a successful operation. However, I don't see how this would ensure we wait for writeConcern on retryable writes when retrying a successful operation, since they do not take write locks. judah.schvimer, can you clarify this?

Comment by Charlie Swanson [ 08/May/19 ]

This ticket may have just become a priority for the query team. As part of adding $merge, the drivers team is interested to see if we can update the server's contract so that all aggregates accept both readConcern and writeConcern. This will allow both server and drivers to remove some special-casing for the $out stage, and avoid adding some new special-casing for $merge. If we allow every aggregate to accept a writeConcern, we don't want the read-only aggregates to wait for the previous operation's write concern.

Based on this ticket it sounds like there are no known problems with implementing this, and the code changes look relatively straightforward. Would anyone mind if the query team took this on and sent to replication for review? Please let me know if you foresee any problems.

cc tess.avitabile, jeff.yemin and behackett.

Comment by Andy Schwerin [ 08/Mar/18 ]

judah.schvimer, could you clarify the description and summary a little. It fooled milkie and I.

Comment by Eric Milkie [ 07/Mar/18 ]

SGTM

Comment by Judah Schvimer [ 07/Mar/18 ]

The GlobalLockAcquisionTracker added in SERVER-27067, still tracks when we attempt to take a global lock, which shows when we attempt to do a write. SERVER-27067 assumes that no-op writes always attempt to take a global lock. Things like retryable writes that break that contract, need to set the client optime themselves.

Comment by Eric Milkie [ 07/Mar/18 ]

How are you going to tell the difference between "a parse error that should do no waiting" and "a no-op write that should wait"? For example, a non-parse error for which we still need to wait would be a duplicate key error; we need to wait in order to ensure the client can do a subsequent read and see the data that caused the duplicate key error.

Generated at Thu Feb 08 04:34:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.