[SERVER-33727] Do not wait for write concern if opTime didn't change during write Created: 07/Mar/18 Updated: 29/Oct/23 Resolved: 23/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.12 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | James Wahlin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Sprint: | Query 2019-06-03 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 50 | ||||||||||||||||||||||||||||
| Description |
|
If a command accepts a write concern, it will always call waitForWriteConcern, no matter if and how it fails. The only question is what OpTime it waits for. If no write was attempted (and consequently the GlobalLockAcquisitionTracker did not move forward the client opTime), we will simply wait for write concern on the client opTime from before the command began. The client opTime may be null, in which case we don't do any waiting, or may already be committed in which we also do no waiting: https://github.com/mongodb/mongo/blob/0bc90017e2563ed7432d56051483763c91d28025/src/mongo/db/service_entry_point_mongod.cpp#L77-L88 There is an assumption here that if we do not attempt to take the GlobalLock at any point then we should not wait for write concern at all. This is not always true, such as with retryable writes retrying successful operations( |
| Comments |
| Comment by Githook User [ 23/May/19 ] |
|
Author: {'email': 'james@mongodb.com', 'name': 'James Wahlin', 'username': 'jameswahlin'}Message: |
| Comment by Charlie Swanson [ 17/May/19 ] |
|
Amazing! Thank you so much everyone above! james.wahlin see above for some great context here. Also happy to catch up in-person. It looks like you have time to slot this into the upcoming sprint but let me know if you don't. |
| Comment by A. Jesse Jiryu Davis [ 16/May/19 ] |
|
Acknowledged, Andy, I retract my objection. =) |
| Comment by Tess Avitabile (Inactive) [ 16/May/19 ] |
|
Thanks, jesse and shane.harvey. charlie.swanson, I think this work is now sufficiently well-defined that the Query team can take it on. My only recommendation is to avoid changing the behavior of getLastError, if possible. |
| Comment by Andy Schwerin [ 16/May/19 ] |
|
I believe that step 5 of Judah's proposal (if a global write lock was acquired, still wait for replication) makes the scenario described by Jesse and Shane above safe. Client A does need the guarantee that the acknowledgement in example step 3 above means the database state won't roll back. |
| Comment by A. Jesse Jiryu Davis [ 16/May/19 ] |
|
I talked this over with shane.harvey. On the one hand, we have no specific knowledge as Driver Team people that makes us say this is a bad idea. But we did think of a scenario that makes us concerned about this change. Years ago we knew of customers who would stream unacknowledged writes followed by a GLE, like Andy describes. Mongos broke that knowingly when it changed some connection pool logic, and drivers broke it similarly soon after. (Java and Python removed "thread affinity" from their pools.) So we think drivers don't support this technique anymore and applications have adapted. Still, it seems like Judah's proposed change might not be a good idea. Consider:
Today, Client A knows that x was set to 1 and won't be rolled back, no matter whether Client A or Client B won the race. But it seems to us that after the proposed change, Client A loses this guarantee. Users would lose this simple technique to do an update and know the durable outcome. |
| Comment by Andy Schwerin [ 13/May/19 ] |
|
Well, once upon a time, it mattered. I don't know if it still should. The use case was that clients would send a sequence of fire-and-forget writes and then do gLE at the end. Since we hung up on connections on step-down, a successful gLE would guarantee that all prior writes had committed. I don't know that we promised that the combined write+gle of using a write command counted in the same way. I also don't know if we should preserve that behavior even if we used to leverage it. It hasn't been a guarantee preserved by sharding for at least 4 years, and I suspect that it wasn't correctly preserved before then. jesse, could you think about this for 10 minutes and weigh in? |
| Comment by Tess Avitabile (Inactive) [ 13/May/19 ] |
|
That's a good point, we will need to make sure this change does not affect the behavior of getLastError. However, I was particularly asking whether we have any implicit guarantee that a successful majority writeConcern implies that all previous operations performed by the client are majority committed. That is, supposing aggregate now always accepts a writeConcern, should a successful aggregate with writeConcern w:majority (that did not attempt to write) imply that all previous operations performed by the client are majority committed? |
| Comment by Andy Schwerin [ 09/May/19 ] |
|
Ugh, this is really hard to answer. The getLastError command is still supported and may not take a write lock? Even if it doesn't, I'm not sure if all commands that choose to not write do so having already taken a write lock. |
| Comment by Tess Avitabile (Inactive) [ 09/May/19 ] |
|
Got it, thanks. I think this will be okay. My only concern is whether we have any implicit guarantee that a successful majority writeConcern implies that all previous operations performed by the client are majority committed. We might make that implicit guarantee in order to provide feature parity with getLastError. schwerin, do you believe we are making this implicit guarantee today? |
| Comment by Judah Schvimer [ 09/May/19 ] |
|
My proposal is slightly different than what we do today. Today we wait for write concern no matter what. In my proposal, if both (1) lastOpAfterRun == lastOpBeforeRun and (2) we did not take a global write lock, then we would not wait for write concern.
Today I agree. If we did not wait for write concern when both (1) lastOpAfterRun == lastOpBeforeRun and (2) we did not take a global write lock, then this would no longer be the case. |
| Comment by Tess Avitabile (Inactive) [ 09/May/19 ] |
|
judah.schvimer, I think your proposal is how the code works today.
I don't think this is true. If lastOpBeforeRun was not committed and the writeConcern is majority, then we will wait for this OpTime to be committed. I think we would need to take additional action to skip the wait for writeConcern if lastOpAfterRun == lastOpBeforeRun. |
| Comment by Judah Schvimer [ 09/May/19 ] |
|
The proposal is that we would do the following:
This means that if lastOpAfterRun == lastOpBeforeRun and we did not take a global write lock at all we would not wait for write concern. One concern here is if the system's last optime is equal to lastOpBeforeRun and we want to wait for write concern on a successful retryable write. An addendum to the above would be to track when we've set the client's lastOp explicitly in the current operation and wait for write concern no matter what in that case. Alternatively, we could just not wait for write concern if lastOpAfterRun == lastOpBeforeRun, the global write lock was not taken, and lastOpAfterRun's term doesn't match the current term. |
| Comment by Tess Avitabile (Inactive) [ 09/May/19 ] |
|
I'm confused about the proposed solution. My understanding is that we would only wait for writeConcern if the command took a write lock, and that we would have retryable writes push the client OpTime forward when retrying a successful operation. However, I don't see how this would ensure we wait for writeConcern on retryable writes when retrying a successful operation, since they do not take write locks. judah.schvimer, can you clarify this? |
| Comment by Charlie Swanson [ 08/May/19 ] |
|
This ticket may have just become a priority for the query team. As part of adding $merge, the drivers team is interested to see if we can update the server's contract so that all aggregates accept both readConcern and writeConcern. This will allow both server and drivers to remove some special-casing for the $out stage, and avoid adding some new special-casing for $merge. If we allow every aggregate to accept a writeConcern, we don't want the read-only aggregates to wait for the previous operation's write concern. Based on this ticket it sounds like there are no known problems with implementing this, and the code changes look relatively straightforward. Would anyone mind if the query team took this on and sent to replication for review? Please let me know if you foresee any problems. cc tess.avitabile, jeff.yemin and behackett. |
| Comment by Andy Schwerin [ 08/Mar/18 ] |
|
judah.schvimer, could you clarify the description and summary a little. It fooled milkie and I. |
| Comment by Eric Milkie [ 07/Mar/18 ] |
|
SGTM |
| Comment by Judah Schvimer [ 07/Mar/18 ] |
|
The GlobalLockAcquisionTracker added in |
| Comment by Eric Milkie [ 07/Mar/18 ] |
|
How are you going to tell the difference between "a parse error that should do no waiting" and "a no-op write that should wait"? For example, a non-parse error for which we still need to wait would be a duplicate key error; we need to wait in order to ensure the client can do a subsequent read and see the data that caused the duplicate key error. |