[SERVER-30842] Don't try to set last optime for client backwards after rollback Created: 25/Aug/17 Updated: 30/Oct/23 Resolved: 28/Aug/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.10, 3.5.13 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||
| Sprint: | Repl 2017-09-11 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
A write that fails may result in a call to setLastOpToSystemLastOpTime. We invariant that the last optime on a client never goes backwards, but the system optime can indeed go backwards after a rollback. If that happens we should just refuse to set the last optime on the client backwards, but we shouldn't crash. |
| Comments |
| Comment by Scott Glajch [ 21/Jun/18 ] | ||||||||||||||
|
I am referencing that log line. Given your explanation, it's really odd that we are seeing these messages at all, much less that often. We actually run 2 mongo clusters, one for our main storage (where we store data in a number of ways), and one for our raw events, where we have only 1 type of collection, which is essentially each raw event. The raw event cluster is the cluster we have migrated to 3.6 and is having this problem (we migrate this cluster first as it is less mission critical and we can sniff out upgrade problems first there). Our writes are therefore only written once and never updated (so no $sets there), and then occasionally read, either individually or in aggregations for some simple reporting. The collection is using ObjectIDs for the _id of the collection, so I doubt DuplicateKeyErrors are happening. However I have found a pattern in the logs. These log lines are coming from the primary shard for a replica set, and right before these log lines happen, I see something like this:
Maybe config.system.sessions not being sharded is a problem? Or maybe the index build on these config.cache.chunks collections do writes on "lastmod" that aren't unique?
| ||||||||||||||
| Comment by Spencer Brody (Inactive) [ 20/Jun/18 ] | ||||||||||||||
|
I assume by "the log line" you are referring to this message? | ||||||||||||||
| Comment by Scott Glajch [ 20/Jun/18 ] | ||||||||||||||
|
Here is a screenshot showing how often these messages happened during the 3.4 rolling upgrade (mostly after the stepdowns on the shard nodes). They still happen from time to time after the upgrade, though usually they max at a few hundred per minute and not 100K.
| ||||||||||||||
| Comment by Scott Glajch [ 20/Jun/18 ] | ||||||||||||||
|
This fix has caused the log line to be logged a lot on our systems after upgrading from 3.4.9 to 3.6.5. To the point where I'd consider doing a custom build to remove this log line or up the level in which this is logged. At one point it made it hard for our log parser to keep up with the amount of lines being logged. Is there any plan to lessen the extent of this? If I'm seeing this message a lot, is it worrying? | ||||||||||||||
| Comment by Githook User [ 05/Oct/17 ] | ||||||||||||||
|
Author: {'email': 'spencer@mongodb.com', 'name': 'Spencer T Brody', 'username': 'stbrody'}Message: (cherry picked from commit cee116558d7771b520c0fd4dd239c0a987dcc2ec) | ||||||||||||||
| Comment by Githook User [ 28/Aug/17 ] | ||||||||||||||
|
Author: {'username': 'stbrody', 'name': 'Spencer T Brody', 'email': 'spencer@mongodb.com'}Message: |