[SERVER-34326] Global snapshot reads fail with SnapshotTooOld error Created: 04/Apr/18  Updated: 29/Oct/23  Resolved: 13/Apr/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.7.4

Type: Bug Priority: Major - P3
Reporter: Misha Tyulenev Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2018-04-23
Participants:
Linked BF Score: 62

 Description   

global snapshot find and aggregate intermittently fails with SnapshotTooOld error even with retries, This indicates that the snapshot window is just too short.

        assert.soon(() => {
            const res = sessionDb.runCommand(cmdObj);
            if (!res.ok) {
                assert(res.code === ErrorCodes.SnapshotTooOld ||
                           res.code === ErrorCodes.TransactionAborted,
                       "expected command to fail with SnapshotTooOld or TransactionAborted, cmd: " 
                           tojson(cmdObj) + ", result: " + tojson(res));
                print("Retrying because of SnapshotTooOld or TransactionAborted error.");
                txnNumber++;
                cmdObj.txnNumber = NumberLong(txnNumber);
                return false;
            }
 
            assert.commandWorked(res, "expected command to succeed, cmd: " + tojson(cmdObj));
            return true;
        });

Thnis is a test case where cmd is any mongos find or aggregate commands with readConcern:

{snapshot: true}

 Comments   
Comment by Githook User [ 13/Apr/18 ]

Author:

{'email': 'misha@mongodb.com', 'name': 'Misha Tyulenev', 'username': 'mikety'}

Message: SERVER-34326 follow up: fix lint errors
Branch: master
https://github.com/mongodb/mongo/commit/8c59201055adc886541c42b53e72a8b70963ec4a

Comment by Githook User [ 12/Apr/18 ]

Author:

{'email': 'misha@mongodb.com', 'name': 'Misha Tyulenev', 'username': 'mikety'}

Message: SERVER-34326 use highest cluserTime for global snapshot reads
Branch: master
https://github.com/mongodb/mongo/commit/8246ce54572d8af086162a88e7bf54449801a2d9

Comment by Misha Tyulenev [ 10/Apr/18 ]

After offline discussion will change the logic in SnapshotUnavailable retries to use the latest known cluster time.

Comment by Eric Milkie [ 05/Apr/18 ]

You can't use the actual lastCommittedOpTime returned with a reject for the next atClusterTime attempt. You have to adjust the timestamp further into the future.

Comment by Misha Tyulenev [ 05/Apr/18 ]

I think so, the atClusterTime in a request generated by mongos is set from the lastCommittedOpTime returned with a reject but its always behind. This looks like that there is a noop write happens that moves the snapshot forward.

Comment by Eric Milkie [ 05/Apr/18 ]

How are you selecting a new timestamp to read at when you retry? Are you pushing it far enough into the future?

Generated at Thu Feb 08 04:36:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.