[SERVER-49691] Change streams may be subject to spurious "CappedPositionLost" when resuming Created: 17/Jul/20  Updated: 27/Oct/23  Resolved: 27/Aug/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Bernard Gorman
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File wip_plumb_yield_policy.patch    
Issue Links:
Related
related to SERVER-49690 Retry "CappedPositionLost" when searc... Closed
is related to SERVER-48523 Unconditionally check the first entry... Closed
Operating System: ALL
Sprint: Query 2020-08-24, Query 2020-09-07
Participants:

 Description   

Our testing infrastructure uncovered a rare case where this might happen, detailed in SERVER-49690. When a change stream is resuming, it may encounter this error. As far as I know this has never been observed, but I see no reason it couldn't happen. I would recommend looking into whether we can reproduce this. If so, I think we should do one of the following:
1) Disabling yielding when doing the oplog check upon resume during the change stream
2) Adding a similar retry loop within the change stream
3) Ensuring drivers will retry this error

During SERVER-49690 I looked into option #1 but the patch quickly exploded. I'll attach my WIP but it certainly won't compile and doesn't plumb the yield policy far enough to fix the issue.



 Comments   
Comment by Charlie Swanson [ 27/Aug/20 ]

Sorry this dropped off my radar! I suspect this may still be a problem. If I remember correctly, the problem wasn't so much that we were doing a separate sub-pipeline. It was just that we could establish a cursor, yield, and then the collection would be truncated. All of this before we examine the first record. I don't see anything about SERVER-48523 that changes this dynamic? As long as yielding is around, I think this may always be a problem since the collection may be truncated during a yield? Thinking more, maybe the most practical solution is to have our drivers retry this error if it is encountered during a resume (since I don't think it's fatal). It might be nice for us to do this internally so our tests don't have to worry about it though.

Comment by Bernard Gorman [ 27/Aug/20 ]

charlie.swanson: now that we've pushed SERVER-48523, which removes the separate oplog check that you highlighted above, I'm going to close this as "Gone Away."

Comment by Bernard Gorman [ 17/Jul/20 ]

charlie.swanson: ah, I see - thanks for the clarification! I think the scenario you're concerned about will be indirectly addressed by SERVER-48523, by making the oplog check part of the actual COLLSCAN rather than a separate operation. Does that make sense?

Comment by Charlie Swanson [ 17/Jul/20 ]

Sorry I think I should be more specific bernard.gorman. I was thinking of this code. Here we are simply trying to determine if there's enough oplog history, but we might get this spurious error. If I'm not mistaken, that call to getNext() may just throw the CappedPositionLost error.

Comment by Bernard Gorman [ 17/Jul/20 ]

charlie.swanson: I don't think attempting to resume after this error is ever appropriate. When resuming, a change stream will use the minTs mechanism to skip directly to the timestamp of the resume token in the oplog (or the point immediately before it, if startAtOperationTime is used and no event with the specified timestamp exists). If we yield and the oplog rolls over this point to produce a CappedPositionLost exception, it implies that the stream has genuinely become unresumable.

I think option (1) is the only improvement we could make here, to avoid the situation where we successfully establish the cursor but then yield and roll over instead of scanning away from the start of the oplog. But we would need to account for the possibility that the oplog contains no events (or very infrequent events) after the resume point, since this could result in a very lengthy scan that never yields.

Generated at Thu Feb 08 05:20:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.