[SERVER-49691] Change streams may be subject to spurious "CappedPositionLost" when resuming Created: 17/Jul/20 Updated: 27/Oct/23 Resolved: 27/Aug/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, Querying |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Bernard Gorman |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Query 2020-08-24, Query 2020-09-07 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Our testing infrastructure uncovered a rare case where this might happen, detailed in During |
| Comments |
| Comment by Charlie Swanson [ 27/Aug/20 ] |
|
Sorry this dropped off my radar! I suspect this may still be a problem. If I remember correctly, the problem wasn't so much that we were doing a separate sub-pipeline. It was just that we could establish a cursor, yield, and then the collection would be truncated. All of this before we examine the first record. I don't see anything about |
| Comment by Bernard Gorman [ 27/Aug/20 ] |
|
charlie.swanson: now that we've pushed |
| Comment by Bernard Gorman [ 17/Jul/20 ] |
|
charlie.swanson: ah, I see - thanks for the clarification! I think the scenario you're concerned about will be indirectly addressed by |
| Comment by Charlie Swanson [ 17/Jul/20 ] |
|
Sorry I think I should be more specific bernard.gorman. I was thinking of this code. Here we are simply trying to determine if there's enough oplog history, but we might get this spurious error. If I'm not mistaken, that call to getNext() may just throw the CappedPositionLost error. |
| Comment by Bernard Gorman [ 17/Jul/20 ] |
|
charlie.swanson: I don't think attempting to resume after this error is ever appropriate. When resuming, a change stream will use the minTs mechanism to skip directly to the timestamp of the resume token in the oplog (or the point immediately before it, if startAtOperationTime is used and no event with the specified timestamp exists). If we yield and the oplog rolls over this point to produce a CappedPositionLost exception, it implies that the stream has genuinely become unresumable. I think option (1) is the only improvement we could make here, to avoid the situation where we successfully establish the cursor but then yield and roll over instead of scanning away from the start of the oplog. But we would need to account for the possibility that the oplog contains no events (or very infrequent events) after the resume point, since this could result in a very lengthy scan that never yields. |