[SERVER-48974] Index build can crash with CappedPositionLost error Created: 18/Jun/20  Updated: 06/Dec/22  Resolved: 03/May/21

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 4.5.1, 4.4.0-rc10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-56062 Restart index builds after CappedPosi... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Execution Team 2021-07-26
Participants:
Linked BF Score: 18

 Description   

This is a very unlikely bug that has only reproduced by building an index on a tiny capped collection (maxSize=1) with a high number of concurrent inserts.

Update: This was also observed in SERVER-56062 on collections that were not trivially small.

If an index build collection scan recovers from yielding and can't restore its cursor because the saved position was deleted, then an index build will crash at this invariant with a CappedPositionLost error.

Example:

Invariant failure","attr":{"expr":"status.isA<ErrorCategory::Interruption>() || status.isA<ErrorCategory::ShutdownError>()","msg":"Unnexpected error code during index build cleanup: CappedPositionLost: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(1)

It wouldn't be a complete solution to just abort the index build, because a secondary could hit this error independently of a primary and still crash.

I think we can safely restart the collection scan if we hit a CappedPositionLost error. While this poses a liveness issue, I think the circumstances of hitting this bug are extreme enough to warrant this solution.



 Comments   
Comment by Gregory Wlodarek [ 03/May/21 ]

Marking this as a duplicate of SERVER-56062. SERVER-56062 restarts the collection scan phase when it encounters CappedPositionLost.

Comment by Louis Williams [ 18/Jun/20 ]

Assuming we agree on the solution, I think this would involve moving this call to initiateBulk inside insertAllDocumentsInCollection and then wrap that in a retry if we hit a CappedPositionLost exception.

Generated at Thu Feb 08 05:18:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.