[SERVER-85390] Index build can be missing on secondary after stepdown Created: 18/Jan/24  Updated: 29/Jan/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shin Yee Tan Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: storex-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:
Linked BF Score: 10

 Description   

When oplogApplicationEnforcesSteadyStateConstraints is set to false, if there is an index build conflict after a node steps down and the existing index build fails, the index build is not retried resulting in an index build missing on a secondary.

This is the order events that can cause this:

  • Node 1 starts index build A as a primary and the index build is scheduled (making it past this check)
  • Node 1 steps down
  • Node 2 steps up as primary and starts index build B with the same name as index build A
  • Node 1 tries to replicate index build B but runs into an index build conflict with index build A but continues with this error
  • Index build A on node 1 fails
  • Index build B now missing on node 1 and is not retried

In production, this could only be detected by running dbcheck when the db hashes for Node 1 and Node 2 mismatch.


Generated at Thu Feb 08 06:57:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.