[SERVER-67867] Recover and proceed with TTL pass if document removal fails Created: 07/Jul/22  Updated: 05/Dec/22  Resolved: 12/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.9, 6.0.0-rc13
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Eric Sedor Assignee: Backlog - Storage Execution Team
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Storage Execution
Sprint: Execution Team 2022-10-17
Participants:

 Description   

Failure to delete a document interrupts a TTL pass. Subsequent TTL passes hit the same failure. As such, if a single document removal fails, TTL on a collection is halted. This can occur in the wake of bugs like WT-7995.

{"t":{"$date":"2022-01-01T00:00:00.000Z"},"s":"E","c":"QUERY","id":4615603,"ctx":"TTLMonitor","msg":"Erroneous index key found with reference to non-existent record id. Consider dropping and then re-creating the index and then running the validate command on the collection.","attr":{"namespace":"db.coll","recordId":"22922469419","indexKeyData":[{"key":{"started":{"$date":"2021-01-01T00:00:00.000Z"}},"pattern":{"started":1}}]}}
{"t":{"$date":"2022-01-01T00:00:01.000Z"},"s":"E","c":"INDEX","id":5400703,"ctx":"TTLMonitor","msg":"Error running TTL job on collection","attr":{"namespace":"db.coll","error":{"code":301,"codeName":"DataCorruptionDetected","errmsg":"Erroneous index key found with reference to non-existent record id. Consider dropping and then re-creating the index and then running the validate command on the collection."}}}

It's good that this information is logged, but the TTL pass should follow up a failure like this by doing additional work:

  • identify a subsequent range of documents to delete
  • delete that range of documents

I'd suggest not preserving any state about what's been skipped and recommend against trying to fix the inconsistency by removing index the erroneous index key. That is: the TTLMonitor should continue to try to behave "normally" every time it runs. This ensures that an error like this continues to be logged instead of being accounted for and forgotten.



 Comments   
Comment by Cris Insignares Cuello [ 12/Sep/22 ]

if this error occurs the TTL index is corrupted, we do not want to delete data based on the content of a corrupted index.

Generated at Thu Feb 08 06:09:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.