[SERVER-45037] CollectionBulkLoader::insertDocuments() should be called with the collection cloner mutex held to guarantee memory and thread safety properties. Created: 10/Dec/19  Updated: 29/Oct/23  Resolved: 02/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by SERVER-43274 Implement cloners using DBClient with... Closed
Related
related to SERVER-71683 unbounded memory growth during tenant... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-12-16, Repl 2019-12-30, Repl 2020-01-13
Participants:
Linked BF Score: 58

 Description   

After this commit, CollectionBulkLoader::insertDocuments() (storage document/ external sorter insertion) is not called with collection cloner mutex held. As a result, we can hit data inconsistency and buffer overflow issues.

Data inconsistency can occur because this CollectionBulkLoader::insertDocuments() is not thread-safe function.
Buffer overflow can occur when the reads (returned from sync source) are much faster than the writes (document insertion). As a result, CollectionCloner::_documentsToInsert buffer can overflow and lead to OOM issue.

Note: I was able to see this buffer overflow, lead to server crash, when I tried to disable readOnce cursor on initial sync logkeeper workload.



 Comments   
Comment by Githook User [ 02/Jan/20 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-45037 CollectionBulkLoader::insertDocuments() should be called with the collection cloner mutex held to guarantee memory and thread safety properties.
Branch: master
https://github.com/mongodb/mongo/commit/2c26afb53b1eefec26a09268c5c9e4fbe3b389d3

Comment by Matthew Russotto [ 16/Dec/19 ]

insertDocumentsCallback() is thread safe. The crash is a result of there being no throttling, so the receive thread (handleNextBatch) is allowed to get far ahead of the insert thread, but this is not a race. By inserting the documents while holding the cloner lock, we effectively force the receive thread to run in lock-step with insertion. There may be a performance gain by using explicit throttling and not running in lock-step, but I think we just need to fix the bug for now.

If the dbwork thread is being completely starved, we can still crash, but that case probably indicates some sort of system problem.

Comment by Suganthi Mani [ 10/Dec/19 ]

Thinking more about it, I feel, even after we revert to the old behavior (i.e) collectionBulkLoader::insertDocuments() should be called with the collection cloner mutex held, there is a slim chance that we can hit with the buffer overflow issue when your task runner/thread pool is slow in scheduling the task. I feel we should have some limit to that CollectionCloner::_documentsToInsert buffer so that we cover the case of fast producer and slow consumer.

Comment by Suganthi Mani [ 10/Dec/19 ]

As a part of this ticket, I would like to add few other minor items

1) correct the warning message- s/_insertDocumentsCallback/insertDocumentsCallback
2) Add a note that CollectionBulkLoader::insertDocuments() is not thread safe to make it helpful for future readers.

Generated at Thu Feb 08 05:07:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.