[SERVER-71683] unbounded memory growth during tenant migrations Created: 29/Nov/22  Updated: 29/Oct/23  Resolved: 07/Dec/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.1, 6.2.0-rc3, 6.3.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Eric Milkie Assignee: Suganthi Mani
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 0001-memory-leak.patch     File t.js    
Issue Links:
Backports
Related
is related to SERVER-45037 CollectionBulkLoader::insertDocuments... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.2
Steps To Reproduce:

git am the attached patch.
Then run the attached script via:
nohup buildscripts/resmoke.py run t.js
Observe the memory growth (or even process OOM-killed)

Sprint: Server Serverless 2022-12-12
Participants:

 Description   

It appears that there is no backpressure between reading from a donor and writing on a recipient; there is an in-memory buffer that lives on the recipient that tenant migration writer threads pull from to perform writes. This buffer can grow without bound if writing on the recipient is significantly slower compared to reading on the donor.



 Comments   
Comment by Githook User [ 07/Dec/22 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-71683 Tenant collection cloner reads the next batch from socket buffer only after writing all the documents in the current batch to storage

(cherry picked from commit 5fee6fff13b8a0b9f96f6bbe228afcd9514ac952)
Branch: v6.2
https://github.com/mongodb/mongo/commit/6b55bbebb1b199a9e0dcfdb4611a7a2cb58ba3a5

Comment by Githook User [ 07/Dec/22 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-71683 Tenant collection cloner reads the next batch from socket buffer only after writing all the documents in the current batch to storage

(cherry picked from commit 5fee6fff13b8a0b9f96f6bbe228afcd9514ac952)
Branch: v6.1
https://github.com/mongodb/mongo/commit/3b45a1f1e5f0f7ff5c7180b05f4c3ae050566789

Comment by Suganthi Mani [ 07/Dec/22 ]

Just for those who are watching this ticket, we considered following 3 options

( We chose option#1 for following reasons 1)Simplicity 2) This is a problem only for MTM protocol and this protocol will be retired soon and will be replaced with split and merge.)

Option#1: Tenant collection cloner reads the next batch from socket buffer to in-memory buffer only after inserting all the documents in the in-memory buffer to the collection, by running insert docs in-line with handleNextBatch()

  • Notes:
    1. With our current code base, irrespective of writer pool size, only one writer thread is responsible for inserting documents into the given tenant collection, it is mainly because of the limitation of task runner which is used to schedule the insert docs task. This isn't a conscious choice for tenant collection cloner. It's a carry-over from initial sync collection cloner code where insert docs for a given collection can't run parallel due to WT bulk insertion limitation.
    2. Given the fact, we have single writer thread, option#1 isn't a major design change to tenant collection cloner.
    3. Since each batch batch size can only be <= 16MB, the in-memory buffer can't grow unbounded with option#1. Even, in case of doc insertion being really slow, we expect the socket buffer (default size should be few KBs or MBs) to overflow and in-turn throttle the exhaust cursor on the donor side in generating the batches.

Option#2: Put explicit max size limit to the in-memory buffer - TenantCollectionCloner::_documentsToInsert

  • Notes:
    1. 1) This fix aligns with any future efforts of "improving the tenant collection cloner performance" - parallelize insert doc tasks for a given tenant collection.
    2. To avoid busy looping and unnecessary scheduling of tasks on task runner , we might also need to consider having receiver thread to block/wait when in-memory buffer is full and unblock when space is available (+ wait interruption, especially due to merge abort).
    3. There may be some performance gain for option#2 due to running insertion and receiving step ( i.e, reading (+ decompressing) next batch from socket buffer to in-memory buffer) in parallel.

Option#3: Like SERVER-45037, do storage writes with tenant collection cloner mutex lock held to effectively force the receive thread to run in lock-step with insertion

  • Rejected: Doing storage writes with mutex lock held is an anti-pattern (mutex lock should be used only for short critical section) and is prone to deadlocks(PM-3075)
Comment by Githook User [ 07/Dec/22 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-71683 Tenant collection cloner reads the next batch from socket buffer only after writing all the documents in the current batch to storage
Branch: master
https://github.com/mongodb/mongo/commit/5fee6fff13b8a0b9f96f6bbe228afcd9514ac952

Comment by Suganthi Mani [ 01/Dec/22 ]

Reposting milkie@mongodb.com's slack response on why this ticket is marked as P2-Critical

It's hit in production twice now (one rather recently)
the recent one triggered some very long manual cleanup that took us 4 days to finally fix completely
and during that period, those tenants were all impaired
I marked it critical in the hopes we can get a fix into 6.2.0 if not 6.1.1

Generated at Thu Feb 08 06:19:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.