[SERVER-59650] Insert after refine can lead to unsafe access to the chunk manager Created: 27/Aug/21  Updated: 21/Oct/22  Resolved: 21/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Antonio Fuschetto
Resolution: Won't Fix Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-70758 Stop tracking chunk writes in version... Closed
Operating System: ALL
Sprint: Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05, Sharding EMEA 2022-09-19, Sharding EMEA 2022-10-03, Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31
Participants:
Linked BF Score: 0

 Description   

CollectionMetadata offers a method to get a weak reference to the chunk manager it points to.

It has been observed a case in which this led to the chunk manager being cleared up while still used by an op observer.

The flow that led to hit the error - right after a refineCollectionShardKey was the following:



 Comments   
Comment by Antonio Fuschetto [ 21/Oct/22 ]

The initial idea of referring to a chunk manager object with a shared pointer (instead of a raw one) does not resolve the problem since the life type of the collection metadata is the actual root cause of the issue. Indeed, the collection metadata is deleted while and insert/update operation is in progress on the secondaries, causing a memory access violation (dangling pointer error) of the chunk manager.

The findIntersectingChunkWithSimpleCollation function, which triggers the described problem, will no longer be used in 6.0 branch and higher (see SERVER-70758). As for the other branches, a synchronization mechanism would be needed to ensure that the collection metadata was not deleted while an insert/update operation is in progress on the secondaries. After a discussion with pierlauro.sciarelli@mongodb.com, a proper solution would be costly in terms of performance, as well as development.

Considering that a problem of this nature has never occurred in real environments (external customers), we believe it is appropriate to ignore the problem on old branches, at least for now. Closing as won’t fix.

Generated at Thu Feb 08 05:47:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.