[SERVER-68154] Race in destruction of Resharding metrics can lead to calling pure virtual function Created: 19/Jul/22  Updated: 29/Oct/23  Resolved: 26/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.diff    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-67048 Replace ShardingDataTransformCumulati... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Apply repro.diff and run the cpp test

Sprint: Sharding 2022-07-11, Sharding 2022-07-25, Sharding 2022-08-08
Participants:
Linked BF Score: 135

 Description   

Whenever ShardingDataTransformInstanceMetrics destructor is called, it will remove the observer to itself (here) from the ShardingDataTransformCumulativeMetrics. The issue is that ShardingDataTransformInstanceMetrics::getRecipientHighEstimateRemainingTimeMillis is a pure virtual function and by the time the destructor for ShardingDataTransformInstanceMetrics is being run, the child class is already destroyed (In production server, this is the ReshardingMetrics class). So there is a small window where the observer instance is still accessible and trying to call getRecipientHighEstimateRemaining on it will result in terminating the server.



 Comments   
Comment by Githook User [ 26/Jul/22 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-68154 Race in destruction of Resharding metrics can lead to calling pure virtual function
Branch: master
https://github.com/mongodb/mongo/commit/921bba175902f9b9f29751a466383c3d7e80df7b

Comment by Brett Nawrocki [ 21/Jul/22 ]

It seems to me that fundamentally, either 1. the metrics need to be deregistered before the child class starts getting destroyed or 2. the metrics relevant to calculating the time estimate need to survive past the child class being destroyed.

For approach 1, I agree that option A is the less bad choice.

For approach 2, I think the resulting code would be more convoluted than its worth.

 
I think we can close SERVER-67048 afterwards if we do go with option A here.

Comment by Randolph Tan [ 21/Jul/22 ]

Making the function into non pure virtual appears to make the test pass. But I suspect that this only narrows the window for the race. From my understanding, the vtable is switched to pointing to ShardingDataTransformCumulativeMetrics's when it's own destructor is called, so there is still a tiny window where calls to getRecipientHighEstimateRemainingTimeMillis will attempt to call ReshardingMetrics's after it has started destroying it's own member variables.

So I have 2 proposals in mind:
Option A: Child class is responsible for calling _deregister in it's destructor (also add invariant in base class destructor that _deregister is null).
pros:

  • less typing

cons:

  • not immediately obvious why child class needs to be responsible for deregistration from readability stand point.

Option B: Remove _deregister from base class and child class is responsible for it's lifetime.
pros:

  • opposite of option A cons. No weirdness on calling _deregister since it owns it.

cons:

  • opposite of option A pros. Observer functionality and lifetime maintenance logic will need to duplicated for each child class.

I'm currently leaning towards option A, what do you guys think? max.hirschhorn@mongodb.com, brett.nawrocki@mongodb.com

Generated at Thu Feb 08 06:10:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.