[SERVER-51440] Cleanup refcounting and lifetime of ReplicaSetMonitor Created: 08/Oct/20  Updated: 22/Oct/20  Resolved: 22/Oct/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-50189 Replace ReplicaSetMonitorManager::rem... Backlog
depends on SERVER-50467 Ensure that tenant migration donor on... Closed
depends on SERVER-51413 RemoteCommandTargeterRS custom delete... Closed
depends on SERVER-51441 Get rid of ReplicaSetMonitor::get() Closed
depends on SERVER-51442 ShardRegistry should handle gracefull... Closed
depends on SERVER-51443 Server should cleanup the ReplicaSetM... Closed
depends on SERVER-51480 Clear circular dependency of shared_p... Closed
Related
related to SERVER-49789 Make tenant migration donor use a Str... Closed
is related to SERVER-50189 Replace ReplicaSetMonitorManager::rem... Backlog
Participants:

 Description   

This is an umbrella task to keep track of all smaller tasks to accomplish the refactoring.

UPDATE: hopefully we don't need TTL. I will rename or delete this ticket when it's more clear. For now it holds the list of other tickets needed for refactoring, below.

 

The problem: the lifetime if the ReplicaSetMonitor is managed by the shared_ptr. The ownership graph has circular dependencies (like _eventsPublisher and TopologyListenerPtr) that prevent the instance to be deleted. At the same time, breaking those circular dependencies by replacing a listener with weak_ptr is making the RSM to be deleted too fast.

The solution: convert the ReplicaSetMonitor factory and ReplicaSetMonitorManager to cache with TTL:

  1. Do not have a strict refcounting delete policy, have a lazy delete with TTL (about 1 hour or so)
  2. Make the callers to not keep the reference for a while, even use a weak pointer when possible. TTL will prevent the immediate deletion
  3. Remove the get() method, so an entry in the cache is always recreated. With 1 hour TTL it is guaranteed that each record is recreated not more often than once per hour, so the cost is very low
  4. Thus we still have the remove() method that clears the entry from cache immediately, and the dangling references will delete the instance soon, however we cannot force it to terminate immediately. I think the remove() should set a flag in the instance that it should terminate asap

In the tests, set TTL to be short, like 10 seconds. 



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 22/Oct/20 ]

I'm moving list of tickets to SERVER-50189 

Generated at Thu Feb 08 05:25:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.