[SERVER-60595] Resmoke hooks such as ContinuousTenantMigration may not pause even after being paused Created: 11/Oct/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vishnu Kaushik Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File server50959repro.log     File tmrepro.py    
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

The implementation of ContinuousTenantMigration suggests that when we pause after test, we expect the hook to be in a state in which no migrations are going on. This can be violated.

Suppose this sequence of events takes place:

  • The tenant migrations thread is started, and it pauses here before self._is_idle_evt.clear(). It has already checked to make sure a tenant migration is permitted.
  • The main resmoke thread of execution is done with the test and attempts to pause the thread. Marking the test as finished in pause() is irrelevant now, since the tenant migrations thread has already run past the wait_for_tenant_migration_permitted().
  • Since the tenant migrations thread has not performed self._is_idle_evt.clear() yet, this check in pause() succeeds, and we think we have finished pausing the tenant migrations thread.
  • However, the tenant migrations thread is free to proceed and does not know it should pause.

There is a sequence of steps in which stop() comes into play once all tests have been completed, which prevents the tenant migration thread from ever terminating.



 Comments   
Comment by Vishnu Kaushik [ 12/Oct/21 ]

I attached the reproducer (it's the current tenant_migrations.py with some sleeps / waits where necessary, as well as a log file from a severe failed run, in which the tenant migrations thread is never able to complete.

It may take 5 - 6 runs to reproduce the issue in its most severe form. However, the reproducer should reliably make the call to pause() return and then have a tenant migration run immediately after.

Comment by Judah Schvimer [ 11/Oct/21 ]

Vishnu found this by code inspection.

Generated at Thu Feb 08 05:50:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.