[SERVER-39172] Shut down mongod nodes in parallel in ReplSetTest Created: 24/Jan/19  Updated: 29/Oct/23  Resolved: 05/Nov/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Improvement Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 5db08392e3c331774a397b0d,enterprise-rhel-62-64-bit,563dc7451690efa475db5feda913098e777471da.png     PNG File 5dc08364e3c3317a73ff6fa6,enterprise-rhel-62-64-bit,be045feaf3ba4af8037c5baceda2b15cd2498a24.png     File replset_shutdown.js    
Issue Links:
Duplicate
is duplicated by SERVER-22080 add a fire and forget shutdown method... Closed
Problem/Incident
causes SERVER-48206 ReplSetTest#stopSet() no longer detec... Closed
Related
related to SERVER-27342 Do not block unnecessarily on connect... Closed
is related to SERVER-44460 reconfig.js should add last node back... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-11-04, Repl 2019-11-18
Participants:

 Description   

In the ReplSetTest.stopSet function, we call ReplSetTest.stop sequentially for each node in the replica set. This in turn calls MongoRunner.stopMongod for each replica set node. By default, when MongoRunner shuts down a mongod node, it will call waitpid on the underlying process before continuing. This means that in ReplSetTest.stopSet we need to wait for a mongod to shut down cleanly before moving on and shutting down the next node. We could speed up this process by having a way in MongoRunner to shut down a process without calling waitpid on it. In ReplSetTest, after initiating shutdown on each node (and not blocking), we could go through each process and call waitProgram on its process id, which will call waitpid. This should speed up the shutdown process in ReplSetTest and reduce test times for both local testing and, ideally, overall Evergreen test suite durations.



 Comments   
Comment by Githook User [ 04/Nov/19 ]

Author:

{'name': 'William Schultz', 'username': 'will62794', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-39172 Shut down and validate nodes in parallel in ReplSetTest.stopSet
Branch: master
https://github.com/mongodb/mongo/commit/a417e979908af2124b990d68a22c437005877790

Comment by William Schultz (Inactive) [ 04/Nov/19 ]

We can also look at the stopSet shutdown times across the replica_sets suite after the changes, taken from this patch build):

Median: 731 ms
SD: 627 ms

We can view this in comparison to the stopSet shutdown profile before the changes, taken from this patch build:

Median: 1497 ms
SD: 1049ms

I expect that the profile is still somewhat spread out even after the changes due to the fact that we have to validate collections during shutdown, and this may be a non-trivial amount of work that varies by test.

Comment by William Schultz (Inactive) [ 04/Nov/19 ]

From an initial patch build that includes these changes, we can see the improvements for the ReplSetTest control tests.

replsettest_control_1_node.js: stopSet stopped all replica set nodes in 838ms
replsettest_control_12_nodes.js: stopSet stopped all replica set nodes in 1141ms

This is a scale factor of (1141/838) = 1.36x, within the 1.5x goal bound.

Comment by William Schultz (Inactive) [ 24/Jan/19 ]

> That sounds like what you're doing, no?

Yes, that is the intention here. I also wanted to make it clear in this ticket that ReplSetTest.stopSet should take advantage of such a feature. SERVER-22080 doesn't seem to mention this explicitly.

Comment by Max Hirschhorn [ 24/Jan/19 ]

I had interpreted SERVER-22080 as making it so each of the replica set members are signaled for termination and then waited on afterward. That sounds like what you're doing, no?

  1. SIGTERM mongod 1
  2. SIGTERM mongod 2
  3. SIGTERM mongod 3
  4. waitpid() mongod 1
  5. waitpid() mongod 2
  6. waitpid() mongod 3

Based on the title of SERVER-22080, it seems to be requesting a way to shut down nodes without waiting for them to shut down completely before returning.

The title is misguided because failing to ever call WaitForSingleObject() on Windows would lead to "The process cannot access the file because it is being used by another process." type of messages due to not waiting long enough for the OS to actually release all the handle objects even after the process exits before attempting to use the same dbpath again (for example).

Comment by William Schultz (Inactive) [ 24/Jan/19 ]

Based on the title of SERVER-22080, it seems to be requesting a way to shut down nodes without waiting for them to shut down completely before returning. SERVER-39172 requests that ReplSetTest change the way it shuts down nodes, by utilizing such a method. I would prefer closing the old ticket in favor of this one, since I don't feel that the description and intent there was well fleshed out.

Comment by Max Hirschhorn [ 24/Jan/19 ]

william.schultz, I think one of SERVER-22080 or this ticket should be closed as a duplicate.

Comment by William Schultz (Inactive) [ 24/Jan/19 ]

By running a basic ReplSetTest shutdown test(replset_shutdown.js) with the existing code, calling stopSet on a 7 node replica set takes 5792ms. If we remove this waitpid call and instead call waitProgram on each process after we have already initiated their shutdown, the stopSet process takes 1616ms.

Generated at Thu Feb 08 04:51:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.