[SERVER-61117] Startup error results in a hang on shutdown Created: 29/Oct/21  Updated: 29/Oct/23  Resolved: 29/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Start a 4.9 FCV replica set member with a 5.1 binary.

Sprint: Replication 2021-11-29, Replication 2021-12-13, Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07, Repl 2022-02-21, Repl 2022-03-07, Repl 2022-03-21, Repl 2022-04-04
Participants:

 Description   

For the following startup error, the shutdown process will hang forever, waiting for replication to finish starting up:

"t":{"$date":"2021-10-29T13:43:41.902+00:00"},"s":"E",  "c":"CONTROL",  "id":20557,   "ctx":"initandlisten","msg":"DBException in initAndListen, terminating","attr":{"error":"Location4926900: Invalid value for featureCompatibilityVersiondocument in admin.system.version, found 4.9, expected '5.0' or '5.0' or '5.1. See https://docs.mongodb.com/master/release-notes/5.0-compatibility/#feature-compatibility."}}

The hang seems to happen when the main thread subsequently calls _waitForStartupComplete() on the repl coord.

Thread 1 (Thread 0x7f1f24a1abc0 (LWP 322078)):
#0  0x00007f1f213b6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055b040236b7c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#2  0x000055b03d7f482c in void std::_V2::condition_variable_any::wait<std::unique_lock<mongo::latch_detail::Latch> >(std::unique_lock<mongo::latch_detail::Latch>&) ()
#3  0x000055b03d7d00f3 in mongo::repl::ReplicationCoordinatorImpl::_waitForStartUpComplete() ()
#4  0x000055b03d7ec2de in mongo::repl::ReplicationCoordinatorImpl::shutdown(mongo::OperationContext*) ()
#5  0x000055b03d666fbe in mongo::(anonymous namespace)::shutdownTask(mongo::ShutdownTaskArgs const&) ()
#6  0x000055b04008da55 in mongo::(anonymous namespace)::runTasks(std::stack<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::deque<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::allocator<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)> > > >, mongo::ShutdownTaskArgs const&) ()
#7  0x000055b03d4d944d in mongo::shutdown(mongo::ExitCode, mongo::ShutdownTaskArgs const&) ()
#8  0x000055b03cf96910 in mongo::exitCleanly(mongo::ExitCode) ()
#9  0x000055b03d665961 in mongo::mongod_main(int, char**) ()
#10 0x000055b03d4e93ee in main ()

In general, this type of hang is a potential issue for all exceptions that can occur in initAndListen.
The preceding example was taken from a particular node in Serverless QA that was mistakenly started using a newer binary without first updating the FCV.



 Comments   
Comment by Githook User [ 29/Mar/22 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-61117 Prevent uncaught errors in ReplicationCoordinatorImpl::startLoadLocalConfig from causing server hangs

This reverts commit c616ce771a282833d3f515ea02a87d89f5c42089.
Branch: master
https://github.com/mongodb/mongo/commit/2c8efb52fe4d688302a3edf06c00f23df48497e9

Comment by Githook User [ 29/Mar/22 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: Revert "SERVER-61117 Prevent uncaught errors in ReplicationCoordinatorImpl::startLoadLocalConfig from causing server hangs"

This reverts commit 4f57c205480557f133535c65f743b88414d32280.
Branch: master
https://github.com/mongodb/mongo/commit/c616ce771a282833d3f515ea02a87d89f5c42089

Comment by Githook User [ 28/Mar/22 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-61117 Prevent uncaught errors in ReplicationCoordinatorImpl::startLoadLocalConfig from causing server hangs
Branch: master
https://github.com/mongodb/mongo/commit/4f57c205480557f133535c65f743b88414d32280

Comment by Eric Milkie [ 04/Nov/21 ]

This particular exception is being generated by a call to tenant_migration_access_blocker::recoverTenantMigrationAccessBlockers(opCtx) as part of ReplicationCoordinatorImpl::_startLoadLocalConfig() in version 5.1.0.

Comment by Eric Milkie [ 29/Oct/21 ]

I wonder if the actual fix here is to make DBExceptions in initAndListen use quick exit rather than exitCleanly in general, since there is too much potential for getting stuck.

Generated at Thu Feb 08 05:51:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.