[SERVER-2184] Graceful shutdown fails once in a while Created: 07/Dec/10  Updated: 30/Mar/12  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.7.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexandru Ioan Turc Assignee: Aaron Staple
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-2652 pthread_mutex_lock assertion on shutdown Closed
Operating System: Linux
Participants:

 Description   

These all occurred when sent a SIGTERM signal to mongo process (to gracefully terminate the instance):

Example 1:

Fri Dec 3 15:28:33 [initandlisten] shutdown: going to flush oplog...
Fri Dec 3 15:28:33 [initandlisten] shutdown: going to close sockets...
Fri Dec 3 15:28:33 [initandlisten] shutdown: waiting for fs preallocator...
Fri Dec 3 15:28:33 [initandlisten] shutdown: closing all files...
Fri Dec 3 15:28:33 [conn1] end connection 127.0.0.1:58246
Fri Dec 3 15:28:33 [interruptThread] now exiting
Fri Dec 3 15:28:33 dbexit: ; exiting immediately
mongod: /opt/boost/include/boost/thread/pthread/mutex.hpp:50: void boost::mutex::lock(): Assertion `!pthread_mutex_lock(&m)' failed.
Fri Dec 3 15:28:33 Got signal: 6 (Aborted).

Example 2:

Mon Dec 6 09:04:55 [initandlisten] shutdown: going to flush oplog...
Mon Dec 6 09:04:55 [initandlisten] shutdown: going to close sockets...
Mon Dec 6 09:04:55 [initandlisten] shutdown: waiting for fs preallocator...
Mon Dec 6 09:04:55 [initandlisten] shutdown: closing all files...
Mon Dec 6 09:04:55 [interruptThread] now exiting
Mon Dec 6 09:04:55 dbexit: ; exiting immediately
Mon Dec 6 09:04:55 Got signal: 11 (Segmentation fault).

Example 3:

      • glibc detected *** /home/at/ats/ats-tools/mongodb/current/bin/mongod: double free or corruption (fasttop): 0x0a16c9b0 ***


 Comments   
Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

See linked case

Comment by Aaron Staple [ 02/Mar/11 ]

I ran a few tests and it looks like we only respond to the first SIGTERM or SIGINT, and I think this behavior is consistent with the intent of the code. If any other signal is being sent that would cause a problem.

Comment by Alexandru Ioan Turc [ 21/Feb/11 ]

I just realized that is possible that my shutdown script was sending multiple TERM signals, 1 second interval. I'm thinking that if the application was already in shutdown mode but it did not finish by the time the second TERM signal was sent, it tried do execute some showdown procedures which were already execute as a consequence of the first TERM signal - like releasing some resource already released. This is really just an idea, I did not check the mongo server code to see if it accounts for something like this or not.

Comment by auto [ 31/Jan/11 ]

Author:

{u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}

Message: for now, allow exit code 14 in killall test SERVER-2184
https://github.com/mongodb/mongo/commit/59e153c5259e2ac20cf5ed449c31b379c572e0ae

Comment by Aaron Staple [ 14/Dec/10 ]

Just wanted to check in again to ask if any more of the logs are available for examples 2 and 3 - thanks

Comment by auto [ 08/Dec/10 ]

Author:

{'login': 'astaple', 'name': 'Aaron', 'email': 'aaron@10gen.com'}

Message: SERVER-2184 clarify usage of mongo mutex
/mongodb/mongo/commit/0b97deb52745f7da9464ddaf11ea2f85f211fa26

Comment by Aaron Staple [ 08/Dec/10 ]

Would it be possible to send more of the log files for examples 2 and 3?

Comment by Aaron Staple [ 08/Dec/10 ]

I did an audit for uses of boost mutexes which could potentially trigger the first message reported:

RWLock

  • MongoFile / mmmutex
  • SpinLock
  • NetworkCounter
  • ServiceStats ?
  • CachedBSONObj ?
  • CmdReplSetReconfig
    etc.

task / Ret ?
task / Server ?
MVar ?
ClientCursor::ccmutex

Generated at Thu Feb 08 02:59:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.