[SERVER-2184] Graceful shutdown fails once in a while Created: 07/Dec/10 Updated: 30/Mar/12 Resolved: 02/Sep/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 1.7.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alexandru Ioan Turc | Assignee: | Aaron Staple |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
These all occurred when sent a SIGTERM signal to mongo process (to gracefully terminate the instance): Example 1: Fri Dec 3 15:28:33 [initandlisten] shutdown: going to flush oplog... Example 2: Mon Dec 6 09:04:55 [initandlisten] shutdown: going to flush oplog... Example 3:
|
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ] |
|
See linked case |
| Comment by Aaron Staple [ 02/Mar/11 ] |
|
I ran a few tests and it looks like we only respond to the first SIGTERM or SIGINT, and I think this behavior is consistent with the intent of the code. If any other signal is being sent that would cause a problem. |
| Comment by Alexandru Ioan Turc [ 21/Feb/11 ] |
|
I just realized that is possible that my shutdown script was sending multiple TERM signals, 1 second interval. I'm thinking that if the application was already in shutdown mode but it did not finish by the time the second TERM signal was sent, it tried do execute some showdown procedures which were already execute as a consequence of the first TERM signal - like releasing some resource already released. This is really just an idea, I did not check the mongo server code to see if it accounts for something like this or not. |
| Comment by auto [ 31/Jan/11 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: for now, allow exit code 14 in killall test |
| Comment by Aaron Staple [ 14/Dec/10 ] |
|
Just wanted to check in again to ask if any more of the logs are available for examples 2 and 3 - thanks |
| Comment by auto [ 08/Dec/10 ] |
|
Author: {'login': 'astaple', 'name': 'Aaron', 'email': 'aaron@10gen.com'}Message: |
| Comment by Aaron Staple [ 08/Dec/10 ] |
|
Would it be possible to send more of the log files for examples 2 and 3? |
| Comment by Aaron Staple [ 08/Dec/10 ] |
|
I did an audit for uses of boost mutexes which could potentially trigger the first message reported: RWLock
task / Ret ? |