[SERVER-16821] Do not abort server when receiving multiple SIGUSR1 in the same second Created: 13/Jan/15  Updated: 09/Apr/17  Resolved: 25/Jan/17

Status: Closed
Project: Core Server
Component/s: Logging
Affects Version/s: 2.6.4
Fix Version/s: 3.4.4, 3.5.2

Type: Improvement Priority: Minor - P4
Reporter: Volans Assignee: Gabriel Russell (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
is duplicated by SERVER-19729 logRotate should not fassert when ren... Closed
Related
related to SERVER-9358 Log rotation can overwrite previous l... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Platforms 2017-01-23, Platforms 2017-02-13
Participants:

 Description   

Server crashes if logRotate destination file already exists after receiving a SIGUSR1.
For example if SIGUSR1 is sent twice during the same second the destination file will be the same and the server will crash.

Affected version:

2015-01-13T13:54:10.021+0000 [signalProcessingThread] db version v2.6.4
2015-01-13T13:54:10.021+0000 [signalProcessingThread] git version: 3a830be0eb92d772aa855ebb711ac91d658ee910
2015-01-13T13:54:10.021+0000 [signalProcessingThread] build info: Linux build7.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49

The reported error is:

015-01-13T13:54:10.766+0000 [signalProcessingThread] warning: Rotating log file /data/log/mongodb/mongodb.log failed: FileRenameFailed Renaming file /data/log/mongodb/mongodb.log to /data/log/mongodb/mongodb.log.2015-01-13T13-54-10 failed; destination already exists
2015-01-13T13:54:10.766+0000 [signalProcessingThread] Fatal Assertion 16782
2015-01-13T13:54:10.769+0000 [signalProcessingThread] 0x11e6111 0x1187e49 0x116a97d 0x11e57cf 0x122aaa9 0x7fe544cbd182 0x7fe543fc238d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1187e49]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116a97d]
/usr/bin/mongod() [0x11e57cf]
/usr/bin/mongod() [0x122aaa9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fe544cbd182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fe543fc238d]
2015-01-13T13:54:10.769+0000 [signalProcessingThread]

***aborting after fassert() failure

2015-01-13T13:54:10.772+0000 [signalProcessingThread] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11e6111 0x11e54ee 0x7fe543efe000 0x7fe543efdf89 0x7fe543f01398 0x116a9ea 0x11e57cf 0x122aaa9 0x7fe544cbd182 0x7fe543fc238d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
/usr/bin/mongod() [0x11e54ee]
/lib/x86_64-linux-gnu/libc.so.6(+0x37000) [0x7fe543efe000]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fe543efdf89]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fe543f01398]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116a9ea]
/usr/bin/mongod() [0x11e57cf]
/usr/bin/mongod() [0x122aaa9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fe544cbd182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fe543fc238d]



 Comments   
Comment by Githook User [ 09/Apr/17 ]

Author:

{u'username': u'gabrielrussell', u'name': u'Gabriel Russell', u'email': u'gabriel.russell@mongodb.com'}

Message: SERVER-16821 don't rotate logs more often then once per second

(cherry picked from commit 3cef6afea83b252613be458a0e0bf94ecea28f96)
Branch: v3.4
https://github.com/mongodb/mongo/commit/c7d6b5a70cef52b2b3a2a6830130376236cb4b9b

Comment by Githook User [ 25/Jan/17 ]

Author:

{u'username': u'gabrielrussell', u'name': u'Gabriel Russell', u'email': u'gabriel.russell@mongodb.com'}

Message: SERVER-16821 don't rotate logs more often then once per second
Branch: master
https://github.com/mongodb/mongo/commit/3cef6afea83b252613be458a0e0bf94ecea28f96

Comment by Volans [ 24/Feb/15 ]

nicolas_ this explains why it happen to you. You're matching multiple files and logrotate apply the post-rotate block for each file if the sharedscripts option is not set (see http://linuxcommand.org/man_pages/logrotate8.html ).
So in your case in a couple of servers was able to rotate the files in less than 1 second and send SIGUSR1 to MongoDB twice in a second, hence the assert error.

If you have multiple MongoDB processes how do you send the SIGUSR1 to the right process?
My suggestion is to have a different logrotate per MongoDB process to be able to send the SIGUSR1 to the right process and in any case if you may match multiple files use the sharedscripts option.

My 2 cents

Comment by Nicolas Fouché [ 24/Feb/15 ]

volans I don't have the "sharedscripts" option. I did not notice it in https://jira.mongodb.org/browse/SERVER-14053?focusedCommentId=596867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-596867. Should I add it ?

About rotating multiple files, the config actually matches multiple files. Only because we have several mongod running. I admit that we should create one logrotate config per mongod.

/var/log/mongo/*.log {

Comment by Volans [ 16/Feb/15 ]

nicolas_ did you check if your logrotate config could match multiple files? If so, is the "sharedscripts" option set to run the postrotate only once?

Comment by Nicolas Fouché [ 16/Feb/15 ]

volans I checked that I don't send SIGUSR1 twice; for example I don't have a mongo log rotation configured daily and hourly. I had the exact same error message.
All my mongo servers are provisioned by Puppet. So they have the same logrotate configuration. Thus, I don't understand why the problem happened on 2 servers out of 6.

Comment by Volans [ 11/Feb/15 ]

nicolas_ In addition, are you getting exactly the same stack trace? "failed; destination already exists"

Comment by Volans [ 11/Feb/15 ]

nicolas_ have you checked if your logrotate configuration by any chance could execute the SIGUSR1 twice in the same second?
Because normally it should not happen for a normal daily logrotation.
I'll suggest you to check the matching files, the operation that you are doing, options like sharedscripts, etc.

Comment by Nicolas Fouché [ 11/Feb/15 ]

I updated from 2.4.9 to 2.6.7. And out of 6 mongod processes, 2 of them had this Fatal Assert on log rotation at midnight. I used SIGUSR1, and have been using it for years. I was extremely lucky that it did not crash my whole cluster.
As volans said, crashing the whole DB because of a few missing logs is way too overkill. Especially because log rotation usually happens at night. When I ran logrotate manually, the Fata Assert did not happen.

To fix the problem, I also tried all suggestions, and only the one suggested by Volans worked:

/usr/bin/mongo admin --eval "db.runCommand( { logRotate : 1 } )

Anyway, I understand it will be fixed in 3.0 by SERVER-4905. But this fix is so old that I don't understand why it wasn't applied to 2.6.

Comment by Volans [ 13/Jan/15 ]

ramon.fernandez thanks for your multiple replies.

To answer to your comments:

  1. sure overwriting an existing file seems quite bad and I get the origin of this behaviour
  2. I think the name could get a more granular time (i.e. microseconds) to avoid this issue and/or a random suffix to be appended to avoid collisions, but most of it it should allow a standard logrotation (i.e. just close the file descriptor and open the new one)
  3. I'm looking forward to see a more standard log rotation for MongoDB give that is quite not possible to properly integrate it with logrotate right now (yes I've implemented my logrotate in a very similar way to the one you linked above)
  4. I arrived to this scenario testing the logrotation configuration so yes, surely is not the case in a normal environment but I open the issue for two specific reasons:
    • crashing the whole DB seems a bit too much IMHO because will cause downtime, is not even a restart, the DB will remain down until restarted manually
    • I was not able to reproduce the same behaviour calling db.runCommand( { logRotate : 1 } ) directly from within the DB, hence I think is a bug in the way the SIGUSR1 is processed
Comment by Ramon Fernandez Marina [ 13/Jan/15 ]

volans, I should add that if you're using logrotate you may want to check this comment by one of our techops engineers.

If you search for how to use logrotate and MongoDB you may also run into this stack overflow thread, which was addressed by the same MongoDB engineer.

Comment by Ramon Fernandez Marina [ 13/Jan/15 ]

volans, as I mentioned before, this behavior introduced SERVER-9358 to avoid overwriting the log files. One could argue that aborting may not be the right thing to do, but not rotating the file doesn't sound correct either. Given that this may have implications to the audit subsystem, aborting was deemed preferable.

One could also argue that the rotated file name could include a more granular timestamp to prevent this issue, but also that rotating a log file twice in a second is probably an error in log rotation procedures. Can you elaborate on how you arrived to this scenario to see if there's any further recommendations we can incorporate into our documentation?

In SERVER-4905 a new flag was introduced, logRenameOnRotate, to aid in the integration with standard log rotation mechanisms. You may want to investigate the use of this flag to see if it helps in your environment. Note that the documentation is in the pipeline, but you can take a look at the code. This flag will be part of the upcoming 2.8.0 release.

Comment by Ramon Fernandez Marina [ 13/Jan/15 ]

Thanks for your report volans. This behavior was introduced in SERVER-9358 and is intended behavior. Please stay tuned for further details.

Generated at Thu Feb 08 03:42:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.