[SERVER-4739] Race condition in log rotation (was: SIGUSR1 should set a flag rather than doing rotation) Created: 21/Jan/12  Updated: 11/Jul/16  Resolved: 17/Apr/13

Status: Closed
Project: Core Server
Component/s: Logging, Stability
Affects Version/s: 2.2.0, 2.2.2
Fix Version/s: 2.2.5, 2.4.4, 2.5.0

Type: Improvement Priority: Major - P3
Reporter: Mathias Stearn Assignee: Eric Milkie
Resolution: Done Votes: 24
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-7707 MongoDB 2.2.1/Linux, 32-bit crashing ... Closed
is duplicated by SERVER-3339 mongod process hangs after USR1 signal Closed
Related
related to SERVER-4905 logRotate should conform to standard ... Closed
is related to SERVER-9578 mongos dies on SIGUSR1 instead of rot... Closed
Participants:

 Description   

The locking within logRotate only covers the swap of file handles, but renaming an open log file while it is being written to by another thread is not safe. The locking needs to be modified so that no race conditions exist.

In addition, doing log rotation from a signal handler may be unsafe. It would be better to set a flag in the signal handler, and respond to it from a normal (non-signal) code path. For example, we could rotate the log next time something is logged.

Currently there is a deadlock if the current thread is in the middle of logging.



 Comments   
Comment by Stennie Steneker (Inactive) [ 02/Sep/13 ]

Shay: the fixVersions are noted in the issue Details section at the top of the page:

 Fix Version/s: 2.2.5, 2.4.4, 2.5.0

So for the 2.4 release series, 2.4.4 or newer will have the fix.

Regards,
Stephen

Comment by Shay Asher [ 02/Sep/13 ]

is the issue fixed on 2.4.5?

Comment by Tad Marshall [ 12/May/13 ]

Hi John,

I think that you have hit SERVER-9578, which was inadvertently introduced in version 2.4.3.

This is fixed in the nightly 2.4 build; the fix will be in version 2.4.4.

If you want to verify that this is fixed in the latest 2.4 code, you could download http://downloads.mongodb.org/linux/mongodb-linux-x86_64-v2.4-latest.tgz, which is marked as version "2.4.4-pre-".

Tad

Comment by John Dever [ 12/May/13 ]

Let me know if this needs to be a new issue, but I believe this is still occurring on the latest 10gen mongo RPM. Doing a SIGUSR1 against mongos causes an immediate crash under absolutely no load. This has been happening against 8ish servers in my test env, all running identical configs.

This does not occur against mongod.

MongoS version 2.4.3 starting: pid=31081 port=27017 64-bit host=<redacted> (--help for usage)
git version: fe1743177a5ea03e91e0052fb5e2cb2945f6d95f
build sys info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49

mongod --version
db version v2.4.3
Sun May 12 11:03:12.255 git version: fe1743177a5ea03e91e0052fb5e2cb2945f6d95f

MongoS Log:
Sun May 12 14:46:23.783 [Balancer] Refreshing MaxChunkSize: 64
Sun May 12 14:46:23.784 [Balancer] trying to acquire new distributed lock for balancer on <redacted>.yb.int:27017,stgkopmfmgd02.yb.int:27017,stgkopmfmgd03.yb.int:27017 ( lock timeout : 900000, ping interval : 30000, process : <redacted>:27017:1368369971:1804289383 )
Sun May 12 14:46:23.784 [Balancer] about to acquire distributed lock 'balancer/<redacted>:27017:1368369971:1804289383:
{ "state" : 1,
"who" : "<redacted>:27017:1368369971:1804289383:Balancer:846930886",
"process" : "<redacted>:27017:1368369971:1804289383",
"when" :

{ "$date" : "Sun May 12 14:46:23 2013" }

,
"why" : "doing balance round",
"ts" :

{ "$oid" : "518fab3fe98c422fd4ec59af" }

}
{ "_id" : "balancer",
"state" : 0,
"ts" :

{ "$oid" : "518fab39e98c422fd4ec59ae" }

}
Sun May 12 14:46:23.914 [Balancer] distributed lock 'balancer/<redacted>:27017:1368369971:1804289383' acquired, ts : 518fab3fe98c422fd4ec59af
Sun May 12 14:46:23.914 [Balancer] *** start balancing round
Sun May 12 14:46:23.914 [Balancer] waitForDelete: 0
Sun May 12 14:46:23.914 [Balancer] secondaryThrottle: 1
Sun May 12 14:46:23.915 [Balancer] no collections to balance
Sun May 12 14:46:23.915 [Balancer] no need to move any chunk
Sun May 12 14:46:23.915 [Balancer] *** end of balancing round
Sun May 12 14:46:23.986 [Balancer] distributed lock 'balancer/<redacted>:27017:1368369971:1804289383' unlocked.

Comment by auto [ 18/Apr/13 ]

Author:

{u'date': u'2013-04-18T15:33:22Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-4739 use a thread for logRotate signal instead of a signal handler
Branch: v2.2
https://github.com/mongodb/mongo/commit/e8ea40668b1f2dcc0d61760ad3bccd98e0d77b8c

Comment by auto [ 18/Apr/13 ]

Author:

{u'date': u'2013-04-17T15:02:23Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-4739 compile mongos on Windows
Branch: v2.4
https://github.com/mongodb/mongo/commit/757fe510df0ee15901000195520f6c1d67c66052

Comment by auto [ 18/Apr/13 ]

Author:

{u'date': u'2013-04-16T15:37:19Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-4739 use a thread for logRotate signal instead of a signal handler

Conflicts:
src/mongo/db/db.cpp
src/mongo/s/server.cpp
Branch: v2.4
https://github.com/mongodb/mongo/commit/9d5de58e4fb9129ed630e99e732899411249ac6a

Comment by Johan Hedin [ 17/Apr/13 ]

Tad,

Thanks for the clarification! So I will just wait then. Or try to add the master branch changes to 2.2 or 2.4.

Anyway, at least I learned a bit of Jira usage

Comment by Tad Marshall [ 17/Apr/13 ]

Hi Johan,

Sorry for the confusion; the backports have not happened yet.

This is a result of the way we use the "Backport" field in Jira. When Backport is set to Yes, the plan is to backport the change to one or more earlier versions, and when those versions are chosen they are added to the "Fix Version/s" list. But it isn't actually in those versions until Backport is changed to Done.

Note also the specific versions: 2.2.5 and 2.4.3, neither of which has been created yet.

Tad

Comment by Johan Hedin [ 17/Apr/13 ]

I see this marked as fixed in both 2.2 and 2.4 but I only see a commit in the master branch?!

I'm asking because I would like to bring this in as a patch when I compile mongod 2.2, and soon 2.4, myself.

Comment by auto [ 17/Apr/13 ]

Author:

{u'date': u'2013-04-17T15:02:23Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-4739 compile mongos on Windows
Branch: master
https://github.com/mongodb/mongo/commit/04451da971f8de4201e2c8d58b678d1a033106f8

Comment by auto [ 17/Apr/13 ]

Author:

{u'date': u'2013-04-16T15:37:19Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-4739 use a thread for logRotate signal instead of a signal handler
Branch: master
https://github.com/mongodb/mongo/commit/1de69d1de81d98c2ecbb5e6cedd31109de15da88

Comment by Johan Hedin [ 07/Apr/13 ]

Thanks again for the update. I'm just about to increase the load on my db quite drastically so this information was really in the last minute!

Comment by Leonid Evdokimov [ 07/Apr/13 ]

db.runCommand does not trigger the bug to the best of my knowledge.

The issue with SIGUSR1 is simple: some functions that are not async-signal-safe are called during log rotation, that crashed mongod process.
I don't know if there is something in log rotation itself. I have set up runCommand instead of kill a year ago in hourly rotation job and have not seen deadlocks anymore since then. I had several deadlocks a month using kill in hourly rotation. Here is log rotation script I use now: https://gist.github.com/darkk/5331671

Comment by Johan Hedin [ 07/Apr/13 ]

Thanks for the explanation Leonid!

Changing from kill -USR1 to what you suggest was actually my thought but I got the impression from the description of this issue that even db.runCommand("logRotate") could crash/dead lock the server!?

Anything you could comment on?

Comment by Leonid Evdokimov [ 07/Apr/13 ]

It becomes extremely dangerous only under high load.
You should rather run mongo --eval 'db.runCommand("logRotate")' for rotation instead of kill -USR1 if you're afraid of triggering deadlock/crash.

Comment by Johan Hedin [ 07/Apr/13 ]

Does this mean that log rotating is basically extremely dangerous? I'm running 2.2.3 and rotate logs by SIGUSR1. Have I just been lucky not having any crashes yet?

Comment by Paul Hamby [ 05/Apr/13 ]

We are experiencing this with 2.2.2

Comment by nosqldb [ 17/Feb/13 ]

We experienced it on 2.2.0.

Comment by Klébert Hodin [ 21/Jan/13 ]

We experienced it on 2.2.2.

Comment by Oded Maimon [ 21/Jan/13 ]

anyone know what versions are affected by this bug?

Generated at Thu Feb 08 03:06:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.