[SERVER-47198] SIGUSR1 can cause MongoDB process crash Created: 31/Mar/20  Updated: 29/Oct/23  Resolved: 29/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.17
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Tommy Lee Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File killed.log     File server-47198.diff    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-57690 Allow log rotation via SIGUSR1 to fai... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4, v3.6
Steps To Reproduce:
  1. Delete the MongoDB log file, e.g., /var/log/mongodb/00-shard03-replica.log
  2. kill -SIGUSR1 <mongod process id>
  3. The process will exit
Sprint: Service Arch 2021-06-28, Service Arch 2021-07-12
Participants:
Case:

 Description   

After deleted the MongoDB log file then send the signal USR1 with kill command, 

then mongo process will crash

Mongo Version: 3.6.17

Platform: CentOS 7 64bit

kill -SIGUSR1 <mongod process id>



 Comments   
Comment by Githook User [ 28/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-47198 rotateLogs: tolerate unexpectedly missing or unexpectedly existing files
Branch: master
https://github.com/mongodb/mongo/commit/75fdaf5d32a8ece4de9ed673b10d1d02256f4172

Comment by Githook User [ 28/Jun/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@gmail.com', 'username': 'BillyDonahue'}

Message: SERVER-47198 logRotate tolerate nonfatal file existence and nonexistence
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/d9a66326b9874a7e0e2aa03dbdb9275f710b672a

Comment by Billy Donahue [ 14/Jun/21 ]

Sara, I can send you a CR off of my SERVER-47198 github branch if you'd like, since I've got a head start on it.

Comment by Billy Donahue [ 22/Sep/20 ]

My recommendation for what it's worth would be to ignore a "not found" error returned by the rename part of the log rotation.

If the boost::filesystem::rename step comes back with errc::no_such_file_or_directory, we can calmly assume the old log file was moved aside by an administrative procedure. It happens. We can maybe log a warning about it. We can continue, opening a new log file as we would have done anyway.

I think server-47198.diff is the only code change we'd need
[ https://github.com/mongodb/mongo/compare/master...BillyDonahue:SERVER-47198 ]

Comment by Billy Donahue [ 22/Sep/20 ]

Reproduced with master branch code right now.
I did an 'mv' to move the log out of the way, with the expectation that mongo would continue writing to the file descriptor it's holding.
Then hit mongod with SIGUSR1 to rotate the logs and open a new file. Mongod crashes and dumps core.

The failure is logged to the moved-away old log file. (killed.log)

 

Core dump shows the problem is that we fassert that logv2::rotateLogs succeeds.

(gdb) up
#4 mongo::(anonymous namespace)::handleOneSignal (rotation=<synthetic pointer>, rotation=<synthetic pointer>, waited=...) at src/mongo/util/signal_handlers.cpp:240
240 fassert(16782, logv2::rotateLogs(serverGlobalParams.logRenameOnRotate));

Log shows:

{"t":{"$date":"2020-09-22T17:40:52.581+00:00"},"s":"W", "c":"CONTROL", "id":23168, "ctx":"SignalHandler","msg":"Log rotation failed","attr":{"reason: {"code":37,"codeName":"FileRenameFailed","errmsg":"Failed to rename /home/ubuntu/mongo-dev/work/mongo-vscode/killme.log to /home/ubuntu/mongo-dev/work/mongo-vscode/killme.log.2020-09-22T17-40-52: No such file or directory"}

So while that's weird and disappointing, I don't think the server needs to CRASH from it.
It would be reasonable to absorb the rename failure, and proceed to open a new log file and continue operating. It's a user operation or filesystem operational glitch, and there was no mysterious condition to debug.

Comment by Tommy Lee [ 03/Apr/20 ]

We used to use the following command for log rotating

kill -SIGUSR1 <mongod process id>

Now I changed to MongoDB native log rotate command

mongo --port 10001 -u admin -p "`awk -F'"' '{print $6}' /root/.mongorc.js`" --eval "db.adminCommand( { logRotate : 1 } );"

 

Using MongoDB native log rotate command is more safe, if log rotate fail, the process still there. For SIGUSR1  will cause the process crash.

 

Comment by Tommy Lee [ 03/Apr/20 ]

Hi Carl,

very appreciate to hear your response, if need me provide more information please let me know.

We are  a huge online game  company, using MongoDB a lot, lucky caught the bug in staging environment.

And we using the same version in the production environment. That's a potential risk.

 

Thanks  

 

Regards,

Tommy

Comment by Carl Champain (Inactive) [ 02/Apr/20 ]

Hi ftcd2016@gmail.com,

Thank you for the report.
We were able to recreate your issue. We are passing this ticket along to the appropriate team for additional investigation.

Kind regards,
Carl

Generated at Thu Feb 08 05:13:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.