[SERVER-11229] Unclean mongod shutdown when slow flush is in progress Created: 16/Oct/13  Updated: 18/Jul/16  Resolved: 18/Jul/16

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexander Komyagin Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Description   

When a slow flush operation is in progress, terminating MongoD with SIGTERM can result in an unclean shut down:

Mon Oct 14 23:35:31.453 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Mon Oct 14 23:35:31.454 [signalProcessingThread] now exiting
Mon Oct 14 23:35:31.454 dbexit: 
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: going to close listening sockets...
Mon Oct 14 23:35:31.454 [signalProcessingThread] closing listening socket: 34
Mon Oct 14 23:35:31.454 [signalProcessingThread] closing listening socket: 35
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: going to flush diaglog...
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: going to close sockets...
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: waiting for fs preallocator...
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: lock for final commit...
Mon Oct 14 23:35:31.454 [signalProcessingThread] shutdown: final commit...
Mon Oct 14 23:35:31.464 [rsBackgroundSync] Socket recv() errno:9 Bad file descriptor IP:27017
Mon Oct 14 23:35:31.464 [rsBackgroundSync] SocketException: remote: IP:27017 error: 9001 socket exception [1] server [IP:27017] 
Mon Oct 14 23:35:31.464 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: IP:27017
Mon Oct 14 23:35:31.850 [conn91747] end connection XXX:53591 (60 connections now open)
Mon Oct 14 23:35:32.491 [conn91742] end connection XXX:59239 (59 connections now open)
Mon Oct 14 23:35:32.530 [conn91743] end connection XXX:39650 (58 connections now open)
Mon Oct 14 23:35:32.716 [conn91744] end connection XXX:36543 (57 connections now open)
Mon Oct 14 23:35:46.320 [conn91552] end connection XXX:41064 (56 connections now open)
Mon Oct 14 23:35:46.716 [conn91563] end connection XXX:41163 (55 connections now open)
Mon Oct 14 23:37:01.239 [signalProcessingThread] shutdown: closing all files...
Mon Oct 14 23:37:01.239 [DataFileSync] flushing mmaps took 627745ms  for 317 files
Mon Oct 14 23:37:01.239 [DataFileSync] ERROR: Client::shutdown not called: DataFileSync
 
 
***** SERVER RESTARTED *****
 
 
Mon Oct 14 23:43:32.090 [initandlisten] MongoDB starting : pid=32018 port=27017 dbpath=/db/data 64-bit host=XXX
Mon Oct 14 23:43:32.091 [initandlisten] db version v2.4.3
Mon Oct 14 23:43:32.091 [initandlisten] git version: 37efc4cff49790b5080682772a71c747d449a54d
Mon Oct 14 23:43:32.091 [initandlisten] build info: Linux ip-a-b-c-d 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Mon Oct 14 23:43:32.091 [initandlisten] allocator: tcmalloc
Mon Oct 14 23:43:32.091 [initandlisten] options: { bind_ip: "XXX", config: "/conf/mongodb.cnf", dbpath: "/db/data", directoryperdb: "true", fork: "true", logappend: "true", logpath: "/log/mongodb.log", noauth: "true", nounixsocket: "true", pidfilepath: "/var/mongodb.pid", port: 27017, replSet: "XXX", rest: "true" }
Mon Oct 14 23:43:32.113 [initandlisten] journal dir=/db/data/journal
Mon Oct 14 23:43:32.113 [initandlisten] recover begin

While it is absolutely not normal to have flush times of 600+ seconds, we should be either smarter about waiting for the ongoing flush to complete before shut down, or just print a meaningful error message.



 Comments   
Comment by Ian Whalen (Inactive) [ 18/Jul/16 ]

Although we don't have a specific commit to point to, the Integration team is certain that this was fixed during a 3.0 rewrite of the flushing logic.

Generated at Thu Feb 08 03:25:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.