[SERVER-21592] Crash with "checkpoint server error" if early shutdown is invoked due to socket error at startup Created: 20/Nov/15  Updated: 06/Dec/22  Resolved: 08/Feb/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Backlog - Storage Execution Team
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File crash.log    
Assigned Teams:
Storage Execution
Operating System: ALL
Steps To Reproduce:

Not a reliable repro, but trying to start mongod using a port, which is in use should do the trick.

Participants:

 Description   

This is specific to MongoD with WiredTiger storage engine and only happens during early shutdown.

If mongod invokes shutdown very early in the startup sequence (say, because it cannot bind to the listening socket, because it's in use), this may catch the WiredTiger engine still initializing and cause it to crash. The call stack below shows the location of the crash and I am also attaching the complete logs.

mongo::printStackTrace(std::ostream&) at /tmp/TestRunDirectory/mongo/src/mongo/util/stacktrace_posix.cpp:171
mongo::(anonymous namespace)::printSignalAndBacktrace(int) at /tmp/TestRunDirectory/mongo/src/mongo/util/signal_handlers_synchronous.cpp:179
 (inlined by) mongo::(anonymous namespace)::abruptQuit(int) at /tmp/TestRunDirectory/mongo/src/mongo/util/signal_handlers_synchronous.cpp:235
?? ??:0
?? ??:0
?? ??:0
mongo::fassertFailed(int) at /tmp/TestRunDirectory/mongo/src/mongo/util/assert_util.cpp:172
mongo::(anonymous namespace)::mdb_handle_error(__wt_event_handler*, __wt_session*, int, char const*) at /tmp/TestRunDirectory/mongo/src/mongo/util/assert_util.h:214
__wt_eventv at /tmp/TestRunDirectory/mongo/src/third_party/wiredtiger/src/support/err.c:286
__wt_err at /tmp/TestRunDirectory/mongo/src/third_party/wiredtiger/src/support/err.c:311
__wt_panic at /tmp/TestRunDirectory/mongo/src/third_party/wiredtiger/src/support/err.c:494
__ckpt_server at /tmp/TestRunDirectory/mongo/src/third_party/wiredtiger/src/conn/conn_ckpt.c:124
?? ??:0
?? ??:0



 Comments   
Comment by Alexander Gorrod [ 08/Feb/17 ]

I created a script that fails starting MongoDB because it specifies an in-use port and replicates most of the command from the attached log file. I ran several thousand iterations against a build of the 3.2.0-rc0 version of MongoDB and didn't see the reported failure. The script I used was:

$ cat s21592_run.sh
#!/bin/bash
 
DBPATH=`pwd`/data
LOGPATH=`pwd`/data/mdb.log
PORT=22
 
for i in `seq 1000`; do
	rm -rf $DBPATH && mkdir $DBPATH
	./mongod --oplogSize 40 --port $PORT --smallfiles --replSet test-configRS --dbpath $DBPATH --logpath $LOGPATH --journal --configsvr --storageEngine wiredTiger --nopreallocj --setParameter enableTestCommands=1
	found=`grep panic $LOGPATH`
	if [ $? -eq 0 ]; then
		exit 1
	fi
done

I'm going to close this ticket - if you come across a way to reproduce the symptom I'm happy to chase harder.

Comment by Max Hirschhorn [ 23/Nov/15 ]

resmoke.py does any cleanup of the data directory prior to starting the next test, not after the previous one completes. jstests/sharding/split_with_force.js uses ShardingTest to start a sharded cluster. ShardingTest.prototype.stop() will send a SIGTERM to the mongos processes, mongod shard processes, and mongod config server processes in that order. After sending the signal, it calls wait_for_pid(). Once all the processes are terminated, it calls resetDbpath() to delete the data directory.

Comment by Kaloian Manassiev [ 23/Nov/15 ]

michael.cahill, this happened during a run of a js test from the sharding suite, so there is some possibility that the test deleted the data directories before the shutdown completed (although I don't see any message in the logs). max.hirschhorn, do you know if there is some synchronization in resmoke.py to wait for mongod to fully stop before deleting the data directories (or whether they are deleted at all)?

Comment by Michael Cahill (Inactive) [ 23/Nov/15 ]

That error would also happen if the database directory was removed from underneath WiredTiger before the shutdown was complete. Is something like that possible?

Comment by Ramon Fernandez Marina [ 21/Nov/15 ]

Couldn't repro with 3.2.0-rc3 after many attempts as follows:

mongod --dbpath /tmp/db --port 20013 --logpath /tmp/mdb.log --fork
buildscripts/resmoke.py jstests/sharding/split_with_force.js

But probably warrants a closer look at the code.

Generated at Thu Feb 08 03:57:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.