[SERVER-21592] Crash with "checkpoint server error" if early shutdown is invoked due to socket error at startup Created: 20/Nov/15 Updated: 06/Dec/22 Resolved: 08/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Assigned Teams: |
Storage Execution
|
| Operating System: | ALL |
| Steps To Reproduce: | Not a reliable repro, but trying to start mongod using a port, which is in use should do the trick. |
| Participants: |
| Description |
|
This is specific to MongoD with WiredTiger storage engine and only happens during early shutdown. If mongod invokes shutdown very early in the startup sequence (say, because it cannot bind to the listening socket, because it's in use), this may catch the WiredTiger engine still initializing and cause it to crash. The call stack below shows the location of the crash and I am also attaching the complete logs.
|
| Comments |
| Comment by Alexander Gorrod [ 08/Feb/17 ] | |||||||||||||||
|
I created a script that fails starting MongoDB because it specifies an in-use port and replicates most of the command from the attached log file. I ran several thousand iterations against a build of the 3.2.0-rc0 version of MongoDB and didn't see the reported failure. The script I used was:
I'm going to close this ticket - if you come across a way to reproduce the symptom I'm happy to chase harder. | |||||||||||||||
| Comment by Max Hirschhorn [ 23/Nov/15 ] | |||||||||||||||
|
resmoke.py does any cleanup of the data directory prior to starting the next test, not after the previous one completes. jstests/sharding/split_with_force.js uses ShardingTest to start a sharded cluster. ShardingTest.prototype.stop() will send a SIGTERM to the mongos processes, mongod shard processes, and mongod config server processes in that order. After sending the signal, it calls wait_for_pid(). Once all the processes are terminated, it calls resetDbpath() to delete the data directory. | |||||||||||||||
| Comment by Kaloian Manassiev [ 23/Nov/15 ] | |||||||||||||||
|
michael.cahill, this happened during a run of a js test from the sharding suite, so there is some possibility that the test deleted the data directories before the shutdown completed (although I don't see any message in the logs). max.hirschhorn, do you know if there is some synchronization in resmoke.py to wait for mongod to fully stop before deleting the data directories (or whether they are deleted at all)? | |||||||||||||||
| Comment by Michael Cahill (Inactive) [ 23/Nov/15 ] | |||||||||||||||
|
That error would also happen if the database directory was removed from underneath WiredTiger before the shutdown was complete. Is something like that possible? | |||||||||||||||
| Comment by Ramon Fernandez Marina [ 21/Nov/15 ] | |||||||||||||||
|
Couldn't repro with 3.2.0-rc3 after many attempts as follows:
But probably warrants a closer look at the code. |