[SERVER-17570] MongoDB 3.0 NT Service shutdown race condition with db.serverShutdown() Created: 12/Mar/15  Updated: 19/Sep/15  Resolved: 18/Mar/15

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.0.0
Fix Version/s: 3.0.2, 3.1.1

Type: Bug Priority: Critical - P2
Reporter: David Golub Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.exe-(PID-10816).dmp    
Issue Links:
Related
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Completed:
Sprint: Platform 1 04/03/15
Participants:

 Description   

There is a crash in mongod while it is being terminated during some of the Automation Agent tests on Windows. The crash only happens with MongoDB 3.0. Running the same tests with MongoDB 2.6 or 2.4 does not produce the crash. Crash dump file is attached.



 Comments   
Comment by Githook User [ 19/Mar/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-17570: Fix NT Service shutdown race condition

(cherry picked from commit 0f6f2e9a199bff05c69ca5aace94a5ac8fe2cff3)
Branch: v3.0
https://github.com/mongodb/mongo/commit/4e1f873727914b341f9638b1a36fef66cfebfa16

Comment by Mark Benvenuto [ 19/Mar/15 ]

Here are the bug details:

There are two different execution models for mongod on Windows
1. Console program

  • wmain() is called on thread 0, and we listen for connections on thread 0

2. NT Service

  • wmain() is called on thread 0, then StartServiceCtrlDispatcherW is called, and
    we listen for connections on a thread spawned by sechost.dll such as thread 3
    called ServiceMain. Thread 0 is used for processing NT Service control manager
    events.

(Note: thread numbers are example thread numbers from a debugger)

The bug:

There is a race condition between exitCleanly, and StartServiceCtrlDispatcher
returning. When exitCleanly is called via CmdShutdown::run,
StartServiceCtrlDispatcherW returns before exitCleanly finishes, and therefore
startService calls quickExit. This results in an unclean shtudown.
StartServiceCtrlDispatcherW returns because initAndListen has finished as a
result of inShutdown being set to true.

This does not happen if the process receives a SERVICE_CONTROL_STOP (ie, sc.exe
stop mongodb), because this event is processed on thread 0, and therefore
StartServiceCtrlDispatcherW will not return until serviceShutdown is completes
first.

The fix:

1. Create a method called signalShutdown to set inShutdown true so the listener
loops stop on the ServiceMain thread
2. Change exitCleanly in mongod to use a lock to provide mutual exclusion
instead of atomic operations.
3. Cleanup ntservice.cpp now that signalShutdown exists, and we have made
initService the only caller of exitCleanly.

Comment by Githook User [ 18/Mar/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-17570: Fix NT Service shutdown race condition
Branch: master
https://github.com/mongodb/mongo/commit/0f6f2e9a199bff05c69ca5aace94a5ac8fe2cff3

Generated at Thu Feb 08 03:44:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.