[SERVER-23072] Missing stack trace when mongod crashes with heap corruption Created: 09/Mar/16  Updated: 14/Apr/16  Resolved: 11/Apr/16

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Mark Benvenuto
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-23036 ShardRegistry accesses _configServerC... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Platforms 12 (04/01/16), Platforms 13 (04/22/16)
Participants:

 Description   

In evergreen we had a failure of csrs_upgrade_during_count.js where a mongod crashed with the following error:

[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.447+0000 d20010| 2016-03-09T03:16:01.447+0000 I CONTROL  [conn20] *** C runtime error: HEAP CORRUPTION DETECTED: after Normal block (#4002541) at 0x0000000001E2BC60., terminating
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.447+0000 d20010| 2016-03-09T03:16:01.447+0000 I -        [conn20] Fatal Assertion 17006
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.457+0000 d20010| 2016-03-09T03:16:01.457+0000 I CONTROL  [conn20] *** unhandled exception 0x80000003 at 0x000007FEFD7131F2, terminating
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.457+0000 d20010| 2016-03-09T03:16:01.457+0000 I CONTROL  [conn20] *** stack trace for unhandled exception:

The lack of a stack trace makes it difficult to debug what actually went wrong.

sharding_WT failed on windows-64-2k8-debug

csrs_upgrade_during_count.js - Logs | History



 Comments   
Comment by Mark Benvenuto [ 11/Apr/16 ]

I have not seen any more reports or a repro.

Comment by Mark Benvenuto [ 18/Mar/16 ]

Actually, here is the sequence of what happened in this code since this is 3.2:

1. The C RTL detects a problem and calls crtDebugCallback().
2. crtDebugCallback() logs a message and calls fassertFailed().
3. fassertFailed() calls breakpoint()
4. Windows calls exceptionFilter() to process the breakpoint as an unhandled exception
5. exceptionFilter() calls printWindowsStackTrace()

I do not know why the stack trace was not printed. It definitely reached this part of the code. I may have stalled doing the I/O to read the symbols from the PDB. The entire log ends < 4 seconds later.

In a quick test where injected a similar fault by hand, it worked fine.

diff --git a/src/mongo/db/commands/test_commands.cpp b/src/mongo/db/commands/test_commands.cpp
index 485998c..a7c8cd1 100644
--- a/src/mongo/db/commands/test_commands.cpp
+++ b/src/mongo/db/commands/test_commands.cpp
@@ -163,6 +163,13 @@ public:
         log() << "test only command sleep invoked" << endl;
         long long millis = 0;
 
+        char* b1 = new char[5];
+        b1[0] = '\0';
+        b1[5] = 'A';
+        b1[6] = 'B';
+        printf("b %s\n", b1);
+        delete b1;
+
         if (cmdObj["secs"] || cmdObj["millis"]) {
             if (cmdObj["secs"]) {
                 uassert(34344, "'secs' must be a number.", cmdObj["secs"].isNumber());

Comment by Eric Milkie [ 10/Mar/16 ]

0x80000003 is the exception used to stop the debugger, when attached. The debugger installs a top-level exception handler to capture this exception.
In our breakpoint() code, we check to see if the debugger is present before raising this exception. Also, I don't believe that breakpoint() was actually called in this situation; perhaps it was something internal to Windows that tried to signal the debugger?
One option here to make things better is to filter out and ignore 0x80000003 exceptions in exceptionFilter(), rather than trying to print a stack trace.

In the above issue, what I believe should have happened is this:
1. The C RTL detects a problem and calls crtDebugCallback().
2. crtDebugCallback() logs a message and calls fassertFailed().
3. fassertFailed() logs another message and calls std::abort().
4. std::abort() raises SIGABRT which calls into our registered signal handler, abruptQuit().
5. abruptQuit() calls printSignalAndBacktrace() and then calls breakpoint().
6. abruptQuit() calls endProcessWithSignal(), which calls _exit().

The log above shows that steps 1-4 seem to have worked as expected, but step 5 somehow called breakpoint() first – and IsDebuggerPresent() erroneously returned "true"? Or else something entirely different happened. I'm not sure.

Comment by Spencer Brody (Inactive) [ 09/Mar/16 ]

Could be SERVER-23036? Hard to tell for certain though without the stack trace

Comment by Spencer Brody (Inactive) [ 09/Mar/16 ]

Hmm, one of the shard mongods crashed:

[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.447+0000 d20010| 2016-03-09T03:16:01.447+0000 I CONTROL  [conn20] *** C runtime error: HEAP CORRUPTION DETECTED: after Normal block (#4002541) at 0x0000000001E2BC60., terminating
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.447+0000 d20010| 2016-03-09T03:16:01.447+0000 I -        [conn20] Fatal Assertion 17006
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.457+0000 d20010| 2016-03-09T03:16:01.457+0000 I CONTROL  [conn20] *** unhandled exception 0x80000003 at 0x000007FEFD7131F2, terminating
[js_test:csrs_upgrade_during_count] 2016-03-09T03:16:01.457+0000 d20010| 2016-03-09T03:16:01.457+0000 I CONTROL  [conn20] *** stack trace for unhandled exception:

Unfortunately there doesn't seem to actually be a stack trace anywhere...

Generated at Thu Feb 08 04:02:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.