[SERVER-19008] Fatal Assertion 28559 - During cap collection stress w/ WT Created: 17/Jun/15  Updated: 04/Aug/15  Resolved: 08/Jul/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eitan Klein Assignee: Keith Bostic (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: 32qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File rs3primary.txt    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

I'm going to check if this reproducible / Or I can recover from it

the workload
./capped_insert.sh
16 threads with single collection

Participants:

 Description   

Environment:

  • Single node replica set

Work load:
Cap collection stress

Db version v3.0.4-rc0

32704 2015-06-16T21:51:28.911-0400 I COMMAND  [conn14] command sbtest.$cmd command: insert { insert: "sbtest", documents: [ { name: 76501385, group: 4429318211268616576, payload3: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], payload1: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload2: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload4: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload5: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload6: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...) } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:80 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } 141ms
32705 2015-06-16T21:51:28.911-0400 I COMMAND  [conn15] command sbtest.$cmd command: insert { insert: "sbtest", documents: [ { payload: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], payload2: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload3: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload4: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload5: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), name: 76501371, group: 4571080436024877535, payload1: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload6: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...) } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:80 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } 142ms
32706 2015-06-16T21:51:28.911-0400 I COMMAND  [conn7] command sbtest.$cmd command: insert { insert: "sbtest", documents: [ { group: 9039435050900736449, payload: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], payload2: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload3: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload4: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload5: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), name: 76501381, payload1: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...), payload6: BinData(0, 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...) } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:80 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } 141ms
32707 2015-06-16T21:51:31.283-0400 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (0) [1434505891:283308][4028:140704081122080], file:collection-4--816086440416248550.wt, cursor.next: read checksum error [4096B @ 3318939648, 4114503719 != 1863859940]
32708 2015-06-16T21:51:31.284-0400 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (0) [1434505891:283308][4028:140704081122080], file:collection-4--816086440416248550.wt, cursor.next: collection-4--816086440416248550.wt: encountered an illegal file format or internal value
32709 2015-06-16T21:51:31.284-0400 I -        [conn5] Fatal Assertion 28559
32710 2015-06-16T21:51:31.284-0400 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (-31804) [1434505891:284308][4028:140704081122080], file:collection-4--816086440416248550.wt, cursor.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
32711 2015-06-16T21:51:31.284-0400 I -        [conn14] Fatal Assertion 28559
32712 2015-06-16T21:51:31.284-0400 I -        [conn11] Fatal Assertion 28559
32713 2015-06-16T21:51:31.284-0400 I -        [conn6] Fatal Assertion 28559
32714 2015-06-16T21:51:31.284-0400 I -        [conn18] Fatal Assertion 28559
32715 2015-06-16T21:51:31.284-0400 I -        [conn16] Fatal Assertion 28559
32716 2015-06-16T21:51:31.284-0400 I -        [conn7] Fatal Assertion 28559
32717 2015-06-16T21:51:31.284-0400 I -        [conn15] Fatal Assertion 28559
32718 2015-06-16T21:51:31.284-0400 I -        [conn9] Fatal Assertion 28559
32719 2015-06-16T21:51:31.284-0400 I -        [conn13] Fatal Assertion 28559
32720 2015-06-16T21:51:31.285-0400 I -        [conn17] Fatal Assertion 28559
32721 2015-06-16T21:51:31.284-0400 I -        [WiredTigerRecordStoreThread for local.oplog.rs] Fatal Assertion 28558
32722 2015-06-16T21:51:31.285-0400 I -        [conn4] Fatal Assertion 28559
32723 2015-06-16T21:51:31.285-0400 I -        [conn3] Fatal Assertion 28559
32724 2015-06-16T21:51:31.285-0400 I -        [conn10] Fatal Assertion 28559
32725 2015-06-16T21:51:31.284-0400 I -        [conn8] Fatal Assertion 28559
32726 2015-06-16T21:51:31.285-0400 I -        [conn12] Fatal Assertion 28559
32727 2015-06-16T21:51:31.885-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\util\stacktrace_win.cpp(175)                            mongo::printStackTrace+0x43
32728 2015-06-16T21:51:31.885-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\util\log.cpp(135)                                       mongo::logContext+0x97
32729 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\util\assert_util.cpp(165)                               mongo::fassertFailed+0x80
32730 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp(62)           mongo::wtRCToStatus_slow+0x83
32731 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\storage\wiredtiger\wiredtiger_util.h(61)             mongo::wtRCToStatus+0x28
32732 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\storage\wiredtiger\wiredtiger_record_store.cpp(589)  mongo::WiredTigerRecordStore::insertRecord+0x3c9
32733 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\storage\wiredtiger\wiredtiger_record_store.cpp(625)  mongo::WiredTigerRecordStore::insertRecord+0x98
32734 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\catalog\collection.cpp(181)                          mongo::Collection::insertDocument+0x6f
32735 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\repl\oplog.cpp(266)                                  mongo::repl::`anonymous namespace'::_logOpRS+0x604
32736 2015-06-16T21:51:31.886-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\repl\oplog.cpp(370)                                  mongo::repl::logOp+0x84
32737 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(1209)     mongo::singleInsert+0x142
32738 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(1128)     mongo::insertOne+0x15b
32739 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(1179)     mongo::WriteBatchExecutor::execOneInsert+0xf5
32740 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(912)      mongo::WriteBatchExecutor::execInserts+0x242
32741 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(772)      mongo::WriteBatchExecutor::bulkExecute+0x58
32742 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\batch_executor.cpp(275)      mongo::WriteBatchExecutor::executeBatch+0x54f
32743 2015-06-16T21:51:31.887-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\commands\write_commands\write_commands.cpp(149)      mongo::WriteCmd::run+0x151
32744 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\dbcommands.cpp(1260)                                 mongo::_execCommand+0x6a
32745 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\dbcommands.cpp(1478)                                 mongo::Command::execCommand+0xd53
32746 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\dbcommands.cpp(1550)                                 mongo::_runCommands+0x440
32747 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\query\find.cpp(137)                                  mongo::runCommands+0x43
32748 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\query\find.cpp(609)                                  mongo::runQuery+0x3c3
32749 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\instance.cpp(218)                                    mongo::receivedQuery+0x36b
32750 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\instance.cpp(400)                                    mongo::assembleResponse+0x352
32751 2015-06-16T21:51:31.888-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\db\db.cpp(207)                                          mongo::MyMessageHandler::process+0xb8
32752 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(231)                   mongo::PortMessageServer::handleIncomingMsg+0x573
32753 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(185)       boost::`anonymous namespace'::thread_start_function+0x21
32754 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] MSVCR120.dll                                                                        beginthreadex+0x107
32755 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] MSVCR120.dll                                                                        endthreadex+0x192
32756 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] KERNEL32.DLL                                                                        BaseThreadInitThunk+0xd
32757 2015-06-16T21:51:31.889-0400 I CONTROL  [conn14] 
32758 2015-06-16T21:51:31.890-0400 I -        [conn14] 



 Comments   
Comment by Keith Bostic (Inactive) [ 08/Jul/15 ]

We can't reproduce the failure.

I'm still interested in adding some kind of light-weight tracing that would allow us to debug these failures when we only have a single occurrence to work with, but we don't have any strong design candidates so far.

Comment by Keith Bostic (Inactive) [ 25/Jun/15 ]

In that case I'd add it to verbose, verbose=[writepath], where writepath is some subset of reconciliation, block, checkpoint, write (write is a history of every page write with human-readable offsets & size).

I'm still conflicted: if I can reproduce the problem and turn on this special diagnostic, then I don't need this special diagnostic because I can reproduce the problem. If I can't reproduce the problem, the diagnostic does me no good.

I think we need something light-weight enough to have turned on all the time.

Comment by Michael Cahill (Inactive) [ 25/Jun/15 ]

I was thinking of it like verbose: something that can be turned on for diagnostics. The API could even be "diagnostics=[block,btree,checkpoint]", etc.

Michael.

Comment by Keith Bostic (Inactive) [ 24/Jun/15 ]

Are you imagining something we'd turn on all the time, or only when we're chasing one of these problems?

Comment by Michael Cahill (Inactive) [ 24/Jun/15 ]

Maybe we should explore adding WT_SESSION::log_printf diagnostics, with a scheme similar to verbose messages? That way, we could be doing relatively lightweight logging hopefully without serializing around whatever bug we're chasing.

If we had a history of every page write containing a human-readable offset + size, that would be a start in working backwards to the root cause.

Comment by Keith Bostic (Inactive) [ 22/Jun/15 ]

michael.cahill, we could pretty easily add additional information into the page address, or store it on the page itself as a new cell type.

But, I'm with you, I don't have ideas I like on what would be the useful information to track.

Historically, I don't think I've ever managed to debug one of these with only a single occurrence, and the amount of information I think we'd need in order to solve one from a single reported occurrence is daunting. When I've chased these problems historically, I've usually been looking for the path the page takes through the system to disk, that is, trying to figure out why the page was written, so I've been dumping a code path into the page, which obviously takes up quite a bit of room.

If we can't possibly log enough information to solve the problem, could we log enough information to make it more likely we could reproduce the bug?

Comment by Michael Cahill (Inactive) [ 22/Jun/15 ]

keith.bostic, the only idea I've had (which I don't like much) is that we could keep write_gen in the page address for diagnostics. That would obviously be a format change, but maybe if we were chasing something like this with a reproducer, it might help figure out what's going on?

Regardless of that, when a checksum mismatch is detected, we could log more information about the page header we find: its type and size (even if they're bogus), plus we could search block->live.alloc to see whether the address we're searching for was reallocated recently?

Comment by Eitan Klein [ 18/Jun/15 ]

We had a discussion this morning w/ AlexG related to power failure, it's hard to believe we can repro the exact issues (it's all about timing). but I'm trying and I will let you know if I have a repro.

As it's not likely to happen, and we expect to see more of this during the coming cycle, Can you think on adding traces which can help you to isolate such an issue? or other debug ability improvement

CC: jonathan.abrahams

Generated at Thu Feb 08 03:49:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.