[SERVER-17881] Primary crashing with 'Didn't find RecordId in WiredTigerRecordStore' while tailing capped collection Created: 02/Apr/15  Updated: 02/Sep/15  Resolved: 16/May/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Evan Lawrence-Hurt Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-18006 Shard replicas crash after 3.0.2 upgr... Closed
Operating System: ALL
Steps To Reproduce:

I haven't yet been able to reproduce this in isolation, this usually appears under write load of ~500 updates/sec, ~10 inserts/sec. Simply using the above mubsub library in isolation hasn't worked. Will add more info as I have it, do let me know if you have any suggestions for how to isolate this.

Participants:

 Description   

I've upgraded my UAT environment to Mongo 3.0.1 with the WiredTiger storage engine & am repeatedly seeing my primary replicaset fallback to RECOVERY mode with the below stacktrace. The "prism-pubsub.intergate" collection mentioned in the stack trace is a capped collection which we're opening a tailable cursor on using the (just forked) mubsub library here: https://github.com/evanlh/mubsub.

2015-04-02T15:52:21.286-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\util\stacktrace_win.cpp(175)                            mongo::printStackTrace+0x43
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\util\log.cpp(135)                                       mongo::logContext+0x97
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\util\assert_util.cpp(214)                               mongo::msgasserted+0xd7
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\storage\wiredtiger\wiredtiger_record_store.cpp(330)  mongo::WiredTigerRecordStore::dataFor+0xc7
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\catalog\collection.cpp(157)                          mongo::Collection::docFor+0x52
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\exec\idhack.cpp(153)                                 mongo::IDHackStage::work+0x176
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\query\plan_executor.cpp(328)                         mongo::PlanExecutor::getNextSnapshotted+0xdd
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\query\plan_executor.cpp(286)                         mongo::PlanExecutor::getNext+0x41
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\dbhelpers.cpp(151)                                   mongo::Helpers::findOne+0x227
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\dbhelpers.cpp(116)                                   mongo::Helpers::findOne+0x37
2015-04-02T15:52:21.287-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\rs_rollback.cpp(583)                            mongo::repl::`anonymous namespace'::syncFixUp+0x18e7
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\rs_rollback.cpp(774)                            mongo::repl::`anonymous namespace'::_syncRollback+0x395
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\rs_rollback.cpp(827)                            mongo::repl::syncRollback+0x1f5
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\bgsync.cpp(428)                                 mongo::repl::BackgroundSync::_rollbackIfNeeded+0x590
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\bgsync.cpp(262)                                 mongo::repl::BackgroundSync::produce+0x406
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\bgsync.cpp(200)                                 mongo::repl::BackgroundSync::_producerThread+0x9a
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\mongo\db\repl\bgsync.cpp(153)                                 mongo::repl::BackgroundSync::producerThread+0x86
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(185)       boost::`anonymous namespace'::thread_start_function+0x21
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] MSVCR120.dll                                                                        beginthreadex+0x107
2015-04-02T15:52:21.288-0400 I CONTROL  [rsBackgroundSync] MSVCR120.dll                                                                        endthreadex+0x192
2015-04-02T15:52:21.289-0400 I CONTROL  [rsBackgroundSync] kernel32.dll                                                                        BaseThreadInitThunk+0xd
2015-04-02T15:52:21.289-0400 I CONTROL  [rsBackgroundSync]
2015-04-02T15:52:21.486-0400 I REPL     [rsBackgroundSync] replSet exception in rollback ns:prism-pubsub.intergate { _id: ObjectId('551d9db163449994281a07b4') } 28556 Didn't find RecordId in WiredTigerRecordStore ndeletes:9
2015-04-02T15:52:21.486-0400 I REPL     [rsBackgroundSync] replSet 9 delete and 20 update operations processed out of 40 total operations
2015-04-02T15:52:21.765-0400 I REPL     [ReplicationExecutor] replSetElect voting yea for algo-rh01-sdny.itgssi.com:27017 (1)
2015-04-02T15:52:22.379-0400 I REPL     [ReplicationExecutor] Member algo-rh01-sdny.itgssi.com:27017 is now in state PRIMARY
2015-04-02T15:52:22.934-0400 I REPL     [rsBackgroundSync] rollback 5 d:9 u:30
2015-04-02T15:52:22.934-0400 I REPL     [rsBackgroundSync] rollback 6
2015-04-02T15:52:23.075-0400 W REPL     [rsBackgroundSync] issues during syncRollback, see log
2015-04-02T15:52:23.075-0400 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-04-02T15:52:23.076-0400 I REPL     [rsBackgroundSync] rollback finished
2015-04-02T15:52:23.091-0400 I REPL     [ReplicationExecutor] syncing from: algo-rh01-sdny.itgssi.com:27017



 Comments   
Comment by Daniel Pasette (Inactive) [ 01/May/15 ]

Hi Evan, a couple of other questions/comments.

  • In addition to Ramon's request for the data files/logs, can you tell me if you're recreating the dataset from scratch each time?
  • Is there anything else you can tell me about how I can use mubsub to attempt and replicate this issue?
  • The next release of MongoDB in the v3.0 series (3.0.3-rc1) was released yesterday. GA should be in around a week or so.

Thanks,
Dan

Comment by Ramon Fernandez Marina [ 22/Apr/15 ]

Hi evan.lawrence-hurt@itg.com, we haven't heard back from you for a while. Have you had a chance to try switching primary and secondary host os? Do you have an update on whether you can share database files privately with us?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 06/Apr/15 ]

Thanks evan.lawrence-hurt@itg.com. You can upload files securely and privately via scp:

scp -P 722 -r <file_or_directory> SERVER-17881@www.mongodb.com:

Comment by Evan Lawrence-Hurt [ 06/Apr/15 ]

@Ramon, I'll try switching primary/secondary to see if it's Windows-specific. W/r/t the data files & even full logs, I definitely can't upload them to a public forum-- they contain sensitive customer information-- but I'll check with my manager and see if we can send them to you privately. Thanks!

Comment by Ramon Fernandez Marina [ 03/Apr/15 ]

evan.lawrence-hurt@itg.com, in addition to full logs from both nodes, would it be possible for you to share the database files with us for analysis? If the answer is yes please let us know, and we can provide instructions to upload the files privately.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 02/Apr/15 ]

evan.lawrence-hurt@itg.com, does the issue appear if you run your primary on RHEL and your secondary on Windows? Or only when the primary is on Windows?

We'll try to reproduce on our end; if you could upload full logs for both primary and secondary from startup until the exception appear that may help us investigate.

Thanks,
Ramón.

Comment by Evan Lawrence-Hurt [ 02/Apr/15 ]

Primary replica is deployed on Windows Server 2008 R2 SP1 virtual machine with 8g RAM & 2 cores. Secondary is RHEL 6.6 same hardware specs.

Generated at Thu Feb 08 03:45:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.