[SERVER-17881] Primary crashing with 'Didn't find RecordId in WiredTigerRecordStore' while tailing capped collection Created: 02/Apr/15 Updated: 02/Sep/15 Resolved: 16/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Evan Lawrence-Hurt | Assignee: | Ramon Fernandez Marina |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | I haven't yet been able to reproduce this in isolation, this usually appears under write load of ~500 updates/sec, ~10 inserts/sec. Simply using the above mubsub library in isolation hasn't worked. Will add more info as I have it, do let me know if you have any suggestions for how to isolate this. |
||||||||
| Participants: | |||||||||
| Description |
|
I've upgraded my UAT environment to Mongo 3.0.1 with the WiredTiger storage engine & am repeatedly seeing my primary replicaset fallback to RECOVERY mode with the below stacktrace. The "prism-pubsub.intergate" collection mentioned in the stack trace is a capped collection which we're opening a tailable cursor on using the (just forked) mubsub library here: https://github.com/evanlh/mubsub.
|
| Comments |
| Comment by Daniel Pasette (Inactive) [ 01/May/15 ] | |
|
Hi Evan, a couple of other questions/comments.
Thanks, | |
| Comment by Ramon Fernandez Marina [ 22/Apr/15 ] | |
|
Hi evan.lawrence-hurt@itg.com, we haven't heard back from you for a while. Have you had a chance to try switching primary and secondary host os? Do you have an update on whether you can share database files privately with us? Thanks, | |
| Comment by Ramon Fernandez Marina [ 06/Apr/15 ] | |
|
Thanks evan.lawrence-hurt@itg.com. You can upload files securely and privately via scp:
| |
| Comment by Evan Lawrence-Hurt [ 06/Apr/15 ] | |
|
@Ramon, I'll try switching primary/secondary to see if it's Windows-specific. W/r/t the data files & even full logs, I definitely can't upload them to a public forum-- they contain sensitive customer information-- but I'll check with my manager and see if we can send them to you privately. Thanks! | |
| Comment by Ramon Fernandez Marina [ 03/Apr/15 ] | |
|
evan.lawrence-hurt@itg.com, in addition to full logs from both nodes, would it be possible for you to share the database files with us for analysis? If the answer is yes please let us know, and we can provide instructions to upload the files privately. Thanks, | |
| Comment by Ramon Fernandez Marina [ 02/Apr/15 ] | |
|
evan.lawrence-hurt@itg.com, does the issue appear if you run your primary on RHEL and your secondary on Windows? Or only when the primary is on Windows? We'll try to reproduce on our end; if you could upload full logs for both primary and secondary from startup until the exception appear that may help us investigate. Thanks, | |
| Comment by Evan Lawrence-Hurt [ 02/Apr/15 ] | |
|
Primary replica is deployed on Windows Server 2008 R2 SP1 virtual machine with 8g RAM & 2 cores. Secondary is RHEL 6.6 same hardware specs. |