[SERVER-18006] Shard replicas crash after 3.0.2 upgrade with "Didn't find RecordId in WiredTigerRecordStore" Created: 13/Apr/15  Updated: 01/Sep/15  Resolved: 19/May/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Oleg Rekutin Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-17506 Race between inserts and checkpoints ... Closed
Related
related to SERVER-17881 Primary crashing with 'Didn't find Re... Closed
Operating System: ALL
Participants:

 Description   

Upgraded a 4-shard replicated 3.0.1 cluster to 3.0.2. 1 hour later, two of the shard replicas crashed with the following stack trace:

2015-04-10T20:42:23.166+0000 I -        [repl writer worker 14] Assertion: 28556:Didn't find RecordId in WiredTigerRecordStore
2015-04-10T20:42:23.174+0000 I CONTROL  [repl writer worker 14] 
 0xf6b4a9 0xf0bca1 0xef0eff 0xd835bc 0x92e4d0 0xa2ac15 0xa1caa0 0xbed174 0xbed524 0xbedb5d 0xb1a536 0xc577e1 0xcaea6e 0xcb1255 0xf02abb 0xfb8a04 0x7f68e4785df3 0x7f68e32351ad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6B4A9"},{"b":"400000","o":"B0BCA1"},{"b":"400000","o":"AF0EFF"},{"b":"400000","o":"9835BC"},{"b":"400000","o":"52E4D0"},{"b":"400000","o":"62AC15"},{"b":"400000","o":"61CAA0"},{"b":"400000","o":"7ED174"},{"b":"400000","o":"7ED524"},{"b":"400000","o":"7EDB5D"},{"b":"400000","o":"71A536"},{"b":"400000","o":"8577E1"},{"b":"400000","o":"8AEA6E"},{"b":"400000","o":"8B1255"},{"b":"400000","o":"B02ABB"},{"b":"400000","o":"BB8A04"},{"b":"7F68E477E000","o":"7DF3"},{"b":"7F68E313F000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.0.2", "gitVersion" : "6201872043ecbbc0a4cc169b5482dcf385fc464f", "uname" : { "sysname" : "Linux", "release" : "3.4.73-64.112.amzn1.x86_64", "version" : "#1 SMP Tue Dec 10 01:50:05 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "810BAA4B0BDE1EBB076BF6A544BF2724A13FAE84" }, { "b" : "7FFFF29DA000", "elfType" : 3, "buildId" : "6A7E58423C94E707E8B4A45F098E78D557DD5A7D" }, { "b" : "7F68E477E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F68E4511000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "A5C6FCE56D21542B9535D5FBCFE74EBA71D24613" }, { "b" : "7F68E412D000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "75565934551BECB21A027BE0A3AE28A62AEEB7EE" }, { "b" : "7F68E3F25000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F68E3D21000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F68E3A18000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261" }, { "b" : "7F68E3716000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F68E3500000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "C52958E393BDF8E8D090F36DE0F4E620D8736FBF" }, { "b" : "7F68E313F000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "77E48C0DBAE0843560558F1E51A5FC61602ABC5F" }, { "b" : "7F68E499A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F68E2EFC000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9DF61878D8918F25CC74AD01F417FDB051DFE3DA" }, { "b" : "7F68E2C17000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6F1DB0F811D1B210520443442D4437BC43BF9A80" }, { "b" : "7F68E2A14000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "1A6E97644CC9149C2E1871C6AE1DB51975E78A41" }, { "b" : "7F68E27E9000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F7DF34078FD7BFD684FE46D5F677EEDA1D9B9DC9" }, { "b" : "7F68E25D3000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "87B4EBF2183C8EA4AB657212203EFFE6340E2F4F" }, { "b" : "7F68E23C8000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "381960ACAB9C39461D58BDE7B272C4F61BB3582F" }, { "b" : "7F68E21C5000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7F68E1FAB000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "17295EFA7A45ADE184C2A7A3294460BDC8B43277" }, { "b" : "7F68E1D8A000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB9D9" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6b4a9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0bca1]
 mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef0eff]
 mongod(_ZNK5mongo21WiredTigerRecordStore7dataForEPNS_16OperationContextERKNS_8RecordIdE+0xDC) [0xd835bc]
 mongod(_ZNK5mongo10Collection6docForEPNS_16OperationContextERKNS_8RecordIdE+0x20) [0x92e4d0]
 mongod(_ZN5mongo11IDHackStage4workEPm+0x185) [0xa2ac15]
 mongod(_ZN5mongo11DeleteStage4workEPm+0x70) [0xa1caa0]
 mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbed174]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbed524]
 mongod(_ZN5mongo12PlanExecutor11executePlanEv+0x3D) [0xbedb5d]
 mongod(_ZN5mongo13deleteObjectsEPNS_16OperationContextEPNS_8DatabaseERKNS_10StringDataENS_7BSONObjENS_12PlanExecutor11YieldPolicyEbbbb+0x226) [0xb1a536]
 mongod(_ZN5mongo4repl21applyOperation_inlockEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjEbb+0x1041) [0xc577e1]
 mongod(_ZN5mongo4repl8SyncTail9syncApplyEPNS_16OperationContextERKNS_7BSONObjEb+0x2EE) [0xcaea6e]
 mongod(_ZN5mongo4repl14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x65) [0xcb1255]
 mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf02abb]
 mongod(+0xBB8A04) [0xfb8a04]
 libpthread.so.0(+0x7DF3) [0x7f68e4785df3]
 libc.so.6(clone+0x6D) [0x7f68e32351ad]
-----  END BACKTRACE  -----
2015-04-10T20:42:23.174+0000 E REPL     [repl writer worker 14] writer worker caught exception:  :: caused by :: 28556 Didn't find RecordId in WiredTigerRecordStore on: { ts: Timestamp 1428698543000|107, h: 3163678979668444329, v: 2, op: "d", ns: "abc.xyz", fromMigrate: true, o: { _id: ObjectId('5509b725e4b0bd119005d2b6') } }
2015-04-10T20:42:23.174+0000 I -        [repl writer worker 14] Fatal Assertion 16360
2015-04-10T20:42:23.174+0000 I -        [repl writer worker 14] 
 
***aborting after fassert() failure

Second replica same trace! But a different collection and about 5 minutes later in time.

Both of these failures are preceded by a chunk move (in the mongos logs), a few minutes after. Here is the only line in the MongoS that's pertaining:

2015-04-10T20:37:14.516+0000 I SHARDING [Balancer] moving chunk ns: abc.xyz moving ( ns: abc.xyz, shard: rs_prod1_event_shard2:rs_prod1_event_shard2/p1e2r1.prod.evergage.com:27017,p1e2r2.prod.evergage.com:27017, lastmod: 7|1||000000000000000000000000, min: { user.id: "8812601eb0ca0000" }, max: { user.id: "b0adb576c86b0000" }) rs_prod1_event_shard2:rs_prod1_event_shard2/p1e2r1.prod.evergage.com:27017,p1e2r2.prod.evergage.com:27017 -> rs_prod1_event_shard3:rs_prod1_event_shard3/p1e3r1.prod.evergage.com:27017,p1e3r2.prod.evergage.com:27017

A little later in the mongos, two minutes before failure is the only other log line related to this same collection:

2015-04-10T20:40:09.779+0000 I SHARDING [Balancer] ChunkManager: time to load chunks for abc.xyz: 0ms sequenceNumber: 2667 version: 10|1||54ee6b6d4c0a4ab1d04d2986 based on: 9|1||54ee6b6d4c0a4ab1d04d2986

So it seems like this was a result of a chunk move. We turned off the load balancer and ran without crashes for the last two days. Of course, it's not sustainable to leave the load balancer off.

Replicas would NOT restart after this crash and kept crashing with the same stack trace on the same collection. They would start w/o the replSet argument, but with replSet, it would fail. We ended up removing the collection locally (started w/o replSet) and that allowed the node to restart as a replication member. (We then dropped that collection clusterwide)

About an hour before the failure, the entire cluster has been upgraded from 3.0.1 to 3.0.2, specifically the mongodb-linux-x86_64-amazon-3.0.2.tgz release. Here's the uname -a output:

$ uname -a
Linux p1e1r2 3.4.73-64.112.amzn1.x86_64 #1 SMP Tue Dec 10 01:50:05 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux



 Comments   
Comment by Ramon Fernandez Marina [ 01/Sep/15 ]

Thanks for the update oleg@evergage.com, that's great news! And your message should help other users that may run into this issue, so thanks again.

Cheers,
Ramón.

Comment by Oleg Rekutin [ 01/Sep/15 ]

Ramon, thank you for your response! Indeed, we were able to re-sync all the nodes over time and looks like we have been able to get past this issue, as it has not come up since June 17.

Comment by Ramon Fernandez Marina [ 01/Sep/15 ]

Hi oleg@evergage.com, as per Michael's message on SERVER-17506, once the bug has occurred it's not possible to detect it or recover from the lost update.

If you have a replica set you can try to re-sync from a different node; if the source node was not affected the resync process should succeed and give you a baseline to recover the rest of the nodes.

Regards,
Ramón.

Comment by Oleg Rekutin [ 17/Jun/15 ]

ramon.fernandez, we hit this problem again. Is there a way to proactively find out every database that was affected by (or has leftover corruption from) SERVER-17506?

Comment by Ramon Fernandez Marina [ 19/May/15 ]

Thanks for getting back to us oleg@evergage.com. We'll close the ticket then – feel free to re-open (or open a new one) if you see this again.

Regards,
Ramón.

Comment by Oleg Rekutin [ 18/May/15 ]

Dan, apologies for missing your question earlier.

Yes, I did run this cluster with 3.0.0 at one point. SO indeed it's quite possible that it's leftovers from SERVER-17506.

We were forced to turn on back load balancing as it wasn't too happy unbalanced, but I am happy to report that we have not run into this problem again, while running more load and continuing to run load balancing. The cluster is still running 3.0.2.

Given all the data points, it's quite possible that this is a SERVER-17506 leftover, and I think we can close this as either a dupe or a non reproducible.

Comment by Ramon Fernandez Marina [ 15/May/15 ]

Hi oleg@evergage.com, we haven't heard back from you for some time. If this is still an issue for you, can you please provide the information requested by Dan in the previous comment?

Thanks,
Ramón.

Comment by Daniel Pasette (Inactive) [ 01/May/15 ]

Hi Oleg, can you confirm whether or not you were running this cluster with 3.0.0 at one point? It is possible that the collection was impacted by SERVER-17506, but did not manifest until the chunk migration forced the data to be read.

Thanks
Dan

Comment by Ramon Fernandez Marina [ 22/Apr/15 ]

Thanks oleg@evergage.com, I think you're right about storage problems, so I think it's safe to rule those out.

As for SERVER-17506 my first impression is that is a different problem with the same symptom: you were running 3.0.1 before and you didn't see the behavior reported in this ticket – it only appeared after the upgrade to 3.0.2. If for example we forgot about a corner case in SERVER-17506 I'd say it's unlikely that it waited until 3.0.2 to show up.

That being said, the assertion is the same as in SERVER-17506 so we'll keep that in mind during our investigation.

Regards,
Ramón.

Comment by Oleg Rekutin [ 22/Apr/15 ]

Hi Ramón, these replicas run on top of AWS EBS, and I don't see any errors in system logs related to disk or IO issues. I don't see write or read latency abnormalities for the involved GP2 EBS volume for the failure time period. So I don't think it's related to on-disk corruption.

It's also strange that two of our nodes experienced this shortly after the 3.0.2 upgrade. Another notch against on-disk corruption.

Could this be related to SERVER-17506? That's listed as fixed in 3.0.1, but I don't know the nature of the fix: if the problem's "seed" is planted long before the symptom shows up, then is it possible that running 3.0.0 wrote some bad metadata?

Comment by Ramon Fernandez Marina [ 22/Apr/15 ]

Thanks for uploading the logs oleg@evergage.com. One think you may be able to do is to check the health of the disks of the affected nodes to rule out on-disk corruption because of flaky hardware.

There's a similar error message in SERVER-17881 related to tailing a capped collection, and since that's the mechanism used for replication that ticket may be related to this one.

I understand your concerns about turning the balancer back on; we can try to reproduce on our end, but until we do there's unfortunately not enough information for us to proceed. I'm going to link this ticket to SERVER-17881 and close it, but if you find any additional information please post it on this ticket and we'll reopen it for further investigation.

Regards,
Ramón.

Comment by Oleg Rekutin [ 16/Apr/15 ]

Hi Jason, apologies for the delay in response,

Yes, both members that suffered the issue were in the secondary state.

I've uploaded both the logs from the time that the servers were started as secondaries as 3.0.2 (this was the first start after the 3.0.2 upgrade) to your private server. You can see that both members were in SECONDARY state.

Unfortunately, we did not capture the data files for these members at the time of the assertion or shortly thereafter. If this happens again, I will try to capture the data files. But our load balancer is off for the time being and I am not eager to reproduce this problem.

  • Oleg
Comment by J Rassi [ 13/Apr/15 ]

Hi,

Sorry to hear that you encountered this problem. We'll need additional information to further diagnose the issue:

  • Can you confirm that both of the members that suffered this issue were in state SECONDARY at the time?
  • Could you upload the full mongod log for the two members that encountered this problem, as an attachment to this ticket?
  • Do you have a full backup of the data files for either of these members, at the time of the first assertion failure or shortly thereafter? If so, would you be willing to upload an archive of the data files as an attachment to the ticket?

If you are not able to share the files publicly (or any file exceeds ~100MB in size), note also that you may upload the files to a private server of ours that is accessible only to MongoDB staff members (hit "enter" at the password prompt):

scp -P 722 <file(s)> SERVER-18006@www.10gen.com:

~ Jason Rassi

Generated at Thu Feb 08 03:46:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.