[SERVER-23250] One node in replicated sharded clusters keeps crashing Created: 21/Mar/16  Updated: 26/Apr/16  Resolved: 26/Apr/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aise Bouma Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Issue Links:
Related
related to SERVER-17081 Resync command under write load can f... Closed
Operating System: ALL
Steps To Reproduce:

Happens randomly

Participants:

 Description   

One node crashes about once a day with error:

2016-03-18T23:59:01.254+0100 I CONTROL  [conn101] pid=10361 port=27017 64-bit host=gelderland
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] db version v3.2.4
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] git version: e2ee9ffcf9f5a94fad76802e28cc978718bb7a30
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] allocator: tcmalloc
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] modules: none
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101] build environment:
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101]     distmod: ubuntu1404
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101]     distarch: x86_64
2016-03-18T23:59:01.255+0100 I CONTROL  [conn101]     target_arch: x86_64
2016-03-18T23:59:01.256+0100 I CONTROL  [conn101] options: { config: "/etc/mongod.conf", net: { port: 27017 }, replication: { replSetName: "midden" }, storage: { dbPath: "/var/lib/mongodb", engine: "wiredTiger", journal: { enabled: true } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
...
2016-03-19T12:37:39.884+0100 F -        [rsSync] Invalid access at address: 0x803be64ac
2016-03-19T12:37:39.905+0100 F -        [rsSync] Got signal: 11 (Segmentation fault).
 
 0x12f3502 0x12f2659 0x12f29d8 0x7f80511be340 0x19b3b38 0x198229f 0x19cf28e 0x1067ae4 0xac923e 0xac9761 0xe94bfb 0xf28acb 0xf2aa6a 0xf208d0 0x7f8051999a60 0x7f80511b6182 0x7f8050ee347d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EF3502","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"EF2659"},{"b":"400000","o":"EF29D8"},{"b":"7F80511AE000","o":"10340"},{"b":"400000","o":"15B3B38","s":"__wt_row_search"},{"b":"400000","o":"158229F","s":"__wt_btcur_insert"},{"b":"400000","o":"15CF28E"},{"b":"400000","o":"C67AE4","s":"_ZN5mongo21WiredTigerRecordStore13insertRecordsEPNS_16OperationContextEPSt6vectorINS_6RecordESaIS4_EEb"},{"b":"400000","o":"6C923E","s":"_ZN5mongo10Collection16_insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_b"},{"b":"400000","o":"6C9761","s":"_ZN5mongo10Collection15insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_bb"},{"b":"400000","o":"A94BFB","s":"_ZN5mongo4repl15writeOpsToOplogEPNS_16OperationContextERKSt6vectorINS_7BSONObjESaIS4_EE"},{"b":"400000","o":"B28ACB","s":"_ZN5mongo4repl8SyncTail10multiApplyEPNS_16OperationContextERKNS1_7OpQueueE"},{"b":"400000","o":"B2AA6A","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEv"},{"b":"400000","o":"B208D0","s":"_ZN5mongo4repl13runSyncThreadEv"},{"b":"7F80518E8000","o":"B1A60"},{"b":"7F80511AE000","o":"8182"},{"b":"7F8050DE9000","o":"FA47D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.4", "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.19.0-25-generic", "version" : "#26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "EF46210F8976780D45B811C3540FECB9E734EABE" }, { "b" : "7FFFCF9C7000", "elfType" : 3, "buildId" : "01005E92EA3C0C1527396B65C7412468298D0813" }, { "b" : "7F80523D4000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "E21720F2804EF30440F2B39CD409252C26F58F73" }, { "b" : "7F8051FF8000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "9BC22F9457E3D7E9CF8DDC135C0DAC8F7742135D" }, { "b" : "7F8051DF0000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "B376100CAB1EAC4E5DE066EACFC282BF7C0B54F3" }, { "b" : "7F8051BEC000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "67699FFDA9FD2A552032E0652A242E82D65AA10D" }, { "b" : "7F80518E8000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "D0E735DBECD63462DA114BD3F76E6EC7BB1FACCC" }, { "b" : "7F80515E2000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "EF3F6DFFA1FBE48436EC6F45CD3AABA157064BB4" }, { "b" : "7F80513CC000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F80511AE000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "AF06068681750736E0524DF17D5A86CB2C3F765C" }, { "b" : "7F8050DE9000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "5382058B69031CAA9B9996C11061CD164C9398FF" }, { "b" : "7F8052633000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2A816C3EBBA4E12813FBD34B06FBD25BC892A67F" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f3502]
 mongod(+0xEF2659) [0x12f2659]
 mongod(+0xEF29D8) [0x12f29d8]
 libpthread.so.0(+0x10340) [0x7f80511be340]
 mongod(__wt_row_search+0x5E8) [0x19b3b38]
 mongod(__wt_btcur_insert+0x45F) [0x198229f]
 mongod(+0x15CF28E) [0x19cf28e]
 mongod(_ZN5mongo21WiredTigerRecordStore13insertRecordsEPNS_16OperationContextEPSt6vectorINS_6RecordESaIS4_EEb+0x254) [0x1067ae4]
 mongod(_ZN5mongo10Collection16_insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_b+0x16E) [0xac923e]
 mongod(_ZN5mongo10Collection15insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_bb+0x1B1) [0xac9761]
 mongod(_ZN5mongo4repl15writeOpsToOplogEPNS_16OperationContextERKSt6vectorINS_7BSONObjESaIS4_EE+0x14B) [0xe94bfb]
 mongod(_ZN5mongo4repl8SyncTail10multiApplyEPNS_16OperationContextERKNS1_7OpQueueE+0x86B) [0xf28acb]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x134A) [0xf2aa6a]
 mongod(_ZN5mongo4repl13runSyncThreadEv+0x2B0) [0xf208d0]
 libstdc++.so.6(+0xB1A60) [0x7f8051999a60]
 libpthread.so.0(+0x8182) [0x7f80511b6182]
 libc.so.6(clone+0x6D) [0x7f8050ee347d]
-----  END BACKTRACE  -----



 Comments   
Comment by Kelsey Schubert [ 26/Apr/16 ]

Hi aisebouma,

Thank you for letting us know that the issue you originally reported has gone away. I am going to close this ticket to reflect that.

Regarding the new issue you are experiencing, from the information you have provided I believe this is a duplicate of SERVER-17081. Please feel free to vote for SERVER-17081 and watch it for updates.

Kind regards,
Thomas

Comment by Aise Bouma [ 22/Apr/16 ]

After changing the hardware: out of the 7 servers in the cluster this one keeps crashing now and then (about once a week now). The error is different now:

2016-04-22T14:42:42.709+0200 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 571a1c42:1bd newest timestamp: 571a1a60:45. Op being applied:

{ <data removed for security reasons>, module: "DEBUG", message: "Set ID = 114, position(s) = POS_INPUT_CONVEYOR_3" }

}
2016-04-22T14:42:42.709+0200 I - [rsSync] Fatal Assertion 18905
2016-04-22T14:42:42.709+0200 I - [rsSync]

Is this a known issue? Do you need some extra info?

Comment by Aise Bouma [ 01/Apr/16 ]

Ok I will do that.

After changing the hardware and resyncing, the server only experienced one crash during a week now. And this crash seems to have a different cause.

Comment by Kelsey Schubert [ 31/Mar/16 ]

Hi aisebouma,

We would like to compare the oplog of primary and secondary at the time of the segfault to get a better idea of is triggering the crash.

When you next experience this issue, following the crash, can you please follow the steps below:

  1. execute

    mongodump -h <primary_hostname><:primary_port> -d local -c oplog.rs -o primary_dump

  2. restart the secondary as a standalone node to allow the recovery process to bring the oplog up to date
  3. After recovery process completes, execute

    mongodump -h <secondary_hostname><:secondary_port> -d local -c oplog.rs -o secondary_dump

Once you have dumped the oplog of the secondary, you may restart it with its standard configuration to have it rejoin the replica set.

When you have these two dumps, please upload them to the same portal as before.

Thank you again for you help,
Thomas

Comment by Aise Bouma [ 31/Mar/16 ]

My config servers contain a lot less data, so it must be the right node. Please remember that the server only crashed once a day. So it was probably triggered by some action on the database.

Comment by Ramon Fernandez Marina [ 30/Mar/16 ]

aisebouma, I've downloaded your files and started up a node in maintenance mode. I'm running validate(true) on your data, but so far everything is working well.

One question though: the data above shows the crashing node belongs to a replica set named midden, but when I start this node the logs say configReplSet, which is a config server. Are you sure you sent me the data from the right node?

EDIT: I've been able to successfully validate the complete dataset you uploaded, so I'm wondering if we're looking at the right node. I'll retrace my steps in case I made an error, but can you please check on your end that you uploaded the dataset for the right node?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 23/Mar/16 ]

Thanks for taking the time to upload the data aisebouma, I'm downloading it to my local machine now. We'll post any updates to this ticket.

Cheers,
Ramón.

Comment by Aise Bouma [ 23/Mar/16 ]

The upload is complete now.

I tried resyncing the node, but that also failed.

I will now replace the nodes hardware and try to sync it.

Comment by Aise Bouma [ 22/Mar/16 ]

OK I will give the upload a try.

Comment by Ramon Fernandez Marina [ 22/Mar/16 ]

aisebouma, unfortunately I don't think it's possible to pinpoint the interesting data, but I'll ask others. Note that one option is to resync this node from the primary. While that may prevent us from investigating further it would allow you to bring this node back into operation.

If you choose to upload the data you'll need to split it in 5GB chunks; you can use split as follows:

split -d -b 5300000000 dbpath.tgz part.

and upload all the part.* files.

Comment by Aise Bouma [ 22/Mar/16 ]

The data is 57GB large. Uploading will take forever. Is any (small) part of particular interest?

Comment by Ramon Fernandez Marina [ 21/Mar/16 ]

aisebouma, looks like there could be a race condition somewhere, as it's very strange that only one node is affected. The alternative is that your data got into a state that causes this segfault to trigger, so could you upload the dbpath contents for this node here?

This is a private, secure upload portal where your files will only be accessible to MongoDB staff for the purpose if tracking down this bug.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 21/Mar/16 ]

Thanks for your report and for uploading the log aisebouma, we're investigating.

Generated at Thu Feb 08 04:02:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.