[SERVER-33071] Mongodb Crashes with signal: 6 using IXSCAN on MMAPV1 Created: 02/Feb/18  Updated: 21/Mar/18  Resolved: 16/Feb/18

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Andrey Melnikov Assignee: Bruce Lucas (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hello

I get the following backtrace in log when my db 3.4.10 master instance with no replica set running on Ubuntu 16.04.3 LTS

Possibly related to https://jira.mongodb.org/browse/SERVER-28001

However ulimit -n 64000 is set permanently in settings and also I run mongod --repair on database recently when moved it from other server but still no luck and db crashes time to time. The file system of the volume containing database files also is ok. Also I can't reIndex the collection cause it also leads to crash.

Could you please advice what can I do to resolve this issue?

2018-02-02T11:12:31.778+0000 I COMMAND  [conn1047] command books.searches_archive command: count { count: "searches_archive", query: { date: { $gte: 1517000401, $lte: 1517086799 }, user.companyid: { $eq: "10" } } } planSummary: IXSCAN { date: 1 } keysExamined:3442 docsExamined:3442 numYields:310 reslen:44 locks:{ Global: { acquireCount: { r: 622 } }, MMAPV1Journal: { acquireCount: { r: 311 } }, Database: { acquireCount: { r: 311 } }, Collection: { acquireCount: { R: 311 } } } protocol:op_query 633ms
2018-02-02T11:12:32.491+0000 I -        [conn1047] Fatal Assertion 17441 at src/mongo/db/storage/mmap_v1/record_store_v1_base.cpp 282
2018-02-02T11:12:32.492+0000 I -        [conn1047]
 
***aborting after fassert() failure
 
 
2018-02-02T11:12:32.562+0000 F -        [conn1047] Got signal: 6 (Aborted).
 
 0x5635774e8651 0x5635774e7869 0x5635774e7d4d 0x7f131c87e390 0x7f131c4d8428 0x7f131c4da02a 0x5635767927b3 0x5635771a52f6 0x5635771a5325 0x5635771b9c2d 0x5635771b9e37 0x563576b5911b 0x563576affd86 0x563576b217d3 0x563576af3548 0x563576e26952 0x563576e28cf8 0x563576e299ac 0x563576ddff41 0x5635769daba7 0x5635769e4aaf 0x5635769e617a 0x563576ffba30 0x563576c00a52 0x563576c02a56 0x56357680251d 0x563576802e4d 0x563577450b91 0x7f131c8746ba 0x7f131c5aa41d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"563575F77000","o":"1571651","s":"_ZN5mongo15printStackTraceERSo"},{"b":"563575F77000","o":"1570869"},{"b":"563575F77000","o":"1570D4D"},{"b":"7F131C86D000","o":"11390"},{"b":"7F131C4A3000","o":"35428","s":"gsignal"},{"b":"7F131C4A3000","o":"3702A","s":"abort"},{"b":"563575F77000","o":"81B7B3","s":"_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj"},{"b":"563575F77000","o":"122E2F6"},{"b":"563575F77000","o":"122E325","s":"_ZNK5mongo17RecordStoreV1Base13getNextRecordEPNS_16OperationContextERKNS_7DiskLocE"},{"b":"563575F77000","o":"1242C2D","s":"_ZN5mongo27SimpleRecordStoreV1Iterator7advanceEv"},{"b":"563575F77000","o":"1242E37","s":"_ZN5mongo27SimpleRecordStoreV1Iterator9seekExactERKNS_8RecordIdE"},{"b":"563575F77000","o":"BE211B","s":"_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE"},{"b":"563575F77000","o":"B88D86","s":"_ZN5mongo10FetchStage6doWorkEPm"},{"b":"563575F77000","o":"BAA7D3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"563575F77000","o":"B7C548","s":"_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE"},{"b":"563575F77000","o":"EAF952","s":"_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE"},{"b":"563575F77000","o":"EB1CF8","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE"},{"b":"563575F77000","o":"EB29AC","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE"},{"b":"563575F77000","o":"E68F41","s":"_ZN5mongo16getExecutorCountEPNS_16OperationContextEPNS_10CollectionERKNS_12CountRequestEbNS_12PlanExecutor11YieldPolicyE"},{"b":"563575F77000","o":"A63BA7"},{"b":"563575F77000","o":"A6DAAF","s":"_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE"},{"b":"563575F77000","o":"A6F17A","s":"_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE"},{"b":"563575F77000","o":"1084A30","s":"_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE"},{"b":"563575F77000","o":"C89A52"},{"b":"563575F77000","o":"C8BA56","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"563575F77000","o":"88B51D","s":"_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE"},{"b":"563575F77000","o":"88BE4D"},{"b":"563575F77000","o":"14D9B91"},{"b":"7F131C86D000","o":"76BA"},{"b":"7F131C4A3000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.10", "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-109-generic", "version" : "#132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "563575F77000", "elfType" : 3, "buildId" : "953D98B21F3F66FCA98875FCB02884203F3CEDC6" }, { "b" : "7FFD3A5BF000", "elfType" : 3, "buildId" : "935280D9447D52B22E652A6F878EC406871F51FA" }, { "b" : "7F131D7F9000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "DCF10134B91ED2139E3E8C72564668F5CDBA8522" }, { "b" : "7F131D3B5000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "1649272BE0CA9FA22F082DC86372B6C9959779B0" }, { "b" : "7F131D1AD000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F131CFA9000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F131CCA0000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F131CA8A000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F131C86D000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F131C4A3000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F131DA62000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x5635774e8651]
 mongod(+0x1570869) [0x5635774e7869]
 mongod(+0x1570D4D) [0x5635774e7d4d]
 libpthread.so.0(+0x11390) [0x7f131c87e390]
 libc.so.6(gsignal+0x38) [0x7f131c4d8428]
 libc.so.6(abort+0x16A) [0x7f131c4da02a]
 mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x5635767927b3]
 mongod(+0x122E2F6) [0x5635771a52f6]
 mongod(_ZNK5mongo17RecordStoreV1Base13getNextRecordEPNS_16OperationContextERKNS_7DiskLocE+0x25) [0x5635771a5325]
 mongod(_ZN5mongo27SimpleRecordStoreV1Iterator7advanceEv+0x3D) [0x5635771b9c2d]
 mongod(_ZN5mongo27SimpleRecordStoreV1Iterator9seekExactERKNS_8RecordIdE+0x87) [0x5635771b9e37]
 mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0xAB) [0x563576b5911b]
 mongod(_ZN5mongo10FetchStage6doWorkEPm+0x106) [0x563576affd86]
 mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x563576b217d3]
 mongod(_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE+0x198) [0x563576af3548]
 mongod(_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE+0xF2) [0x563576e26952]
 mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE+0x2D8) [0x563576e28cf8]
 mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE+0xEC) [0x563576e299ac]
 mongod(_ZN5mongo16getExecutorCountEPNS_16OperationContextEPNS_10CollectionERKNS_12CountRequestEbNS_12PlanExecutor11YieldPolicyE+0x4C1) [0x563576ddff41]
 mongod(+0xA63BA7) [0x5635769daba7]
 mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x4FF) [0x5635769e4aaf]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0xF6A) [0x5635769e617a]
 mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x240) [0x563576ffba30]
 mongod(+0xC89A52) [0x563576c00a52]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x746) [0x563576c02a56]
 mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1FD) [0x56357680251d]
 mongod(+0x88BE4D) [0x563576802e4d]
 mongod(+0x14D9B91) [0x563577450b91]
 libpthread.so.0(+0x76BA) [0x7f131c8746ba]
 libc.so.6(clone+0x6D) [0x7f131c5aa41d]
-----  END BACKTRACE  -----



 Comments   
Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ]

Hi Andrey,

Yes, the remove command would likely encounter the same issue, so there's no way to repair the affected collection in-place, and you would need to insert the recovered documents into a separate collection. I would recommend of course that you do a validate(true) on all of your data to check its integrity.

Bruce

Comment by Andrey Melnikov [ 02/Feb/18 ]

Hi Bruce

And if I determine the area which is corrupted can I delete these records by _id or using skip? Will the remove() command regarding to this bad area lead to crash? Or it is better to insert unaffected documents to separate collection during this forward /backward scan ?

Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ]

Hi Andrey,

Unfortunately there isn't.

If resync or restore are not possible, you might be able to recover some of the data by doing a forward collection scan with a small batch size (e.g. 2), and then repeating with a backward collection scan. Each will error out when it encounters the first bad document, but by using a small batch size you allow the scan to get as close as possible to the point in the list where the error occurs. Neither scan will be able to reach documents between two separate corrupted regions of the linked list.

Hope this helps,
Bruce

Comment by Andrey Melnikov [ 02/Feb/18 ]

Thanks, Bruce! Is there a way to determine the failing document (or a bunch of them) and then delete it?

Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ]

Hi Andrey,

That error indicates that while traversing the linked list of records in the mmapv1 data it encountered a pointer to the next record that was less than 8, which is the minimum possible value. We can't provide a definitive diagnosis based on that information alone, but the most likely explanation in our experience is that an error at the storage level occurred and a write failed at some time in the past, leaving a hole in the file that is read back as 0s. You might check syslog for write failures, and perform disk diagnostics.

Since mongod repair doesn't work in this case, the recovery options are to resync the node or to restore from backup.

Bruce

Generated at Thu Feb 08 04:32:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.