[SERVER-22946] MongoDB crashing Created: 03/Mar/16  Updated: 15/Mar/16  Resolved: 15/Mar/16

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aaron Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Try to add new record to a big database(20GB), and it keep crashing.

Participants:

 Description   

It is keep crashing when add new record,here is the Traceback:

mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf51af9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xef1831]
 mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xCE) [0xed6d6e]
 mongod(_ZNK5mongo5KeyV18dataSizeEv+0xEC) [0xd0143c]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE8splitPosEPNS_13BtreeBucketV1Ei+0x6D) [0xcf812d]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE5splitEPNS_16OperationContextEPNS_13BtreeBucketV1ENS_7DiskLocEiS7_RKNS_5KeyV1ES7_S7_+0x41) [0xcfb3c1]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE10insertHereEPNS_16OperationContextENS_7DiskLocEiRKNS_5KeyV1ES5_S5_S5_+0x167) [0xcfba57]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE7_insertEPNS_16OperationContextEPNS_13BtreeBucketV1ENS_7DiskLocERKNS_5KeyV1ES7_bS7_S7_+0x182) [0xcfbd72]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE7_insertEPNS_16OperationContextEPNS_13BtreeBucketV1ENS_7DiskLocERKNS_5KeyV1ES7_bS7_S7_+0x217) [0xcfbe07]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE7_insertEPNS_16OperationContextEPNS_13BtreeBucketV1ENS_7DiskLocERKNS_5KeyV1ES7_bS7_S7_+0x217) [0xcfbe07]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE7_insertEPNS_16OperationContextEPNS_13BtreeBucketV1ENS_7DiskLocERKNS_5KeyV1ES7_bS7_S7_+0x217) [0xcfbe07]
 mongod(_ZN5mongo10BtreeLogicINS_13BtreeLayoutV1EE6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_7DiskLocEb+0x3A9) [0xcfc309]
 mongod(_ZN5mongo18BtreeInterfaceImplINS_13BtreeLayoutV1EE6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdEb+0x6F) [0xcee3af]
 mongod(_ZN5mongo22BtreeBasedAccessMethod6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdERKNS_19InsertDeleteOptionsEPl+0x19B) [0xa8756b]
 mongod(_ZN5mongo12IndexCatalog12_indexRecordEPNS_16OperationContextEPNS_17IndexCatalogEntryERKNS_7BSONObjERKNS_8RecordIdE+0x6E) [0x92568e]
 mongod(_ZN5mongo12IndexCatalog11indexRecordEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdE+0x85) [0x925a05]
 mongod(_ZN5mongo10Collection15_insertDocumentEPNS_16OperationContextERKNS_7BSONObjEb+0xB0) [0x9139e0]
 mongod(_ZN5mongo10Collection14insertDocumentEPNS_16OperationContextERKNS_7BSONObjEb+0x8D) [0x913ddd]
 mongod(_ZN5mongo18WriteBatchExecutor13execOneInsertEPNS0_16ExecInsertsStateEPPNS_16WriteErrorDetailE+0xA83) [0x9b1b23]
 mongod(_ZN5mongo18WriteBatchExecutor11execInsertsERKNS_21BatchedCommandRequestERKNS_19WriteConcernOptionsEPSt6vectorIPNS_16WriteErrorDetailESaIS9_EE+0x29D) [0x9b287d]
 mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestERKNS_19WriteConcernOptionsEPSt6vectorIPNS_19BatchedUpsertDetailESaIS9_EEPS7_IPNS_16WriteErrorDetailESaISE_EE+0x3E) [0x9b475e]
 mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x37B) [0x9b4e9b]
 mongod(_ZN5mongo8WriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x15D) [0x9b783d]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9dafa4]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC1D) [0x9dbf2d]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9dcc3b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x746) [0xba0ee6]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab73d0]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80e96d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf04c2b]
 libpthread.so.0(+0x7DF3) [0x7fd122334df3]
 libc.so.6(clone+0x6D) [0x7fd120de31ed]
-----  END BACKTRACE  -----



 Comments   
Comment by Kelsey Schubert [ 08/Mar/16 ]

Hi liuaaronsz,

Thank you for confirming that you have a corrupt index. However, it is challenging to determine whether the corruption is isolated to the index. Therefore, I would still recommend executing repairDatabase.

Kind regards,
Thomas

Comment by Aaron [ 08/Mar/16 ]

{
"ns" : "owdb.owparser",
"datasize" : 11581930976,
"nrecords" : 17157362,
"lastExtentSize" : 2146426864,
"firstExtent" : "0:5000 ns:owdb.owparser",
"lastExtent" : "13:2000 ns:owdb.owparser",
"extentCount" : 27,
"firstExtentDetails" : {
"loc" : "0:5000",
"xnext" : "0:29000",
"xprev" : "null",
"nsdiag" : "owdb.owparser",
"size" : 8192,
"firstRecord" : "0:6eb0",
"lastRecord" : "0:6cb0"
},
"lastExtentDetails" : {
"loc" : "13:2000",
"xnext" : "null",
"xprev" : "11:2000",
"nsdiag" : "owdb.owparser",
"size" : 2146426864,
"firstRecord" : "13:20b0",
"lastRecord" : "13:181615b0"
},
"deletedCount" : 27,
"deletedSize" : 1742336064,
"nIndexes" : 3,
"valid" : false,
"errors" : [
"index UID_1_sysID_1_datecode_1 is not multi-key, but has more entries
(17283876) than documents (17157362)",
"exception during index validate idxn 2: 0 assertion
src/mongo/db/storage/mmap_v1/btree/key.cpp:447"
],
"warning" : "Some checks omitted for speed. use {full:true} option to do
more thorough scan.",
"advice" : "ns corrupt. See http://dochub.mongodb.org/core/data-recovery",
"ok" : 1
}

On Tue, Mar 8, 2016 at 12:02 PM, Thomas Schubert (JIRA) <jira@mongodb.org>
wrote:

Comment by Kelsey Schubert [ 08/Mar/16 ]

Hi liuaaronsz,

Unfortunately, the behavior you are describing could indicate that your backups have the same corruption issue.

  1. Does this crash occur with every insert or only on particular inserts?
  2. Can you please upload the logs preceding the stack trace?

To check the integrity of your database files, please consider executing db.collection.validate(true) on the affected collection and attaching the output to this ticket.

Thank you,
Thomas

Comment by Aaron [ 08/Mar/16 ]

Hi Thomas

Here are the answers, and we are not sure the file corruption caused this
issue, we have daily backup files, and we tried to recover from backup, but
this issue still exist. :

  1. If this node is part of a replicaset, are other members affected?
    No, it is running on single node
  1. Around this operation, were there any other server errors logged?
    There is no server error
  1. Are you using journaling?
    yes, we are using journaling.
  1. What kind of underlying storage mechanism are you using? Are the storage
    devices attached locally or over the network? Are the disks SSDs or HDDs?
    What kind of RAID and/or volume management system are you using?
    It is running on RAID 5 , disks are HDD. It is Ext3 file system.
  1. Have you manipulated (coppied or moved) the underling database files?
    We did daily backup, we shut down the database and do a copy of all
    database files, then resume database service.
  1. Have you ever restored this instance from backups?
    yes, we did, and we are still seeing this issue.
  1. What method do you use to create backups?
    we are doing file system level copy as backup.

Is there some tool you have to check the integrity of database file?

On Mon, Mar 7, 2016 at 3:53 PM, Thomas Schubert (JIRA) <jira@mongodb.org>

Comment by Kelsey Schubert [ 07/Mar/16 ]

Hi liuaaronsz,

The stack trace you have provided indicates that this node has some data files that have become corrupt in some way. It is not clear if the corruption resides in the index or the data itself. In cases like this, it is very challenging to determine whether the corruption is isolated beyond the file level.

I have compiled a list of routine questions about data storage and the configuration of your environment. We can use these questions to help get a better understanding of what is going on here. Often corruption is the result of faulty disk drives or power failures. But, please note that in these sorts of situations it can be difficult to understand the cause of the corruption without a straightforward reproduction.

  1. If this node is part of a replicaset, are other members affected?
  2. Around this operation, were there any other server errors logged?
  3. Are you using journaling?
  4. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
  5. Have you manipulated (coppied or moved) the underling database files?
  6. Have you ever restored this instance from backups?
  7. What method do you use to create backups?

I would recommend a clean resync from a node that is not affected. If that is not possible, I would recommend executing repairDatabase. Before attempting repairDatabase, please consider backing up your files.

I would also suggest that you check the integrity of the affected nodes's disk drives. If this issue persists, you may need to replace them.

Regards,
Thomas

Generated at Thu Feb 08 04:01:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.