[SERVER-13291] Invalid memory access causes crash and leaves server in unrecoverable state Created: 20/Mar/14  Updated: 10/Dec/14  Resolved: 07/Jul/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Farooq Mela Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I have a sharded cluster with two replica sets; each replica set has a primary, secondary, and arbiter. During some chunk migrations, the secondary crashed with an invalid memory access:

Thu Mar 20 00:11:15.400 Invalid access at address: 0x7e82cb2c4024 from thread: repl writer worker 1
Thu Mar 20 00:11:15.400 Got signal: 7 (Bus error).
Thu Mar 20 00:11:15.470 Backtrace:
0xde46e1 0x6d06c9 0x6d0c52 0x7f18e429dcb0 0x7feb9b 0x804610 0x80474f 0x80474f 0x80484c 0x8051fd 0x9db2e3 0x9e2944 0x9e45cc 0xacaa1f 0xacb4b2 0xa80bdb 0xa71707 0xc2d253 0xc2c998 0xdb2021
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde46e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d06c9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d0c52]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f18e429dcb0]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE4findERKNS_12IndexDetailsERKNS_5KeyV1ERKNS_7DiskLocERKNS_8OrderingERib+0x8b) [0x7feb9b]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x60) [0x804610]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x19f) [0x80474f]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x19f) [0x80474f]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_7BSONObjERKNS_8OrderingERiRbS8_i+0x8c) [0x80484c]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE7unindexENS_7DiskLocERNS_12IndexDetailsERKNS_7BSONObjES3_+0x51d) [0x8051fd]
 /usr/bin/mongod(_ZNK5mongo18IndexInterfaceImplINS_12BtreeData_V1EE7unindexENS_7DiskLocERNS_12IndexDetailsERKNS_7BSONObjES3_+0x63) [0x9db2e3]
 /usr/bin/mongod() [0x9e2944]
 /usr/bin/mongod(_ZN5mongo13unindexRecordEPNS_16NamespaceDetailsEPNS_6RecordERKNS_7DiskLocEb+0x7c) [0x9e45cc]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPNS_16NamespaceDetailsEPKcPNS_6RecordERKNS_7DiskLocEbbb+0x1bf) [0xacaa1f]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbbb+0x82) [0xacb4b2]
 /usr/bin/mongod(_ZN5mongo13deleteObjectsEPKcNS_7BSONObjEbbbPNS_11RemoveSaverE+0xd6b) [0xa80bdb]
 /usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xb37) [0xa71707]
 /usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc2d253]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc2c998]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdb2021]

I then try to restart the server process, but each time I do, it crashes again with the same error:

***** SERVER RESTARTED *****
 
 
Thu Mar 20 00:16:40.792 [initandlisten] MongoDB starting : pid=12850 port=27017 dbpath=/mnt/ssd0/mongodb 64-bit host=ip-172-31-15-97
Thu Mar 20 00:16:40.792 [initandlisten] db version v2.4.9
Thu Mar 20 00:16:40.792 [initandlisten] git version: 52fe0d21959e32a5bdbecdc62057db386e4e029c
Thu Mar 20 00:16:40.792 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Thu Mar 20 00:16:40.792 [initandlisten] allocator: tcmalloc
Thu Mar 20 00:16:40.792 [initandlisten] options: { config: "/etc/mongodb.conf", dbpath: "/mnt/ssd0/mongodb", keyFile: "/var/lib/mongodb/key", logappend: "true", logpath: "/var/log/mongodb/mongodb.log", replSet: "c10r_rs0" }
Thu Mar 20 00:16:40.794 [initandlisten] journal dir=/mnt/ssd0/mongodb/journal
Thu Mar 20 00:16:40.804 [initandlisten] recover begin
Thu Mar 20 00:16:40.804 [initandlisten] recover lsn: 67446000
Thu Mar 20 00:16:40.804 [initandlisten] recover /mnt/ssd0/mongodb/journal/j._223
Thu Mar 20 00:16:40.805 [initandlisten] recover skipping application of section seq:67267936 < lsn:67446000
Thu Mar 20 00:16:40.971 [initandlisten] recover skipping application of section seq:67327460 < lsn:67446000
Thu Mar 20 00:16:41.378 [initandlisten] recover skipping application of section seq:67386642 < lsn:67446000
Thu Mar 20 00:16:41.614 [initandlisten] recover /mnt/ssd0/mongodb/journal/j._224
Thu Mar 20 00:16:42.117 [initandlisten] recover cleaning up
Thu Mar 20 00:16:42.117 [initandlisten] removeJournalFiles
Thu Mar 20 00:16:42.175 [initandlisten] recover done
Thu Mar 20 00:16:42.242 [websvr] admin web console waiting for connections on port 28017
Thu Mar 20 00:16:42.242 [initandlisten] waiting for connections on port 27017
Thu Mar 20 00:16:42.262 [rsStart] replSet I am ip-172-31-15-97:27017
Thu Mar 20 00:16:42.263 [rsStart] replSet STARTUP2
Thu Mar 20 00:16:42.264 [rsHealthPoll] replset info ip-172-31-7-105:27017 thinks that we are down
Thu Mar 20 00:16:42.264 [rsHealthPoll] replSet member ip-172-31-7-105:27017 is up
Thu Mar 20 00:16:42.264 [rsHealthPoll] replSet member ip-172-31-7-105:27017 is now in state PRIMARY
Thu Mar 20 00:16:42.267 [initandlisten] connection accepted from 172.31.5.87:58346 #1 (1 connection now open)
Thu Mar 20 00:16:42.268 [conn1]  authenticate db: local { authenticate: 1, nonce: "abb7f6424d81c71f", user: "__system", key: "dfb68b42dcbb35c1da05e479817db475" }
Thu Mar 20 00:16:43.264 [rsSync] replSet SECONDARY
Thu Mar 20 00:16:43.672 [initandlisten] connection accepted from 172.31.5.87:58347 #2 (2 connections now open)
Thu Mar 20 00:16:43.842 [initandlisten] connection accepted from 172.31.7.105:49017 #3 (3 connections now open)
Thu Mar 20 00:16:43.842 [conn3] end connection 172.31.7.105:49017 (2 connections now open)
Thu Mar 20 00:16:43.842 [initandlisten] connection accepted from 172.31.7.105:49018 #4 (4 connections now open)
Thu Mar 20 00:16:43.843 [conn4]  authenticate db: local { authenticate: 1, nonce: "370747b8d626ae6b", user: "__system", key: "1621b74845db8066a8f1146b7d3c4e20" }
Thu Mar 20 00:16:43.843 [conn4]  authenticate db: local { authenticate: 1, nonce: "a5af772b216170bf", user: "__system", key: "785eadb9503bd9bf9cea035bfad12f6a" }
Thu Mar 20 00:16:44.085 [initandlisten] connection accepted from 172.31.0.228:41168 #5 (4 connections now open)
Thu Mar 20 00:16:44.086 [conn5] end connection 172.31.0.228:41168 (3 connections now open)
Thu Mar 20 00:16:44.086 [initandlisten] connection accepted from 172.31.0.228:41169 #6 (4 connections now open)
Thu Mar 20 00:16:44.087 [conn6]  authenticate db: local { authenticate: 1, nonce: "3f8ac5c819eeb53f", user: "__system", key: "3101d50913a97466cde4cbed821f5ca8" }
Thu Mar 20 00:16:44.087 [conn6]  authenticate db: local { authenticate: 1, nonce: "57f591267cfa5d23", user: "__system", key: "a673e04f54b4c46c5c40aeaa94e74452" }
Thu Mar 20 00:16:44.265 [rsHealthPoll] replSet member ip-172-31-0-228:27017 is up
Thu Mar 20 00:16:44.265 [rsHealthPoll] replSet member ip-172-31-0-228:27017 is now in state ARBITER
Thu Mar 20 00:16:45.547 [initandlisten] connection accepted from 172.31.8.61:42866 #7 (5 connections now open)
Thu Mar 20 00:16:49.264 [rsBackgroundSync] replSet syncing to: ip-172-31-7-105:27017
Thu Mar 20 00:16:49.395 Invalid access at address: 0x7eacf0936024 from thread: repl writer worker 1
Thu Mar 20 00:16:49.395 Got signal: 7 (Bus error).
Thu Mar 20 00:16:49.402 Backtrace:
0xde46e1 0x6d06c9 0x6d0c52 0x7f432b6f6cb0 0x7feb9b 0x804610 0x80474f 0x80474f 0x80484c 0x8051fd 0x9db2e3 0x9e2944 0x9e45cc 0xacaa1f 0xacb4b2 0xa80bdb 0xa71707 0xc2d253 0xc2c998 0xdb2021
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde46e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d06c9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d0c52]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f432b6f6cb0]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE4findERKNS_12IndexDetailsERKNS_5KeyV1ERKNS_7DiskLocERKNS_8OrderingERib+0x8b) [0x7feb9b]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x60) [0x804610]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x19f) [0x80474f]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_5KeyV1ERKNS_8OrderingERiRbS8_i+0x19f) [0x80474f]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE6locateERKNS_12IndexDetailsERKNS_7DiskLocERKNS_7BSONObjERKNS_8OrderingERiRbS8_i+0x8c) [0x80484c]
 /usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE7unindexENS_7DiskLocERNS_12IndexDetailsERKNS_7BSONObjES3_+0x51d) [0x8051fd]
 /usr/bin/mongod(_ZNK5mongo18IndexInterfaceImplINS_12BtreeData_V1EE7unindexENS_7DiskLocERNS_12IndexDetailsERKNS_7BSONObjES3_+0x63) [0x9db2e3]
 /usr/bin/mongod() [0x9e2944]
 /usr/bin/mongod(_ZN5mongo13unindexRecordEPNS_16NamespaceDetailsEPNS_6RecordERKNS_7DiskLocEb+0x7c) [0x9e45cc]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPNS_16NamespaceDetailsEPKcPNS_6RecordERKNS_7DiskLocEbbb+0x1bf) [0xacaa1f]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbbb+0x82) [0xacb4b2]
 /usr/bin/mongod(_ZN5mongo13deleteObjectsEPKcNS_7BSONObjEbbbPNS_11RemoveSaverE+0xd6b) [0xa80bdb]
 /usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xb37) [0xa71707]
 /usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc2d253]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc2c998]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdb2021]
 

I then attempted to run the server with --repair, but I got this message:

***** SERVER RESTARTED *****
 
 
Thu Mar 20 00:19:45.045 [initandlisten] MongoDB starting : pid=13064 port=27017 dbpath=/mnt/ssd0/m
ongodb 64-bit host=ip-172-31-15-97
Thu Mar 20 00:19:45.046 [initandlisten] db version v2.4.9
Thu Mar 20 00:19:45.046 [initandlisten] git version: 52fe0d21959e32a5bdbecdc62057db386e4e029c
Thu Mar 20 00:19:45.046 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen
 #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Thu Mar 20 00:19:45.046 [initandlisten] allocator: tcmalloc
Thu Mar 20 00:19:45.046 [initandlisten] options: { config: "/etc/mongodb.conf", dbpath: "/mnt/ssd0
/mongodb", keyFile: "/var/lib/mongodb/key", logappend: "true", logpath: "/var/log/mongodb/mongodb.
log", repair: true, replSet: "c10r_rs0" }
**************
You specified --repair but there are dirty journal files. Please
restart without --repair to allow the journal files to be replayed.
If you wish to repair all databases, please shutdown cleanly and
run with --repair again.
**************
Thu Mar 20 00:19:45.046 [initandlisten] exception in initAndListen: 12596 old lock file, terminating
Thu Mar 20 00:19:45.046 dbexit:
Thu Mar 20 00:19:45.046 [initandlisten] shutdown: going to close listening sockets...
Thu Mar 20 00:19:45.046 [initandlisten] shutdown: going to flush diaglog...
Thu Mar 20 00:19:45.046 [initandlisten] shutdown: going to close sockets...
Thu Mar 20 00:19:45.046 [initandlisten] shutdown: waiting for fs preallocator...
Thu Mar 20 00:19:45.046 [initandlisten] shutdown: closing all files...
Thu Mar 20 00:19:45.046 [initandlisten] closeAllFiles() finished
Thu Mar 20 00:19:45.046 dbexit: really exiting now

I was forced to erase all the data from the data directory and have the secondary do a complete re-sync.



 Comments   
Comment by Ramon Fernandez Marina [ 07/Jul/14 ]

farooq@connectifier.com, looks like at some point during replication some data in your secondary got corrupted. When the secondary is restarted and data corruption is detected, mongod shuts itself down to prevent further damage. Data corruption is most commonly caused by hardware problems such as bad hard drives or flaky memory. As you already discovered, re-syncing from the primary is a possible solution for this issue.

Generated at Thu Feb 08 03:31:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.