[SERVER-5835] failed to create new data node: Got signal: 7 (Bus error) Created: 14/May/12  Updated: 08/Mar/13  Resolved: 18/Oct/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: mengqy Assignee: Scott Hernandez (Inactive)
Resolution: Incomplete Votes: 0
Labels: crash, replicaset, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

A replica set for testing with 1 master and 2 data nodes on a Debian server:

os:debian lenny(5.0.4)
kernel:2.6.26-2-amd64

CPU:Pentium(R) Dual-Core CPU E5300 @ 2.60GHz
Memory:4G
disk: sda 500G
sdb 320G

Mongodb 2.0.5-rc0


Operating System: Linux
Participants:

 Description   

We had replication problems in our production environment with mongodb 1.8.2 after a server crash.
Over 12 TB data were synced until nearly finishing, then slave crashed.

Then we tested replication with mongodb 2.0.5-rc0. The database size is about a hundred DB.

The testing nodes were started with the following commands:

/usr/local/mongodb/bin/mongod --port 27020 --replSet jingoal --dbpath=/mongo_data/igoal1/ --logpath=/usr/local/mongodb/log/mongodb1.log --logappend --oplogSize 20000 --journal --fork
/usr/local/mongodb/bin/mongod --port 27021 --replSet jingoal --dbpath=/mongo_data/igoal4/ --logpath=/usr/local/mongodb/log/mongodb4.log --logappend --oplogSize 20000 --journal --fork
/usr/local/mongodb/bin/mongod --port 27022 --replSet jingoal --dbpath=/data2/igoal5/ --logpath=/usr/local/mongodb/log/mongodb5.log --logappend --oplogSize 20000 --journal --fork


Here is the error log from testing (2.0.5-rc0):

Fri May 11 20:21:52 [conn514] end connection 192.168.0.64:43413
Fri May 11 20:21:52 [initandlisten] connection accepted from 192.168.0.64:43417 #516
Fri May 11 20:22:19 Invalid access at address: 0x7fd8119ac000

Fri May 11 20:22:19 Got signal: 7 (Bus error).

Fri May 11 20:22:20 Backtrace:
0xa95df9 0xa9651c 0x7fecd0c21a80 0x7fecd0196feb 0x8b4239 0x8b576b 0x85aa85 0x5c183a 0x850885 0x852995 0x8542e9 0x82e76a 0x830323 0x828381 0x828438 0x8288d0 0xaabd90 0x7fecd
0c19fc7 0x7fecd01e959d
/usr/local/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x3a9) [0xa95df9]
/usr/local/mongodb/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x22c) [0xa9651c]
/lib/libpthread.so.0 [0x7fecd0c21a80]
/lib/libc.so.6(memcpy+0x15b) [0x7fecd0196feb]
/usr/local/mongodb/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbPb+0x679) [0x8b4239]
/usr/local/mongodb/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEb+0x4b) [0x8b576b]
/usr/local/mongodb/bin/mongod(_ZN5mongo6Cloner3FunclERNS_27DBClientCursorBatchIteratorE+0x545) [0x85aa85]
/usr/local/mongodb/bin/mongod(_ZN5mongo18DBClientConnection5queryEN5boost8functionIFvRNS_27DBClientCursorBatchIteratorEEEERKSsNS_5QueryEPKNS_7BSONObjEi+0x1aa) [0x5c183a]
/usr/local/mongodb/bin/mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x3c5) [0x850885]
/usr/local/mongodb/bin/mongod(_ZN5mongo6Cloner2goEPKcRSsRKSsbbbbbbPi+0x1665) [0x852995]
/usr/local/mongodb/bin/mongod(_ZN5mongo9cloneFromEPKcRSsRKSsbbbbbbPi+0x59) [0x8542e9]
/usr/local/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl18_syncDoInitialSyncEv+0xe6a) [0x82e76a]
/usr/local/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x23) [0x830323]
/usr/local/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x61) [0x828381]
/usr/local/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x48) [0x828438]
/usr/local/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0xa0) [0x8288d0]
/usr/local/mongodb/bin/mongod(thread_proxy+0x80) [0xaabd90]
/lib/libpthread.so.0 [0x7fecd0c19fc7]
/lib/libc.so.6(clone+0x6d) [0x7fecd01e959d]

Fri May 11 20:22:20 [initandlisten] connection accepted from 192.168.0.64:43421 #517
Fri May 11 20:22:20 [conn515] end connection 192.168.0.64:43415
Logstream::get called in uninitialized state
Fri May 11 20:22:20 ERROR: Client::~Client _context should be null but is not; client:rsSync
Logstream::get called in uninitialized state
Fri May 11 20:22:20 ERROR: Client::shutdown not called: rsSync


We have searched google, mongodb jira and forums, with some similar reportings, but no idea how to fix this problem.
So need your help, thanks.



 Comments   
Comment by Scott Hernandez (Inactive) [ 10/Oct/12 ]

mengqy, was this resolved with more memory? Have you upgraded since then?

Comment by mengqy [ 15/May/12 ]

Production cluster:
The ongoing replicating shows that slave eventually eats up RAM and swap, gets almost dead:
1. Out and back from cacti once a time;
2. Seems alive from rs.status();
3. Ping ok but ssh fails.

rs.status() of primary (production cluster):

{
"db" : "jingoalmail",
"collections" : 7,
"objects" : 202592396,
"avgObjSize" : 65541.24188915758,
"dataSize" : NumberLong("13278157231140"),
"storageSize" : NumberLong("13283877803280"),
"numExtents" : 6407,
"indexes" : 7,
"indexSize" : 21428296624,
"fileSize" : NumberLong("13561110724608"),
"ok" : 1
}

==>
About 13 TB data, 200 million objects, 21 GB index size.
Seems need more RAM inaddition to current 32 GB, right?
64 GB enough or more?

Comment by mengqy [ 15/May/12 ]

Hi Eliot, thanks for your quick reply and reminding about primary data health.

Our testing cluster is used as hybrid data storage for json data and files. The problematic production cluster is used as file storage.
Primaries work well in testing and production clusters, with good data that are serving online users or our testing team.
We have 32GB RAM in each production cluster data nodes. Do we need more RAM to replicate a new data node for over 12 TB data?


We just finished another testing with 2.0.5, about a hundred GB data. With steadily growing machine loads as before, then crashed again with the following errors:

Tue May 15 10:14:44 [rsHealthPoll] DBClientCursor::init call() failed
Tue May 15 10:14:44 [rsHealthPoll] replSet info 192.168.0.64:27020 is down (or slow to respond): DBClientBase::findN: transport error: 192.168.0.64:27020 query:

{ replSetHe artbeat: "jingoal", v: 5, pv: 1, checkEmpty: false, from: "192.168.0.64:27022" }

Tue May 15 10:14:44 [rsHealthPoll] replSet member 192.168.0.64:27020 is now in state DOWN
Tue May 15 10:14:44 Invalid access at address: 0x7f047d91fde8

Tue May 15 10:14:45 Got signal: 7 (Bus error).

Tue May 15 10:14:46 Backtrace:
0x98b609 0x98bbe0 0x9a7600 0x760a42 0x7a6a98 0x7b2229 0x7b392b 0x758685 0x4c69b4 0x751287 0x753247 0x754b99 0x72fae5 0x731458 0x72c5f1 0x72c6a8 0x72cb30 0xa0dce9 0x9a379d 0
xa92fb9
[0x98b609]
[0x98bbe0]
[0x9a7600]
[0x760a42]
[0x7a6a98]
[0x7b2229]
[0x7b392b]
[0x758685]
[0x4c69b4]
[0x751287]
[0x753247]
[0x754b99]
[0x72fae5]
[0x731458]
[0x72c5f1]
[0x72c6a8]
[0x72cb30]
[0xa0dce9]
[0x9a379d]
[0xa92fb9]

Logstream::get called in uninitialized state
pure virtual method called
pure virtual method called
pure virtual method called
pure virtual method called
pure virtual method called
pure virtual method called
Tue May 15 10:14:47 Tue May 15 10:14:47 pure virtual method called
terminate() called, printing stack:
Tue May 15 10:14:47 terminate() called, printing stack:
terminate() called, printing stack:
Tue May 15 10:14:47 terminate() called, printing stack:
Tue May 15 10:14:47 terminate() called, printing stack:
Tue May 15 10:14:47 terminate() called, printing stack:
Tue May 15 10:14:47 terminate() called, printing stack:
Tue May 15 10:14:46 Invalid access at address: 0x19f0000

Tue May 15 10:14:47 Got signal: 11 (Segmentation fault).

Tue May 15 10:14:47 [rsHealthPoll] replSet member 192.168.0.64:27020 is up
0x98b07d 0xa09d86 0xa09db3 0xa0a2cf 0x879aff 0x87a98a 0x83de25 0x841779 0x78ab37 0x78eeb6 0x999988 0x53cc37 0x9a379d 0xa92fb9
[0x98b07d]
[0xa09d86]
[0xa09db3]
[0xa0a2cf]
[0x879aff]
[0x87a98a]
[0x83de25]
[0x841779]
[0x78ab37]
[0x78eeb6]
[0x999988]
[0x53cc37]
[0x9a379d]
[0xa92fb9]
Tue May 15 10:14:47 Got signal: 6 (Aborted).

Comment by Eliot Horowitz (Inactive) [ 14/May/12 ]

What data is in the primary?
Is it known good?

Generated at Thu Feb 08 03:10:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.