[SERVER-7021] One Shard a Secondary Node can't be restart Created: 12/Sep/12  Updated: 08/Mar/13  Resolved: 02/Jan/13

Status: Closed
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: 2.0.6, 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xuguang zhan Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

PRIMARY> rs.status()
{
"set" : "webex11_shard1",
"date" : ISODate("2012-09-12T09:06:12Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.224.88.109:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1347417601000, "i" : 2184 }

,
"optimeDate" : ISODate("2012-09-12T02:40:01Z"),
"self" : true
},
{
"_id" : 1,
"name" : "10.224.88.110:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23238,
"optime" :

{ "t" : 1347417601000, "i" : 2184 }

,
"optimeDate" : ISODate("2012-09-12T02:40:01Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "10.224.88.160:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23228,
"optime" :

{ "t" : 1347343168000, "i" : 1 }

,
"optimeDate" : ISODate("2012-09-11T05:59:28Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0,
"errmsg" : "syncThread: 14031 Can't take a write lock while out of disk space"
},
{
"_id" : 3,
"name" : "10.224.88.161:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23218,
"optime" :

{ "t" : 1347343168000, "i" : 1 }

,
"optimeDate" : ISODate("2012-09-11T05:59:28Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0,
"errmsg" : "syncThread: 14031 Can't take a write lock while out of disk space"
},
{
"_id" : 4,
"name" : "10.224.88.163:27018",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("1970-01-01T00:00:00Z"),
"pingMs" : 0,
"errmsg" : "socket exception"
}
],
"ok" : 1
}


Attachments: PNG File Pri_data.PNG     PNG File restart-failure-2.0.6.PNG    
Issue Links:
Related
Operating System: Linux
Participants:

 Description   

in this Shard ,one Secondary Node can't be restart, check the exception log

I have Test it on the 2.2.0 and 2.0.6, both not workable.
[log]

Wed Sep 12 07:34:09 [initandlisten] MongoDB starting : pid=31578 port=27018 dbpath=/root/mongodata/db 64-bit host=wdsq1wco0010
Wed Sep 12 07:34:09 [initandlisten] db version v2.2.0, pdfile version 4.5
Wed Sep 12 07:34:09 [initandlisten] git version: f5e83eae9cfbec7fb7a071321928f00d1b0c5207
Wed Sep 12 07:34:09 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Wed Sep 12 07:34:09 [initandlisten] options:

{ dbpath: "/root/mongodata/db", journalCommitInterval: 50, logpath: "/root/mongo_log", maxConns: 1000, oplogSize: 8000, replSet: "webex11_shard1", rest: true, shardsvr: true, syncdelay: 60.0 }

Wed Sep 12 07:34:09 [initandlisten] journal dir=/root/mongodata/db/journal
Wed Sep 12 07:34:09 [initandlisten] recover begin
Wed Sep 12 07:34:09 [initandlisten] recover lsn: 1127815404
Wed Sep 12 07:34:09 [initandlisten] recover /root/mongodata/db/journal/j._0
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:0 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:506986899 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:518847260 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:525615615 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:527173035 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:587074688 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:588212486 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:607192888 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section seq:607252714 < lsn:1127815404
Wed Sep 12 07:34:09 [initandlisten] recover skipping application of section more...
Wed Sep 12 07:34:09 [initandlisten] recover /root/mongodata/db/journal/j._1
Wed Sep 12 07:34:09 [initandlisten] dbexception during recovery: 15923 couldn't get file length when opening mapping /root/mongodata/db/performance.ns boost::filesystem::file_size: No such file or directory: "/root/mongodata/db/performance.ns"
Wed Sep 12 07:34:09 [initandlisten] exception in initAndListen: 15923 couldn't get file length when opening mapping /root/mongodata/db/performance.ns boost::filesystem::file_size: No such file or directory: "/root/mongodata/db/performance.ns", terminating
Wed Sep 12 07:34:09 dbexit:
Wed Sep 12 07:34:09 [initandlisten] shutdown: going to close listening sockets...
Wed Sep 12 07:34:09 [initandlisten] shutdown: going to flush diaglog...
Wed Sep 12 07:34:09 [initandlisten] shutdown: going to close sockets...
Wed Sep 12 07:34:09 [initandlisten] shutdown: waiting for fs preallocator...
Wed Sep 12 07:34:09 [initandlisten] shutdown: lock for final commit...
Wed Sep 12 07:34:09 [initandlisten] shutdown: final commit...
Wed Sep 12 07:34:09 [initandlisten] shutdown: closing all files...
Wed Sep 12 07:34:09 [initandlisten] closeAllFiles() finished
Wed Sep 12 07:34:09 [initandlisten] shutdown: removing fs lock...
Wed Sep 12 07:34:09 dbexit: really exiting now

[analysis] we find the secondary lost the performance.ns, but check the another PriNode it also no eixt collection,but it still can restart

Check attachment Pri_data
Check attachment Test on V2.0.6

Info about the set
PRIMARY> rs.status()
{
"set" : "webex11_shard1",
"date" : ISODate("2012-09-12T09:06:12Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.224.88.109:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1347417601000, "i" : 2184 }

,
"optimeDate" : ISODate("2012-09-12T02:40:01Z"),
"self" : true
},
{
"_id" : 1,
"name" : "10.224.88.110:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23238,
"optime" :

{ "t" : 1347417601000, "i" : 2184 }

,
"optimeDate" : ISODate("2012-09-12T02:40:01Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "10.224.88.160:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23228,
"optime" :

{ "t" : 1347343168000, "i" : 1 }

,
"optimeDate" : ISODate("2012-09-11T05:59:28Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0,
"errmsg" : "syncThread: 14031 Can't take a write lock while out of disk space"
},
{
"_id" : 3,
"name" : "10.224.88.161:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 23218,
"optime" :

{ "t" : 1347343168000, "i" : 1 }

,
"optimeDate" : ISODate("2012-09-11T05:59:28Z"),
"lastHeartbeat" : ISODate("2012-09-12T09:06:11Z"),
"pingMs" : 0,
"errmsg" : "syncThread: 14031 Can't take a write lock while out of disk space"
},
{
"_id" : 4,
"name" : "10.224.88.163:27018",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("1970-01-01T00:00:00Z"),
"pingMs" : 0,
"errmsg" : "socket exception"
}
],
"ok" : 1
}



 Comments   
Comment by Xuguang zhan [ 14/Sep/12 ]

when i used repair:it cause the server can't be restart

Fri Sep 14 02:28:18 [initandlisten] MongoDB starting : pid=2499 port=27018 dbpath=/root/mongodata/db 64-bit host=wdsq1wco0010
Fri Sep 14 02:28:18 [initandlisten] db version v2.0.6, pdfile version 4.5
Fri Sep 14 02:28:18 [initandlisten] git version: e1c0cbc25863f6356aa4e31375add7bb49fb05bc
Fri Sep 14 02:28:18 [initandlisten] build info: Linux ip-10-110-9-236 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Fri Sep 14 02:28:18 [initandlisten] options:

{ dbpath: "/root/mongodata/db", journalCommitInterval: 50, logpath: "/root/mongo_log", maxConns: 1000, oplogSize: 8000, repair: true, replSet: "webex11_shard1", rest: true, shardsvr: true, syncdelay: 10.0 }

Fri Sep 14 02:28:18 [initandlisten] journal dir=/root/mongodata/db/journal
Fri Sep 14 02:28:18 [initandlisten] recover : no journal files present, no recovery needed
Fri Sep 14 02:28:18 [initandlisten] ****
Fri Sep 14 02:28:18 [initandlisten] ****
Fri Sep 14 02:28:18 [initandlisten] need to upgrade database performance with pdfile version 4.5, new version: 4.5
Fri Sep 14 02:28:18 [initandlisten] starting upgrade
Fri Sep 14 02:28:18 [initandlisten] performance repairDatabase performance
Fri Sep 14 02:28:18 [initandlisten] removeJournalFiles
Fri Sep 14 02:28:18 [FileAllocator] allocating new datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.ns, filling with zeroes...
Fri Sep 14 02:28:18 [FileAllocator] creating directory /root/mongodata/db/$tmp_repairDatabase_0/_tmp
Fri Sep 14 02:28:18 [FileAllocator] done allocating datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.ns, size: 16MB, took 0.072 secs
Fri Sep 14 02:28:18 [FileAllocator] allocating new datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.0, filling with zeroes...
Fri Sep 14 02:28:19 [FileAllocator] done allocating datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.0, size: 64MB, took 0.188 secs
Fri Sep 14 02:28:19 [initandlisten] build index performance.small

{ _id: 1 }

Fri Sep 14 02:28:19 [FileAllocator] allocating new datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.1, filling with zeroes...
Fri Sep 14 02:28:19 [initandlisten] build index done 0 records 0 secs
Fri Sep 14 02:28:19 [initandlisten] performance.system.indexes Assertion failure isOk() db/pdfile.h 300
0x57a8a6 0x5853eb 0x8a728c 0x8a7636 0x8cda4e 0x94c610 0x8cc55a 0x8cc973 0x8cce7c 0x8d9d1e 0x8dd66e 0x8df083 0x8e0357 0x943295 0x889107 0x88dcd9 0x88f5e1 0x5e396d 0x5b731c 0x85081e
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a8a6]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5853eb]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo11DataFileMgr7findAllEPKcRKNS_7DiskLocE+0x5fc) [0x8a728c]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo13findTableScanEPKcRKNS_7BSONObjERKNS_7DiskLocE+0xe6) [0x8a7636]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZNK5mongo9QueryPlan9newCursorERKNS_7DiskLocEi+0x29e) [0x8cda4e]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo11UserQueryOp5_initEv+0x780) [0x94c610]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo7QueryOp4initEv+0x11a) [0x8cc55a]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12QueryPlanSet6Runner6initOpERNS_7QueryOpE+0x23) [0x8cc973]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12QueryPlanSet6Runner4initEv+0x2ec) [0x8cce7c]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12QueryPlanSet6Runner22runUntilFirstCompletesEv+0x1e) [0x8d9d1e]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12QueryPlanSet5runOpERNS_7QueryOpE+0x11e) [0x8dd66e]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo16MultiPlanScanner9runOpOnceERNS_7QueryOpE+0x523) [0x8df083]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo16MultiPlanScanner5runOpERNS_7QueryOpE+0x17) [0x8e0357]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0xfa5) [0x943295]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod [0x889107]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x559) [0x88dcd9]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x81) [0x88f5e1]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0xad) [0x5e396d]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x3ac) [0x5b731c]
/root/mongodb-linux-x86_64-2.0.6/bin/mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbbbNS_5QueryE+0x6de) [0x85081e]
Fri Sep 14 02:28:19 [initandlisten] assertion 0 assertion db/pdfile.h:300 ns:performance.system.indexes query:{ name:

{ $ne: "_id_" }

}
Fri Sep 14 02:28:19 [initandlisten] exception in initAndListen: 13106 nextSafe():

{ $err: "assertion db/pdfile.h:300" }

, terminating
Fri Sep 14 02:28:19 dbexit:
Fri Sep 14 02:28:19 [initandlisten] shutdown: going to close listening sockets...
Fri Sep 14 02:28:19 [initandlisten] shutdown: going to flush diaglog...
Fri Sep 14 02:28:19 [initandlisten] shutdown: going to close sockets...
Fri Sep 14 02:28:19 [initandlisten] shutdown: waiting for fs preallocator...
Fri Sep 14 02:28:19 [FileAllocator] done allocating datafile /root/mongodata/db/$tmp_repairDatabase_0/performance.1, size: 128MB, took 0.388 secs
Fri Sep 14 02:28:19 [initandlisten] shutdown: lock for final commit...
Fri Sep 14 02:28:19 [initandlisten] shutdown: final commit...
Fri Sep 14 02:28:19 [initandlisten] shutdown: closing all files...
Fri Sep 14 02:28:19 [initandlisten] closeAllFiles() finished
Fri Sep 14 02:28:19 [initandlisten] journalCleanup...
Fri Sep 14 02:28:19 [initandlisten] removeJournalFiles
Fri Sep 14 02:28:19 [initandlisten] shutdown: removing fs lock...
Fri Sep 14 02:28:19 dbexit: really exiting now

Comment by Kevin Matulef [ 13/Sep/12 ]

The node didn't start before, but now it does? Can you try restarting it with the --repair option?

Comment by Xuguang zhan [ 13/Sep/12 ]

yes, the Disk have been used up.

check the new error on the restart node
{
"_id" : 4,
"name" : "10.224.88.163:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 129,
"optime" :

{ "t" : 1346658635000, "i" : 1403 }

,
"optimeDate" : ISODate("2012-09-03T07:50:35Z"),
"lastHeartbeat" : ISODate("2012-09-13T05:38:16Z"),
"pingMs" : 0,
"errmsg" : "syncTail: 13440 bad offset:0 accessing file: /root/mongodata/db/performance.0 - consider repairing database, syncing: { ts: Timestamp 1346658635000|1404, h: -6209838952630485647, op: \"i\", ns: \"performance.small\", o:

{ _id: ObjectId('504460ecc9636070c52b11" }

],
"ok" : 1
}

any idea?

Comment by Kevin Matulef [ 13/Sep/12 ]

Sorry, my screen cut off the development environment stuff (usually people just post OS version + mongoDB version there). I cut and pasted the rs.status() info in the description.

A couple of your other secondaries have error messages "Can't take a write lock while out of disk space." Are these machines simply out of space?

Comment by Xuguang zhan [ 13/Sep/12 ]

Pls check my envirement ,it have show we have another secondary nodes also, I have checked another bug ,you have mentioned have fixed in 2.1.2 ,so I try this version ,it is not working .pls help to check.

same question on this ,why the .ns file diappeared in the PRI node and the bad secondary Node ?. i check another secondary node it have exist
16404 rw------ 1 root root 16777216 Aug 28 08:07 performance.ns
65604 rw------ 1 root root 67108864 Aug 28 08:07 performance.0
131204 rw------ 1 root root 134217728 Aug 28 08:07 performance.1
262404 rw------ 1 root root 268435456 Sep 7 01:52 performance.2

Copy:
65604 rw------ 1 root root 67108864 Sep 13 05:19 performance.0
131204 rw------ 1 root root 134217728 Sep 13 05:19 performance.1
262404 rw------ 1 root root 268435456 Sep 13 05:19 performance.2
16404 rw------ 1 root root 16777216 Sep 13 05:19 performance.ns

then i copy the data to the secondary Node ,it can be restart. but we can't get the performance data in PRI node
16404 rw------ 1 root root 16777216 Aug 20 08:51 admin.ns
65604 rw------ 1 root root 67108864 Aug 20 08:51 admin.0
131204 rw------ 1 root root 134217728 Aug 20 08:51 admin.1
262404 rw------ 1 root root 268435456 Sep 3 07:54 performance.2
131204 rw------ 1 root root 134217728 Sep 3 07:54 performance.1

how to explain this? may be data lost? i think we should check

Comment by Kevin Matulef [ 13/Sep/12 ]

Did this machine experience a hard shutdown? It looks like it's failing when it tries to replay the journal files.

I presume the "performance" database is no longer visible on the primary? Odd that the .ns file should just disappear. Did any events lead up to this?

Do you have any other secondary nodes in this set? The safest way to recover here would be to copy the datafiles from a working secondary.

Generated at Thu Feb 08 03:13:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.