[SERVER-6926] Secondary should take into account error status of node it chooses to sync from - not just ping time. Created: 04/Sep/12  Updated: 15/Feb/13  Resolved: 10/Sep/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xuguang zhan Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

three Mongo config server
one Mongos
i have create one shard

> cfg = {
_id : "webex11_shard1",
members : [
{_id : 0, host : "10.224.88.109:27018", tags :

{"DC": "SJ"}

, priority:2},
{_id : 1, host : "10.224.88.110:27018", tags :

{"DC": "SJ"}

, priority:1},
{_id : 2, host : "10.224.88.160:27018", tags :

{"DC": "TX"}

, priority:1},
{_id : 3, host : "10.224.88.161:27018", tags :

{"DC": "TX"}

, priority:1},
{_id : 4, host : "10.224.88.163:27018", tags :

{"DC": "LONDON"}

, priority:0}
]
}


Attachments: PNG File sycOptlog.PNG     PNG File sycdiskFailure.PNG    
Operating System: Linux
Participants:

 Description   

when I use multhiThread to call insert , it cause tow secondaryNode Disk fully, but check the rs.status() I have confuse why the 10.224.88.160 not sync the right optlog with 10.224.88.109(PRI) 10.224.88.110(normally Secondary), seems it only keep in touch with other two nodes which have disk fully\

detail pls check the two attachments

EDIT: in this replica set, 10.224.88.160 is syncing from either 10.224.88.161 or 10.224.88.163 - both have the same error (out of disk space) and have stopped syncing from the PRIMARY. This means that 10.224.88.60 is now also falling behind the primary even though it is not in error. This issue to be fixed here is that a node should check the error status of the node it is syncing from and switch to sync from another node if there is an error status.

https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/k0XaKb0vH3s



 Comments   
Comment by Xuguang zhan [ 11/Sep/12 ]

have you testing the case in the Version 2.2 about this? any wiki to show us more detail info.

Comment by Eric Milkie [ 10/Sep/12 ]

In the latest version of MongoDB (2.2), secondaries are more liberal about shutting down when replication problems occur. This behavior should avoid the issues reported here.

Comment by Gregor Macadam [ 07/Sep/12 ]

In 2.0.6 I can repro this but in 2.2 mongod asserts when it is out of disk space

Fri Sep  7 13:20:24 [slaveTracking] update local.slaves query: { _id: ObjectId('5048b8392151e92ceb0f9cd2'), host: "10.224.70.215", ns: "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1347024023000|3579 } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:21801 126ms
Fri Sep  7 13:20:25 [FileAllocator] allocating new datafile ./data/test.5, filling with zeroes...
Fri Sep  7 13:20:25 [FileAllocator] FileAllocator: posix_fallocate failed: errno:28 No space left on device falling back
Fri Sep  7 13:20:25 [FileAllocator] error: failed to allocate new file: ./data/test.5 size: 2146435072 failure creating new datafile; lseek failed for fd 31 with errno: errno:2 No such file or directory.  will try again in 10 seconds
Fri Sep  7 13:20:25 [repl writer worker 1] ERROR: writer worker caught exception: Can't take a write lock while out of disk space on: { ts: Timestamp 1347024024000|6710, h: -2729287484557914270, op: "i", ns: "test.foo22", o: { _id: ObjectId('5049f4983702551a5e021bca'), d: "ejffjiewjfewifjioewfeifjewoivjewoivoiewnvioewnvoiewnvoiewnfoiwefewionewiovnewoivnweoivnweiovnewiovnewiovnewoivnweiovnewiovnweiovnewiovnewiovnweiovnweiovnwe", e: "ngewngeuwbguwenfwenckewmclekwvnewklvnewinvewk vewklcniwc wcepwocmwepovnwepiovniwenvkewncipewmcpoewmbpoewhgopjepfmwepnvpioewnvpowemgpowenpoewnpoevn", f: "ugwebjfewkjbewjnweiognweoivnweoivnewiovnweiovnewiovnweiocnieowncionvioewnvioewvnewioncioencienceiowncioewnciewncioewncion", g: "fhewfnewinewcnewnbewjcnewjkcbenwjkcnewjcnewjbewjkbgfneewnicenwiocenwoicnewiocneiocnewoicnewioncewicnewocineocinweioc", h: "fjewnfkewnckewncklewnkcnewklvnewklvnewklcnkleweklcnewklcnewivnweivneiwcewikcnewpiocewncnewicnewicnecoiewio", i: "uwebjencjkewncklewncklewncklewncklenwcklewnclkewncklewncklewklcnewklcnweicnwfiajgoejvopenvpewnvpiewnbienbirebnoirebneiro" } }
Fri Sep  7 13:20:25 [repl writer worker 1]   Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f6553ee67f1 0x7f655328cccd 
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod() [0x7c3659]
 /lib64/libpthread.so.0(+0x77f1) [0x7f6553ee67f1]
 /lib64/libc.so.6(clone+0x6d) [0x7f655328cccd]
Fri Sep  7 13:20:25 [repl writer worker 1] 
 
***aborting after fassert() failure
 
 
Fri Sep  7 13:20:25 Got signal: 6 (Aborted).
 
Fri Sep  7 13:20:25 Backtrace:
0xade6e1 0x5582d9 0x7f65531d9900 0x7f65531d9885 0x7f65531db065 0x802e3e 0x64f77d 0x77d3dd 0x7c3659 0x7f6553ee67f1 0x7f655328cccd 
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /lib64/libc.so.6(+0x32900) [0x7f65531d9900]
 /lib64/libc.so.6(gsignal+0x35) [0x7f65531d9885]
 /lib64/libc.so.6(abort+0x175) [0x7f65531db065]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 ./mongodb-linux-x86_64-2.2.0/bin/mongod() [0x7c3659]
 /lib64/libpthread.so.0(+0x77f1) [0x7f6553ee67f1]
 /lib64/libc.so.6(clone+0x6d) [0x7f655328cccd]
 

and so the secondary will choose another node to sync from.

Generated at Thu Feb 08 03:13:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.