[SERVER-21422] one shard of mongo-cluster crashed after large backwards clock jump Created: 12/Nov/15  Updated: 17/Nov/15  Resolved: 17/Nov/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: ufo ofu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

cenots 6.5


Attachments: Text File crash1.log    
Operating System: Linux
Participants:

 Description   

mongod --version 3.0.4
The shard have five members, two of which are the members of the arbiter. Suddenly the primary crashed, the new primary selected immediately crashed, until the three data members crashed. After the three members crashed, the three data members are not start.



 Comments   
Comment by Eric Milkie [ 17/Nov/15 ]

Current behavior isn't ideal; SERVER-18557 will fix this issue.

Comment by Eric Milkie [ 17/Nov/15 ]

Yes, unfortunately the server needs a stable clock source for replication. It can tolerate small jumps in time (such as for daylight savings time adjustment), but cannot handle months-long jumps backward in time.

Comment by ufo ofu [ 17/Nov/15 ]

The reason has been found. before a moment, the operating system's time is changed to '2016-01-02', and then be modified. According to 'global_optime.cpp', timestamps in oplog.rs set to '1452628808'. And the counter has been increased until the mongo is crashed.

namespace { 
    mongo::mutex globalOptimeMutex("globalOptime"); 
    mongo::OpTime globalOpTime(0, 0); 
 
 
    bool skewed(const mongo::OpTime& val) { 
        if (val.getInc() & 0x80000000) { 
            mongo::warning() << "clock skew detected  prev: " << val.getSecs() 
                             << " now: " << (unsigned) time(0) << std::endl; 
            return true; 
        } 
 
 
        return false; 
    } 
} 
 
 
namespace mongo { 
    void setGlobalOptime(const OpTime& newTime) { 
        mutex::scoped_lock lk(globalOptimeMutex); 
        globalOpTime = newTime; 
    } 
 
 
    OpTime getLastSetOptime() { 
        mutex::scoped_lock lk(globalOptimeMutex); 
        return globalOpTime; 
    } 
 
 
    OpTime getNextGlobalOptime() { 
        mutex::scoped_lock lk(globalOptimeMutex); 
 
 
        const unsigned now = (unsigned) time(0); 
        const unsigned globalSecs = globalOpTime.getSecs(); 
        if ( globalSecs == now ) { 
            globalOpTime = OpTime(globalSecs, globalOpTime.getInc() + 1); 
        } 
        else if ( now < globalSecs ) { 
            globalOpTime = OpTime(globalSecs, globalOpTime.getInc() + 1); 
            // separate function to keep out of the hot code path 
            fassert(17449, !skewed(globalOpTime)); 
        } 
        else { 
            globalOpTime = OpTime(now, 1); 
        } 
 
 
        return globalOpTime; 
    } 
} 

Comment by Ramon Fernandez Marina [ 12/Nov/15 ]

Can you please upload full logs for all affected nodes? Also, do you have time synchronization in place? What's the system time on each of the affected nodes?

Thanks,
Ramón.

Comment by ufo ofu [ 12/Nov/15 ]

log: W - [ReplicationExecutor] clock skew detected prev: 1452628808 now: 1447285045

Comment by ufo ofu [ 12/Nov/15 ]

2015-11-12T07:37:25.225+0800 I REPL     [ReplicationExecutor] Standing for election
2015-11-12T07:37:25.226+0800 I REPL     [ReplicationExecutor] replSet info electSelf
2015-11-12T07:37:25.227+0800 I REPL     [ReplicationExecutor] received vote: 1 votes from 118:
2015-11-12T07:37:25.227+0800 I REPL     [ReplicationExecutor] received vote: 1 votes from :
2015-11-12T07:37:25.227+0800 I REPL     [ReplicationExecutor] replSet election succeeded, assuming primary role
2015-11-12T07:37:25.227+0800 W -        [ReplicationExecutor] clock skew detected  prev: 1452628808 now: 1447285045
2015-11-12T07:37:25.227+0800 I -        [ReplicationExecutor] Fatal Assertion 17449
2015-11-12T07:37:25.227+0800 W NETWORK  [ReplExecNetThread-0] Failed to connect to :, reason: errno:111 Connection refused
2015-11-12T07:37:25.227+0800 W NETWORK  [ReplExecNetThread-3] Failed to connect to :, reason: errno:111 Connection refused
2015-11-12T07:37:25.240+0800 I CONTROL  [ReplicationExecutor] 
 0xf77369 0xf140f1 0xef7e41 0xa814b0 0xc6f593 0xc7d444 0xca48b4 0xc8353e 0xc82c81 0xc8494a 0xfc58e4 0x3b9a4079d1 0x3b99ce88fd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77369"},{"b":"400000","o":"B140F1"},{"b":"400000","o":"AF7E41"},{"b":"400000","o":"6814B0"},{"b":"400000","o":"86F593"},{"b":"400000","o":"87D444"},{"b":"400000","o":"8A48B4"},{"b":"400000","o":"88353E"},{"b":"400000","o":"882C81"},{"b":"400000","o":"88494A"},{"b":"400000","o":"BC58E4"},{"b":"3B9A400000","o":"79D1"},{"b":"3B99C00000","o":"E88FD"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "2.6.32-220.el6.x86_64", "version" : "#1 SMP Tue Dec 6 19:48:22 GMT 2011", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "7702B4740E91E1BA6F701DB2A42553968AC09562" }, { "b" : "7FFF477DC000", "elfType" : 3, "buildId" : "71E48146FA2DCAC5D4D82A1EF860D535627AA1B9" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B8DFF8E53D9F2B80C3C382E83EC17C828B536A39" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "40BEA6554E64FC0C3D5C7D0CD91362730515102F" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "FC4EFD7502ACB3B9D213D28272D15A165857AD5A" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "583411D8786F86A1D6B8741C502831E6122445A7" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "454F8FC6CC6502C6401E5F9E221564D80665D277" }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C9A87F6A29ED1D3CB18F539845A45FE3A9877FF1" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "1425CB3B4C2F49C8101ED9B8F1D289053B4DFA77" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F8E59B70E469F3A924A268911FF8FD0C37E7460" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "54BA6B78A9220344E77463947215E42F0EABCC62" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6797403AA5F8FAD8ADFF683478B45F528CE4FB0E" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E3FA235F3BA3F776A01A18ECA737C9890F445923" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "58B696478044E028A5970D48A4ED50E164B43B36" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77369]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf140f1]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef7e41]
 mongod(_ZN5mongo19getNextGlobalOptimeEv+0x1A0) [0xa814b0]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl35_performPostMemberStateUpdateActionENS1_27PostMemberStateUpdateActionE+0xA3) [0xc6f593]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl25_onElectCmdRunnerCompleteEv+0x474) [0xc7d444]
 mongod(_ZN5mongo4repl19ScatterGatherRunner34_signalSufficientResponsesReceivedEPNS0_19ReplicationExecutorE+0x84) [0xca48b4]
 mongod(+0x88353E) [0xc8353e]
 mongod(+0x882C81) [0xc82c81]
 mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x48A) [0xc8494a]
 mongod(+0xBC58E4) [0xfc58e4]
 libpthread.so.0(+0x79D1) [0x3b9a4079d1]
 libc.so.6(clone+0x6D) [0x3b99ce88fd]
-----  END BACKTRACE  -----
2015-11-12T07:37:25.240+0800 I -        [ReplicationExecutor] 
 
***aborting after fassert() failure

Generated at Thu Feb 08 03:57:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.