[SERVER-21244] Fatal Assertion 18750 Created: 02/Nov/15  Updated: 23/Nov/15  Resolved: 23/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: rujun1 Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

shard with 3 member replica set, 40 cpu, 128g

Participants:

 Description   

***** SERVER RESTARTED *****
2015-11-02T02:09:00.191+0800 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
2015-11-02T02:09:00.207+0800 W -        [initandlisten] Detected unclean shutdown - /mongo/data/mongod.lock is not empty.
2015-11-02T02:09:00.207+0800 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-11-02T02:09:00.207+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=48G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=5,log_size=2GB),statistics_log=(wait=0),
2015-11-02T02:09:02.448+0800 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] MongoDB starting : pid=190762 port=21111 dbpath=/mongo/data 64-bit host=102
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] 
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] db version v3.0.4
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] git version: 0481c958daeb2969800511e7475dc66986fa9ed5
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 0.9.8j-fips 07 Jan 2009
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] build info: Linux ip-10-156-16-176 3.0.13-0.27-ec2 #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 BOOST_LIB_VERSION=1_49
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] options: { config: "/mongo/conf/mongodb_rep_1.conf", cpu: false, fastsync: false, net: { bindIp: "0.0.0.0", http: { RESTInterfaceEnabled: false }, maxIncomingConnections: 800, port: 21111, wireObjectCheck: false }, notablescan: false, operationProfiling: { mode: "off", slowOpThresholdMs: 100 }, processManagement: { fork: true, pidFilePath: "/mongo/mongo.pid" }, replication: { oplogSizeMB: 8192, replSet: "rep_1/102.site:21111,117.site:21111,118.site:21111" }, security: { authorization: "enabled", javascriptEnabled: true, keyFile: "/mongo/bin/key" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/mongo/data", directoryPerDB: true, engine: "wiredTiger", journal: { enabled: true }, mmapv1: { journal: { commitIntervalMs: 300 }, nsSize: 64, preallocDataFiles: true, quota: { enforced: false }, smallFiles: false }, syncPeriodSecs: 5.0, wiredTiger: { collectionConfig: { blockCompressor: "snappy" }, engineConfig: { cacheSizeGB: 48, journalCompressor: "snappy" }, indexConfig: { prefixCompression: true } } }, systemLog: { destination: "file", logAppend: true, path: "/mongo/logs/mongostatus.log", verbosity: 0 } }
2015-11-02T02:09:03.932+0800 I NETWORK  [initandlisten] waiting for connections on port 21111
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "rep_1", version: 1, members: [ { _id: 0, host: "102:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "117.site:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "118.site:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] This node is 102:21111 in the config
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] transition to STARTUP2
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] Starting replication applier threads
2015-11-02T02:09:03.975+0800 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-11-02T02:09:04.005+0800 I REPL     [ReplicationExecutor] Member 117.site:21111 is now in state PRIMARY
2015-11-02T02:09:04.006+0800 I REPL     [ReplicationExecutor] Member 118.site:21111 is now in state SECONDARY
2015-11-02T02:09:04.974+0800 I NETWORK  [initandlisten] connection accepted from1.1.1.118:49271 #1 (1 connection now open)
2015-11-02T02:09:04.977+0800 I NETWORK  [initandlisten] connection accepted from1.1.1.117:57571 #2 (2 connections now open)
2015-11-02T02:09:04.999+0800 I ACCESS   [conn2] Successfully authenticated as principal __system on local
2015-11-02T02:09:04.999+0800 I ACCESS   [conn1] Successfully authenticated as principal __system on local
2015-11-02T02:09:06.976+0800 I REPL     [ReplicationExecutor] syncing from: 118.site:21111
2015-11-02T02:09:06.998+0800 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to 118.site:21111
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] replSet our last op time fetched: Nov  2 02:05:20:d
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] replset source's GTE: Nov  2 02:05:27:1
2015-11-02T02:09:06.999+0800 F REPL     [rsBackgroundSync] replSet need to rollback, but in inconsistent state
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] minvalid: 56365467:3e7 our last optime: 56365460:d
2015-11-02T02:09:06.999+0800 I -        [rsBackgroundSync] Fatal Assertion 18750
2015-11-02T02:09:06.999+0800 I -        [rsBackgroundSync] 
 
***aborting after fassert() failure



 Comments   
Comment by Ramon Fernandez Marina [ 23/Nov/15 ]

rujun1, unfortunately this log only shows the same symptoms: this node is no longer able to replicate so it needs to be re-synced from a healthy node. Also I don't see any evidence of a bug in the server so I'm going to close this ticket.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by rujun1 [ 03/Nov/15 ]

and i don't know how it happen:
write: 100/s
read: 1-10/s

when i upload the attachment, server report "Cannot attach file mongostatus_1.log: Unknown server error (403)."
so, I can only take a small amount of log, in the following:

2015-11-02T02:05:21.131+0800 I NETWORK  [LockPinger] DBClientCursor::init call() failed
2015-11-02T02:05:21.131+0800 I NETWORK  [conn2618] end connection 1.1.1.102:57913 (16 connections now open)
2015-11-02T02:05:21.131+0800 I NETWORK  [conn18850] end connection 1.1.1.117:57356 (5 connections now open)
2015-11-02T02:05:21.131+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59906 #18851 (5 connections now open)
2015-11-02T02:05:21.614+0800 I ACCESS   [conn18851] Successfully authenticated as principal __system on local
2015-11-02T02:05:21.950+0800 I NETWORK  [LockPinger] Socket say send() errno:9 Bad file descriptor 1.1.1.117:21112
2015-11-02T02:05:22.131+0800 I QUERY    [conn18798] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1446400659000|1 } } cursorid:22604405916 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:447 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1458660 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } 1960ms
2015-11-02T02:05:22.142+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59908 #18852 (6 connections now open)
2015-11-02T02:05:22.168+0800 I ACCESS   [conn18852] Successfully authenticated as principal __system on local
2015-11-02T02:05:22.881+0800 I WRITE    [conn38] insert aa_aa.bb_bb query: { _id: ObjectId('5636542d94cb5b93517a7fcc'), ddddd: "20151102020429", eeeee: 0, fffff: "cca0f8ce80c211e58000845b12436558061281", ccccc: 20257, dflag: false, lpr: { bright: 50, color: 4, coords: { left: 0.052, top: 0.111, right: 0.078, bottom: 0.093 }, direc: 0, ggggg: 90, hhhhh: 9, iiiii: "鐨?VDXYI", jjjjj: 7 }, src: 2, kkkkk: 20151102020304, lllll: "", mmmmm: 364925504993800 } ninserted:0 keyUpdates:0 writeConflicts:0 exception: Not primary while writing to aa_aa.bb_bb code:10107 numYields:0 locks:{ Global: { acquireCount: { r: 22, w: 22 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 2884526 } }, Database: { acquireCount: { w: 22 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 20 } } } 2134ms
2015-11-02T02:05:22.885+0800 I NETWORK  [conn18798] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [1.1.1.118:59527] 
2015-11-02T02:05:22.885+0800 I NETWORK  [LockPinger] Socket say send() errno:9 Bad file descriptor 1.1.1.118:21112
2015-11-02T02:05:22.885+0800 I NETWORK  [LockPinger] scoped connection to 102.site:21112,117.site:21112,118.site:21112 not being returned to the pool
2015-11-02T02:05:22.885+0800 W SHARDING [LockPinger] distributed lock pinger '102.site:21112,117.site:21112,118.site:21112/102:21111:1446175720:2140191113' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed:  102.site:21112 (1.1.1.102) failed:10276 DBClientBase::findN: transport error: 102.site:21112 ns: admin.$cmd query: { getlasterror: 1, fsync: 1 } 117.site:21112 (1.1.1.117) failed:9001 socket exception [SEND_ERROR] server [1.1.1.117:21112]  118.site:21112 (1.1.1.118) failed:9001 socket exception [SEND_ERROR] server [1.1.1.118:21112] 
2015-11-02T02:05:22.888+0800 I COMMAND  [conn38] command aa_aa.$cmd command: insert { insert: "bb_bb", documents: 1000, writeConcern: {}, ordered: false, metadata: { shardName: "rep_1", shardVersion: [ Timestamp 1000|1323, ObjectId('5632e3fc070f04e17014e1af') ], session: 0 } } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:91128 locks:{ Global: { acquireCount: { r: 1001, w: 1001 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 2884526 } }, Database: { acquireCount: { w: 1001 } }, Collection: { acquireCount: { w: 981 } }, oplog: { acquireCount: { w: 20 } } } 61451ms
2015-11-02T02:05:22.888+0800 I NETWORK  [conn38] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [1.1.1.102:55336] 
2015-11-02T02:05:23.623+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59912 #18853 (5 connections now open)
2015-11-02T02:05:23.641+0800 I ACCESS   [conn18853] Successfully authenticated as principal __system on local
2015-11-02T02:05:24.143+0800 I REPL     [ReplicationExecutor] syncing from: 117.site:21111
2015-11-02T02:05:24.216+0800 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to 117.site:21111
2015-11-02T02:05:24.216+0800 I REPL     [rsBackgroundSync] replSet our last op time fetched: Nov  2 02:05:20:d
2015-11-02T02:05:24.216+0800 I REPL     [rsBackgroundSync] replset source's GTE: Nov  2 02:05:27:1
2015-11-02T02:05:24.216+0800 I REPL     [rsBackgroundSync] beginning rollback
2015-11-02T02:05:24.216+0800 I REPL     [rsBackgroundSync] rollback 0
2015-11-02T02:05:24.216+0800 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-11-02T02:05:24.216+0800 I REPL     [rsBackgroundSync] rollback 1
2015-11-02T02:05:24.217+0800 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-11-02T02:05:24.217+0800 I REPL     [rsBackgroundSync] replSet info rollback our last optime:   Nov  2 02:05:20:d
2015-11-02T02:05:24.217+0800 I REPL     [rsBackgroundSync] replSet info rollback their last optime: Nov  2 02:05:27:3e7
2015-11-02T02:05:24.217+0800 I REPL     [rsBackgroundSync] replSet info rollback diff in end of log times: -7 seconds
2015-11-02T02:05:25.594+0800 I REPL     [rsBackgroundSync] replSet rollback found ggggging events at Nov  2 02:04:21:1
2015-11-02T02:05:25.594+0800 I REPL     [rsBackgroundSync] replSet rollback findcommonpoint scanned : 1019
2015-11-02T02:05:25.594+0800 I REPL     [rsBackgroundSync] replSet rollback 3 fixup
2015-11-02T02:05:25.598+0800 I REPL     [rsBackgroundSync] rollback 3.5
2015-11-02T02:05:25.598+0800 I REPL     [rsBackgroundSync] rollback 4 n:19
2015-11-02T02:05:25.598+0800 I REPL     [rsBackgroundSync] replSet minvalid=Nov  2 02:05:27 56365467:3e7
2015-11-02T02:05:30.569+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.102:52780 #18854 (6 connections now open)
2015-11-02T02:05:30.587+0800 I ACCESS   [conn18854] Successfully authenticated as principal __system on local
2015-11-02T02:05:31.133+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57369 #18855 (7 connections now open)
2015-11-02T02:05:31.156+0800 I ACCESS   [conn18855] Successfully authenticated as principal __system on local
2015-11-02T02:05:43.139+0800 I NETWORK  [conn18848] end connection 1.1.1.117:57351 (6 connections now open)
2015-11-02T02:05:43.139+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57375 #18856 (8 connections now open)
2015-11-02T02:05:43.954+0800 I ACCESS   [conn18856] Successfully authenticated as principal __system on local
2015-11-02T02:05:45.136+0800 I NETWORK  [conn18849] end connection 1.1.1.118:59893 (6 connections now open)
2015-11-02T02:05:45.136+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59924 #18857 (7 connections now open)
2015-11-02T02:05:45.161+0800 I ACCESS   [conn18857] Successfully authenticated as principal __system on local
2015-11-02T02:05:52.249+0800 I NETWORK  [conn18852] end connection 1.1.1.118:59908 (6 connections now open)
2015-11-02T02:05:52.885+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [102.site:21112]
2015-11-02T02:05:52.886+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [117.site:21112]
2015-11-02T02:05:52.886+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [118.site:21112]
2015-11-02T02:05:58.751+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:50732 #18858 (7 connections now open)
2015-11-02T02:05:58.751+0800 I NETWORK  [conn18858] end connection 127.0.0.1:50732 (6 connections now open)
2015-11-02T02:06:13.967+0800 I NETWORK  [conn18856] end connection 1.1.1.117:57375 (5 connections now open)
2015-11-02T02:06:13.968+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57401 #18859 (6 connections now open)
2015-11-02T02:06:13.991+0800 I ACCESS   [conn18859] Successfully authenticated as principal __system on local
2015-11-02T02:06:15.173+0800 I NETWORK  [conn18857] end connection 1.1.1.118:59924 (5 connections now open)
2015-11-02T02:06:15.173+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59955 #18860 (6 connections now open)
2015-11-02T02:06:15.198+0800 I ACCESS   [conn18860] Successfully authenticated as principal __system on local
2015-11-02T02:06:23.230+0800 I NETWORK  [LockPinger] Socket recv() timeout  1.1.1.102:21112
2015-11-02T02:06:23.231+0800 I NETWORK  [LockPinger] SocketException: remote: 1.1.1.102:21112 error: 9001 socket exception [RECV_TIMEOUT] server [1.1.1.102:21112] 
2015-11-02T02:06:23.231+0800 I NETWORK  [LockPinger] DBClientCursor::init call() failed
2015-11-02T02:06:23.398+0800 I NETWORK  [LockPinger] scoped connection to 102.site:21112,117.site:21112,118.site:21112 not being returned to the pool
2015-11-02T02:06:23.398+0800 W SHARDING [LockPinger] distributed lock pinger '102.site:21112,117.site:21112,118.site:21112/102:21111:1446175720:2140191113' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed:  102.site:21112 (1.1.1.102) failed:10276 DBClientBase::findN: transport error: 102.site:21112 ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2015-11-02T02:06:44.003+0800 I NETWORK  [conn18859] end connection 1.1.1.117:57401 (5 connections now open)
2015-11-02T02:06:44.003+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57417 #18861 (7 connections now open)
2015-11-02T02:06:45.209+0800 I NETWORK  [conn18860] end connection 1.1.1.118:59955 (5 connections now open)
2015-11-02T02:07:07.215+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59971 #18862 (7 connections now open)
2015-11-02T02:07:07.215+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57423 #18863 (7 connections now open)
2015-11-02T02:07:07.215+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59977 #18864 (8 connections now open)
2015-11-02T02:07:07.215+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:50790 #18865 (9 connections now open)
2015-11-02T02:07:07.215+0800 I NETWORK  [conn18865] end connection 127.0.0.1:50790 (8 connections now open)
2015-11-02T02:07:07.217+0800 I NETWORK  [conn18862] end connection 1.1.1.118:59971 (7 connections now open)
2015-11-02T02:07:07.217+0800 I NETWORK  [conn18861] end connection 1.1.1.117:57417 (7 connections now open)
2015-11-02T02:07:07.217+0800 I NETWORK  [conn18863] end connection 1.1.1.117:57423 (5 connections now open)
2015-11-02T02:07:07.218+0800 I NETWORK  [conn18864] end connection 1.1.1.118:59977 (4 connections now open)
2015-11-02T02:07:08.008+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57438 #18866 (5 connections now open)
2015-11-02T02:07:08.032+0800 I ACCESS   [conn18866] Successfully authenticated as principal __system on local
2015-11-02T02:07:09.215+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:59983 #18867 (6 connections now open)
2015-11-02T02:07:09.294+0800 I ACCESS   [conn18867] Successfully authenticated as principal __system on local
2015-11-02T02:07:37.215+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [102.site:21112]
2015-11-02T02:07:37.215+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [117.site:21112]
2015-11-02T02:07:37.216+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [118.site:21112]
2015-11-02T02:07:38.047+0800 I NETWORK  [conn18866] end connection 1.1.1.117:57438 (5 connections now open)
2015-11-02T02:07:38.047+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57454 #18868 (6 connections now open)
2015-11-02T02:07:38.072+0800 I ACCESS   [conn18868] Successfully authenticated as principal __system on local
2015-11-02T02:07:39.301+0800 I NETWORK  [conn18867] end connection 1.1.1.118:59983 (5 connections now open)
2015-11-02T02:07:39.302+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:49161 #18869 (6 connections now open)
2015-11-02T02:07:39.329+0800 I ACCESS   [conn18869] Successfully authenticated as principal __system on local
2015-11-02T02:07:59.054+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:50876 #18870 (7 connections now open)
2015-11-02T02:07:59.055+0800 I NETWORK  [conn18870] end connection 127.0.0.1:50876 (6 connections now open)
2015-11-02T02:08:07.534+0800 I NETWORK  [LockPinger] Socket recv() timeout  1.1.1.102:21112
2015-11-02T02:08:07.535+0800 I NETWORK  [LockPinger] SocketException: remote: 1.1.1.102:21112 error: 9001 socket exception [RECV_TIMEOUT] server [1.1.1.102:21112] 
2015-11-02T02:08:07.535+0800 I NETWORK  [LockPinger] DBClientCursor::init call() failed
2015-11-02T02:08:07.806+0800 I NETWORK  [LockPinger] scoped connection to 102.site:21112,117.site:21112,118.site:21112 not being returned to the pool
2015-11-02T02:08:07.806+0800 W SHARDING [LockPinger] distributed lock pinger '102.site:21112,117.site:21112,118.site:21112/102:21111:1446175720:2140191113' det2015-11-02T02:09:00.094+0800 I CONTROL  ***** SERVER RESTARTED *****
2015-11-02T02:09:00.191+0800 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
2015-11-02T02:09:00.207+0800 W -        [initandlisten] Detected unclean shutdown - /mongo/data/mongod.lock is not empty.
2015-11-02T02:09:00.207+0800 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-11-02T02:09:00.207+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=48G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=5,log_size=2GB),statistics_log=(wait=0),
2015-11-02T02:09:02.448+0800 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] MongoDB starting : pid=190762 port=21111 dbpath=/mongo/data 64-bit host=102
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2015-11-02T02:09:03.523+0800 I CONTROL  [initandlisten] 
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] db version v3.0.4
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] git version: 0481c958daeb2969800511e7475dc66986fa9ed5
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 0.9.8j-fips 07 Jan 2009
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] build info: Linux ip-10-156-16-176 3.0.13-0.27-ec2 #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 BOOST_LIB_VERSION=1_49
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-02T02:09:03.525+0800 I CONTROL  [initandlisten] options: { config: "/mongo/conf/mongodb_rep_1.conf", cpu: false, fastsync: false, net: { bindIp: "0.0.0.0", http: { RESTInterfaceEnabled: false }, maxIncomingConnections: 800, port: 21111, wireObjectCheck: false }, notablescan: false, operationProfiling: { mode: "off", slowOpThresholdMs: 100 }, processManagement: { fork: true, pidFilePath: "/mongo/mongo.pid" }, replication: { oplogSizeMB: 8192, replSet: "rep_1/102.site:21111,117.site:21111,118.site:21111" }, security: { authorization: "enabled", javascriptEnabled: true, keyFile: "/mongo/bin/key" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/mongo/data", directoryPerDB: true, engine: "wiredTiger", journal: { enabled: true }, mmapv1: { journal: { commitIntervalMs: 300 }, nsSize: 64, preallocDataFiles: true, quota: { enforced: false }, smallFiles: false }, syncPeriodSecs: 5.0, wiredTiger: { collectionConfig: { blockCompressor: "snappy" }, engineConfig: { cacheSizeGB: 48, journalCompressor: "snappy" }, indexConfig: { prefixCompression: true } } }, systemLog: { destination: "file", logAppend: true, path: "/mongo/logs/mongostatus.log", verbosity: 0 } }
2015-11-02T02:09:03.932+0800 I NETWORK  [initandlisten] waiting for connections on port 21111
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "rep_1", version: 1, members: [ { _id: 0, host: "102:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "117.site:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "118.site:21111", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] This node is 102:21111 in the config
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] transition to STARTUP2
2015-11-02T02:09:03.974+0800 I REPL     [ReplicationExecutor] Starting replication applier threads
2015-11-02T02:09:03.975+0800 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-11-02T02:09:04.005+0800 I REPL     [ReplicationExecutor] Member 117.site:21111 is now in state PRIMARY
2015-11-02T02:09:04.006+0800 I REPL     [ReplicationExecutor] Member 118.site:21111 is now in state SECONDARY
2015-11-02T02:09:04.974+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.118:49271 #1 (1 connection now open)
2015-11-02T02:09:04.977+0800 I NETWORK  [initandlisten] connection accepted from 1.1.1.117:57571 #2 (2 connections now open)
2015-11-02T02:09:04.999+0800 I ACCESS   [conn2] Successfully authenticated as principal __system on local
2015-11-02T02:09:04.999+0800 I ACCESS   [conn1] Successfully authenticated as principal __system on local
2015-11-02T02:09:06.976+0800 I REPL     [ReplicationExecutor] syncing from: 118.site:21111
2015-11-02T02:09:06.998+0800 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to 118.site:21111
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] replSet our last op time fetched: Nov  2 02:05:20:d
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] replset source's GTE: Nov  2 02:05:27:1
2015-11-02T02:09:06.999+0800 F REPL     [rsBackgroundSync] replSet need to rollback, but in inconsistent state
2015-11-02T02:09:06.999+0800 I REPL     [rsBackgroundSync] minvalid: 56365467:3e7 our last optime: 56365460:d
2015-11-02T02:09:06.999+0800 I -        [rsBackgroundSync] Fatal Assertion 18750
2015-11-02T02:09:06.999+0800 I -        [rsBackgroundSync] 
 
***aborting after fassert() failure

Comment by Ramon Fernandez Marina [ 02/Nov/15 ]

rujun1, this assertion indicates that this secondary can't rejoin the replica set and needs to be re-sync'd.

Can you elaborate on how you arrived at this situation? Can you also provide logs for this node from the previous restart until you hit the assertion above?

Thanks,
Ramón.

Generated at Thu Feb 08 03:56:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.