|
Thanks for this information. The sharded cluster seems to be properly configured. I also looked through the log and found a couple of interesting points from earlier in the history:
2015-05-15T18:16:12.505+0530 I CONTROL [initandlisten] options: { config: "/home/Cluster/Database/secondary.conf", net: { bindIp: "10.23.24.62", port: 10045 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 75 }, processManagement: { fork: true }, replication: { oplogSizeMB: 5, replSet: "rs1" }, security: { authorization: "disabled" }, storage: { dbPath: "/nginx/Database/mongo_secondary/", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/nginx/Database/secondary.log" } }
|
2015-05-15T18:16:25.103+0530 I REPL [ReplicationExecutor] This node is 10.23.24.62:10045 in the config
|
2015-05-15T18:16:25.103+0530 I REPL [ReplicationExecutor] transition to STARTUP2
|
2015-05-15T18:16:25.104+0530 I REPL [ReplicationExecutor] Member 10.23.24.62:10044 is now in state PRIMARY
|
2015-05-15T18:16:30.104+0530 I REPL [rsSync] ******
|
2015-05-15T18:16:30.104+0530 I REPL [rsSync] creating replication oplog of size: 5MB...
|
2015-05-15T18:16:30.118+0530 I STORAGE [rsSync] Starting WiredTigerRecordStoreThread local.oplog.rs
|
2015-05-15T18:16:30.146+0530 I REPL [rsSync] ******
|
2015-05-15T18:16:30.146+0530 I REPL [rsSync] initial sync pending
|
2015-05-15T18:16:30.147+0530 I REPL [ReplicationExecutor] syncing from: 10.23.24.62:10044
|
# ...
|
2015-05-15T18:30:17.031+0530 I REPL [rsSync] initial sync done
|
2015-05-15T18:30:17.040+0530 I REPL [ReplicationExecutor] transition to RECOVERING
|
2015-05-15T18:30:17.041+0530 I REPL [ReplicationExecutor] transition to SECONDARY
|
It looks like it took about 15 minutes to do the initial sync, but the logs are not consistent with a node stuck in the recovering state. The sh.status() output also doesn't indicate any recent migrations, failed or otherwise. There's also a large amount of time when this member was able to service requests as normal.
I am somewhat concerned by the size of the oplog. It looks like you're setting it to 5MB in the configuration file. This is quite short and means that if a secondary is offline or lags while the primary writes 5MB of data to the oplog it must resync, and may cause problems if the primary writes more than 5MB of data during the initial sync.
See the following documentation on the oplog and resizing:
Later I see a couple of instances restarts caused by errors in replication:
2015-05-20T13:13:00.913+0530 I INDEX [repl writer worker 0] build index on: SalesHEHLDB.LoanApplication properties: { v: 1, unique: true, key: { applicationID: 1.0 }, name: "applicationID_1", ns: "SalesHEHLDB.LoanApplication" }
|
2015-05-20T13:13:00.913+0530 I INDEX [repl writer worker 0] building index using bulk method
|
2015-05-20T13:13:00.915+0530 I INDEX [repl writer worker 0] build index done. scanned 188 total records. 0 secs
|
2015-05-20T13:13:00.974+0530 I INDEX [repl writer worker 0] build index on: SalesHEHLDB.DedupeReport properties: { v: 1, key: { dedupeID: 1.0 }, name: "dedupeID_1", ns: "SalesHEHLDB.DedupeReport" }
|
2015-05-20T13:13:00.974+0530 I INDEX [repl writer worker 0] building index using bulk method
|
2015-05-20T13:13:00.981+0530 I INDEX [repl writer worker 0] build index done. scanned 8 total records. 0 secs
|
2015-05-20T13:13:01.046+0530 E REPL [repl writer worker 15] writer worker caught exception: :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf79b') } u: { _id: ObjectId('555c3b0429576754e1fbf79b'), op: "query", ns: "SalesHEHLDB.system.namespaces", query: {}, ntoreturn: 0, ntoskip: 0, nscanned: 52, nscannedObjects: 52, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 29837, w: 0 }, timeAcquiringMicros: { r: 1, w: 4 } }, nreturned: 52, responseLength: 2615, millis: 29, execStats: { type: "COLLSCAN", works: 54, yields: 1, unyields: 1, invalidates: 0, advanced: 52, needTime: 1, needFetch: 0, isEOF: 1, docsTested: 52, children: [] }, ts: new Date(1429677212324), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } on: { ts: Timestamp 1432107780000|517, h: -8914998153402753834, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf79b'), op: "query", ns: "SalesHEHLDB.system.namespaces", query: {}, ntoreturn: 0, ntoskip: 0, nscanned: 52, nscannedObjects: 52, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 29837, w: 0 }, timeAcquiringMicros: { r: 1, w: 4 } }, nreturned: 52, responseLength: 2615, millis: 29, execStats: { type: "COLLSCAN", works: 54, yields: 1, unyields: 1, invalidates: 0, advanced: 52, needTime: 1, needFetch: 0, isEOF: 1, docsTested: 52, children: [] }, ts: new Date(1429677212324), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } }
|
2015-05-20T13:13:01.046+0530 E REPL [repl writer worker 8] writer worker caught exception: :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf799') } u: { _id: ObjectId('555c3b0429576754e1fbf799'), op: "command", ns: "SalesHEHLDB.$cmd", command: { dbstats: 1 }, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 81439, w: 0 }, timeAcquiringMicros: { r: 7, w: 5 } }, responseLength: 299, millis: 81, execStats: {}, ts: new Date(1429612546243), client: "10.251.168.68", allUsers: [ { user: "LeadUsersDEV", db: "admin" } ], user: "LeadUsersDEV@admin" } on: { ts: Timestamp 1432107780000|515, h: -6736318779683313939, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf799'), op: "command", ns: "SalesHEHLDB.$cmd", command: { dbstats: 1 }, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 81439, w: 0 }, timeAcquiringMicros: { r: 7, w: 5 } }, responseLength: 299, millis: 81, execStats: {}, ts: new Date(1429612546243), client: "10.251.168.68", allUsers: [ { user: "LeadUsersDEV", db: "admin" } ], user: "LeadUsersDEV@admin" } }
|
2015-05-20T13:13:01.046+0530 I - [repl writer worker 8] Fatal Assertion 16360
|
2015-05-20T13:13:01.046+0530 I - [repl writer worker 8]
|
|
***aborting after fassert() failure
|
|
|
2015-05-20T13:13:01.046+0530 E REPL [repl writer worker 1] writer worker caught exception: :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf79c') } u: { _id: ObjectId('555c3b0429576754e1fbf79c'), op: "query", ns: "SalesHEHLDB.Lead", query: { query: {}, $snapshot: true }, cursorid: 1301660389175, ntoreturn: 0, ntoskip: 0, exhaust: true, nscanned: 101, nscannedObjects: 101, keyUpdates: 0, numYield: 4, lockStats: { timeLockedMicros: { r: 19209, w: 0 }, timeAcquiringMicros: { r: 12, w: 4 } }, nreturned: 101, responseLength: 171646, millis: 23, execStats: { type: "FETCH", works: 105, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 4, isEOF: 0, alreadyHasObj: 0, forcedFetches: 0, matchTested: 0, children: [ { type: "IXSCAN", works: 101, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 0, isEOF: 0, keyPattern: "{ _id: 1 }", isMultiKey: 0, boundsVerbose: "field #0['_id']: [MinKey, MaxKey]", yieldMovedCursor: 0, dupsTested: 0, dupsDropped: 0, seenInvalidated: 0, matchTested: 0, keysExamined: 101, children: [] } ] }, ts: new Date(1429677212373), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } on: { ts: Timestamp 1432107780000|518, h: 7326568665598770184, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf79c'), op: "query", ns: "SalesHEHLDB.Lead", query: { query: {}, $snapshot: true }, cursorid: 1301660389175, ntoreturn: 0, ntoskip: 0, exhaust: true, nscanned: 101, nscannedObjects: 101, keyUpdates: 0, numYield: 4, lockStats: { timeLockedMicros: { r: 19209, w: 0 }, timeAcquiringMicros: { r: 12, w: 4 } }, nreturned: 101, responseLength: 171646, millis: 23, execStats: { type: "FETCH", works: 105, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 4, isEOF: 0, alreadyHasObj: 0, forcedFetches: 0, matchTested: 0, children: [ { type: "IXSCAN", works: 101, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 0, isEOF: 0, keyPattern: "{ _id: 1 }", isMultiKey: 0, boundsVerbose: "field #0['_id']: [MinKey, MaxKey]", yieldMovedCursor: 0, dupsTested: 0, dupsDropped: 0, seenInvalidated: 0, matchTested: 0, keysExamined: 101, children: [] } ] }, ts: new Date(1429677212373), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } }
|
2015-05-20T13:13:01.046+0530 I - [repl writer worker 1] Fatal Assertion 16360
|
|
# ...
|
|
2015-07-02T18:17:36.826+0530 I CONTROL [initandlisten] options: { config: "/home/Cluster/Database/secondary.conf", net: { bindIp: "10.23.24.62", port: 10045 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 75 }, processManagement: { fork: true }, replication: { oplogSizeMB: 5, replSet: "rs1" }, security: { keyFile: "/home/Cluster/Database/mongodb-key" }, storage: { dbPath: "/nginx/Database/mongo_secondary/", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/nginx/Database/secondary.log" } }
|
2015-07-02T18:17:36.947+0530 I - [initandlisten] Fatal Assertion 18823
|
2015-07-02T18:17:37.033+0530 I CONTROL [initandlisten]
|
0xf77369 0xf140f1 0xef7e41 0xc6e8e9 0xaaf233 0x7fa6f0 0x38ecc1ed1d 0x7f34c9
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B77369"},{"b":"400000","o":"B140F1"},{"b":"400000","o":"AF7E41"},{"b":"400000","o":"86E8E9"},{"b":"400000","o":"6AF233"},{"b":"400000","o":"3FA6F0"},{"b":"38ECC00000","o":"1ED1D"},{"b":"400000","o":"3F34C9"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.el6.x86_64", "version" : "#1 SMP Sun Nov 10 22:19:54 EST 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "7702B4740E91E1BA6F701DB2A42553968AC09562" }, { "b" : "7FFFE44FF000", "elfType" : 3, "buildId" : "81A81BE2E44C93640ADEDB62ADC93A47F4A09DD1" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "1BB4E10307D6B94223749CFDF2AD14C365972C60" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B933728EE9CE426EBD94C3DF251C78AC838E0E8D" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3E71CF35EF7C460CC31BE02541DC2F4DB9CF83A0" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "B26528BF6C0636AC1CAE5AC50BDBC07E60851DF4" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "AFC7448F2F2F6ED4E5BC82B1BD8A7320B84A9D48" }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "ED99110E629209C5CA6C0ED704F2C5CE3171513A" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "98B028A725D6E93253F25DF00B794DFAA66A3145" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "A44499D29B114A5366CD72DD4883958495AC1C1D" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "871059BD552A77064DCFF4CB5DB5222BD39A31C4" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "57BF668F99B7F5917B8D55FBB645173C9A644575" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "EFF68B7DE77D081BC4A0CB38FE9DCBC60541BF92" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "95EBB74C2C0A1E1714344036145A0239FFA4892D" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "6ADE12F76961F73B33D160AC4D342222E7FC7A65" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "D02E7D3149950118009A81997434E28B7D9EC9B2" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D053BB4FF0C2FC983842F81598813B9B931AD0D1" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "5AFCBEA0D62EE0335714CCBAB7BA808E2A16028C" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "8A8734DC37305D8CC2EF8F8C3E5EA03171DB07EC" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "56843351EFB2CE304A7E4BD0754991613E9EC8BD" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "BAD5C71361DADF259B6E306A49E6F47F24AEA3DC" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77369]
|
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf140f1]
|
mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef7e41]
|
mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl8shutdownEv+0x1F9) [0xc6e8e9]
|
mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x73) [0xaaf233]
|
mongod(main+0x140) [0x7fa6f0]
|
libc.so.6(__libc_start_main+0xFD) [0x38ecc1ed1d]
|
mongod(+0x3F34C9) [0x7f34c9]
|
----- END BACKTRACE -----
|
2015-07-02T18:17:37.033+0530 I - [initandlisten]
|
|
***aborting after fassert() failure
|
|
2015-07-03T10:48:20.212+0530 E STORAGE WiredTiger (2) [1435900700:147915][25588:0x7f8da115f700], log-server: /nginx/Database/mongo_secondary//journal: opendir: No such file or directory
|
2015-07-03T10:48:20.250+0530 E STORAGE WiredTiger (2) [1435900700:250744][25588:0x7f8da115f700], log-server: dirlist journal prefix WiredTigerPreplog: No such file or directory
|
2015-07-03T10:48:20.266+0530 E STORAGE WiredTiger (2) [1435900700:266097][25588:0x7f8da115f700], log-server: log pre-alloc server error: No such file or directory
|
2015-07-03T10:48:20.280+0530 E STORAGE WiredTiger (2) [1435900700:277104][25588:0x7f8da115f700], log-server: log server error: No such file or directory
|
The first issue may be caused by an oplog entry created with an entry before SERER-18111 was solved. The second issue looks related to allocating space and drive access, as the termination is right after start up, although we can look into that more if it persists.
Can you increase the oplog size and allow the member to resync normally, and see what impact this has on your replica set?
Regards,
sam
|