[SERVER-19302] One of the replica set member is always in "RECOVERING" state. Created: 07/Jul/15  Updated: 14/Jul/15  Resolved: 14/Jul/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kingsly J Assignee: Sam Kleinman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Red Hat Enterprise Linux Server release 6.5 (Santiago)


Attachments: Text File secondary_1.log    
Operating System: Linux
Participants:

 Description   

Hi Team,
We are using mongodb version of 3.0.4. we had set up Sharding env in two servers. Each shard is a replica set. Each replica set is having one primary, one secondary and one arbiter.
The issue we are facing is that our replica set member in one of the replica set, got moved to "RECOVERING" state. we tried all the options mentioned in documentation (including intial sync after taking backup), but after restart it is moving to "STARTUP2" and then to "RECOVERING" and is staying on this state for days and not getting changed to "SECONDARY". By checkng on few jira tickets, we thought it is know bug in V3.0.2 and rectified in V3.0.4 (based on release notes).
Please let us know, still that bug of residing in "RECOVERING" state due to some movechunk problem is there. If not, please help us out in getting rid of this issue. We are facing this often.



 Comments   
Comment by Sam Kleinman (Inactive) [ 14/Jul/15 ]

I'm glad to hear that this worked. I'm going to go ahead and close this ticket for now, but let us know if you see this problem again.

Comment by Kingsly J [ 14/Jul/15 ]

Hi Sam,
We have increased the oplog size and allowed the replica set member in "RECOVERING" state to resysnc (also removed and re-added that member to the replica set), now it got moved to "SECONDARY" and for more than couple of days, it is getting retained in "SECONDARY" state.

Thanks for your valuable suggestion.

Regards,
Kingsly J

Comment by Sam Kleinman (Inactive) [ 10/Jul/15 ]

Thanks for the update. I will leave this ticket open for a few more days just to make sure that both resyncing the member and increasing the oplog size (on all members of the set) will fix the issue you're seeing. Please update the ticket when you've completed the maintenance, and let us know if you see any more of that assertion.

Cheers,
sam

Comment by Kingsly J [ 10/Jul/15 ]

Hi Sam,
Thank you so much for your suggestion. We will increase the oplog size and will resync again.

Regards,
Kingsly J

Comment by Sam Kleinman (Inactive) [ 09/Jul/15 ]

Thanks for this information. The sharded cluster seems to be properly configured. I also looked through the log and found a couple of interesting points from earlier in the history:

2015-05-15T18:16:12.505+0530 I CONTROL  [initandlisten] options: { config: "/home/Cluster/Database/secondary.conf", net: { bindIp: "10.23.24.62", port: 10045 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 75 }, processManagement: { fork: true }, replication: { oplogSizeMB: 5, replSet: "rs1" }, security: { authorization: "disabled" }, storage: { dbPath: "/nginx/Database/mongo_secondary/", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/nginx/Database/secondary.log" } }
2015-05-15T18:16:25.103+0530 I REPL     [ReplicationExecutor] This node is 10.23.24.62:10045 in the config
2015-05-15T18:16:25.103+0530 I REPL     [ReplicationExecutor] transition to STARTUP2
2015-05-15T18:16:25.104+0530 I REPL     [ReplicationExecutor] Member 10.23.24.62:10044 is now in state PRIMARY
2015-05-15T18:16:30.104+0530 I REPL     [rsSync] ******
2015-05-15T18:16:30.104+0530 I REPL     [rsSync] creating replication oplog of size: 5MB...
2015-05-15T18:16:30.118+0530 I STORAGE  [rsSync] Starting WiredTigerRecordStoreThread local.oplog.rs
2015-05-15T18:16:30.146+0530 I REPL     [rsSync] ******
2015-05-15T18:16:30.146+0530 I REPL     [rsSync] initial sync pending
2015-05-15T18:16:30.147+0530 I REPL     [ReplicationExecutor] syncing from: 10.23.24.62:10044
# ... 
2015-05-15T18:30:17.031+0530 I REPL     [rsSync] initial sync done
2015-05-15T18:30:17.040+0530 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-05-15T18:30:17.041+0530 I REPL     [ReplicationExecutor] transition to SECONDARY

It looks like it took about 15 minutes to do the initial sync, but the logs are not consistent with a node stuck in the recovering state. The sh.status() output also doesn't indicate any recent migrations, failed or otherwise. There's also a large amount of time when this member was able to service requests as normal.

I am somewhat concerned by the size of the oplog. It looks like you're setting it to 5MB in the configuration file. This is quite short and means that if a secondary is offline or lags while the primary writes 5MB of data to the oplog it must resync, and may cause problems if the primary writes more than 5MB of data during the initial sync.

See the following documentation on the oplog and resizing:

Later I see a couple of instances restarts caused by errors in replication:

2015-05-20T13:13:00.913+0530 I INDEX    [repl writer worker 0] build index on: SalesHEHLDB.LoanApplication properties: { v: 1, unique: true, key: { applicationID: 1.0 }, name: "applicationID_1", ns: "SalesHEHLDB.LoanApplication" }
2015-05-20T13:13:00.913+0530 I INDEX    [repl writer worker 0] 	 building index using bulk method
2015-05-20T13:13:00.915+0530 I INDEX    [repl writer worker 0] build index done.  scanned 188 total records. 0 secs
2015-05-20T13:13:00.974+0530 I INDEX    [repl writer worker 0] build index on: SalesHEHLDB.DedupeReport properties: { v: 1, key: { dedupeID: 1.0 }, name: "dedupeID_1", ns: "SalesHEHLDB.DedupeReport" }
2015-05-20T13:13:00.974+0530 I INDEX    [repl writer worker 0] 	 building index using bulk method
2015-05-20T13:13:00.981+0530 I INDEX    [repl writer worker 0] build index done.  scanned 8 total records. 0 secs
2015-05-20T13:13:01.046+0530 E REPL     [repl writer worker 15] writer worker caught exception:  :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf79b') } u: { _id: ObjectId('555c3b0429576754e1fbf79b'), op: "query", ns: "SalesHEHLDB.system.namespaces", query: {}, ntoreturn: 0, ntoskip: 0, nscanned: 52, nscannedObjects: 52, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 29837, w: 0 }, timeAcquiringMicros: { r: 1, w: 4 } }, nreturned: 52, responseLength: 2615, millis: 29, execStats: { type: "COLLSCAN", works: 54, yields: 1, unyields: 1, invalidates: 0, advanced: 52, needTime: 1, needFetch: 0, isEOF: 1, docsTested: 52, children: [] }, ts: new Date(1429677212324), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } on: { ts: Timestamp 1432107780000|517, h: -8914998153402753834, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf79b'), op: "query", ns: "SalesHEHLDB.system.namespaces", query: {}, ntoreturn: 0, ntoskip: 0, nscanned: 52, nscannedObjects: 52, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 29837, w: 0 }, timeAcquiringMicros: { r: 1, w: 4 } }, nreturned: 52, responseLength: 2615, millis: 29, execStats: { type: "COLLSCAN", works: 54, yields: 1, unyields: 1, invalidates: 0, advanced: 52, needTime: 1, needFetch: 0, isEOF: 1, docsTested: 52, children: [] }, ts: new Date(1429677212324), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } }
2015-05-20T13:13:01.046+0530 E REPL     [repl writer worker 8] writer worker caught exception:  :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf799') } u: { _id: ObjectId('555c3b0429576754e1fbf799'), op: "command", ns: "SalesHEHLDB.$cmd", command: { dbstats: 1 }, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 81439, w: 0 }, timeAcquiringMicros: { r: 7, w: 5 } }, responseLength: 299, millis: 81, execStats: {}, ts: new Date(1429612546243), client: "10.251.168.68", allUsers: [ { user: "LeadUsersDEV", db: "admin" } ], user: "LeadUsersDEV@admin" } on: { ts: Timestamp 1432107780000|515, h: -6736318779683313939, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf799'), op: "command", ns: "SalesHEHLDB.$cmd", command: { dbstats: 1 }, keyUpdates: 0, numYield: 0, lockStats: { timeLockedMicros: { r: 81439, w: 0 }, timeAcquiringMicros: { r: 7, w: 5 } }, responseLength: 299, millis: 81, execStats: {}, ts: new Date(1429612546243), client: "10.251.168.68", allUsers: [ { user: "LeadUsersDEV", db: "admin" } ], user: "LeadUsersDEV@admin" } }
2015-05-20T13:13:01.046+0530 I -        [repl writer worker 8] Fatal Assertion 16360
2015-05-20T13:13:01.046+0530 I -        [repl writer worker 8] 
 
***aborting after fassert() failure
 
 
2015-05-20T13:13:01.046+0530 E REPL     [repl writer worker 1] writer worker caught exception:  :: caused by :: 10156 cannot update system collection: SalesHEHLDB.system.profile q: { _id: ObjectId('555c3b0429576754e1fbf79c') } u: { _id: ObjectId('555c3b0429576754e1fbf79c'), op: "query", ns: "SalesHEHLDB.Lead", query: { query: {}, $snapshot: true }, cursorid: 1301660389175, ntoreturn: 0, ntoskip: 0, exhaust: true, nscanned: 101, nscannedObjects: 101, keyUpdates: 0, numYield: 4, lockStats: { timeLockedMicros: { r: 19209, w: 0 }, timeAcquiringMicros: { r: 12, w: 4 } }, nreturned: 101, responseLength: 171646, millis: 23, execStats: { type: "FETCH", works: 105, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 4, isEOF: 0, alreadyHasObj: 0, forcedFetches: 0, matchTested: 0, children: [ { type: "IXSCAN", works: 101, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 0, isEOF: 0, keyPattern: "{ _id: 1 }", isMultiKey: 0, boundsVerbose: "field #0['_id']: [MinKey, MaxKey]", yieldMovedCursor: 0, dupsTested: 0, dupsDropped: 0, seenInvalidated: 0, matchTested: 0, keysExamined: 101, children: [] } ] }, ts: new Date(1429677212373), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } on: { ts: Timestamp 1432107780000|518, h: 7326568665598770184, v: 2, op: "i", ns: "SalesHEHLDB.system.profile", o: { _id: ObjectId('555c3b0429576754e1fbf79c'), op: "query", ns: "SalesHEHLDB.Lead", query: { query: {}, $snapshot: true }, cursorid: 1301660389175, ntoreturn: 0, ntoskip: 0, exhaust: true, nscanned: 101, nscannedObjects: 101, keyUpdates: 0, numYield: 4, lockStats: { timeLockedMicros: { r: 19209, w: 0 }, timeAcquiringMicros: { r: 12, w: 4 } }, nreturned: 101, responseLength: 171646, millis: 23, execStats: { type: "FETCH", works: 105, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 4, isEOF: 0, alreadyHasObj: 0, forcedFetches: 0, matchTested: 0, children: [ { type: "IXSCAN", works: 101, yields: 4, unyields: 4, invalidates: 0, advanced: 101, needTime: 0, needFetch: 0, isEOF: 0, keyPattern: "{ _id: 1 }", isMultiKey: 0, boundsVerbose: "field #0['_id']: [MinKey, MaxKey]", yieldMovedCursor: 0, dupsTested: 0, dupsDropped: 0, seenInvalidated: 0, matchTested: 0, keysExamined: 101, children: [] } ] }, ts: new Date(1429677212373), client: "10.232.204.67", allUsers: [ { user: "admin", db: "admin" } ], user: "admin@admin" } }
2015-05-20T13:13:01.046+0530 I -        [repl writer worker 1] Fatal Assertion 16360
 
# ... 
 
2015-07-02T18:17:36.826+0530 I CONTROL  [initandlisten] options: { config: "/home/Cluster/Database/secondary.conf", net: { bindIp: "10.23.24.62", port: 10045 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 75 }, processManagement: { fork: true }, replication: { oplogSizeMB: 5, replSet: "rs1" }, security: { keyFile: "/home/Cluster/Database/mongodb-key" }, storage: { dbPath: "/nginx/Database/mongo_secondary/", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/nginx/Database/secondary.log" } }
2015-07-02T18:17:36.947+0530 I -        [initandlisten] Fatal Assertion 18823
2015-07-02T18:17:37.033+0530 I CONTROL  [initandlisten] 
 0xf77369 0xf140f1 0xef7e41 0xc6e8e9 0xaaf233 0x7fa6f0 0x38ecc1ed1d 0x7f34c9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77369"},{"b":"400000","o":"B140F1"},{"b":"400000","o":"AF7E41"},{"b":"400000","o":"86E8E9"},{"b":"400000","o":"6AF233"},{"b":"400000","o":"3FA6F0"},{"b":"38ECC00000","o":"1ED1D"},{"b":"400000","o":"3F34C9"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.el6.x86_64", "version" : "#1 SMP Sun Nov 10 22:19:54 EST 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "7702B4740E91E1BA6F701DB2A42553968AC09562" }, { "b" : "7FFFE44FF000", "elfType" : 3, "buildId" : "81A81BE2E44C93640ADEDB62ADC93A47F4A09DD1" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "1BB4E10307D6B94223749CFDF2AD14C365972C60" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B933728EE9CE426EBD94C3DF251C78AC838E0E8D" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3E71CF35EF7C460CC31BE02541DC2F4DB9CF83A0" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "B26528BF6C0636AC1CAE5AC50BDBC07E60851DF4" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "AFC7448F2F2F6ED4E5BC82B1BD8A7320B84A9D48" }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "ED99110E629209C5CA6C0ED704F2C5CE3171513A" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "98B028A725D6E93253F25DF00B794DFAA66A3145" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "A44499D29B114A5366CD72DD4883958495AC1C1D" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "871059BD552A77064DCFF4CB5DB5222BD39A31C4" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "57BF668F99B7F5917B8D55FBB645173C9A644575" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "EFF68B7DE77D081BC4A0CB38FE9DCBC60541BF92" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "95EBB74C2C0A1E1714344036145A0239FFA4892D" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "6ADE12F76961F73B33D160AC4D342222E7FC7A65" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "D02E7D3149950118009A81997434E28B7D9EC9B2" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D053BB4FF0C2FC983842F81598813B9B931AD0D1" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "5AFCBEA0D62EE0335714CCBAB7BA808E2A16028C" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "8A8734DC37305D8CC2EF8F8C3E5EA03171DB07EC" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "56843351EFB2CE304A7E4BD0754991613E9EC8BD" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "BAD5C71361DADF259B6E306A49E6F47F24AEA3DC" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77369]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf140f1]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef7e41]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl8shutdownEv+0x1F9) [0xc6e8e9]
 mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x73) [0xaaf233]
 mongod(main+0x140) [0x7fa6f0]
 libc.so.6(__libc_start_main+0xFD) [0x38ecc1ed1d]
 mongod(+0x3F34C9) [0x7f34c9]
-----  END BACKTRACE  -----
2015-07-02T18:17:37.033+0530 I -        [initandlisten] 
 
***aborting after fassert() failure
 
2015-07-03T10:48:20.212+0530 E STORAGE  WiredTiger (2) [1435900700:147915][25588:0x7f8da115f700], log-server: /nginx/Database/mongo_secondary//journal: opendir: No such file or directory
2015-07-03T10:48:20.250+0530 E STORAGE  WiredTiger (2) [1435900700:250744][25588:0x7f8da115f700], log-server: dirlist journal prefix WiredTigerPreplog: No such file or directory
2015-07-03T10:48:20.266+0530 E STORAGE  WiredTiger (2) [1435900700:266097][25588:0x7f8da115f700], log-server: log pre-alloc server error: No such file or directory
2015-07-03T10:48:20.280+0530 E STORAGE  WiredTiger (2) [1435900700:277104][25588:0x7f8da115f700], log-server: log server error: No such file or directory

The first issue may be caused by an oplog entry created with an entry before SERER-18111 was solved. The second issue looks related to allocating space and drive access, as the termination is right after start up, although we can look into that more if it persists.

Can you increase the oplog size and allow the member to resync normally, and see what impact this has on your replica set?

Regards,
sam

Comment by Kingsly J [ 09/Jul/15 ]

Hi Sam,
Thanks for your reply.
I have attached the required log. PFB the o/p of sh.status(),

 sharding version: {
        "_id" : 1,
        "minCompatibleVersion" : 5,
        "currentVersion" : 6,
        "clusterId" : ObjectId("556bf422d035834ca553c990")
}
  shards:
        {  "_id" : "rs1",  "host" : "rs1/10.23.24.62:10044,10.23.24.62:10045" }
        {  "_id" : "rs2",  "host" : "rs2/10.23.24.65:10044,10.23.24.65:10045" }
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                No recent migrations

Regards,
Kingsly J

Comment by Sam Kleinman (Inactive) [ 07/Jul/15 ]

Sorry that you've encountered this error. A few questions:

  1. Could you please attach recent logs from the member in the RECOVERING state?
  2. Could you please include the output of sh.status() while connected to the mongos with the mongo shell?

Regards,
sam

Generated at Thu Feb 08 03:50:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.