[SERVER-26360] After node goes into RECOVERING due to being too stale to sync from its source, it will never recover even if there is a valid sync source it could use Created: 27/Sep/16  Updated: 13/May/20  Resolved: 03/Apr/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.5, 3.5.6

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: William Schultz (Inactive)
Resolution: Done Votes: 1
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File toostale.js    
Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-23899 Reset maintenance mode when transitio... Closed
Related
related to SERVER-27815 always write seed doc to oplog during... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Repl 2017-03-06, Repl 2017-03-27, Repl 2017-04-17
Participants:
Linked BF Score: 35

 Description   

If a node is syncing from another node and has no common point in its oplog, it will go into RECOVERING state due to being 'too stale'. It will never leave recovering, even if there is another node it could sync from which does have a common point.

I have attached a cleaned up and extended version of the toostale.js test that demonstrates this issue.



 Comments   
Comment by Githook User [ 20/Apr/17 ]

Author:

{u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'}

Message: SERVER-26360 Node should be able to leave RECOVERING after going too stale

This reverts commit 94a762ba470c21db4db14afb692c1b7dc5f4c69b.
Branch: v3.4
https://github.com/mongodb/mongo/commit/6fbaadc720e3237ba1f811bbe175d2f67a4bb65e

Comment by Githook User [ 20/Apr/17 ]

Author:

{u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'}

Message: Revert "SERVER-26360 Node should be able to leave RECOVERING after going too stale"

This reverts commit 31a2a5bcec525c9dc62cfdd06d126673f507c8df.
Branch: v3.4
https://github.com/mongodb/mongo/commit/94a762ba470c21db4db14afb692c1b7dc5f4c69b

Comment by Githook User [ 17/Apr/17 ]

Author:

{u'username': u'will62794', u'name': u'William Schultz', u'email': u'william.schultz@mongodb.com'}

Message: SERVER-26360 Node should be able to leave RECOVERING after going too stale

(cherry picked from commit 136b728f6b51ce6155335652ea64640a4ae194c6)
Branch: v3.4
https://github.com/mongodb/mongo/commit/31a2a5bcec525c9dc62cfdd06d126673f507c8df

Comment by Githook User [ 03/Apr/17 ]

Author:

{u'username': u'will62794', u'name': u'William Schultz', u'email': u'william.schultz@mongodb.com'}

Message: SERVER-26360 Node should be able to leave RECOVERING after going too stale
Branch: master
https://github.com/mongodb/mongo/commit/136b728f6b51ce6155335652ea64640a4ae194c6

Comment by Spencer Brody (Inactive) [ 27/Sep/16 ]

Log output from the attached toostale.js.

Node s2 goes into recovering due to being too stale to sync from the primary:

[js_test:toostale] 2016-09-26T20:06:40.544-0400 d25012| 2016-09-26T20:06:40.543-0400 I REPL     [rsBackgroundSync] syncing from: ubuntu:25010
[js_test:toostale] 2016-09-26T20:06:40.545-0400 d25012| 2016-09-26T20:06:40.544-0400 I REPL     [rsBackgroundSync] Blacklisting ubuntu:25010 because our last fetched optime: { ts: Timestamp 1474934672000|4, t: 1 } is before their earliest optime: { ts: Timestamp 1474934677000|2257, t: 1 } for 1min until: 2016-09-26T20:07:40.544-04001474934800000|0, t: 9223372036854775807 }
[js_test:toostale] 2016-09-26T20:06:40.545-0400 d25012| 2016-09-26T20:06:40.544-0400 I REPL     [rsBackgroundSync] could not find member to sync from
[js_test:toostale] 2016-09-26T20:06:40.545-0400 d25012| 2016-09-26T20:06:40.544-0400 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
[js_test:toostale] 2016-09-26T20:06:40.546-0400 d25012| 2016-09-26T20:06:40.544-0400 I REPL     [rsBackgroundSync] Our newest OpTime : { ts: Timestamp 1474934672000|4, t: 1 }
[js_test:toostale] 2016-09-26T20:06:40.546-0400 d25012| 2016-09-26T20:06:40.545-0400 I REPL     [rsBackgroundSync] Earliest OpTime available is { ts: Timestamp 1474934677000|2257, t: 1 }
[js_test:toostale] 2016-09-26T20:06:40.546-0400 d25012| 2016-09-26T20:06:40.545-0400 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[js_test:toostale] 2016-09-26T20:06:40.546-0400 d25012| 2016-09-26T20:06:40.545-0400 I REPL     [rsBackgroundSync] going into maintenance mode with 0 other maintenance mode tasks in progress
[js_test:toostale] 2016-09-26T20:06:40.546-0400 d25012| 2016-09-26T20:06:40.545-0400 I REPL     [rsBackgroundSync] transition to RECOVERING

Later, the secondary s1 is restarted, which has a larger oplog and which should be a valid sync source for s2. s2 sets s1 as it's sync source:

[js_test:toostale] 2016-09-26T20:07:03.128-0400 d25012| 2016-09-26T20:07:03.127-0400 I REPL     [ReplicationExecutor] Member ubuntu:25011 is now in state SECONDARY
[js_test:toostale] 2016-09-26T20:07:04.126-0400 d25012| 2016-09-26T20:07:04.126-0400 I REPL     [rsBackgroundSync] syncing from: ubuntu:25011

but then it errors and never actually syncs from s1:

[js_test:toostale] 2016-09-26T20:07:09.183-0400 d25012| 2016-09-26T20:07:09.182-0400 W REPL     [rsBackgroundSync] Fetcher error querying oplog: InvalidSyncSource: sync source ubuntu:25011 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: 0) is no longer valid
[js_test:toostale] 2016-09-26T20:07:09.183-0400 d25012| 2016-09-26T20:07:09.182-0400 I REPL     [rsBackgroundSync] could not find member to sync from
[js_test:toostale] 2016-09-26T20:07:09.282-0400 d25012| 2016-09-26T20:07:09.282-0400 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to ubuntu:25011: InvalidSyncSource: Sync target is no longer valid

Generated at Thu Feb 08 04:11:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.