[SERVER-31283] Unable to complete rollback after falling off the oplog Created: 27/Sep/17  Updated: 27/Oct/23  Resolved: 27/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bogdan Guban Assignee: Kelsey Schubert
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Description   

I have mongo replication set. One of the servers crashed with the next info in log file:

2017-09-26T11:44:51.714-0700 I INDEX    [repl index builder 132373] build index done.  scanned 0 total records. 0 secs
2017-09-26T11:44:51.718-0700 I INDEX    [repl index builder 132374] build index on: site_7053.visitor_offers properties: { v: 2, key: { origin_id: 1 }, name: "origin_id_1", ns: "site_7053.visitor_offers", background: true }
2017-09-26T11:44:51.718-0700 I INDEX    [repl index builder 132374] build index done.  scanned 0 total records. 0 secs
2017-09-26T11:44:51.724-0700 I INDEX    [repl index builder 132375] build index on: site_7053.visitor_offers properties: { v: 2, key: { offer_id: 1 }, name: "offer_id_1", ns: "site_7053.visitor_offers", background: true }
2017-09-26T11:44:51.724-0700 I INDEX    [repl index builder 132375] build index done.  scanned 0 total records. 0 secs
2017-09-26T11:44:52.027-0700 I REPL     [rsBackgroundSync] rollback 0
2017-09-26T11:44:52.029-0700 I REPL     [rsBackgroundSync] transition to ROLLBACK
2017-09-26T11:44:52.029-0700 I NETWORK  [rsBackgroundSync] legacy transport layer closing all connections
2017-09-26T11:44:52.034-0700 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 55139
2017-09-26T11:44:52.036-0700 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 54921
2017-09-26T11:44:52.039-0700 I -        [conn52753] end connection 10.0.6.27:39961 (967 connections now open)
2017-09-26T11:44:52.039-0700 I -        [conn52775] end connection 10.0.6.141:55674 (967 connections now open)
2017-09-26T11:44:52.039-0700 I -        [conn55717] end connection 10.0.2.59:60926 (967 connections now open)
...
2017-09-26T11:44:52.042-0700 I -        [conn54912] end connection 10.0.2.59:51880 (724 connections now open)
2017-09-26T11:44:52.042-0700 I -        [conn54923] end connection 10.0.2.59:51613 (716 connections now open)
2017-09-26T11:44:52.042-0700 I -        [conn54891] end connection 10.0.6.141:44850 (713 connections now open)
2017-09-26T11:44:52.094-0700 I REPL     [rsBackgroundSync] beginning rollback
2017-09-26T11:44:52.094-0700 I REPL     [rsBackgroundSync] rollback 1
2017-09-26T11:44:52.102-0700 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2017-09-26T11:44:52.106-0700 I -        [conn55037] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
2017-09-26T11:44:52.106-0700 I -        [conn55028] AssertionException handling request, closing client connection: 172 Operation attempted on a closed transport Session.
2017-09-26T11:44:52.106-0700 I -        [conn55037] end connection 10.0.6.141:44957 (4 connections now open)
2017-09-26T11:44:52.106-0700 I -        [conn55028] end connection 10.0.6.141:44944 (4 connections now open)
2017-09-26T11:44:52.108-0700 I REPL     [rsBackgroundSync] rollback our last optime:   Sep 26 09:00:18:7cd
2017-09-26T11:44:52.108-0700 I REPL     [rsBackgroundSync] rollback their last optime: Sep 26 11:44:51:3e8
2017-09-26T11:44:52.108-0700 I REPL     [rsBackgroundSync] rollback diff in end of log times: -9873 seconds
2017-09-26T11:44:58.672-0700 F REPL     [rsBackgroundSync] rollback error RS100 reached beginning of remote oplog
2017-09-26T11:44:58.673-0700 I REPL     [rsBackgroundSync]   them:      mng01.production:27017 (10.0.6.103)
2017-09-26T11:44:58.673-0700 I REPL     [rsBackgroundSync]   theirTime: Sep 26 10:51:58 59ca93be:341
2017-09-26T11:44:58.676-0700 I REPL     [rsBackgroundSync] rollback finished
2017-09-26T11:44:58.676-0700 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: RS100 reached beginning of remote oplog [1] @ 18752
2017-09-26T11:44:58.676-0700 I -        [rsBackgroundSync] Fatal Assertion 28723 at src/mongo/db/repl/rs_rollback.cpp 915
2017-09-26T11:44:58.676-0700 I -        [rsBackgroundSync]
 
***aborting after fassert() failure



 Comments   
Comment by Bogdan Guban [ 28/Sep/17 ]

Hi @thomas.schubert, thanks a lot for your answer. You saved a lot of my time. But it will be great to have better description of problem in log file and maybe a link to solution.

Thanks

Comment by Kelsey Schubert [ 27/Sep/17 ]

Hi bguban

This error indicates that the secondary was unable to stay within the oplog replication window. I would suggest increasing the size of your oplog collection. To clarify, it's expected that the mongod fasserts when it falls off the oplog of the primary. This "loudly" indicates that manual intervention is required to resync the node.

Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Kelsey

Generated at Thu Feb 08 04:26:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.