[SERVER-9481] Server in replication set fluctuates between states SECONDARY and ROLLBACK if it can't rollback due to too much rollback data Created: 26/Apr/13  Updated: 11/Jul/16  Resolved: 20/Jul/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.1
Fix Version/s: 2.4.6, 2.5.1

Type: Bug Priority: Major - P3
Reporter: Andreas Heck Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: pull-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb-rollback-loop.patch    
Issue Links:
Duplicate
is duplicated by SERVER-9494 Rollback Throws Continuous Warnings A... Closed
Related
Operating System: ALL
Steps To Reproduce:

Follow this guide http://comerford.cc/2012/05/28/simulating-rollback-on-mongodb/ but use the following MongoDB command to create the data:

for(var i = 0; i <= 100000000000000000 ; i+){for (var j = 0; j <= 1000; j+){ db.rollback_test.insert(

{"a" : i * j}

);} sleep(100);}

Run this command on the primary until you are confident that you have at least 300 MB of data (e.g. by checking the output of df -h as a rough estimate).

Participants:

 Description   

If you encounter a rollback situation in a replication set and the rollback data is more than 300 megabytes/30 minutes the rollback will fail. Unfortunately it is not clear from rs.status() that one node is in a state where it can't recover from. The node with the rollback problem will try to roll back again and again which can be seen in the mongodb.log:

Fri Apr 26 13:28:55.957 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 0
Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet ROLLBACK
Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 1
Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Fri Apr 26 13:29:01.031 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
Fri Apr 26 13:29:01.873 [rsSync] replSet SECONDARY
Fri Apr 26 13:29:10.640 [conn613] end connection 10.128.128.102:53295 (2 connections now open)
Fri Apr 26 13:29:10.641 [initandlisten] connection accepted from 10.128.128.102:53297 #615 (3 connections now open)
Fri Apr 26 13:29:11.031 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
Fri Apr 26 13:29:11.388 [conn614] end connection 10.128.128.98:41406 (2 connections now open)
Fri Apr 26 13:29:11.389 [initandlisten] connection accepted from 10.128.128.98:41408 #616 (3 connections now open)
Fri Apr 26 13:29:16.104 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
Fri Apr 26 13:29:16.104 [rsBackgroundSync] replSet rollback 0
Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet ROLLBACK
Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 1
Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
Fri Apr 26 13:29:16.876 [rsSync] replSet SECONDARY
Fri Apr 26 13:29:26.105 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 0
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet ROLLBACK
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 1
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
Fri Apr 26 13:29:31.879 [rsSync] replSet SECONDARY
Fri Apr 26 13:29:40.774 [conn615] end connection 10.128.128.102:53297 (2 connections now open)
Fri Apr 26 13:29:40.774 [initandlisten] connection accepted from 10.128.128.102:53299 #617 (3 connections now open)
Fri Apr 26 13:29:41.173 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
Fri Apr 26 13:29:41.412 [conn616] end connection 10.128.128.98:41408 (2 connections now open)
Fri Apr 26 13:29:41.413 [initandlisten] connection accepted from 10.128.128.98:41410 #618 (3 connections now open)
Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 0
Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet ROLLBACK
Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 1
Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Fri Apr 26 13:29:46.247 [rsBackgroundSync] replSet info rollback our last optime: Apr 26 10:52:56:f23
Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet info rollback their last optime: Apr 26 09:47:21:1000
Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet info rollback diff in end of log times: 3935 seconds
Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet rollback too long a time period for a rollback.
Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet rollback 2 error error not willing to roll back more than 30 minutes of data
Fri Apr 26 13:29:46.881 [rsSync] replSet SECONDARY

When you start the mongo shell on a replication set node and repeatedly execute the rs.status() command you will see that the affected secondary goes into the ROLLBACK state for a few seconds max, just to flip back into SECONDARY until it enters the next iteration of the rollback loop.

Actually I would expect the affected node to stay in ROLLBACK until the rollback succeeds. If we are in a situation like this where manual intervention is necessary, maybe it would a good idea for the affected instance to go to the FATAL state.



 Comments   
Comment by auto [ 02/Aug/13 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-9481 further rollback exception cleanup
Branch: v2.4
https://github.com/mongodb/mongo/commit/2c332a62c92cc668d407164a1374932885dc0dd3

Comment by auto [ 02/Aug/13 ]

Author:

{u'username': u'aheck', u'name': u'Andreas Heck', u'email': u'aheck@gmx.de'}

Message: SERVER-9481 Fixes fluctuating between ROLLBACK and SECONDARY when ROLLBACK is impossible

Only leave ROLLBACK when successful and go to FATAL when ROLLBACK is impossible
because we are not willing to rollback so much data

Signed-off-by: Matt Dannenberg <matt.dannenberg@10gen.com>
Branch: v2.4
https://github.com/mongodb/mongo/commit/e1656eda9f044dab3207703994d13c5de91e9c71

Comment by auto [ 12/Jun/13 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-9481 further rollback exception cleanup
Branch: master
https://github.com/mongodb/mongo/commit/6c3225162bcb5aea62fc476b647e14742f2bd65b

Comment by auto [ 12/Jun/13 ]

Author:

{u'username': u'aheck', u'name': u'Andreas Heck', u'email': u'aheck@gmx.de'}

Message: SERVER-9481 Fixes fluctuating between ROLLBACK and SECONDARY when ROLLBACK is impossible

Only leave ROLLBACK when successful and go to FATAL when ROLLBACK is impossible
because we are not willing to rollback so much data

Signed-off-by: Matt Dannenberg <matt.dannenberg@10gen.com>
Branch: master
https://github.com/mongodb/mongo/commit/e3097872b5837e31179702f1009a976656f8af76

Comment by Matt Kangas [ 10/Jun/13 ]

Verified that contributor agreement has been signed.

Comment by Randolph Tan [ 17/May/13 ]

Hi aheck,

Have you signed the contributor agreement?

Thanks!

Comment by Andreas Heck [ 08/May/13 ]

I patched MongoDB 2.4.1 server such that MongoDB only leaves ROLLBACK after the rollback was successful and changes to FATAL in case a rollback fails because there is too much data which prevents the endless loop seen in the log.

Are there any logical problems with that approach or problems with the patch itself?

Generated at Thu Feb 08 03:20:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.