[SERVER-18816] rollback messages in logs Created: 03/Jun/15  Updated: 08/Jun/15  Resolved: 08/Jun/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Van Pham Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-18817 rollback problem Closed
Participants:

 Description   

We start seeing the following messages in the logs and was just wondering why it's happening:

2015-05-21T10:07:24.053-0700 I REPL     [ReplicationExecutor] Error in heartbeat request to ir-mngs3-l002w.sfoprod.local:27017; Location18915 Failed attempt to connect to ir-mngs3-l002w.sfoprod.local:27017; couldn't connect to server ir-mngs3-l002w.sfoprod.local:27017 (172.16.128.38), connection attempt failed
2015-05-21T10:07:25.440-0700 W NETWORK  [LockPinger] Failed to connect to 172.16.128.61:27017 after 5000 milliseconds, giving up.
2015-05-21T10:07:25.440-0700 I NETWORK  [LockPinger] SyncClusterConnection connect fail to: ir-mngc-l002w.sfoprod.local:27017 errmsg: couldn't connect to server ir-mngc-l002w.sfoprod.local:27017 (172.16.128.61), connection attempt failed
2015-05-21T10:07:25.440-0700 I NETWORK  [LockPinger] SyncClusterConnection connecting to [ir-mngc-l002c.sfoprod.local:27017]
2015-05-21T10:07:25.443-0700 I NETWORK  [LockPinger] SyncClusterConnection connecting to [ir-mngc-l002e.sfoprod.local:27017]
2015-05-21T10:07:25.480-0700 I NETWORK  [LockPinger] trying reconnect to ir-mngc-l002w.sfoprod.local:27017 (172.16.128.61) failed
2015-05-21T10:07:27.139-0700 I REPL     [ReplicationExecutor] syncing from: ir-mngs3-l002e.sfoprod.local:27017
2015-05-21T10:07:27.208-0700 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to ir-mngs3-l002e.sfoprod.local:27017
2015-05-21T10:07:27.309-0700 I REPL     [rsBackgroundSync] replSet our last op time fetched: May 21 10:06:50:9f
2015-05-21T10:07:27.309-0700 I REPL     [rsBackgroundSync] replset source's GTE: May 21 10:07:25:1
2015-05-21T10:07:27.309-0700 I REPL     [rsBackgroundSync] beginning rollback
2015-05-21T10:07:27.309-0700 I REPL     [rsBackgroundSync] rollback 0
2015-05-21T10:07:27.310-0700 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-05-21T10:07:27.310-0700 I REPL     [rsBackgroundSync] rollback 1
2015-05-21T10:07:27.310-0700 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-05-21T10:07:27.379-0700 I REPL     [rsBackgroundSync] replSet info rollback our last optime:   May 21 10:06:50:9f
2015-05-21T10:07:27.379-0700 I REPL     [rsBackgroundSync] replSet info rollback their last optime: May 21 10:07:25:14
2015-05-21T10:07:27.379-0700 I REPL     [rsBackgroundSync] replSet info rollback diff in end of log times: -35 seconds
2015-05-21T10:07:27.434-0700 I REPL     [rsBackgroundSync] replSet rollback found matching events at May 21 10:06:29:255
2015-05-21T10:07:27.434-0700 I REPL     [rsBackgroundSync] replSet rollback findcommonpoint scanned : 8873
2015-05-21T10:07:27.434-0700 I REPL     [rsBackgroundSync] replSet rollback 3 fixup
2015-05-21T10:07:28.628-0700 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 172.16.128.36:27017, reason: errno:115 Operation now in progress
2015-05-21T10:07:28.629-0700 W NETWORK  [ReplicaSetMonitorWatcher] No primary detected for set s2
2015-05-21T10:07:29.049-0700 W NETWORK  [ReplExecNetThread-130] Failed to connect to 172.16.128.38:27017, reason: errno:115 Operation now in progress
2015-05-21T10:07:29.049-0700 I REPL     [ReplicationExecutor] Error in heartbeat request to ir-mngs3-l002w.sfoprod.local:27017; Location18915 Failed attempt to connect to ir-mngs3-l002w.sfoprod.local:27017; couldn't connect to server ir-mngs3-l002w.sfoprod.local:27017 (172.16.128.38), connection attempt failed
2015-05-21T10:07:29.672-0700 I NETWORK  [initandlisten] connection accepted from 172.17.128.94:44923 #9766 (38 connections now open)
2015-05-21T10:07:30.484-0700 W NETWORK  [LockPinger] Failed to connect to 172.16.128.61:27017 after 5000 milliseconds, giving up.
2015-05-21T10:07:30.484-0700 I NETWORK  [LockPinger] reconnect ir-mngc-l002w.sfoprod.local:27017 (172.16.128.61) failed failed couldn't connect to server ir-mngc-l002w.sfoprod.local:27017 (172.16.128.61), connection attempt failed
2015-05-21T10:07:30.594-0700 I NETWORK  [LockPinger] scoped connection to ir-mngc-l002w.sfoprod.local:27017,ir-mngc-l002c.sfoprod.local:27017,ir-mngc-l002e.sfoprod.local:27017 not being returned to the pool



 Comments   
Comment by Sam Kleinman (Inactive) [ 08/Jun/15 ]

From these logs, it looks like this member, which had been primary, encountered an error connecting to the other members of the replica set. When the member was able to connect again, it found that the new set had elected a new primary and that it had accepted writes for 35 seconds that had not replicated to the new primary. In order to ensure that all members of the sets have the same data, the member enters the ROLLBACK state and reverts any write operations received during that 35 second window.

While MongoDB attempts to avoid rollbacks, they are possible, and are are required to ensure that Replica Set members all have the same data set. You can learn more about roll backs in the MongoDB Manual documentation of Rollbacks.

I'm going to go ahead and close this ticket: Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag.

Regards,
sam

Generated at Thu Feb 08 03:48:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.