[SERVER-22122] Rollback stuck after index creation just before failover Created: 11/Jan/16  Updated: 11/Mar/16  Resolved: 11/Mar/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Linar Savion Assignee: Kelsey Schubert
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File primary.log     Text File secondary.log    
Issue Links:
Related
related to SERVER-22179 Log progress during rollback 3 fixup Closed
Operating System: ALL
Participants:

 Description   

MongoDB v3.0.5 replica set (2 servers 1 arbiter)

After creating several indexes and very little replication lag (5 mins), a failover occurred, the secondary server was then stuck in the rollback state with the last line showing 'rollback 3 fixup'. the only fix seems to be restoring from a previous backup and then re-syncing...
note the warnings in the log

2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] replSet our last op time fetched: Jan 11 13:45:14:1
2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] replset source's GTE: Jan 11 13:49:34:1
2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] beginning rollback
2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] rollback 0
2016-01-11T14:23:29.914+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK
2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] rollback 1
2016-01-11T14:23:29.914+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2016-01-11T14:23:30.273+0000 I REPL     [rsBackgroundSync] replSet info rollback our last optime:   Jan 11 13:45:14:1
2016-01-11T14:23:30.273+0000 I REPL     [rsBackgroundSync] replSet info rollback their last optime: Jan 11 14:23:15:1
2016-01-11T14:23:30.273+0000 I REPL     [rsBackgroundSync] replSet info rollback diff in end of log times: -2281 seconds
2016-01-11T14:23:38.430+0000 W REPL     [rsBackgroundSync] ignoring op with no _id during rollback. ns: JER_DB.system.indexes, document: { ts: Timestamp 1452519153000|2, h: -6306293468986696903, v: 2, op: "i", ns: "JER_DB.system.indexes", o: { name: "uuid_state_1", ns: "JER_DB.Celery_log", unique: true, key: { uuid: 1, state: 1 } } }
2016-01-11T14:23:38.436+0000 W REPL     [rsBackgroundSync] replSet info rollback of dropIndexes is slow in this version of mongod
2016-01-11T14:23:38.445+0000 W REPL     [rsBackgroundSync] ignoring op with no _id during rollback. ns: JER_DB.system.indexes, document: { ts: Timestamp 1452518828000|1, h: -4918113476754073791, v: 2, op: "i", ns: "JER_DB.system.indexes", o: { name: "uuid_1", ns: "JER_DB.Celery_log", key: { uuid: 1 } } }
2016-01-11T14:23:38.986+0000 W REPL     [rsBackgroundSync] ignoring op with no _id during rollback. ns: JER_SIGNALS.system.indexes, document: { ts: Timestamp 1452518673000|4, h: -8456178278876933663, v: 2, op: "i", ns: "JER_SIGNALS.system.indexes", o: { ns: "JER_SIGNALS.test_ndfd_weather_forecast_", name: "ts_1", key: { ts: 1 } } }
2016-01-11T14:23:38.986+0000 W REPL     [rsBackgroundSync] ignoring op with no _id during rollback. ns: JER_SIGNALS.system.indexes, document: { ts: Timestamp 1452518673000|3, h: 1746040611747149689, v: 2, op: "i", ns: "JER_SIGNALS.system.indexes", o: { ns: "JER_SIGNALS.test_ndfd_weather_forecast_", unique: true, name: "pid_ts_1_1", key: { pid: 1, ts: 1 } } }
2016-01-11T14:23:41.113+0000 I REPL     [rsBackgroundSync] replSet rollback found matching events at Jan 11 13:12:04:b0
2016-01-11T14:23:41.113+0000 I REPL     [rsBackgroundSync] replSet rollback findcommonpoint scanned : 105683
2016-01-11T14:23:41.144+0000 I REPL     [rsBackgroundSync] replSet rollback 3 fixup



 Comments   
Comment by Kelsey Schubert [ 10/Feb/16 ]

Hi jpfourny,

Thank you for posting your observations. Rollbacks may take over an hour to complete if nodes are under heavy load. Please be aware that if a mongod receives a clean shutdown command the rollback process will complete before the node shuts down.

Since nodes may remain in rollback 3 fixup for some time, I have opened SERVER-22179 to periodically log progress during this stage. This improvement should clarify when a mongod is progressing through the rollback process as expected.

Kind regards,
Thomas

Comment by Joseph Fourny [ 27/Jan/16 ]

Hi,

I hit the exact same problem that Linar did with MongoDB 3.0.8. I have a 5-node, distributed replica set with 2 nodes in Dallas (Texas), 2 nodes in Toronto (Canada) and an arbiter (also in Toronto, but we pretend it is offsite). The workload was a stress test involving 1 database and 1 collection. Our test driver has many writers that insert and remove documents (about 1 MB each) and a number of readers that count those documents. There are no indexes (aside from the implicit one on _id field).

We wanted to see how failover behaves when all nodes are stressed (load averages > 1). Once the workload saturated the primary and secondaries, we pulled the plug on both nodes in Toronto. Within 10 seconds, a node in Dallas was elected. We kept the Toronto nodes offline for a few minutes and then brought them back. They both transitioned to ROLLBACK state and remained there indefinitely (over an hour). There was basically zero activity on those boxes. After inspection the /rollback directory, we only found about 3 bson files, and according to the replica set status, those nodes were around 800 seconds behind. We could not find any messages like "replSet too much data to roll back" in the mongod logs on either of the Toronto nodes.

While in this ROLLBACK state, the mongod service would not shutdown cleanly. After forcing it to die and bringing it back, the following entries show in the log:

2016-01-27T15:26:50.279-0600 I CONTROL  ***** SERVER RESTARTED *****
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] MongoDB starting : pid=32235 port=27017 dbpath=/var/lib/mongodb 64-bit host=tor02
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] db version v3.0.8
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] git version: 83d8cc25e00e42856924d84e220fbe4a839e605d
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] build info: Linux ip-10-187-89-126 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] allocator: tcmalloc
2016-01-27T15:26:50.285-0600 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, replication: { replSetName: "wa-hadr" }, storage: { dbPath: "/var/lib/mongodb", journal: { enabled: true }, mmapv1: { smallFiles: true } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2016-01-27T15:26:50.309-0600 W -        [initandlisten] Detected unclean shutdown - /var/lib/mongodb/mongod.lock is not empty.
2016-01-27T15:26:50.333-0600 I JOURNAL  [initandlisten] journal dir=/var/lib/mongodb/journal
2016-01-27T15:26:50.333-0600 I JOURNAL  [initandlisten] recover begin
2016-01-27T15:26:50.333-0600 I JOURNAL  [initandlisten] info no lsn file in journal/ directory
2016-01-27T15:26:50.333-0600 I JOURNAL  [initandlisten] recover lsn: 0
2016-01-27T15:26:50.333-0600 I JOURNAL  [initandlisten] recover /var/lib/mongodb/journal/j._0
2016-01-27T15:26:50.343-0600 I JOURNAL  [initandlisten] recover cleaning up
2016-01-27T15:26:50.343-0600 I JOURNAL  [initandlisten] removeJournalFiles
2016-01-27T15:26:50.344-0600 I JOURNAL  [initandlisten] recover done
2016-01-27T15:26:50.467-0600 I JOURNAL  [durability] Durability thread started
2016-01-27T15:26:50.467-0600 I JOURNAL  [journal writer] Journal writer thread started
2016-01-27T15:26:50.699-0600 I NETWORK  [initandlisten] waiting for connections on port 27017
2016-01-27T15:26:50.838-0600 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "wa-hadr", version: 10, members: [ { _id: 1, host: "tor02.mongohadr.ba.ibm.local:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "tor01.mongohadr.ba.ibm.local:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 3, host: "dal01.mongohadr.ba.ibm.local:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 5, host: "dal02.mongohadr.ba.ibm.local:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 6, host: "tor03.mongohadr.ba.ibm.local:27017", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2016-01-27T15:26:50.838-0600 I REPL     [ReplicationExecutor] This node is tor02.mongohadr.ba.ibm.local:27017 in the config
2016-01-27T15:26:50.838-0600 I REPL     [ReplicationExecutor] transition to STARTUP2
2016-01-27T15:26:50.838-0600 I REPL     [ReplicationExecutor] Starting replication applier threads
2016-01-27T15:26:50.840-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-01-27T15:26:50.840-0600 I REPL     [ReplicationExecutor] Member tor01.mongohadr.ba.ibm.local:27017 is now in state PRIMARY
2016-01-27T15:26:50.841-0600 I REPL     [ReplicationExecutor] Member tor03.mongohadr.ba.ibm.local:27017 is now in state ARBITER
2016-01-27T15:26:50.842-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-01-27T15:26:50.887-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.12:35141 #1 (1 connection now open)
2016-01-27T15:26:50.903-0600 I REPL     [ReplicationExecutor] Member dal01.mongohadr.ba.ibm.local:27017 is now in state SECONDARY
2016-01-27T15:26:50.904-0600 I REPL     [ReplicationExecutor] Member dal02.mongohadr.ba.ibm.local:27017 is now in state SECONDARY
2016-01-27T15:26:51.289-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.12:38122 #2 (2 connections now open)
2016-01-27T15:26:52.279-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.6:40460 #3 (3 connections now open)
2016-01-27T15:26:52.666-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.32:40157 #4 (4 connections now open)
2016-01-27T15:26:53.332-0600 I NETWORK  [initandlisten] connection accepted from 10.1.232.114:55462 #5 (5 connections now open)
2016-01-27T15:26:53.840-0600 I REPL     [ReplicationExecutor] syncing from: tor01.mongohadr.ba.ibm.local:27017
2016-01-27T15:26:53.944-0600 W REPL     [rsBackgroundSync] we are too stale to use tor01.mongohadr.ba.ibm.local:27017 as a sync source
2016-01-27T15:26:53.944-0600 I REPL     [ReplicationExecutor] syncing from: dal01.mongohadr.ba.ibm.local:27017
2016-01-27T15:26:54.014-0600 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to dal01.mongohadr.ba.ibm.local:27017
2016-01-27T15:27:11.472-0600 I REPL     [rsBackgroundSync] replSet our last op time fetched: Jan 27 14:26:25:1b2
2016-01-27T15:27:11.472-0600 I REPL     [rsBackgroundSync] replset source's GTE: Jan 27 14:27:09:1
2016-01-27T15:27:11.472-0600 I REPL     [rsBackgroundSync] beginning rollback
2016-01-27T15:27:11.472-0600 I REPL     [rsBackgroundSync] rollback 0
2016-01-27T15:27:11.472-0600 I REPL     [ReplicationExecutor] transition to ROLLBACK
2016-01-27T15:27:11.472-0600 I REPL     [rsBackgroundSync] rollback 1
2016-01-27T15:27:11.473-0600 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2016-01-27T15:27:11.542-0600 I REPL     [rsBackgroundSync] replSet info rollback our last optime:   Jan 27 14:26:25:1b2
2016-01-27T15:27:11.542-0600 I REPL     [rsBackgroundSync] replSet info rollback their last optime: Jan 27 14:39:57:ba
2016-01-27T15:27:11.542-0600 I REPL     [rsBackgroundSync] replSet info rollback diff in end of log times: -812 seconds
2016-01-27T15:27:20.896-0600 I NETWORK  [conn1] end connection 158.85.123.12:35141 (4 connections now open)
2016-01-27T15:27:20.896-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.12:35145 #6 (5 connections now open)
2016-01-27T15:27:21.774-0600 I NETWORK  [conn2] end connection 10.41.72.12:38122 (4 connections now open)
2016-01-27T15:27:21.810-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.12:38126 #7 (5 connections now open)
2016-01-27T15:27:22.289-0600 I NETWORK  [conn3] end connection 158.85.123.6:40460 (4 connections now open)
2016-01-27T15:27:22.290-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.6:40464 #8 (5 connections now open)
2016-01-27T15:27:23.163-0600 I NETWORK  [conn4] end connection 10.41.72.32:40157 (4 connections now open)
2016-01-27T15:27:23.195-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.32:40161 #9 (5 connections now open)
2016-01-27T15:27:50.908-0600 I NETWORK  [conn6] end connection 158.85.123.12:35145 (4 connections now open)
2016-01-27T15:27:50.909-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.12:35149 #10 (5 connections now open)
2016-01-27T15:27:52.302-0600 I NETWORK  [conn8] end connection 158.85.123.6:40464 (4 connections now open)
2016-01-27T15:27:52.328-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.6:40468 #11 (5 connections now open)
2016-01-27T15:27:52.354-0600 I NETWORK  [conn7] end connection 10.41.72.12:38126 (4 connections now open)
2016-01-27T15:27:52.387-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.12:38131 #12 (5 connections now open)
2016-01-27T15:27:53.699-0600 I NETWORK  [conn9] end connection 10.41.72.32:40161 (4 connections now open)
2016-01-27T15:27:53.731-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.32:40165 #13 (5 connections now open)
2016-01-27T15:28:20.917-0600 I NETWORK  [conn10] end connection 158.85.123.12:35149 (4 connections now open)
2016-01-27T15:28:20.918-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.12:35153 #14 (5 connections now open)
2016-01-27T15:28:22.335-0600 I NETWORK  [conn11] end connection 158.85.123.6:40468 (4 connections now open)
2016-01-27T15:28:22.336-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.6:40472 #15 (5 connections now open)
2016-01-27T15:28:22.875-0600 I NETWORK  [conn12] end connection 10.41.72.12:38131 (4 connections now open)
2016-01-27T15:28:22.907-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.12:38135 #16 (5 connections now open)
2016-01-27T15:28:24.236-0600 I NETWORK  [conn13] end connection 10.41.72.32:40165 (4 connections now open)
2016-01-27T15:28:24.268-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.32:40169 #17 (5 connections now open)
2016-01-27T15:28:50.928-0600 I NETWORK  [conn14] end connection 158.85.123.12:35153 (4 connections now open)
2016-01-27T15:28:50.928-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.12:35157 #18 (5 connections now open)
2016-01-27T15:28:52.352-0600 I NETWORK  [conn15] end connection 158.85.123.6:40472 (4 connections now open)
2016-01-27T15:28:52.355-0600 I NETWORK  [initandlisten] connection accepted from 158.85.123.6:40476 #19 (5 connections now open)
2016-01-27T15:28:53.387-0600 I NETWORK  [conn16] end connection 10.41.72.12:38135 (4 connections now open)
2016-01-27T15:28:53.419-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.12:38139 #20 (5 connections now open)
2016-01-27T15:28:54.768-0600 I NETWORK  [conn17] end connection 10.41.72.32:40169 (4 connections now open)
2016-01-27T15:28:54.801-0600 I NETWORK  [initandlisten] connection accepted from 10.41.72.32:40173 #21 (5 connections now open)
2016-01-27T15:28:57.829-0600 I REPL     [rsBackgroundSync] replSet rollback found matching events at Jan 27 14:23:51:8
2016-01-27T15:28:57.829-0600 I REPL     [rsBackgroundSync] replSet rollback findcommonpoint scanned : 790428
2016-01-27T15:28:57.932-0600 I REPL     [rsBackgroundSync] replSet rollback 3 fixup

As you can see, it is once again stuck after "replSet rollback 3 fixup" is emitted.

Below is the rs.status() from the node stuck in ROLLBACK.

{
        "set" : "wa-hadr",
        "date" : ISODate("2016-01-27T21:37:41.119Z"),
        "myState" : 9,
        "syncingTo" : "dal01.mongohadr.ba.ibm.local:27017",
        "members" : [
                {
                        "_id" : 1,
                        "name" : "tor02.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 9,
                        "stateStr" : "ROLLBACK",
                        "uptime" : 651,
                        "optime" : Timestamp(1453926385, 434),
                        "optimeDate" : ISODate("2016-01-27T20:26:25Z"),
                        "syncingTo" : "dal01.mongohadr.ba.ibm.local:27017",
                        "configVersion" : 10,
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "tor01.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 650,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:37:39.297Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:37:39.297Z"),
                        "pingMs" : 0,
                        "electionTime" : Timestamp(1453928225, 1),
                        "electionDate" : ISODate("2016-01-27T20:57:05Z"),
                        "configVersion" : 10
                },
                {
                        "_id" : 3,
                        "name" : "dal01.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 650,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:37:40.541Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:37:40.997Z"),
                        "pingMs" : 33,
                        "lastHeartbeatMessage" : "could not find member to sync from",
                        "configVersion" : 10
                },
                {
                        "_id" : 5,
                        "name" : "dal02.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 650,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:37:40.570Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:37:40.096Z"),
                        "pingMs" : 38,
                        "lastHeartbeatMessage" : "could not find member to sync from",
                        "configVersion" : 10
                },
                {
                        "_id" : 6,
                        "name" : "tor03.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 650,
                        "lastHeartbeat" : ISODate("2016-01-27T21:37:39.297Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:37:40.786Z"                                                                                                                                                             ),
                        "pingMs" : 0,
                        "configVersion" : 10
                }
        ],
        "ok" : 1
}

Now here is the rs.status() from the arbiter node (which never went down during our tests). It sees everything is OK.

{
        "set" : "wa-hadr",
        "date" : ISODate("2016-01-27T21:39:44.310Z"),
        "myState" : 7,
        "members" : [
                {
                        "_id" : 1,
                        "name" : "tor02.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 9,
                        "stateStr" : "ROLLBACK",
                        "uptime" : 775,
                        "optime" : Timestamp(1453926385, 434),
                        "optimeDate" : ISODate("2016-01-27T20:26:25Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:39:43.386Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:39:43.872Z"),
                        "pingMs" : 0,
                        "syncingTo" : "dal01.mongohadr.ba.ibm.local:27017",
                        "configVersion" : 10
                },
                {
                        "_id" : 2,
                        "name" : "tor01.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 2633,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:39:43.406Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:39:43.400Z"),
                        "pingMs" : 0,
                        "electionTime" : Timestamp(1453928225, 1),
                        "electionDate" : ISODate("2016-01-27T20:57:05Z"),
                        "configVersion" : 10
                },
                {
                        "_id" : 3,
                        "name" : "dal01.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 178034,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:39:42.705Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:39:43.733Z"),
                        "pingMs" : 36,
                        "configVersion" : 10
                },
                {
                        "_id" : 5,
                        "name" : "dal02.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 178034,
                        "optime" : Timestamp(1453927197, 186),
                        "optimeDate" : ISODate("2016-01-27T20:39:57Z"),
                        "lastHeartbeat" : ISODate("2016-01-27T21:39:42.762Z"),
                        "lastHeartbeatRecv" : ISODate("2016-01-27T21:39:43.420Z"),
                        "pingMs" : 35,
                        "configVersion" : 10
                },
                {
                        "_id" : 6,
                        "name" : "tor03.mongohadr.ba.ibm.local:27017",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 178075,
                        "configVersion" : 10,
                        "self" : true
                }
        ],
        "ok" : 1
}

Comment by Kelsey Schubert [ 22/Jan/16 ]

Hi linar-jether,

The replication lag between nodes does not directly impact the amount of time that is spent in replication rollback. A better indicator of the time required would be the number of documents that need to be rolled back. As part of 'rollback 3 fixup' up to date versions of these documents are fetched from the rollback source node.

When mongod rolls back a dropped index, the entire collection is resynced from the rollback source. This process lengthens the time spent in the 'rollback 3 fixup' stage.

To ascertain if having to roll back a dropped index could explain the behavior you observed please provide the following information:

  • If you are aware of which collection had the dropped index that was needed to be rolled back could you please provide output of db.collectionName.stats()?
  • If not, could you please provide output db.stats()?
  • Additionally, could you please provide output of rs.status()?

Thanks,
Thomas

Comment by Linar Savion [ 14/Jan/16 ]

Yes this was the last entry before the node was re-synced.
It was stuck in the 'rollback 3 fixup' state for about an hour (even though it had only several minutes to rollback), until we manually resynced the node.
Usually the load is about 450mb per hour, but during that time i believe we lowered it to about 100mb/hour.

Comment by Kelsey Schubert [ 13/Jan/16 ]

Hi linar-jether,

To make sure we are the same page, can you confirm that the following entry was last entry before the node was restarted? If not, could you please attach the rest of the primary.log to this ticket?

2016-01-11T14:23:41.144+0000 I REPL     [rsBackgroundSync] replSet rollback 3 fixup

Depending on data that has to be rolled back, it is possible that a node would remain in 'rollback 3 fixup' for a number of minutes.

Can you please answer the following questions:

  1. How long was the node in 'rollback 3 fixup' before you manually resynced the node?
  2. How much data do typically you write to your oplog per hour?

If you are able to reproduce this behavior, can you please post the output of mongostat and db.currentop().

Thank you,
Thomas

Comment by Linar Savion [ 12/Jan/16 ]

Added both log files up until the event, we had several crashes on the primary earlier due to an unrelated memory issue

Comment by Ramon Fernandez Marina [ 11/Jan/16 ]

linar-jether, could you please upload the logs for all nodes in this replica set from the last startup until the problem occurred? If you're able to reproduce the issue, the output of db.currentOp() from all nodes will be useful as well. I'm wondering whether the node was busy building the indices you requested.

Note also that there were a number of issues with MongoDB 3.0.5, so I'd recommend you upgrade to 3.0.8 at your earliest convenience.

Generated at Thu Feb 08 03:59:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.