[SERVER-60802] Primary node turns to ROLLBACK state permanently Created: 19/Oct/21  Updated: 10/Jun/22  Resolved: 17/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.15
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Zijun Tian Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-60803 Primary node turns to ROLLBACK state ... Closed
Operating System: ALL
Steps To Reproduce:
  1. The primary node is down with some data unable to sync to secondary nodes.
  2. Some new data writes to the new primary node and sync to the rest of the replica set.
  3. Restart the former primary node.
Participants:

 Description   

We have a MongoDB cluster host on-premise on AWS, containing 1 primary node and 2 secondary nodes, on 3 r5 EC2 instances. Due to some heavy workloads, the primary node's memory utilization reached 100% and then the instance crashed.

After rebooting the instance, we restart the MongoDB, one of the secondary nodes became the primary as expected. Then the former primary node turned into ROLLBACK state. We noticed the docs on https://docs.mongodb.com/manual/core/replica-set-rollbacks/ that this is because secondaries can not keep up with the throughput of operations on the former primary. However, it stuck at the state after several rollback files were created under the rollback folder, and after that, we did not notice any new rollback activities on the log.

In the end, we stopped MongoDB, cleared all data on the node, and started again to sync data from the replica set.



 Comments   
Comment by Dmitry Agranat [ 03/Nov/21 ]

Hi zijun.tian@tusimple.ai, after looking at the logs, it is not clear why the rollback took so long but there are some hints. For example, there are indications of some networking and storage related issues. If a node is struggling to get some resources from the OS to complete its tasks (in this case, a ROLLBACK), this task might take a long time or be stuck until the requested resource is available.

Unfortunately, we do not have the diagnostic.data to either prove or refute this assumption but based on the limited information we see in logs, this might be indeed the case.

Comment by Zijun Tian [ 29/Oct/21 ]

Hi, I uploaded 3 log files (node1.log, node2.log, node3.log)

Logs are from 2021-10-18T19:30:00 - 2021-10-19T11:30:00 (UTC)

Node1 was the secondary node.

Node2 was the previous primary and went down around 2021-10-18T19:55:00. We tried to recover it from 2021-10-18T19:55:00 to 2021-10-19T01:30:00, after failure, we deleted all data and sync from zero.

Node3 was the secondary node, and became primary after Node2 went down.

Comment by Dmitry Agranat [ 28/Oct/21 ]

zijun.tian@tusimple.ai This might be more challenging to diagnose this issue w/o the diagnostic.data but we can try. You can upload logs (make sure they cover the time of the event) from all members into this secure portal. Please make sure to mention the time and the time zone of the event.

Comment by Zijun Tian [ 28/Oct/21 ]

Hi, I still have mongod logs but no diagnostic data.

Comment by Dmitry Agranat [ 28/Oct/21 ]

zijun.tian@tusimple.ai, so just to confirm, you no longer have mongod logs from all members covering the time of the reported event?

Comment by Zijun Tian [ 27/Oct/21 ]

Hi Dmitry, we only have the mongod log at this moment, since we deleted the whole data directory and sync the db from zero.

Comment by Dmitry Agranat [ 27/Oct/21 ]

Hi zijun.tian@tusimple.ai, in order to understand what has happened during the reported event, we'll need to review mongoD logs and diagnostic.data from all members of this replica set covering the time of the incident. Please let us know if you still have this data and we'll provide you with a link to a secure uploader.

Generated at Thu Feb 08 05:50:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.