[SERVER-16323] Server does not recover after a rollback (mms-onprem) Created: 25/Nov/14  Updated: 08/Jan/15  Resolved: 08/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Morales Assignee: Eric Milkie
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS


Attachments: Text File gdb.threads.log     Text File mongod.log    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

Upon initiating a rollback, the rollback operation appears to be stuck and never makes forward progress after ~2+ hours.

A server restart tries to restart the rollback with the same result (mongod remains up and accepts connections, but otherwise cannot finish the rollback).

GDB thread dump and (trimmed) mongod log attached.



 Comments   
Comment by Eric Milkie [ 08/Jan/15 ]

Hung waiting for a response from a remote node – more info to be collected if it happens again.

Comment by Eric Milkie [ 25/Nov/14 ]

According to the log, it attempted to roll back twice. The first attempt failed because it detected that there was too much data to roll back. It can only discover this after fetching ops from the sync source.
After this failure occurs, the code tries to roll back again. This time, while fetching ops, the query hung waiting for a response from the sync source. Curiously, the recv call was also interrupted with EINTR, which is a pretty rare occurrence. However, I don't think it's related to the problem.

It would be helpful if we could look at the sync source onprem-1.vpc1.10gen.cc:27017 and see if we can find the rollback operation still in progress. It would look like a simple findOne of a recently written op, with a source of onprem-2.vpc1.10gen.cc.

Generated at Thu Feb 08 03:40:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.