[SERVER-16323] Server does not recover after a rollback (mms-onprem) Created: 25/Nov/14 Updated: 08/Jan/15 Resolved: 08/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.8.0-rc1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | John Morales | Assignee: | Eric Milkie |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
Upon initiating a rollback, the rollback operation appears to be stuck and never makes forward progress after ~2+ hours. A server restart tries to restart the rollback with the same result (mongod remains up and accepts connections, but otherwise cannot finish the rollback). GDB thread dump and (trimmed) mongod log attached. |
| Comments |
| Comment by Eric Milkie [ 08/Jan/15 ] |
|
Hung waiting for a response from a remote node – more info to be collected if it happens again. |
| Comment by Eric Milkie [ 25/Nov/14 ] |
|
According to the log, it attempted to roll back twice. The first attempt failed because it detected that there was too much data to roll back. It can only discover this after fetching ops from the sync source. It would be helpful if we could look at the sync source onprem-1.vpc1.10gen.cc:27017 and see if we can find the rollback operation still in progress. It would look like a simple findOne of a recently written op, with a source of onprem-2.vpc1.10gen.cc. |