[SERVER-31478] Certain unrecoverable errors don't throw UnrecoverableRollbackError, causing rollback to retry indefinitely Created: 09/Oct/17  Updated: 27/Oct/23  Resolved: 01/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.5.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Backlog - Replication Team
Resolution: Gone away Votes: 0
Labels: neweng, rbfz
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File invalid_dbname.js    
Issue Links:
Backports
Depends
Related
related to SERVER-31543 rollback uses non-system temporary na... Closed
is related to SERVER-31488 dropping a drop-pending collection vi... Closed
is related to SERVER-31489 Rollback of applyOps fails if embedde... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v3.6
Sprint: Repl 2017-12-04
Participants:
Linked BF Score: 15

 Description   

Certain errors are not transient but don't bubble up as UnrecoverableRollbackErrors, so rollback's continuously retry logic will hit the same error every time. The error here seems to be related to new testing of applyOps, since all other commands have valid DBNames.

Please see the attached file for a repro.

d20003| 2017-10-09T18:08:53.972-0400 I REPL     [rsBackgroundSync] scheduling fetcher to read remote oplog on robert-macbook-pro.local:20001 starting at filter: { ts: { $gte: Timestamp 1507586707000|12 } }
d20003| 2017-10-09T18:08:53.973-0400 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp 1507586707000|12, t: 1 }. source's GTE: { ts: Timestamp 1507586718000|1, t: 2 } hashes: (-3080319662268009517/5545839650998093505)
d20003| 2017-10-09T18:08:53.973-0400 I REPL     [rsBackgroundSync] Replication commit point: { ts: Timestamp 1507586706000|4, t: 1 }
d20003| 2017-10-09T18:08:53.973-0400 I REPL     [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method due to startup server parameter.
d20003| 2017-10-09T18:08:53.973-0400 I REPL     [rsBackgroundSync] transition to ROLLBACK
d20003| 2017-10-09T18:08:53.973-0400 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 35
d20003| 2017-10-09T18:08:53.973-0400 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 34
d20003| 2017-10-09T18:08:53.974-0400 I NETWORK  [conn77] end connection 10.4.127.25:59759 (2 connections now open)
d20003| 2017-10-09T18:08:53.974-0400 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: robert-macbook-pro.local:20001
b20001| 2017-10-09T18:08:53.975-0400 I NETWORK  [main] connection accepted from 10.4.127.25:59760 #70 (1 connection now open)
d20004| 2017-10-09T18:08:53.975-0400 I NETWORK  [listener] connection accepted from 10.4.127.25:59761 #71 (8 connections now open)
d20004| 2017-10-09T18:08:53.975-0400 I NETWORK  [conn71] received client metadata from 10.4.127.25:59761 conn: { driver: { name: "MongoDB Internal Client", version: "3.5.13-325-g155db50-patch-59d98345e3c331107a001198" }, os: { type: "Darwin", name: "Mac OS X", architecture: "x86_64", version: "17.0.0" } }
d20003| 2017-10-09T18:08:53.976-0400 I ROLLBACK [rsBackgroundSync] Finding the Common Point
d20003| 2017-10-09T18:08:53.976-0400 I ROLLBACK [rsBackgroundSync] our last optime:   Oct  9 18:05:07:c
d20003| 2017-10-09T18:08:53.976-0400 I ROLLBACK [rsBackgroundSync] their last optime: Oct  9 18:05:19:34
d20003| 2017-10-09T18:08:53.976-0400 I ROLLBACK [rsBackgroundSync] diff in end of log times: -12 seconds
d20003| 2017-10-09T18:08:53.977-0400 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp 1507586706000|4, t: 1 }
d20003| 2017-10-09T18:08:53.977-0400 I ROLLBACK [rsBackgroundSync] Starting refetching documents
d20003| 2017-10-09T18:08:53.977-0400 D ROLLBACK [rsBackgroundSync] Refetching document, namespace: test.one, _id: _id: "ISKRjIN6"
d20003| 2017-10-09T18:08:53.977-0400 I ROLLBACK [rsBackgroundSync] Finished refetching documents. Total size of documents refetched: 1
d20003| 2017-10-09T18:08:53.977-0400 I ROLLBACK [rsBackgroundSync] Checking the RollbackID and updating the MinValid if necessary
d20003| 2017-10-09T18:08:53.977-0400 I ROLLBACK [rsBackgroundSync] Setting minvalid to { ts: Timestamp 1507586719000|52, t: 2 }
d20003| 2017-10-09T18:08:53.978-0400 I ROLLBACK [rsBackgroundSync] Dropping collections to roll back create operations
d20003| 2017-10-09T18:08:53.978-0400 E -        [rsBackgroundSync] Assertion: 28539:need a valid database name src/mongo/db/concurrency/d_concurrency.cpp 198
d20003| 2017-10-09T18:08:53.982-0400 W ROLLBACK [rsBackgroundSync] Rollback cannot complete at this time (retrying later): Location28539: need a valid database name appliedThrough= { ts: Timestamp 1507586707000|12, t: 1 } minvalid= { ts: Timestamp 1507586719000|52, t: 2 }



 Comments   
Comment by Judah Schvimer [ 01/Jun/18 ]

This ticket really only applied to rollback via refetch. There may be some isolated function calls in RTT that should convert their statuses into UnrecoverableRollbackErrors, though it looks like we already do a good job of differentiating them and upconverting when needed. I would close as Gone Away and upconvert individual statuses if we find they're unrecoverable and not being treated as such in their own tickets.

Comment by Spencer Brody (Inactive) [ 31/May/18 ]

judah.schvimer, do you know if this ticket applies to the new rollback algo at all, or has this gone away?

Comment by Judah Schvimer [ 10/Oct/17 ]

Any errors from local data access are unrecoverable. The only recoverable errors are from the sync source. To address this, we should catch exceptions in blocks of code that only access local data, and upconvert statuses to UnrecoverableRollbackErrors if they come from local data only.

Generated at Thu Feb 08 04:27:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.