[SERVER-23104] Improve misleading log message when rollback limit hit Created: 14/Mar/16  Updated: 06/Dec/22  Resolved: 12/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Kevin Pulo Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

When the 300MB rollback limit is hit, the following messages appear in the logfile (at default level):

1
2016-03-10T05:25:41.395-0600 I REPL     [ReplicationExecutor] syncing from: localhost:27018
2
2016-03-10T05:25:41.414-0600 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to localhost:27018
3
2016-03-10T05:25:41.426-0600 I REPL     [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 32, timestamp: Mar  9 16:30:58:c36). source's GTE: (term: 33, timestamp: Mar  9 16:31:00:1) hashes: (4549289231492422674/5887819301075287771)
4
2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] beginning rollback
5
2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] rollback 0
6
2016-03-10T05:25:41.427-0600 I REPL     [ReplicationExecutor] transition to ROLLBACK
7
2016-03-10T05:25:41.434-0600 I REPL     [rsBackgroundSync] rollback 1
8
2016-03-10T05:25:41.442-0600 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
9
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback our last optime:   Mar  9 16:30:58:c36
10
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback their last optime: Mar 10 05:25:41:ab3
11
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback diff in end of log times: -46483 seconds
12
2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 127.0.0.1:27018
13
2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001 socket exception [SEND_ERROR] for 127.0.0.1:27018, will retry
14
2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
15
2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback couldn't re-get ns:test.test _id:_id: ObjectId('56e0573f4bace9fd3d00de6e') 126180/1896532
16
2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
17
2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
18
2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 replSet too much data to roll back
19
...

There are two errors reported:

  • "couldn't re-get", an apparent failure to retrieve a document with a certain _id, on line 15
  • "too much data to roll back", on line 18

The first suggests that there is some problem upstream, perhaps with the document with the indicated _id value. The second, appearing after the node has transitioned out of ROLLBACK and back into SECONDARY, could easily be presumed to have been caused by the first error. This is because the normal flow of causality in logfiles follows the flow of time, ie. problems earlier in the logs tend to cause other problems later in the logs.

However, code inspection reveals that this is not the case. There is only one problem here, which is that the uassert() on db/repl/rs_rollback.cpp:359 has been tripped. This causes the "couldn't re-get" message on rs_rollback.cpp:372, which is misleadingly incorrect — there has not been any failure to re-get any document. Unfortunately, the propagation of the assertion exception upwards in the stack means it doesn't make it into the logfile until much later.

The only way to figure all of this out is to read the code, which shouldn't be necessary to figure out what the logs are saying.

The fix is to output a more appropriate error() message in this case, eg:

db/repl/rs_rollback.cpp

344
     // fetch all the goodVersions of each document from current primary
345
     DocID doc;
346
     unsigned long long numFetched = 0;
347
     try {
348
         for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
349
              it++) {
350
             doc = *it;
351
 
352
             verify(!doc._id.eoo());
353
 
354
             {
355
                 // TODO : slow.  lots of round trips.
356
                 numFetched++;
357
                 BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
358
                 totalSize += good.objsize();
359
-                uassert(13410, "replSet too much data to roll back", totalSize < 300 * 1024 * 1024);
360
+                uassert(13410, "rollback too much data to roll back", totalSize < 300 * 1024 * 1024);
361
 
362
                 // note good might be eoo, indicating we should delete it
363
                 goodVersions[doc.ns][doc] = good;
364
             }
365
         }
366
         newMinValid = rollbackSource.getLastOperation();
367
         if (newMinValid.isEmpty()) {
368
             error() << "rollback error newMinValid empty?";
369
             return;
370
         }
371
     } catch (const DBException& e) {
372
         LOG(1) << "rollback re-get objects: " << e.toString();
373
+        if (e.getCode() == 13410) {
374
+            error() << "rollback too much data to roll back";
375
+        } else {
376
             error() << "rollback couldn't re-get ns:" << doc.ns << " _id:" << doc._id << ' '
377
                     << numFetched << '/' << fixUpInfo.toRefetch.size();
378
+        }
379
         throw e;
380
     }

which would give much clearer logs that look like:

...
2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback too much data to roll back
2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 rollback too much data to roll back
...



 Comments   
Comment by Spencer Brody (Inactive) [ 30/May/18 ]

Rollback has been completely rewritten in 4.0 and the log messages are different.

Comment by Daniel Pasette (Inactive) [ 29/Mar/16 ]

I'd also like to see what the rollback data limit is in the log message.

Generated at Thu Feb 08 04:02:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.