Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Fix
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When the 300MB rollback limit is hit, the following messages appear in the logfile (at default level):

2016-03-10T05:25:41.395-0600 I REPL     [ReplicationExecutor] syncing from: localhost:27018
2016-03-10T05:25:41.414-0600 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to localhost:27018
2016-03-10T05:25:41.426-0600 I REPL     [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 32, timestamp: Mar  9 16:30:58:c36). source's GTE: (term: 33, timestamp: Mar  9 16:31:00:1) hashes: (4549289231492422674/5887819301075287771)
2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] beginning rollback
2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] rollback 0
2016-03-10T05:25:41.427-0600 I REPL     [ReplicationExecutor] transition to ROLLBACK
2016-03-10T05:25:41.434-0600 I REPL     [rsBackgroundSync] rollback 1
2016-03-10T05:25:41.442-0600 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback our last optime:   Mar  9 16:30:58:c36
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback their last optime: Mar 10 05:25:41:ab3
2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback diff in end of log times: -46483 seconds
2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 127.0.0.1:27018
2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001 socket exception [SEND_ERROR] for 127.0.0.1:27018, will retry
2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback couldn't re-get ns:test.test _id:_id: ObjectId('56e0573f4bace9fd3d00de6e') 126180/1896532
2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 replSet too much data to roll back
...

There are two errors reported:

"couldn't re-get", an apparent failure to retrieve a document with a certain _id, on line 15
"too much data to roll back", on line 18

The first suggests that there is some problem upstream, perhaps with the document with the indicated _id value. The second, appearing after the node has transitioned out of ROLLBACK and back into SECONDARY, could easily be presumed to have been caused by the first error. This is because the normal flow of causality in logfiles follows the flow of time, ie. problems earlier in the logs tend to cause other problems later in the logs.

However, code inspection reveals that this is not the case. There is only one problem here, which is that the uassert() on db/repl/rs_rollback.cpp:359 has been tripped. This causes the "couldn't re-get" message on rs_rollback.cpp:372, which is misleadingly incorrect — there has not been any failure to re-get any document. Unfortunately, the propagation of the assertion exception upwards in the stack means it doesn't make it into the logfile until much later.

The only way to figure all of this out is to read the code, which shouldn't be necessary to figure out what the logs are saying.

The fix is to output a more appropriate error() message in this case, eg:

db/repl/rs_rollback.cpp

Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

     // fetch all the goodVersions of each document from current primary
     DocID doc;
     unsigned long long numFetched = 0;
     try {
         for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
              it++) {
             doc = *it;
 
             verify(!doc._id.eoo());
 
             {
                 // TODO : slow.  lots of round trips.
                 numFetched++;
                 BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
                 totalSize += good.objsize();
-                uassert(13410, "replSet too much data to roll back", totalSize < 300 * 1024 * 1024);
+                uassert(13410, "rollback too much data to roll back", totalSize < 300 * 1024 * 1024);
 
                 // note good might be eoo, indicating we should delete it
                 goodVersions[doc.ns][doc] = good;
             }
         }
         newMinValid = rollbackSource.getLastOperation();
         if (newMinValid.isEmpty()) {
             error() << "rollback error newMinValid empty?";
             return;
         }
     } catch (const DBException& e) {
         LOG(1) << "rollback re-get objects: " << e.toString();
+        if (e.getCode() == 13410) {
+            error() << "rollback too much data to roll back";
+        } else {
             error() << "rollback couldn't re-get ns:" << doc.ns << " _id:" << doc._id << ' '
                     << numFetched << '/' << fixUpInfo.toRefetch.size();
+        }
         throw e;
     }

which would give much clearer logs that look like:

...
2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback too much data to roll back
2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 rollback too much data to roll back
...

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Kevin Pulo
Participants:: [DO NOT USE] Backlog - Replication Team, Daniel Pasette, Kevin Pulo, Spencer Brody
Votes:: 3 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Mar 14 2016 12:21:05 AM UTC
Updated:: Dec 06 2022 04:31:11 AM UTC
Resolved:: Jun 12 2018 06:07:09 PM UTC

Details

Description

Attachments

Activity

People

Dates