Improve misleading log message when rollback limit hit

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Won't Fix
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When the 300MB rollback limit is hit, the following messages appear in the logfile (at default level):

      2016-03-10T05:25:41.395-0600 I REPL     [ReplicationExecutor] syncing from: localhost:27018
      2016-03-10T05:25:41.414-0600 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to localhost:27018
      2016-03-10T05:25:41.426-0600 I REPL     [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 32, timestamp: Mar  9 16:30:58:c36). source's GTE: (term: 33, timestamp: Mar  9 16:31:00:1) hashes: (4549289231492422674/5887819301075287771)
      2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] beginning rollback
      2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] rollback 0
      2016-03-10T05:25:41.427-0600 I REPL     [ReplicationExecutor] transition to ROLLBACK
      2016-03-10T05:25:41.434-0600 I REPL     [rsBackgroundSync] rollback 1
      2016-03-10T05:25:41.442-0600 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback our last optime:   Mar  9 16:30:58:c36
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback their last optime: Mar 10 05:25:41:ab3
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback diff in end of log times: -46483 seconds
      2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 127.0.0.1:27018
      2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001 socket exception [SEND_ERROR] for 127.0.0.1:27018, will retry
      2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
      2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback couldn't re-get ns:test.test _id:_id: ObjectId('56e0573f4bace9fd3d00de6e') 126180/1896532
      2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
      2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
      2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 replSet too much data to roll back
      ...
      

      There are two errors reported:

      • "couldn't re-get", an apparent failure to retrieve a document with a certain _id, on line 15
      • "too much data to roll back", on line 18

      The first suggests that there is some problem upstream, perhaps with the document with the indicated _id value. The second, appearing after the node has transitioned out of ROLLBACK and back into SECONDARY, could easily be presumed to have been caused by the first error. This is because the normal flow of causality in logfiles follows the flow of time, ie. problems earlier in the logs tend to cause other problems later in the logs.

      However, code inspection reveals that this is not the case. There is only one problem here, which is that the uassert() on db/repl/rs_rollback.cpp:359 has been tripped. This causes the "couldn't re-get" message on rs_rollback.cpp:372, which is misleadingly incorrect — there has not been any failure to re-get any document. Unfortunately, the propagation of the assertion exception upwards in the stack means it doesn't make it into the logfile until much later.

      The only way to figure all of this out is to read the code, which shouldn't be necessary to figure out what the logs are saying.

      The fix is to output a more appropriate error() message in this case, eg:

      db/repl/rs_rollback.cpp
      Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
           // fetch all the goodVersions of each document from current primary
           DocID doc;
           unsigned long long numFetched = 0;
           try {
               for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
                    it++) {
                   doc = *it;
       
                   verify(!doc._id.eoo());
       
                   {
                       // TODO : slow.  lots of round trips.
                       numFetched++;
                       BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
                       totalSize += good.objsize();
      -                uassert(13410, "replSet too much data to roll back", totalSize < 300 * 1024 * 1024);
      +                uassert(13410, "rollback too much data to roll back", totalSize < 300 * 1024 * 1024);
       
                       // note good might be eoo, indicating we should delete it
                       goodVersions[doc.ns][doc] = good;
                   }
               }
               newMinValid = rollbackSource.getLastOperation();
               if (newMinValid.isEmpty()) {
                   error() << "rollback error newMinValid empty?";
                   return;
               }
           } catch (const DBException& e) {
               LOG(1) << "rollback re-get objects: " << e.toString();
      +        if (e.getCode() == 13410) {
      +            error() << "rollback too much data to roll back";
      +        } else {
                   error() << "rollback couldn't re-get ns:" << doc.ns << " _id:" << doc._id << ' '
                           << numFetched << '/' << fixUpInfo.toRefetch.size();
      +        }
               throw e;
           }
      

      which would give much clearer logs that look like:

      ...
      2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
      2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback too much data to roll back
      2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
      2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
      2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 rollback too much data to roll back
      ...
      

            Assignee:
            [DO NOT USE] Backlog - Replication Team
            Reporter:
            Kevin Pulo
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: