Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23104

Improve misleading log message when rollback limit hit

    XMLWordPrintableJSON

Details

    • Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Icon: Minor - P4 Minor - P4
    • None
    • None
    • Replication
    • None
    • Replication

    Description

      When the 300MB rollback limit is hit, the following messages appear in the logfile (at default level):

      1
      2016-03-10T05:25:41.395-0600 I REPL     [ReplicationExecutor] syncing from: localhost:27018
      2
      2016-03-10T05:25:41.414-0600 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to localhost:27018
      3
      2016-03-10T05:25:41.426-0600 I REPL     [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 32, timestamp: Mar  9 16:30:58:c36). source's GTE: (term: 33, timestamp: Mar  9 16:31:00:1) hashes: (4549289231492422674/5887819301075287771)
      4
      2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] beginning rollback
      5
      2016-03-10T05:25:41.427-0600 I REPL     [rsBackgroundSync] rollback 0
      6
      2016-03-10T05:25:41.427-0600 I REPL     [ReplicationExecutor] transition to ROLLBACK
      7
      2016-03-10T05:25:41.434-0600 I REPL     [rsBackgroundSync] rollback 1
      8
      2016-03-10T05:25:41.442-0600 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      9
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback our last optime:   Mar  9 16:30:58:c36
      10
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback their last optime: Mar 10 05:25:41:ab3
      11
      2016-03-10T05:25:41.444-0600 I REPL     [rsBackgroundSync] rollback diff in end of log times: -46483 seconds
      12
      2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 127.0.0.1:27018
      13
      2016-03-10T05:25:46.421-0600 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001 socket exception [SEND_ERROR] for 127.0.0.1:27018, will retry
      14
      2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
      15
      2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback couldn't re-get ns:test.test _id:_id: ObjectId('56e0573f4bace9fd3d00de6e') 126180/1896532
      16
      2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
      17
      2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
      18
      2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 replSet too much data to roll back
      19
      ...
      

      There are two errors reported:

      • "couldn't re-get", an apparent failure to retrieve a document with a certain _id, on line 15
      • "too much data to roll back", on line 18

      The first suggests that there is some problem upstream, perhaps with the document with the indicated _id value. The second, appearing after the node has transitioned out of ROLLBACK and back into SECONDARY, could easily be presumed to have been caused by the first error. This is because the normal flow of causality in logfiles follows the flow of time, ie. problems earlier in the logs tend to cause other problems later in the logs.

      However, code inspection reveals that this is not the case. There is only one problem here, which is that the uassert() on db/repl/rs_rollback.cpp:359 has been tripped. This causes the "couldn't re-get" message on rs_rollback.cpp:372, which is misleadingly incorrect — there has not been any failure to re-get any document. Unfortunately, the propagation of the assertion exception upwards in the stack means it doesn't make it into the logfile until much later.

      The only way to figure all of this out is to read the code, which shouldn't be necessary to figure out what the logs are saying.

      The fix is to output a more appropriate error() message in this case, eg:

      db/repl/rs_rollback.cpp

      344
           // fetch all the goodVersions of each document from current primary
      345
           DocID doc;
      346
           unsigned long long numFetched = 0;
      347
           try {
      348
               for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
      349
                    it++) {
      350
                   doc = *it;
      351
       
      352
                   verify(!doc._id.eoo());
      353
       
      354
                   {
      355
                       // TODO : slow.  lots of round trips.
      356
                       numFetched++;
      357
                       BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
      358
                       totalSize += good.objsize();
      359
      -                uassert(13410, "replSet too much data to roll back", totalSize < 300 * 1024 * 1024);
      360
      +                uassert(13410, "rollback too much data to roll back", totalSize < 300 * 1024 * 1024);
      361
       
      362
                       // note good might be eoo, indicating we should delete it
      363
                       goodVersions[doc.ns][doc] = good;
      364
                   }
      365
               }
      366
               newMinValid = rollbackSource.getLastOperation();
      367
               if (newMinValid.isEmpty()) {
      368
                   error() << "rollback error newMinValid empty?";
      369
                   return;
      370
               }
      371
           } catch (const DBException& e) {
      372
               LOG(1) << "rollback re-get objects: " << e.toString();
      373
      +        if (e.getCode() == 13410) {
      374
      +            error() << "rollback too much data to roll back";
      375
      +        } else {
      376
                   error() << "rollback couldn't re-get ns:" << doc.ns << " _id:" << doc._id << ' '
      377
                           << numFetched << '/' << fixUpInfo.toRefetch.size();
      378
      +        }
      379
               throw e;
      380
           }
      

      which would give much clearer logs that look like:

      ...
      2016-03-10T05:41:05.948-0600 I REPL     [rsBackgroundSync] rollback 3 fixup
      2016-03-10T05:45:17.601-0600 E REPL     [rsBackgroundSync] rollback too much data to roll back
      2016-03-10T05:45:17.745-0600 I REPL     [ReplicationExecutor] transition to RECOVERING
      2016-03-10T05:45:17.771-0600 I REPL     [ReplicationExecutor] transition to SECONDARY
      2016-03-10T05:45:18.729-0600 E REPL     [rsBackgroundSync] sync producer problem: 13410 rollback too much data to roll back
      ...
      

      Attachments

        Activity

          People

            backlog-server-repl Backlog - Replication Team
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: