Details
-
Improvement
-
Resolution: Won't Fix
-
Minor - P4
-
None
-
None
-
None
-
Replication
Description
When the 300MB rollback limit is hit, the following messages appear in the logfile (at default level):
1
|
2016-03-10T05:25:41.395-0600 I REPL [ReplicationExecutor] syncing from: localhost:27018
|
2
|
2016-03-10T05:25:41.414-0600 I REPL [SyncSourceFeedback] setting syncSourceFeedback to localhost:27018
|
3
|
2016-03-10T05:25:41.426-0600 I REPL [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 32, timestamp: Mar 9 16:30:58:c36). source's GTE: (term: 33, timestamp: Mar 9 16:31:00:1) hashes: (4549289231492422674/5887819301075287771)
|
4
|
2016-03-10T05:25:41.427-0600 I REPL [rsBackgroundSync] beginning rollback
|
5
|
2016-03-10T05:25:41.427-0600 I REPL [rsBackgroundSync] rollback 0
|
6
|
2016-03-10T05:25:41.427-0600 I REPL [ReplicationExecutor] transition to ROLLBACK
|
7
|
2016-03-10T05:25:41.434-0600 I REPL [rsBackgroundSync] rollback 1
|
8
|
2016-03-10T05:25:41.442-0600 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
|
9
|
2016-03-10T05:25:41.444-0600 I REPL [rsBackgroundSync] rollback our last optime: Mar 9 16:30:58:c36
|
10
|
2016-03-10T05:25:41.444-0600 I REPL [rsBackgroundSync] rollback their last optime: Mar 10 05:25:41:ab3
|
11
|
2016-03-10T05:25:41.444-0600 I REPL [rsBackgroundSync] rollback diff in end of log times: -46483 seconds
|
12
|
2016-03-10T05:25:46.421-0600 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 127.0.0.1:27018
|
13
|
2016-03-10T05:25:46.421-0600 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001 socket exception [SEND_ERROR] for 127.0.0.1:27018, will retry
|
14
|
2016-03-10T05:41:05.948-0600 I REPL [rsBackgroundSync] rollback 3 fixup
|
15
|
2016-03-10T05:45:17.601-0600 E REPL [rsBackgroundSync] rollback couldn't re-get ns:test.test _id:_id: ObjectId('56e0573f4bace9fd3d00de6e') 126180/1896532
|
16
|
2016-03-10T05:45:17.745-0600 I REPL [ReplicationExecutor] transition to RECOVERING
|
17
|
2016-03-10T05:45:17.771-0600 I REPL [ReplicationExecutor] transition to SECONDARY
|
18
|
2016-03-10T05:45:18.729-0600 E REPL [rsBackgroundSync] sync producer problem: 13410 replSet too much data to roll back
|
19
|
...
|
There are two errors reported:
- "couldn't re-get", an apparent failure to retrieve a document with a certain _id, on line 15
- "too much data to roll back", on line 18
The first suggests that there is some problem upstream, perhaps with the document with the indicated _id value. The second, appearing after the node has transitioned out of ROLLBACK and back into SECONDARY, could easily be presumed to have been caused by the first error. This is because the normal flow of causality in logfiles follows the flow of time, ie. problems earlier in the logs tend to cause other problems later in the logs.
However, code inspection reveals that this is not the case. There is only one problem here, which is that the uassert() on db/repl/rs_rollback.cpp:359 has been tripped. This causes the "couldn't re-get" message on rs_rollback.cpp:372, which is misleadingly incorrect — there has not been any failure to re-get any document. Unfortunately, the propagation of the assertion exception upwards in the stack means it doesn't make it into the logfile until much later.
The only way to figure all of this out is to read the code, which shouldn't be necessary to figure out what the logs are saying.
The fix is to output a more appropriate error() message in this case, eg:
|
db/repl/rs_rollback.cpp |
|
344
|
// fetch all the goodVersions of each document from current primary
|
345
|
DocID doc;
|
346
|
unsigned long long numFetched = 0;
|
347
|
try {
|
348
|
for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
|
349
|
it++) {
|
350
|
doc = *it;
|
351
|
|
352
|
verify(!doc._id.eoo());
|
353
|
|
354
|
{
|
355
|
// TODO : slow. lots of round trips.
|
356
|
numFetched++;
|
357
|
BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
|
358
|
totalSize += good.objsize();
|
359
|
- uassert(13410, "replSet too much data to roll back", totalSize < 300 * 1024 * 1024);
|
360
|
+ uassert(13410, "rollback too much data to roll back", totalSize < 300 * 1024 * 1024);
|
361
|
|
362
|
// note good might be eoo, indicating we should delete it
|
363
|
goodVersions[doc.ns][doc] = good;
|
364
|
}
|
365
|
}
|
366
|
newMinValid = rollbackSource.getLastOperation();
|
367
|
if (newMinValid.isEmpty()) {
|
368
|
error() << "rollback error newMinValid empty?";
|
369
|
return;
|
370
|
}
|
371
|
} catch (const DBException& e) {
|
372
|
LOG(1) << "rollback re-get objects: " << e.toString();
|
373
|
+ if (e.getCode() == 13410) {
|
374
|
+ error() << "rollback too much data to roll back";
|
375
|
+ } else {
|
376
|
error() << "rollback couldn't re-get ns:" << doc.ns << " _id:" << doc._id << ' '
|
377
|
<< numFetched << '/' << fixUpInfo.toRefetch.size();
|
378
|
+ }
|
379
|
throw e;
|
380
|
} |
which would give much clearer logs that look like:
...
|
2016-03-10T05:41:05.948-0600 I REPL [rsBackgroundSync] rollback 3 fixup
|
2016-03-10T05:45:17.601-0600 E REPL [rsBackgroundSync] rollback too much data to roll back
|
2016-03-10T05:45:17.745-0600 I REPL [ReplicationExecutor] transition to RECOVERING
|
2016-03-10T05:45:17.771-0600 I REPL [ReplicationExecutor] transition to SECONDARY
|
2016-03-10T05:45:18.729-0600 E REPL [rsBackgroundSync] sync producer problem: 13410 rollback too much data to roll back
|
...
|