[SERVER-16679] Cannot resync a 2.8.0-rc3 (wt) node from a 2.8.0-rc2 (wt) primary (mms-test) Created: 29/Dec/14  Updated: 30/Dec/14  Resolved: 30/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc2
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Cailin Nelson Assignee: Andy Schwerin
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-14707 listCollections and listIndexes comma... Closed
Related
related to SERVER-14707 listCollections and listIndexes comma... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

I am upgrading a (test, throwaway) replica set from 2.8.0-rc2 to 2.8.0-rc3.

I started by upgrading the binary for one secondary. As expected, I got the "detected index from earlier version, restart with --repair".

Instead of doing a --repair, I figured I would just resync the node.

When I attempted this, I got:

2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] pid=23987 port=27100 64-bit host=thereq-1.thereqbasic.8682
2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] db version v2.8.0-rc3
2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] git version: 2d679247f17dab05a492c8b6d2c250dab18e54f2
2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] build info: Linux build4.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] allocator: tcmalloc
2014-12-29T03:19:32.949+0000 I CONTROL  [conn2] options: { config: "/data/tiger_5/automation-mongod.conf", net: { port: 27100 }, processManagement: { fork: true }, replication: { replSetName: "tiger" }, storage: { dbPath: "/data/tiger_5", engine: "wiredTiger" }, systemLog: { destination: "file", path: "/data/tiger_5/mongodb.log" } }
2014-12-29T03:19:33.794+0000 I NETWORK  [initandlisten] connection accepted from 172.31.33.237:42156 #5 (4 connections now open)
2014-12-29T03:19:33.872+0000 I REPL     [rsSync] initial sync pending
2014-12-29T03:19:33.873+0000 I REPL     [ReplicationExecutor] syncing from: thereq-0.thereqbasic.8682.mongodbdns.com:27100
2014-12-29T03:19:33.881+0000 I REPL     [rsSync] initial sync drop all databases
2014-12-29T03:19:33.881+0000 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
2014-12-29T03:19:33.881+0000 I REPL     [rsSync] initial sync clone all databases
2014-12-29T03:19:33.884+0000 I REPL     [rsSync] initial sync cloning db: tiger
2014-12-29T03:19:33.888+0000 E REPL     [rsSync] 10065 invalid parameter: expected an object ()
2014-12-29T03:19:33.888+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
2014-12-29T03:19:36.590+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:35417 #6 (5 connections now open)
2014-12-29T03:19:37.909+0000 I NETWORK  [initandlisten] connection accepted from 172.31.18.80:38226 #7 (6 connections now open)
2014-12-29T03:19:38.795+0000 I NETWORK  [initandlisten] connection accepted from 172.31.33.237:42158 #8 (7 connections now open)
2014-12-29T03:19:38.888+0000 I REPL     [rsSync] initial sync pending
2014-12-29T03:19:38.888+0000 I REPL     [ReplicationExecutor] syncing from: thereq-0.thereqbasic.8682.mongodbdns.com:27100
2014-12-29T03:19:38.892+0000 I REPL     [rsSync] initial sync drop all databases
2014-12-29T03:19:38.892+0000 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
2014-12-29T03:19:38.892+0000 I REPL     [rsSync] initial sync clone all databases
2014-12-29T03:19:38.893+0000 I REPL     [rsSync] initial sync cloning db: tiger
2014-12-29T03:19:38.897+0000 E REPL     [rsSync] 10065 invalid parameter: expected an object ()
2014-12-29T03:19:38.897+0000 E REPL     [rsSync] initial sync attempt failed, 8 attempts remaining
2014-12-29T03:19:42.918+0000 I NETWORK  [initandlisten] connection accepted from 172.31.18.80:38228 #9 (8 connections now open)
2014-12-29T03:19:43.897+0000 I REPL     [rsSync] initial sync pending
2014-12-29T03:19:43.897+0000 I REPL     [ReplicationExecutor] syncing from: thereq-0.thereqbasic.8682.mongodbdns.com:27100
2014-12-29T03:19:43.907+0000 I REPL     [rsSync] initial sync drop all databases
2014-12-29T03:19:43.907+0000 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
2014-12-29T03:19:43.907+0000 I REPL     [rsSync] initial sync clone all databases
2014-12-29T03:19:43.921+0000 I REPL     [rsSync] initial sync cloning db: tiger
2014-12-29T03:19:43.939+0000 E REPL     [rsSync] 10065 invalid parameter: expected an object ()
2014-12-29T03:19:43.939+0000 E REPL     [rsSync] initial sync attempt failed, 7 attempts remaining
2014-12-29T03:19:48.939+0000 I REPL     [rsSync] initial sync pending

If the answer to this is simply "Sorry, can't do this, it's an RC" that's fine by me. It's all garbage data.



 Comments   
Comment by Eric Milkie [ 30/Dec/14 ]

rassi@10gen.com take note of Scott's comment above.

Comment by Eric Milkie [ 30/Dec/14 ]

Resolving as a dup of the fixes/improvements needed for listCollection. Those will allow faster diagnosis of this. We won't need to fix the underlying issue of compatibility between rc1/2 and rc3.

Comment by Scott Hernandez (Inactive) [ 30/Dec/14 ]

The issue is that the listCollections command does not return a "cursor" field as the new client is expecting. This breaks the cloner. The problem exists on master but should only be an issue for impls of listCollections in old RCs, namely rc2. Might be worth addressing in SERVER-14707 with better checking/diagnostic messages.

2014-12-30T10:03:13.089-0500 E REPL     [rsSync] initial sync error: 28595 missing 'cursor' field
2014-12-30T10:03:13.089-0500 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
2014-12-30T10:03:18.090-0500 I REPL     [rsSync] initial sync pending

diff --git a/src/mongo/client/dbclient.cpp b/src/mongo/client/dbclient.cpp
index f2bf4cb..1f28a3b 100644
--- a/src/mongo/client/dbclient.cpp
+++ b/src/mongo/client/dbclient.cpp
@@ -910,13 +910,16 @@ namespace mongo {
                                                        << "cursor" << BSONObj()),
                            res,
                            QueryOption_SlaveOk)) {
+                massert(28595, "missing 'cursor' field", res.hasField("cursor"));
                 BSONObj cursorObj = res["cursor"].Obj();
                 massert(28586, "reading multiple batches from listCollections not implemented",
                         cursorObj["id"].numberInt() == 0);
+                massert(28596, "missing 'firstBatch' field", res.hasField("firstBatch"));
                 BSONObj collections = cursorObj["firstBatch"].Obj();
                 BSONObjIterator it( collections );
                 while ( it.more() ) {
                     BSONElement e = it.next();
+                    massert(28597, "array element is not an embedded document", e.isABSONObj());
                     infos.push_back( e.Obj().getOwned() );
                 }
                 return infos;

Comment by Cailin Nelson [ 29/Dec/14 ]

This covers the entire time range of the failed attempt to sync. Would you like me to increase the log verbosity on the source node and try again?

2014-12-29T03:23:38.046+0000 W NETWORK  [ReplExecNetThread-5] Failed to connect to 172.31.5.196:27100, reason: errno:111 Connection refused
2014-12-29T03:23:38.047+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to thereq-1.thereqbasic.8682.mongodbdns.com:27100; Location18915 Failed atte
mpt to connect to thereq-1.thereqbasic.8682.mongodbdns.com:27100; couldn't connect to server thereq-1.thereqbasic.8682.mongodbdns.com:27100 (172.31.5.196), connectio
n attempt failed
2014-12-29T03:23:38.212+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57284 #40 (12 connections now open)
2014-12-29T03:23:38.213+0000 I NETWORK  [conn40] end connection 172.31.5.196:57284 (11 connections now open)
2014-12-29T03:23:38.254+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57286 #41 (12 connections now open)
2014-12-29T03:23:40.062+0000 I REPL     [ReplicationExecutor] Member thereq-1.thereqbasic.8682.mongodbdns.com:27100 is now in state STARTUP2
2014-12-29T03:23:41.271+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57289 #42 (13 connections now open)
2014-12-29T03:23:41.297+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57290 #43 (14 connections now open)
2014-12-29T03:23:41.299+0000 I NETWORK  [conn43] end connection 172.31.5.196:57290 (13 connections now open)
2014-12-29T03:23:41.303+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57291 #44 (14 connections now open)
2014-12-29T03:23:41.318+0000 I NETWORK  [conn44] end connection 172.31.5.196:57291 (13 connections now open)
2014-12-29T03:23:41.333+0000 I NETWORK  [conn42] end connection 172.31.5.196:57289 (12 connections now open)
2014-12-29T03:24:00.630+0000 I NETWORK  [conn26] end connection 172.31.5.196:56824 (11 connections now open)
2014-12-29T03:24:00.630+0000 I NETWORK  [conn29] end connection 172.31.5.196:56865 (10 connections now open)
2014-12-29T03:24:08.285+0000 I NETWORK  [conn41] end connection 172.31.5.196:57286 (9 connections now open)
2014-12-29T03:24:08.312+0000 I NETWORK  [initandlisten] connection accepted from 172.31.5.196:57313 #45 (10 connections now open)
2014-12-29T03:24:31.586+0000 I NETWORK  [conn45] end connection 172.31.5.196:57313 (9 connections now open)
2014-12-29T03:24:32.126+0000 I NETWORK  [ReplExecNetThread-4] DBClientCursor::init call() failed
2014-12-29T03:24:32.126+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to thereq-1.thereqbasic.8682.mongodbdns.com:27100; Location10276 DBClientBas
e::findN: transport error: thereq-1.thereqbasic.8682.mongodbdns.com:27100 ns: admin.$cmd query: { replSetHeartbeat: "tiger", pv: 1, v: 1, from: "thereq-2.thereqbasic
.8682.mongodbdns.com:27100", fromId: 0, checkEmpty: false }
2014-12-29T03:24:32.127+0000 W NETWORK  [ReplExecNetThread-6] 

Comment by Daniel Pasette (Inactive) [ 29/Dec/14 ]

I don't think this should not be an issue, so I'd like to investigate. Can you attach the source node's log file?

Generated at Thu Feb 08 03:41:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.