[SERVER-8017] jstests/replsets/sync_passive2.js failing Created: 22/Dec/12  Updated: 11/Jul/16  Resolved: 28/Dec/12

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.3.2

Type: Bug Priority: Major - P3
Reporter: Tad Marshall Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Failures seen on Solaris build slave and Windows 64-bit.


Operating System: ALL
Participants:

 Description   

http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/759/test/recent%20failures/sync_passive2.js

ReplSetTest await TS for connection to bs-smartos-x86-64-1.10gen.cc:31004 is 1356198556000:1 and latest is 1356198556000:1
ReplSetTest await oplog size for connection to bs-smartos-x86-64-1.10gen.cc:31004 is 20002
ReplSetTest await synced=true
force 2 to sync from 3
 m31002| Sat Dec 22 17:49:56.892 [conn2] attempting to sync from bs-smartos-x86-64-1.10gen.cc:31003, but its latest opTime is 1356198504 and ours is 1356198556 so this may not work
 m31000| Sat Dec 22 17:49:57.073 [conn23] end connection 165.225.128.186:64842 (10 connections now open)
 m31002| Sat Dec 22 17:49:57.073 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31003 by request
 m31003| Sat Dec 22 17:49:57.073 [initandlisten] connection accepted from 165.225.128.186:54430 #19 (6 connections now open)
 m31002| Sat Dec 22 17:49:57.074 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31000
 m31003| Sat Dec 22 17:49:57.074 [conn19] end connection 165.225.128.186:54430 (5 connections now open)
 m31000| Sat Dec 22 17:49:57.074 [initandlisten] connection accepted from 165.225.128.186:47032 #30 (11 connections now open)
{
	"set" : "sync_passive2",
	"date" : ISODate("2012-12-22T17:49:57Z"),
	"myState" : 2,
	"syncingTo" : "bs-smartos-x86-64-1.10gen.cc:31000",
	"members" : [
		{
			"_id" : 0,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31000",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356198556000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-22T17:49:16Z"),
			"lastHeartbeat" : ISODate("2012-12-22T17:49:55Z"),
			"lastHeartbeatRecv" : ISODate("2012-12-22T17:49:56Z"),
			"pingMs" : 0
		},
		{
			"_id" : 1,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31001",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 40,
			"lastHeartbeat" : ISODate("2012-12-22T17:49:55Z"),
			"lastHeartbeatRecv" : ISODate("2012-12-22T17:49:56Z"),
			"pingMs" : 0
		},
		{
			"_id" : 2,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31002",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 74,
			"optime" : {
				"t" : 1356198556000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-22T17:49:16Z"),
			"errmsg" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000",
			"self" : true
		},
		{
			"_id" : 3,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31003",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356198504000,
				"i" : 4242
			},
			"optimeDate" : ISODate("2012-12-22T17:48:24Z"),
			"lastHeartbeat" : ISODate("2012-12-22T17:49:55Z"),
			"lastHeartbeatRecv" : ISODate("2012-12-22T17:49:57Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		},
		{
			"_id" : 4,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31004",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356198556000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-22T17:49:16Z"),
			"lastHeartbeat" : ISODate("2012-12-22T17:49:55Z"),
			"lastHeartbeatRecv" : ISODate("2012-12-22T17:49:56Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		}
	],
	"ok" : 1
}

This test has problems at this stage. It expects to be able to see that it has successfully changed the "sync to" setting for a node and expects to see it be automatically changed to a non-slave-delayed node "within ~30 seconds". In this failure, the setting was successfully changed and was automatically reset 1 millisecond later:

 m31002| Sat Dec 22 17:49:56.892 [conn2] attempting to sync from bs-smartos-x86-64-1.10gen.cc:31003, but its latest opTime is 1356198504 and ours is 1356198556 so this may not work
 m31000| Sat Dec 22 17:49:57.073 [conn23] end connection 165.225.128.186:64842 (10 connections now open)
 m31002| Sat Dec 22 17:49:57.073 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31003 by request
 m31003| Sat Dec 22 17:49:57.073 [initandlisten] connection accepted from 165.225.128.186:54430 #19 (6 connections now open)
 m31002| Sat Dec 22 17:49:57.074 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31000

This very fast reset doesn't give the script any time to see that its setting did, in fact, work.



 Comments   
Comment by Tad Marshall [ 24/Dec/12 ]

Looking at http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/774/test/recent%20failures/sync_passive2.js:

ReplSetTest awaitReplication: checking secondaries against timestamp { "t" : 1356355104000, "i" : 1 }
ReplSetTest awaitReplication: checking secondary #1: bs-smartos-x86-64-1.10gen.cc:31002
ReplSetTest awaitReplication: secondary #1, bs-smartos-x86-64-1.10gen.cc:31002, is synced
ReplSetTest awaitReplication: checking secondary #2: bs-smartos-x86-64-1.10gen.cc:31003
ReplSetTest awaitReplication: timestamp for secondary #2, bs-smartos-x86-64-1.10gen.cc:31003, is { "t" : 1356355054000, "i" : 4525 } but latest is { "t" : 1356355104000, "i" : 1 }
ReplSetTest awaitReplication: last oplog entry (of 20001) for secondary #2, bs-smartos-x86-64-1.10gen.cc:31003, is {  "ts" : {  "t" : 1356355054000,  "i" : 4525 },  "h" : NumberLong("2924272236560738974"),  "v" : 2,  "op" : "i",  "ns" : "test.foo",  "o" : {  "_id" : ObjectId("50d855ee70be68d67b74d5ab"),  "x" : 9999,  "foo" : "bar",  "msg" : "all the talk on the market",  "date" : [ 	ISODate("2012-12-24T13:17:34.745Z"), 	ISODate("2012-12-24T13:17:34.745Z"), 	ISODate("2012-12-24T13:17:34.745Z") ] } }
ReplSetTest awaitReplication: secondary #2, bs-smartos-x86-64-1.10gen.cc:31003, is NOT synced
 m31003| Mon Dec 24 13:19:04.539 [rsSyncNotifier] replset setting oplog notifier to bs-smartos-x86-64-1.10gen.cc:31000
 m31002| Mon Dec 24 13:19:04.539 [conn11] end connection 165.225.128.186:47553 (6 connections now open)
 m31000| Mon Dec 24 13:19:04.539 [initandlisten] connection accepted from 165.225.128.186:48071 #27 (11 connections now open)
ReplSetTest awaitReplication: checking secondaries against timestamp { "t" : 1356355104000, "i" : 1 }
ReplSetTest awaitReplication: checking secondary #1: bs-smartos-x86-64-1.10gen.cc:31002
ReplSetTest awaitReplication: secondary #1, bs-smartos-x86-64-1.10gen.cc:31002, is synced
ReplSetTest awaitReplication: checking secondary #2: bs-smartos-x86-64-1.10gen.cc:31003
ReplSetTest awaitReplication: secondary #2, bs-smartos-x86-64-1.10gen.cc:31003, is synced
ReplSetTest awaitReplication: checking secondary #3: bs-smartos-x86-64-1.10gen.cc:31004
ReplSetTest awaitReplication: secondary #3, bs-smartos-x86-64-1.10gen.cc:31004, is synced
ReplSetTest awaitReplication: finished: all 3 secondaries synced at timestamp { "t" : 1356355104000, "i" : 1 }
{
	"set" : "sync_passive2",
	"date" : ISODate("2012-12-24T13:19:04Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31000",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 67,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31001",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 40,
			"lastHeartbeat" : ISODate("2012-12-24T13:19:04Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0
		},
		{
			"_id" : 2,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31002",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:04Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		},
		{
			"_id" : 3,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31003",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:04Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		},
		{
			"_id" : 4,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31004",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 40,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:04Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		}
	],
	"ok" : 1
}
force 2 to sync from 3
 m31002| Mon Dec 24 13:19:04.673 [conn2] attempting to sync from bs-smartos-x86-64-1.10gen.cc:31003, but its latest opTime is 1356355054 and ours is 1356355104 so this may not work
{
	"set" : "sync_passive2",
	"date" : ISODate("2012-12-24T13:19:04Z"),
	"myState" : 2,
	"syncingTo" : "bs-smartos-x86-64-1.10gen.cc:31000",
	"members" : [
		{
			"_id" : 0,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31000",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 39,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:03Z"),
			"lastHeartbeatRecv" : ISODate("2012-12-24T13:19:04Z"),
			"pingMs" : 0
		},
		{
			"_id" : 1,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31001",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 39,
			"lastHeartbeat" : ISODate("2012-12-24T13:19:03Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0
		},
		{
			"_id" : 2,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31002",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 73,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"errmsg" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000",
			"self" : true
		},
		{
			"_id" : 3,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31003",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 39,
			"optime" : {
				"t" : 1356355054000,
				"i" : 4525
			},
			"optimeDate" : ISODate("2012-12-24T13:17:34Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:03Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		},
		{
			"_id" : 4,
			"name" : "bs-smartos-x86-64-1.10gen.cc:31004",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 39,
			"optime" : {
				"t" : 1356355104000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-12-24T13:18:24Z"),
			"lastHeartbeat" : ISODate("2012-12-24T13:19:03Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "syncing to: bs-smartos-x86-64-1.10gen.cc:31000"
		}
	],
	"ok" : 1
}
 m31002| Mon Dec 24 13:19:04.884 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31003 by request
 m31000| Mon Dec 24 13:19:04.884 [conn22] end connection 165.225.128.186:43498 (10 connections now open)
 m31003| Mon Dec 24 13:19:04.885 [initandlisten] connection accepted from 165.225.128.186:58331 #19 (6 connections now open)
 m31002| Mon Dec 24 13:19:04.885 [rsBackgroundSync] replSet syncing to: bs-smartos-x86-64-1.10gen.cc:31000

The problem is not with replication, but with the nodes tracking each others' opTimes. The logged information from awaitReplication() and replSetGetStatus shows that the primary and all three secondaries were synced at opTime:

{ "t" : 1356355104000, "i" : 1 }

, but when we call replSetGetStatus on bs-smartos-x86-64-1.10gen.cc:31002, it reports that bs-smartos-x86-64-1.10gen.cc:31003 has opTime:

{ "t" : 1356355054000, "i" : 4525 }

. It is wrong.

This doesn't look like a bug in the test. When node[2] gets the replSetSyncFrom command, it is using stale information about node[3] to decide how to respond to the command.

Comment by auto [ 24/Dec/12 ]

Author:

{u'date': u'2012-12-24T01:18:16Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-8017 Improve logic and logging in awaitReplication()

Do not print "ReplSetTest [object Object]" at the start of a run. Do not continue
to check other secondaries in the callback once one has been found to be stale. Do
not silently reset the timestamp we are checking against and continue checking with
a different timestanp. Do not print unlabeled oplog entries. Do not print diagnostic
information when a secondary is found to be in sync. Display success or failure
indication for each checked secondary, and include diagnostic information on failure.
Display updated timestamp and primary name when recovering after an exception. Display
success message on success. Label logged lines consistently so they can be searched
for in logs.
Branch: master
https://github.com/mongodb/mongo/commit/a321943143ef35ae26893b98847207c2045703e6

Comment by Tad Marshall [ 22/Dec/12 ]

Another failure, this one on Windows 64-bit:
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5372/test/replica%20sets/sync_passive2.js

ReplSetTest await TS for connection to AMAZONA-JD1A6QA:31004 is 1356199229000:1 and latest is 1356199229000:1
ReplSetTest await oplog size for connection to AMAZONA-JD1A6QA:31004 is 30002
ReplSetTest await synced=true
force 2 to sync from 3
 m31002| Sat Dec 22 13:01:10.208 [conn2] attempting to sync from AMAZONA-JD1A6QA:31003, but its latest opTime is 1356199179 and ours is 1356199229 so this may not work
 m31002| Sat Dec 22 13:01:10.474 [rsBackgroundSync] replSet syncing to: AMAZONA-JD1A6QA:31003 by request
 m31000| Sat Dec 22 13:01:10.474 [conn20] end connection 10.28.48.89:49416 (10 connections now open)
 m31003| Sat Dec 22 13:01:10.474 [initandlisten] connection accepted from 10.28.48.89:49457 #19 (6 connections now open)
 m31002| Sat Dec 22 13:01:10.474 [rsBackgroundSync] replSet syncing to: AMAZONA-JD1A6QA:31000

In this failure, it didn't even take a full millisecond for the (correctly set) state to be automatically reset. Set (line 6) and reset (line 9) are both reported as 13:01:10.474.

Generated at Thu Feb 08 03:16:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.