[SERVER-18845] w_majority_change.js on v3.0 branch Created: 03/Jun/15  Updated: 29/Sep/15  Resolved: 29/Sep/15

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File w_majority_change.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.writeOK (src/mongo/shell/assert.js:388:9)
    at /data/mci/src/jstests/multiVersion/w_majority_change.js:158:12
    at /data/mci/src/jstests/multiVersion/w_majority_change.js:174:2
2015-05-29T07:26:07.126+0000 E QUERY    Error: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
}

https://evergreen.mongodb.com/task/mongodb_mongo_v3.0_linux_64_multiversion_2b0ff7c06a46301fd87c794a1eb2df90d9767ad9_15_05_28_17_58_56

First appearance after:
https://github.com/mongodb/mongo/commit/6927132b7bcfa7fae83cbc8ad99fb18c320af738



 Comments   
Comment by Matt Dannenberg [ 10/Jun/15 ]

This problem went away when we reverted the original solution to SERVER-18511. The new solution to SERVER-18511 does not exhibit this incorrect behavior.

Comment by Githook User [ 08/Jun/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: Revert "SERVER-18845 send our own update first in updatePosition to fix 2.6 compatibilty"

This reverts commit c69158b7fccfc9eb5648a68fcf194fc0cf30ba4d.
Branch: v3.0
https://github.com/mongodb/mongo/commit/b5b20daad0aed3d0fe11f566547ac91305e5ccdc

Comment by Githook User [ 08/Jun/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: Revert "SERVER-18845 unittest fix"

This reverts commit 27f8803a31119c091e998fe29749dd5f75695ec6.
Branch: v3.0
https://github.com/mongodb/mongo/commit/7b98a8ed93e557d615eb8ab192eeca365e6b6776

Comment by Githook User [ 05/Jun/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-18845 unittest fix
Branch: v3.0
https://github.com/mongodb/mongo/commit/27f8803a31119c091e998fe29749dd5f75695ec6

Comment by Githook User [ 05/Jun/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-18845 send our own update first in updatePosition to fix 2.6 compatibilty
Branch: v3.0
https://github.com/mongodb/mongo/commit/c69158b7fccfc9eb5648a68fcf194fc0cf30ba4d

Comment by Matt Dannenberg [ 04/Jun/15 ]

In 2.6 we required a handshake prior to updatePosition in order to accept replication progress. In 3.0 this is no longer necessary, but we kept that functionality for the sake of 2.6 compatibility. Once we started the 3.2 branch, we noticed that by requiring this, 3.0 and 3.2 were incompatible. As a result, we dropped this requirement from the updatePosition code path in 3.0. In a later commit, in order to fix reporting replication progress post-initial syncing, we removed this requirement from the heartbeat code path that accepts replication progress as well. That is the commit that caused this test failure. Though both commits allow for the same problem (reporting progress for a node that on the behalf of which we have not performed handshake), this commit made the problem a much more common occurrence.

When processing an updatePosition command, 2.6 goes through the array of updates until it find a problematic one and then returns an error. My solution was to put ourselves first in that array, so that our update gets processed before the 2.6 node stops processing the updates. The trouble is nodes could still be chaining through us and be listed after a non-handshook node such that their progress would not be reported. I don't think that there is a proper solution for this. We can fix the test by disallowing chaining.

Comment by Eric Milkie [ 04/Jun/15 ]

The issue appears to be that a 3.0 secondary (31001) is failing to sync from a 2.6 primary (31002), so the primary never gets notification that the secondary has the write:

 m31001| 2015-06-03T20:30:22.676+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-150-51-49:31002
 m31002| 2015-06-03T20:30:22.676+0000 [initandlisten] connection accepted from 10.150.51.49:38435 #17 (6 connections now open)
 m31001| 2015-06-03T20:30:22.676+0000 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to ip-10-150-51-49:31002
 m31001| 2015-06-03T20:30:22.677+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update, response: { ok: 0.0, errmsg: "could not update position upstream; will retry" }
 m31002| 2015-06-03T20:30:22.677+0000 [initandlisten] connection accepted from 10.150.51.49:38436 #18 (7 connections now open)
 m31002| 2015-06-03T20:30:22.678+0000 [conn18] end connection 10.150.51.49:38436 (6 connections now open)
 m31001| 2015-06-03T20:30:22.680+0000 I REPL     [ReplicationExecutor] could not find member to sync from

(above taken from https://logkeeper.mongodb.org/build/556f5c68ead33c19c53803ac/test/556f633cfa59d047f638539c )
I don't understand what would cause the "could not update position upstream" error, though.

Generated at Thu Feb 08 03:48:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.