[SERVER-16649] mongos_rs_auth_shard_failure_tolerance.js should wait for writes to be replicated before shutting down the primary Created: 25/Nov/14  Updated: 15/Jan/15  Resolved: 24/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication, Sharding, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.8.0-rc5

Type: Bug Priority: Minor - P4
Reporter: Ian Whalen (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File log     HTML File oklog    
Issue Links:
Related
related to SERVER-16336 Document not returned: "Didn't find ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

TEST HISTORY

LOGS

assert: [null] != [null] are equal : undefined
Error: [null] != [null] are equal : undefined
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.neq (src/mongo/shell/assert.js:114:5)



 Comments   
Comment by Githook User [ 24/Dec/14 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-16649 Wait for replication in mongos_rs_

{auth_}

shard_failure_tolerance.js tests
Branch: master
https://github.com/mongodb/mongo/commit/9d5ec8f203268427c7e8c86dfc21b50455212c53

Comment by Spencer Brody (Inactive) [ 23/Dec/14 ]

In all the logs I see it looks like the secondary of the shard being queried hasn't figured out a sync source yet for most of the test, and only starts syncing from the primary moments before the primary is shut down. This may be as simple as the data just hadn't replicated to the secondary yet. I'm putting together a patch that adds a

{w:2}

write concern to the relevant inserts, I think that may fix it.

Comment by Ian Whalen (Inactive) [ 19/Dec/14 ]

also, just noting that this does still seem to be happening, but doesn't seem to be frequent enough to be an RC blocker, so I'm bumping down in priority. still happening often enough that we should really probably fix.

Comment by Ian Whalen (Inactive) [ 19/Dec/14 ]

mark.benvenuto what's the next step here?

Comment by David Storch [ 12/Dec/14 ]

Saw this again in a recent patch build:

https://mci.10gen.com/task/mongodb_mongo_master_linux_64_de3b0d465b342e3a5cef629b46f3cf00b742bfd9_14_12_12_19_54_29_sharding_WT_linux_64

Comment by Randolph Tan [ 03/Dec/14 ]

http://buildlogs.mongodb.org/MCI_rhel70/builds/388199/test/sharding_0/mongos_rs_auth_shard_failure_tolerance.js

Comment by David Storch [ 03/Dec/14 ]

After discussing with Mark, this could be the same issue (or at least related to) the WT corruption that he is looking into. Reassigning so that Mark can take a look. Happy to keep digging if you have any suggestions for how we might track this one down.

Comment by David Storch [ 01/Dec/14 ]

A few observations:

  • In the failure case, the explain that Randolph posted shows an EOF plan. This plan means that the collection was not found in the catalog (i.e. the collection does not exist).
  • The MCI history shows that this may have first started on commit 792f66beb03f0c, which was an import of a new version of wiredTiger. This, along with the fact that the issue manifests only when running with the wiredTiger storage engine, leads me to suspect that the root cause is somewhere in wiredTiger.

This is going to require some more digging.

Update: I can reproduce consistently on a build of HEAD simply by running the following commands on a linux-64-amzn-test EC2 instance:

rm -rf /data/db/
python buildscripts/smoke.py --storageEngine=wiredTiger jstests/sharding/mongos_rs_auth_shard_failure_tolerance.js

One hypothesis is that this is a symptom of the node running out of disk space.

Comment by Ian Whalen (Inactive) [ 01/Dec/14 ]

just noting that this is still happening as of 12/01.

Comment by Randolph Tan [ 26/Nov/14 ]

Was able to reproduce this intermittently on linux-64-amzn-test. Just attached the log with custom explain logging with verbose 1 on mongod. It looks like the document was somehow lost, or cannot be observed. Appears to only happen only wiredTiger variants running on the auth version of this test (mongos_rs_shard_failure_tolerance.js seems to be passing all the time).

Explain for fail case:

{
	"queryPlanner" : {
		"mongosPlannerVersion" : 1,
		"winningPlan" : {
			"stage" : "SINGLE_SHARD",
			"shards" : [
				{
					"shardName" : "test-rs1",
					"connectionString" : "test-rs1/ip-10-33-141-202:31200,ip-10-33-141-202:31201",
					"serverInfo" : {
						"host" : "ip-10-33-141-202",
						"port" : 31201,
						"version" : "2.8.0-rc2-pre-",
						"gitVersion" : "45790039049d7375beafe122622363d35ce990c2"
					},
					"plannerVersion" : 1,
					"parsedQuery" : {
						"_id" : {
							"$eq" : 1
						}
					},
					"winningPlan" : {
						"stage" : "EOF"
					},
					"rejectedPlans" : [ ]
				}
			]
		}
	},
	"executionStats" : {
		"nReturned" : 0,
		"executionTimeMillis" : 54,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 0,
		"executionStages" : {
			"stage" : "SINGLE_SHARD",
			"nReturned" : 0,
			"executionTimeMillis" : 54,
			"totalKeysExamined" : 0,
			"totalDocsExamined" : 0,
			"totalChildMillis" : NumberLong(0),
			"shards" : [
				{
					"shardName" : "test-rs1",
					"executionSuccess" : true,
					"executionStages" : {
						"stage" : "EOF",
						"nReturned" : 0,
						"executionTimeMillisEstimate" : 0,
						"works" : 1,
						"advanced" : 0,
						"needTime" : 0,
						"needFetch" : 0,
						"saveState" : 0,
						"restoreState" : 0,
						"isEOF" : 1,
						"invalidates" : 0
					}
				}
			]
		},
		"allPlansExecution" : [
			{
				"shardName" : "test-rs1",
				"allPlans" : [ ]
			}
		]
	},
	"ok" : 1
}

Explain for passing case:

{
	"queryPlanner" : {
		"mongosPlannerVersion" : 1,
		"winningPlan" : {
			"stage" : "SINGLE_SHARD",
			"shards" : [
				{
					"shardName" : "test-rs1",
					"connectionString" : "test-rs1/ip-10-33-141-202:31200,ip-10-33-141-202:31201",
					"serverInfo" : {
						"host" : "ip-10-33-141-202",
						"port" : 31201,
						"version" : "2.8.0-rc2-pre-",
						"gitVersion" : "45790039049d7375beafe122622363d35ce990c2"
					},
					"plannerVersion" : 1,
					"parsedQuery" : {
						"_id" : {
							"$eq" : 1
						}
					},
					"winningPlan" : {
						"stage" : "IDHACK"
					},
					"rejectedPlans" : [ ]
				}
			]
		}
	},
	"executionStats" : {
		"nReturned" : 1,
		"executionTimeMillis" : 54,
		"totalKeysExamined" : 1,
		"totalDocsExamined" : 1,
		"executionStages" : {
			"stage" : "SINGLE_SHARD",
			"nReturned" : 1,
			"executionTimeMillis" : 54,
			"totalKeysExamined" : 1,
			"totalDocsExamined" : 1,
			"totalChildMillis" : NumberLong(0),
			"shards" : [
				{
					"shardName" : "test-rs1",
					"executionSuccess" : true,
					"executionStages" : {
						"stage" : "IDHACK",
						"nReturned" : 1,
						"executionTimeMillisEstimate" : 0,
						"works" : 2,
						"advanced" : 1,
						"needTime" : 0,
						"needFetch" : 0,
						"saveState" : 0,
						"restoreState" : 0,
						"isEOF" : 1,
						"invalidates" : 0,
						"keysExamined" : 1,
						"docsExamined" : 1
					}
				}
			]
		},
		"allPlansExecution" : [
			{
				"shardName" : "test-rs1",
				"allPlans" : [ ]
			}
		]
	},
	"ok" : 1
}

Generated at Thu Feb 08 03:41:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.