[SERVER-20764] create_index_gle.js intermittently fails, getLastError returns error "could not target full range of test.user; metadata not found" Created: 02/Oct/15  Updated: 25/Jan/17  Resolved: 06/Oct/15

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.1.9

Type: Bug Priority: Critical - P2
Reporter: J Rassi Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-20743 Sharding code needs to set $secondary... Closed
depends on SERVER-20746 --configsvr shouldn't imply --master ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding A (10/09/15)
Participants:

 Description   

create_index_gle.js started intermittently failing since yesterday afternoon. The failure symptom is that a getLastError command returns a "could not target full range of test.user; metadata not found". I am unable to reproduce this failure on my desktop. A list follows of recent failures:

Linux:

SSL Ubuntu 14.04:

SSL Amazon Linux:

SSL RHEL 7.0:

Excerpt from first failure in above list:

[js_test:create_index_gle] 2015-10-01T18:54:39.800-0400 d20011| 2015-10-01T18:54:39.800-0400 D -        [conn13] User Assertion: 13435:not master and slaveOk=false
[js_test:create_index_gle] 2015-10-01T18:54:39.801-0400 d20011| 2015-10-01T18:54:39.800-0400 D COMMAND  [conn13] assertion while executing command 'listDatabases' on database 'admin' with arguments '{ listDatabases: 1 }' and metadata '{ configsvr: { opTime: { ts: Timestamp 1443740079000|6, t: -1 } } }': 13435 not master and slaveOk=false
[js_test:create_index_gle] 2015-10-01T18:54:39.801-0400 d20011| 2015-10-01T18:54:39.800-0400 I COMMAND  [conn13] command admin.$cmd command: listDatabases { listDatabases: 1 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 exception: not master and slaveOk=false code:13435 numYields:0 reslen:125 locks:{} protocol:op_command 0ms
[js_test:create_index_gle] 2015-10-01T18:54:39.803-0400 s20015| 2015-10-01T18:54:39.802-0400 I SHARDING [conn1] distributed lock with ts: 560db9afa3a189e276ebc8da' unlocked.
[js_test:create_index_gle] 2015-10-01T18:54:39.803-0400 s20015| 2015-10-01T18:54:39.802-0400 W SHARDING [conn1] could not initialize targeter for index write op in collection test.user
[js_test:create_index_gle] 2015-10-01T18:54:39.805-0400 s20015| 2015-10-01T18:54:39.805-0400 I SHARDING [conn1] distributed lock 'test' acquired, ts : 560db9afa3a189e276ebc8db
[js_test:create_index_gle] 2015-10-01T18:54:39.901-0400 d20011| 2015-10-01T18:54:39.901-0400 D -        [conn13] User Assertion: 13435:not master and slaveOk=false
[js_test:create_index_gle] 2015-10-01T18:54:39.902-0400 d20011| 2015-10-01T18:54:39.901-0400 D COMMAND  [conn13] assertion while executing command 'listDatabases' on database 'admin' with arguments '{ listDatabases: 1 }' and metadata '{ configsvr: { opTime: { ts: Timestamp 1443740079000|8, t: -1 } } }': 13435 not master and slaveOk=false
[js_test:create_index_gle] 2015-10-01T18:54:39.902-0400 d20011| 2015-10-01T18:54:39.901-0400 I COMMAND  [conn13] command admin.$cmd command: listDatabases { listDatabases: 1 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 exception: not master and slaveOk=false code:13435 numYields:0 reslen:125 locks:{} protocol:op_command 0ms
[js_test:create_index_gle] 2015-10-01T18:54:39.904-0400 s20015| 2015-10-01T18:54:39.903-0400 I SHARDING [conn1] distributed lock with ts: 560db9afa3a189e276ebc8db' unlocked.
[js_test:create_index_gle] 2015-10-01T18:54:39.904-0400 s20015| 2015-10-01T18:54:39.903-0400 W SHARDING [conn1] could not refresh targeter :: caused by :: totalSize field not found in listDatabases
[js_test:create_index_gle] 2015-10-01T18:54:39.905-0400 s20015| 2015-10-01T18:54:39.904-0400 W SHARDING [conn1] failed to get database config for test.system.indexes while checking for auto-split: NamespaceNotFound database test not found
[js_test:create_index_gle] 2015-10-01T18:54:39.906-0400 assert: write or write concern failed: {
[js_test:create_index_gle] 2015-10-01T18:54:39.906-0400 	"err" : "could not target full range of test.user; metadata not found",
[js_test:create_index_gle] 2015-10-01T18:54:39.906-0400 	"code" : 26,
[js_test:create_index_gle] 2015-10-01T18:54:39.906-0400 	"n" : 0,
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 	"shards" : [ ],
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 	"shardRawGLE" : {
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 	},
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 	"ok" : 1
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 }
[js_test:create_index_gle] 2015-10-01T18:54:39.907-0400 _getErrorWithCode@src/mongo/shell/utils.js:23:13
[js_test:create_index_gle] 2015-10-01T18:54:39.908-0400 doassert@src/mongo/shell/assert.js:13:14
[js_test:create_index_gle] 2015-10-01T18:54:39.908-0400 assert.gleOK@src/mongo/shell/assert.js:462:9
[js_test:create_index_gle] 2015-10-01T18:54:39.908-0400 @jstests/gle/create_index_gle.js:32:1
[js_test:create_index_gle] 2015-10-01T18:54:39.908-0400 
[js_test:create_index_gle] 2015-10-01T18:54:39.908-0400 2015-10-01T18:54:39.906-0400 E QUERY    [thread1] Error: write or write concern failed: {
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"err" : "could not target full range of test.user; metadata not found",
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"code" : 26,
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"n" : 0,
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"shards" : [ ],
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"shardRawGLE" : {
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	},
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 	"ok" : 1
[js_test:create_index_gle] 2015-10-01T18:54:39.909-0400 } :
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 _getErrorWithCode@src/mongo/shell/utils.js:23:13
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 doassert@src/mongo/shell/assert.js:13:14
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 assert.gleOK@src/mongo/shell/assert.js:462:9
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 @jstests/gle/create_index_gle.js:32:1
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 
[js_test:create_index_gle] 2015-10-01T18:54:39.910-0400 failed to load: jstests/gle/create_index_gle.js



 Comments   
Comment by Githook User [ 06/Oct/15 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-20764 Adjust create_index_gle.js to eliminate race between shell and mongos.

The problem with create_index_gle.js that causes SERVER-20764 was that
when the node ultimately given a slaveDelay is initially elected
primary, the test does not wait for mongos to recognize the change of
primary node in the shard replica set. The fix in this patch is to
initialize the shard replica set with only one electable node.
Branch: master
https://github.com/mongodb/mongo/commit/d9f2d8fda800fdc2cbd2be30413412c19a24f97a

Comment by Andy Schwerin [ 05/Oct/15 ]

Beyond the dependency on SERVER-20743 and SERVER-20746, this test doesn't give mongos a chance to discover changes to the identity of the primary of the shard replica set rs0 before attempting to write to it. This is a bug in the test, given the current behavior in which the shell, mongos and drivers do not retry on NotMaster write errors.

Comment by Andy Schwerin [ 02/Oct/15 ]

The root of this problem is that sharding code is sending some commands to shard member nodes that might be secondaries without setting $secondaryOk.

It happens to manifest now because the switch to PV1 has apparently introduced some extra (spurious?) failover or step down events.

In this particular manifestation, while trying to choose a shard on which to place a new database, mongos attempts to find out the storage size of every shard via the listDatabases command. It uses a primary-preferred read preference for target selection, but does not set $secondaryOk on the command metadata. This is the result of an API deficiency, that makes this too easy to do. I'll file and link a server ticket.

Comment by J Rassi [ 02/Oct/15 ]

I am also unable to reproduce this issue on a Linux spawnhost (rhel55-test host type, binaries from 466f0596, command line: python buildscripts/resmoke.py --repeat=10 --storageEngine=mmapv1 --executor=gle_auth jstests/gle/create_index_gle.js).

Comment by J Rassi [ 02/Oct/15 ]

Assigning to Spencer for initial triage. Spencer, could you provide an interpretation of this error message? I suspect that one of the replication commits from yesterday is the actual culprit, but thought you might be able to identify next steps and next assignee.

Generated at Thu Feb 08 03:55:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.