[SERVER-10383] read_pref_cmd.js failed on Windows 64-bit Created: 31/Jul/13  Updated: 11/Jul/16  Resolved: 25/Oct/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.4

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

buildbot Windows 64-bit Build #5626


Issue Links:
Depends
depends on SERVER-10823 isMaster does not contain tags in som... Closed
depends on SERVER-10945 isMaster can return duplicate hosts Closed
depends on SERVER-7533 Expose method to the mongo shell to g... Closed
Operating System: ALL
Participants:

 Description   

Ren, what happened here?

Windows 64-bit Build #5626 July 30 rev f6a77ea

http://buildbot.mongodb.org/builders/Windows%2064-bit/builds/5626/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5626/test/sharding/read_pref_cmd.js

----
Testing mode: primaryPreferred, tag sets: undefined
----
----
about to do: { "distinct" : "user", "key" : { "x" : 1 }, "query" : { "x" : 1 } }
----
 m31101| Tue Jul 30 09:05:37.511 [initandlisten] connection accepted from 10.28.48.89:50027 #8 (6 connections now open)
 m31101| Tue Jul 30 09:05:37.511 [conn8] run command test.$cmd { query: { distinct: "user", key: { x: 1.0 }, query: { x: 1.0 } }, $readPreference: { mode: "primaryPreferred" } }
 m31101| Tue Jul 30 09:05:37.511 [conn8] command test.$cmd command: { query: { distinct: "user", key: { x: 1.0 }, query: { x: 1.0 } }, $readPreference: { mode: "primaryPreferred" } } ntoreturn:1 keyUpdates:0 locks(micros) r:296 reslen:144 0ms
----
cmd result: {
	"values" : [ ],
	"stats" : {
		"n" : 0,
		"nscanned" : 0,
		"nscannedObjects" : 0,
		"timems" : 0,
		"cursor" : "BtreeCursor x_1"
	},
	"ok" : 1
}
----
 m31100| Tue Jul 30 09:05:37.511 [conn16] query test.system.profile query: { op: "command", ns: "test.$cmd", command.query.distinct: "user" } ntoreturn:1 ntoskip:0 nscanned:0 keyUpdates:0 locks(micros) r:194 nreturned:0 reslen:20 0ms
 m31101| Tue Jul 30 09:05:37.511 [conn1] query test.system.profile query: { op: "command", ns: "test.$cmd", command.query.distinct: "user" } ntoreturn:1 ntoskip:0 nscanned:1 keyUpdates:0 locks(micros) r:147 nreturned:1 reslen:445 0ms
 m31101| Tue Jul 30 09:05:37.511 [conn1] run command admin.$cmd { isMaster: 1.0 }
 m31101| Tue Jul 30 09:05:37.511 [conn1] command admin.$cmd command: { isMaster: 1.0 } ntoreturn:1 keyUpdates:0  reslen:335 0ms
assert failed
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js:77:21
    at Array.forEach (native)
    at cmdTest (D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js:64:18)
    at testReadPreference (D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js:102:5)
    at D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js:242:9
    at Array.forEach (native)
    at testAllModes (D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js:238:7)
Tue Jul 30 09:05:37.527 assert failed at src/mongo/shell/assert.js:7
failed to load: D:\slave\Windows_64bit\mongo\jstests\sharding\read_pref_cmd.js

It looks like exactly the same failure occurred on July 22 Windows 64-bit Build #5622
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5622/test/sharding/read_pref_cmd.js



 Comments   
Comment by auto [ 25/Oct/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10383 read_pref_cmd.js failed on Windows 64-bit
Branch: master
https://github.com/mongodb/mongo/commit/b7876925662e986e8ca5def4cf7aecfcbcc5429e

Comment by Matt Kangas [ 22/Aug/13 ]

test disabled in master as part of SERVER-10429; pushing ticket to 2.5.3

Comment by Matt Kangas [ 21/Aug/13 ]

Ren, what's the status on this? Resolved? Unresolved but diagnosed as a test issue? Other?

Comment by Randolph Tan [ 15/Aug/13 ]

Just realized one thing. If the RSM got socket exceptions because of the socket being closed after the reconfig, then that means that the RSM have not yet refresh itself to the config after the replSetReconfig command. Therefore, if we use SERVER-7533 to wait for the shell RSM to get to the desired state, then we shouldn't be seeing this error again.

P.S. This test is currently disabled temporarily.

Comment by Randolph Tan [ 07/Aug/13 ]

Was able to reproduce this with more reliably with debug build on Linux and shortening the ReplicaSetMonitorWatcher interval to 2 sec. This is what happened in the test:

1. ShardingTest setups the sharded cluster with a replicaSet member, then creates a connection to the replica set for internal use. This activates the ReplicaSetMonitorWatcher.
2. ReplicaSetMonitor watcher talks to the set. This causes it to 'cache' connections to the members of the set.
3. The test does a reconfig. This causes the primary to close all it's outgoing sockets.
4. Right before the test for "Testing mode: primaryPreferred, tag sets: undefined" begins, RSMWatcher wakes up again and tries to refresh the view. It tries to use the connections it used previously and gets a SocketException (closed - note: this is the only type of sockExp that needs a higher verbosity to show up) since the primary has already closed the socket due to reconfig. This in turn makes the RSM label the primary as not ok.
5. Test tries to do a primaryPreferred command op, but will choose the secondary since the primary was already labeled as not ok.

Conclusion:
Another timing issue in test. I think another issue is RSM is so sensitive to errors and takes time (or extra effort) to make it realize that the error was already 'outdated'/'old'.

Comment by Randolph Tan [ 01/Aug/13 ]

All the errors follow the same pattern:

The connection from the monitor (in the shell) to the primary was closed for some unknown reason:

[ReplicaSetMonitorWatcher] Socket recv() errno:10053 An established connection was aborted by the software in your host machine. 10.28.48.89:31100

This makes the monitor to label the primary as not ok to use. And the next test expects the command to be routed to the primary and fails because the monitor thinks the primary is down.

There is no part in the test that kills any server or do any replset reconfig (which closes all connections) during the test. So I can't think of why it would get this error. Not sure if it is just pure coincidence that this happens only on one machine and always at the same part of the test.

Comment by Matt Kangas [ 01/Aug/13 ]

Again:

Windows 64-bit Build #5628 July 31

http://buildbot.mongodb.org/builders/Windows%2064-bit/builds/5628/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5628/test/sharding/read_pref_cmd.js

Comment by Matt Kangas [ 31/Jul/13 ]

Other recent examples on this builder:

Windows 64-bit Build #5618 July 23 rev a36c34b
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5618/test/sharding/read_pref_cmd.js

Windows 64-bit Build #5613 July 19
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5613/test/sharding/read_pref_cmd.js

Windows 64-bit Build #5612 July 18
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5612/test/sharding/read_pref_cmd.js

Windows 64-bit Build #5604 July 10 rev 8512fdd
http://buildlogs.mongodb.org/Windows%2064-bit/builds/5604/test/sharding/read_pref_cmd.js

I don't see any earlier failures on the Windows 64-bit builder within the last 100 builds (since May 17).

Generated at Thu Feb 08 03:23:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.