[SERVER-11314] sharding:empty_cluster_init.js failed on V2.4 Win 32-bit Created: 22/Oct/13  Updated: 10/Dec/14  Resolved: 23/Oct/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: 2.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-11220 empty_cluster_init.js failing Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

V2.4 Windows 32-bit Build #261 rev 8997835c05 (Mon Oct 21)

http://buildlogs.mongodb.org/V2.4%20Windows%2032-bit/builds/261/test/sharding/empty_cluster_init.js

Note that Greg increased a timeout for this test in 2.5.3 in SERVER-11220. Those examples took 10+ min to fail. This one failed in 3.5 minutes.

 m27003| Mon Oct 21 16:05:47.086 [mongosMain] waited 36s for distributed lock configUpgrade for upgrading config database to new format v4
 m27004| Mon Oct 21 16:05:59.582 [mongosMain] waited 48s for distributed lock configUpgrade for upgrading config database to new format v4
...
 m27003| Mon Oct 21 16:05:59.582 [mongosMain] waited 48s for distributed lock configUpgrade for upgrading config database to new format v4
 m27005| Mon Oct 21 16:05:47.117 [mongosMain] distributed lock 'configUpgrade/EC2AMAZ-EXWX6MN:27005:1382371510:41' acquired, ts : 526550c84fa8416bdd73144a
 m27005| Mon Oct 21 16:05:47.117 [mongosMain] starting upgrade of config server from v0 to v4
 m27005| Mon Oct 21 16:05:47.117 [mongosMain] starting next upgrade step from v0 to v4
 m27005| Mon Oct 21 16:05:47.117 [mongosMain] about to log new metadata event: { _id: "EC2AMAZ-EXWX6MN-2013-10-21T16:05:47-526550db4fa8416bdd73144b", server: "EC2AMAZ-EXWX6MN", clientAddr: "N/A", time: new Date(1382371547117), what: "starting upgrade of config database", ns: "config.version", details: { from: 0, to: 4 } }
...
 m27000| Mon Oct 21 16:05:59.597 [conn16] runQuery called admin.$cmd { shutdown: 1, force: 1 }
 m27000| Mon Oct 21 16:05:59.597 [conn16] run command admin.$cmd { shutdown: 1, force: 1 }
 m27000| Mon Oct 21 16:05:59.597 [conn16] command: { shutdown: 1, force: 1 }
 m27000| Mon Oct 21 16:05:59.597 [conn16] terminating, shutdown command received
 m27000| Mon Oct 21 16:05:59.597 dbexit: shutdown called
...
 m27000| Mon Oct 21 16:05:33.857 [conn10] update config.locks query: { _id: "configUpgrade" } update: { $set: { state: 2, who: "EC2AMAZ-EXWX6MN:27005:1382371510:41:mongosMain:18467", process: "EC2AMAZ-EXWX6MN:27005:1382371510:41", when: new Date(1382371528974), why: "upgrading config database to new format v4", ts: ObjectId('526550c84fa8416bdd73144a') } } idhack:1 nupdated:1 fastmod:1 keyUpdates:0 locks(micros) w:111 0ms
 m27004| Mon Oct 21 16:05:33.826 [mongosMain] could not acquire lock 'configUpgrade/EC2AMAZ-EXWX6MN:27004:1382371510:41' (another update won)
 m27004| Mon Oct 21 16:05:33.826 [mongosMain] distributed lock 'configUpgrade/EC2AMAZ-EXWX6MN:27004:1382371510:41' was not acquired.
 m27004| Mon Oct 21 16:05:33.826 [mongosMain] waited 23s for distributed lock configUpgrade for upgrading config database to new format v4
 m27000| Mon Oct 21 16:05:34.840 [conn9] runQuery called config.locks { _id: "configUpgrade" }
...
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:110:60)
    at D:\slave\Windows_32bit_V2.4\mongo\jstests\sharding\empty_cluster_init.js:37:8
Mon Oct 21 16:05:47.086 assert.soon failed: function () {
    try {
        mongosConn = new Mongo(mongoses[0].host);
        return true;
    }
    catch (e) {
        print("Waiting for connect...");
        printjson(e);
        return false;
    }
}, msg:undefined at src/mongo/shell/assert.js:7
failed to load: D:\slave\Windows_32bit_V2.4\mongo\jstests\sharding\empty_cluster_init.js

Note, this occurred on Win 32 which has only a single CPU. If any of the spawned processes are spinning it can starve other processes of CPU time.



 Comments   
Comment by Randolph Tan [ 23/Oct/13 ]

The assertion already failed at 16:05, and for some reason, test still continues to run till 16:08... Basically, the test spawns 3 mongos in succession (with ports 27003, 27004, 27005) and tries to connect to them one by one. The assert failed when it tried to connect to the 27003 which still has not finished the upgrade/initialization phase so it won't accept connections.

Summary of events (basically similar to SERVER-11220)

27003 started

m27003| Mon Oct 21 16:05:07.649 [mongosMain] MongoS version 2.4.8-pre- starting: pid=6096 port=27003 32-bit host=EC2AMAZ-EXWX6MN (--help for usage)

27003 tries to connect to config servers

 m27003| Mon Oct 21 16:05:07.727 [mongosMain] SyncClusterConnection connecting to [localhost:27000]
 m27003| Mon Oct 21 16:05:07.727 [mongosMain] SyncClusterConnection connecting to [localhost:27001]
 m27003| Mon Oct 21 16:05:07.743 [mongosMain] SyncClusterConnection connecting to [localhost:27002]

And for some reasons unclear, fails:

 m27003| Mon Oct 21 16:05:10.785 [mongosMain] scoped connection to localhost:27000,localhost:27001,localhost:27002 not being returned to the pool

And for some reason there was about a 10 seconds of no activity (note that time jumped from 18 sec to 28 sec) - apparently, this machine has only 1 core, OS decided to do something else?:

 m27005| Mon Oct 21 16:05:16.494 [LockPinger] creating distributed lock ping thread for localhost:27000,localhost:27001,localhost:27002 and process EC2AMAZ-EXWX6MN:27005:1382371510:41 (sleeping for 30000ms)
 m27005| Mon Oct 21 16:05:16.494 [LockPinger] SyncClusterConnection connecting to [localhost:27000]
 m27005| Mon Oct 21 16:05:16.494 [LockPinger] SyncClusterConnection connecting to [localhost:27001]
 m27005| Mon Oct 21 16:05:16.494 [LockPinger] SyncClusterConnection connecting to [localhost:27002]
 m27002| Mon Oct 21 16:05:17.165 [conn11] runQuery called admin.$cmd { resetError: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn11] run command admin.$cmd { resetError: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn11] command admin.$cmd command: { resetError: 1 } ntoreturn:1 keyUpdates:0  reslen:37 0ms
 m27000| Mon Oct 21 16:05:16.494 [conn11] runQuery called admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn11] run command admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn11] command admin.$cmd command: { resetError: 1 } ntoreturn:1 keyUpdates:0  reslen:37 0ms
 m27000| Mon Oct 21 16:05:16.494 [conn11] runQuery called admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn11] run command admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn11] fsync from getlasterror
 m27000| Mon Oct 21 16:05:16.494 [conn12] runQuery called admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn12] run command admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn12] command admin.$cmd command: { resetError: 1 } ntoreturn:1 keyUpdates:0  reslen:37 0ms
 m27000| Mon Oct 21 16:05:16.494 [conn12] runQuery called admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn12] run command admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn12] fsync from getlasterror
 m27000| Mon Oct 21 16:05:16.494 [initandlisten] connection accepted from 127.0.0.1:65214 #13 (10 connections now open)
 m27000| Mon Oct 21 16:05:16.494 [conn13] runQuery called admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn13] run command admin.$cmd { resetError: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn13] command admin.$cmd command: { resetError: 1 } ntoreturn:1 keyUpdates:0  reslen:37 0ms
 m27000| Mon Oct 21 16:05:16.494 [conn13] runQuery called admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn13] run command admin.$cmd { getlasterror: 1, fsync: 1 }
 m27000| Mon Oct 21 16:05:16.494 [conn13] fsync from getlasterror
 m27000| Mon Oct 21 16:05:17.150 [conn13] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 653ms
 m27000| Mon Oct 21 16:05:17.150 [conn12] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 663ms
 m27000| Mon Oct 21 16:05:17.165 [conn11] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 665ms
 m27002| Mon Oct 21 16:05:18.039 [conn8] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 892ms
 m27002| Mon Oct 21 16:05:28.974 [conn12] runQuery called admin.$cmd { getlasterror: 1, fsync: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn12] run command admin.$cmd { getlasterror: 1, fsync: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn12] fsync from getlasterror
 m27002| Mon Oct 21 16:05:28.974 [conn12] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 1ms
 m27002| Mon Oct 21 16:05:18.054 [conn10] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:182 891ms
 m27002| Mon Oct 21 16:05:28.974 [conn11] runQuery called admin.$cmd { getlasterror: 1, fsync: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn11] run command admin.$cmd { getlasterror: 1, fsync: 1 }
 m27002| Mon Oct 21 16:05:28.974 [conn11] fsync from getlasterror
 m27002| Mon Oct 21 16:05:28.974 [conn11] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 0ms
 m27002| Mon Oct 21 16:05:18.054 [conn13] command admin.$cmd command: { getlasterror: 1, fsync: 1 } ntoreturn:1 keyUpdates:0  reslen:83 892ms
 m27002| Mon Oct 21 16:05:28.990 [conn12] allocExtent config.lockpings size 4096 0
 m27002| Mon Oct 21 16:05:28.990 [conn12] adding _id index for collection config.lockpings

27005 got the dist lock

 m27001| Mon Oct 21 16:05:29.084 [conn10] update config.locks query: { _id: "configUpgrade", state: 0 } update: { $set: { state: 1, who: "EC2AMAZ-EXWX6MN:27005:1382371510:41:mongosMain:18467", process: "EC2AMAZ-EXWX6MN:27005:1382371510:41", when: new Date(1382371528974), why: "upgrading config database to new format v4", ts: ObjectId('526550c84fa8416bdd73144a') } } nscanned:1 nmoved:1 nupdated:1 keyUpdates:0 locks(micros) w:293 0ms

shell attempted connection to 27003 timed out

 m27002| Mon Oct 21 16:05:47.055 [conn12] run command admin.$cmd { resetError: 1 }
 m27002| Mon Oct 21 16:05:47.055 [conn12] command admin.$cmd command: { resetError: 1 } ntoreturn:1 keyUpdates:0  reslen:37 0ms
Waiting for connect...
Error: couldn't connect to server localhost:27003
assert.soon failed: function () {
    try {
        mongosConn = new Mongo(mongoses[0].host);
        return true;
    }
    catch (e) {
        print("Waiting for connect...");
        printjson(e);
        return false;
    }
}, msg:undefined

27005 just finished upgrading:

 m27005| Mon Oct 21 16:05:47.117 [mongosMain] about to log new metadata event: { _id: "EC2AMAZ-EXWX6MN-2013-10-21T16:05:47-526550db4fa8416bdd73144b", server: "EC2AMAZ-EXWX6MN", clientAddr: "N/A", time: new Date(1382371547117), what: "starting upgrade of config database", ns: "config.version", details: { from: 0, to: 4 } }

Generated at Thu Feb 08 03:25:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.