[SERVER-23829] Logs from mongos started by the mongo shell are prefixed with "sh" instead of "s" Created: 20/Apr/16  Updated: 02/Sep/16  Resolved: 18/Aug/16

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: 3.3.12

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Matt Cotter
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File multi_coll_drop-on-MCI_enterprise-rhel-71-ppc64.txt    
Issue Links:
Related
related to SERVER-22992 wait_for_pid() function in shell_util... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Platforms 2016-08-26
Participants:
Linked BF Score: 0

 Description   

I think the "sh20014" here is supposed to be "s20014". The log might also be missing some logging: seeing a lock write being replicated on the config server secondaries, but not the original write to the config server primary.

Example of "sh20014" doing the same checks "s20014" gets stuck on previously in the logs.

[js_test:multi_coll_drop] 2016-04-06T02:54:18.252-0500 sh20014| 2016-04-06T02:53:59.679-0500 D ASIO     [conn1] startCommand: RemoteCommand 1014 -- target:mongovm16:20013 db:config expDate:2016-04-06T02:54:29.679-0500 cmd:{ find: "locks", filter: { _id: "multidrop.coll" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1459929237000|1, t: 8 } }, limit: 1, maxTimeMS: 30000 }
[js_test:multi_coll_drop] 2016-04-06T02:54:18.255-0500 sh20014| 2016-04-06T02:53:59.679-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 1014 on host mongovm16:20013
[js_test:multi_coll_drop] 2016-04-06T02:54:18.263-0500 sh20014| 2016-04-06T02:53:59.680-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 1014 finished with response: { waitedMS: 0, cursor: { firstBatch: [ { _id: "multidrop.coll", state: 2, ts: ObjectId('5704c06465c17830b843f1cb'), who: "mongovm16:20010:1459929128:185613966:conn5", process: "mongovm16:20010:1459929128:185613966", when: new Date(1459929188727), why: "splitting chunk [{ _id: -62.0 }, { _id: MaxKey }) in multidrop.coll" } ], id: 0, ns: "config.locks" }, ok: 1.0 }
[js_test:multi_coll_drop] 2016-04-06T02:54:18.269-0500 sh20014| 2016-04-06T02:53:59.680-0500 D ASIO     [conn1] startCommand: RemoteCommand 1016 -- target:mongovm16:20013 db:config expDate:2016-04-06T02:54:29.680-0500 cmd:{ find: "lockpings", filter: { _id: "mongovm16:20010:1459929128:185613966" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1459929237000|1, t: 8 } }, limit: 1, maxTimeMS: 30000 }
[js_test:multi_coll_drop] 2016-04-06T02:54:18.270-0500 sh20014| 2016-04-06T02:53:59.681-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 1016 on host mongovm16:20013
[js_test:multi_coll_drop] 2016-04-06T02:54:18.275-0500 sh20014| 2016-04-06T02:53:59.681-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 1016 finished with response: { waitedMS: 0, cursor: { firstBatch: [ { _id: "mongovm16:20010:1459929128:185613966", ping: new Date(1459929228990) } ], id: 0, ns: "config.lockpings" }, ok: 1.0 }
[js_test:multi_coll_drop] 2016-04-06T02:54:18.279-0500 sh20014| 2016-04-06T02:53:59.681-0500 D ASIO     [conn1] startCommand: RemoteCommand 1018 -- target:mongovm16:20013 db:admin expDate:2016-04-06T02:54:29.681-0500 cmd:{ serverStatus: 1, maxTimeMS: 30000 }
[js_test:multi_coll_drop] 2016-04-06T02:54:18.292-0500 sh20014| 2016-04-06T02:53:59.682-0500 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 1018 on host mongovm16:20013

Logs are attached. The logs are coming from the new PPC machine, if that makes any difference.

Evergreen links:
sharding_csrs_continuous_config_stepdown_WT failed on enterprise-rhel-71-ppc64le

multi_coll_drop.js - Logs | History



 Comments   
Comment by Githook User [ 18/Aug/16 ]

Author:

{u'username': u'Machyne', u'name': u'Matt Cotter', u'email': u'matt.cotter@mongodb.com'}

Message: SERVER-23829 shell launcher logs name stored on ProgramRunner

These names used to be stored in the registry in a map, but the
process could have been removed from the map before output is done
being read.
Branch: master
https://github.com/mongodb/mongo/commit/8da9403a0f7b89047845a32d454d951302342808

Comment by Kaloian Manassiev [ 20/Apr/16 ]

I am going to create a new evergreen ticket then for the missing log lines.

Comment by Max Hirschhorn [ 20/Apr/16 ]

I think it is possible to address this issue separately from SERVER-22992. In particular, rather than calling registry.programName(pid) in ProgramOutputMultiplexer::appendLine(), we could pass the prefix to use for the log message to ProgramOutputMultiplexer::appendLine() as an argument. ProgramRunner::operator()() would then needs to change its calls to ProgramOutputMultiplexer::appendLine() to include its _name member (which we should make default to "sh").

We could then delete the remaining usages of the _programNames map from ProgramRegistry.

Comment by Max Hirschhorn [ 20/Apr/16 ]

This is "just" a race condition in the way the ProgramRegistry and ProgramOutputMultiplexer work. If ProgramRegistry::deleteProgram() is called before all of the output from the program is emitted, then it'll end up using the "sh" prefix instead of the one that corresponds to the actual process. Until SERVER-22992 is completed, this race will continue to exist.

Note: The opposite race is documented here where ProgramRegistry::registerProgram() is called after the child process produces output. I think the comment is inaccurate because ProgramRunner::start() must return in order for ProgramRunner::operator()() to execute.

Comment by Kaloian Manassiev [ 20/Apr/16 ]

To add a little bit more context to this ticket. While investigating sharding test failures we noticed that log lines are also missing in the output. Another example is this run.

For example, these lines indicated that the enableSharding command failed for some reason, but we don't see the log line indicating that:

[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.029+0000 s20770| 2016-04-19T20:41:00.015+0000 I SHARDING [conn1] distributed lock 'test' acquired for 'enableSharding', ts : 571697db6b1d12643474add2
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.029+0000 s20770| 2016-04-19T20:41:00.015+0000 D ASIO     [conn1] startCommand: RemoteCommand 36 -- target:ip-10-5-180-162:20767 db:config expDate:2016-04-19T20:41:30.015+0000 cmd:{ find: "databases", filter: { _id: /^test$/i }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1461098459000|28, t: 1 } }, limit: 1, maxTimeMS: 30000 }
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.030+0000 s20770| 2016-04-19T20:41:00.015+0000 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 36 on host ip-10-5-180-162:20767
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.030+0000 s20770| 2016-04-19T20:41:00.015+0000 D ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Request 36 finished with response: { waitedMS: 0, cursor: { id: 0, ns: "config.databases", firstBatch: [] }, ok: 1.0 }
...
// In between these log sections, there should be additional logging from [conn1] with requests made to shards, but they seem to be missing.
...
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.556+0000 s20770| 2016-04-19T20:41:00.553+0000 D ASIO     [conn1] startCommand: RemoteCommand 39 -- target:ip-10-5-180-162:20766 db:config expDate:2016-04-19T20:41:30.553+0000 cmd:{ findAndModify: "locks", query: { ts: ObjectId('571697db6b1d12643474add2') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.560+0000 s20770| 2016-04-19T20:41:00.559+0000 I SHARDING [conn1] distributed lock with ts: 571697db6b1d12643474add2' unlocked.
...
[js_test:read_pref_multi_mongos_stale_config] 2016-04-19T20:41:00.563+0000 s20770| 2016-04-19T20:41:00.560+0000 D -        [conn1] User Assertion: 26:database test not found

Comment by Dianna Hohensee (Inactive) [ 20/Apr/16 ]

Just added it to the end of the description.

Comment by Michael O'Brien [ 20/Apr/16 ]

Can you link to the task that this came from?

Generated at Thu Feb 08 04:04:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.