[SERVER-54723] Shorten resmoke log prefixes Created: 30/Jul/20  Updated: 29/Oct/23  Resolved: 05/Mar/21

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Mikhail Shchatko
Resolution: Fixed Votes: 0
Labels: tig-qwin-eligible, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-48177 Prefix-less resmoke console logger Closed
Problem/Incident
causes SERVER-81595 resmoke incorrectly reports datetimes... Backlog
Related
related to SERVER-62935 Add timestamp with Failed to Load mes... Closed
is related to SERVER-50085 Make it easier to correlate mongo pro... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: STM 2021-03-08, STM 2021-03-22
Participants:
Story Points: 2

 Description   

Log lines (particularly when run through resmoke) have been and are getting longer. The new information we want to add is useful (e.g: TIG-1791), but it comes at a cost. This ticket is to start a conversation to explore what can be done to reduce the need to scroll to the right.

The current log lines outputs the following information (as an example, not all log lines are the same):
a fixture, a job number, node identification, a timestamp, log level, loggger name, thread name, log message

[ReplicaSetFixture:job9:primary] 2019-07-02T02:09:03.027+0000 I  INDEX    [conn98] validated collection config.transactions (UUID: f6c6dde1-fc12-4608-b439-c09508cfee9e)

One workflow I have when debugging a failure in the server (as opposed to an unexpected response):

  1. First find the failure (requires looking at the contents of the log message).
  2. Look at which job/node the failure came from
  3. Add a filter based on job/node identification
  4. Filter out unrelated spammy loggers (for execution related failures that's often all of replication, and service arch logs)
  5. Scan the log statements

Information like the job and loggers are necessary for filtering, but not looking at. The fixture name and log level are always irrelevant to me. The thread name, log message and at times, the timestamp (more for temporal locality or cross referencing) are most important. Node identification is sometimes only needed for filtering, and sometimes needed because I need to look at messages from multiple nodes simultaneously. Lobster coloring is often a sufficient for distinguishing between nodes in this case.

Common refrains in execution standups when going over new BFs that have come in (where the BB has already done an excellent job picking out the relevant log lines) are: "Scroll to the right [please]" and "keep going..."



 Comments   
Comment by Githook User [ 04/Mar/21 ]

Author:

{'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com', 'username': 'MikhailShchatko'}

Message: SERVER-54723 Shorten resmoke log prefixes
Branch: master
https://github.com/mongodb/mongo/commit/8022d8830d9bdec5f0bfe3fd82aee7ebc50c5a47

Comment by Robert Guo (Inactive) [ 03/Mar/21 ]

Got it david.bradford, I'll set something up.

Comment by David Bradford (Inactive) [ 03/Mar/21 ]

Would be possible for someone to sync up with me before this gets merged? This will likely break the log parsing that we do for BFGs and BF suggestions.

Comment by Mikhail Shchatko [ 26/Feb/21 ]

After having conversation in slack with daniel.gottlieb and robert.guo we decided to do the following changes:

  • remove resmoke timestamps
  • shortening resmoke prefixes using specific abbreviations:
    job -> j
    shard -> s
    primary -> prim
    secondary -> sec
    mongos -> s (this collides with "shard", but probably this is not ambiguous)
    node -> n
    configsvr -> c
  • remove MongoDFixture, ReplicaSetFixture, ShardedClusterFixture
  • update fixture tables:

old

+--------------------------------------------+-----+-----+
|name                                        |port |pid  |
+--------------------------------------------+-----+-----+
|ShardedClusterFixture:job1:shard0:primary   |20251|19044|
|ShardedClusterFixture:job1:mongos           |20252|19718|
|ShardedClusterFixture:job1:configsvr:primary|20250|18478|
+--------------------------------------------+-----+-----+

new

+--------------------------------------------+----------+-----+-----+
|full_name                                   |name      |port |pid  |
+--------------------------------------------+----------+-----+-----+
|ShardedClusterFixture:job1:shard0:primary   |j1:s0:prim|20251|10943|
|ShardedClusterFixture:job1:shard1:primary   |j1:s1:prim|20252|11532|
|ShardedClusterFixture:job1:mongos           |j1:s      |20253|12287|
|ShardedClusterFixture:job1:configsvr:primary|j1:c:prim |20250|10533|
+--------------------------------------------+----------+-----+-----+

Comment by Robert Guo (Inactive) [ 06/Oct/20 ]

Nope. We're going to try to get to it in our next quick wins bucket.

Comment by Brooke Miller [ 22/Sep/20 ]

Is this still waiting for user input, robert.guo?

Comment by Robert Guo (Inactive) [ 04/Aug/20 ]

Try to do as much as we can in resmoke and not touch the Server for this ticket.

1. update timestamp format, 2. shortening static prefixes, reduce spacings.

Comment by Brooke Miller [ 04/Aug/20 ]

robert.guo to do the initial investigation with stakeholders about their preference.

Comment by Brooke Miller [ 04/Aug/20 ]

In Triaging, we agreed that the last suggestion is the best route for implementation. However, we should modify the timestamp format to include colons. Before we go ahead and implement this, we should get this in front of stakeholders to make sure the majority would be OK with this change.

Comment by Ryan Timmons [ 30/Jul/20 ]

Strategies I've used very successfully in the past:

  1. use jargon and obvious shortenings for common things (e.g. Replica -> Repl, Fixture -> Fxr)
  2. Remove vowels after the first one and repeated consonants; remove "double-hard" consonants (Fixture -> Fixr; conn -> cn)
  3. Pipes instead of balanced brackets
  4. Remove extraneous date separators and always encode as Zulu (Z) time rather than offset
  5. Reduce padding so it it's only needed in < 50% of cases

[ReplicaSetFixture:job9:primary] 2019-07-02T02:09:03.027+0000 I  INDEX    [conn98] validated collection config.transactions (UUID: f6c6dde1-fc12-4608-b439-c09508cfee9e)

Becomes:

ReplStFixr:j9:primry|20190702T02:09:03.027Z|I|INDX|cn98| validated collection config.transactions (UUID: f6c6dde1-fc12-4608-b439-c09508cfee9e)

This cuts 84 chars of prefix down to 58 while still being, imho, just as mentally obvious when reading it. And we've not made it any less machine-parsable either.

Some would go a step further and say the date info isn't useful, separators aren't needed, and Zulu can be implied:

ReplStFixr:j9:primry|020903027|I|IDX|cn98| validated collection config.transactions (UUID: f6c6dde1-fc12-4608-b439-c09508cfee9e)

This veers away from any sort of ISO-8601 format but it saves another 13 characters of what is noise in nearly all cases.

At any rate it seems reasonable to establish a stable regex so that lobster and other tooling can reliably expand or hide the data the user doesn't care about. In projects I've worked on in the past we've had a published PCRE that describes log-lines and we "required" all projects that parsed logs make reference to the file so we could easily find all the places that used the regex whenever we needed to change it. This saved us a production outage when we switched from offsets (+000) to Zulu, so it paid off.

Generated at Thu Feb 08 05:34:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.