[SERVER-79338] Expand metrics found in reports to provide more coverage signal to MongoDB Teams Created: 25/Jul/23  Updated: 23/Oct/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: javi Arguello Assignee: [DO NOT ASSIGN] Backlog - DevProd Correctness
Resolution: Unresolved Votes: 0
Labels: antithesis
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Correctness
Participants:

 Description   

1. The number of times after a leader election a node goes into member state ROLLBACK. The server also logs some other metrics related to replication rollback like how many operations are being rolled back: https://github.com/mongodb/mongo/blob/fb679bde06827e98f7c55272a83c754959a3ffd6/src/mongo/db/repl/rollback_impl.cpp#L1529
2. The number of times a chunk successfully migrates. https://github.com/mongodb/mongo/blob/fb679bde06827e98f7c55272a83c754959a3ffd6/src/mongo/db/s/migration_source_manager.cpp#L634-L635
3. The number of times a node has 0 read tickets or write tickets available for operations. This kind of metric probably requires post-processing the contents of the diagnostic.data/ directory. It is something we can defer until exploring more into deadlock scenarios. https://jira.mongodb.org/browse/SERVER-75205 is the type of bug I'm thinking of to know "would it be possible for Antithesis to hit this?"
 



 Comments   
Comment by javi Arguello [ 25/Aug/23 ]

(posting Max H. response for tracking)

 
1. These top level metrics look great.
2. Similar to graph B, I would want to see separate graphs for each type of operation being rolled back (insert, create, update, collMod, createIndexes, etc.). A histogram of that form would answer a question such as "are most branches seeing a small number of inserts being rolled back?" And so rather than summing up the total number of inserts rolled back over the course of the entire branch, we'd track each rollback as its own event (effectively ignoring which branch in the experiment the rollback came from). We could additionally do a max(inserts within the branch) if we want to track something at the distribution of each branch within an experiment.
3. Thanks, I think for graphs A and B my brain would want to see them as a bar chart because the line suggests there's some continuity which doesn't really exist. Graph E is also interesting like you were thinking it would be! One adjustment to it I'd recommend would be to narrow in the number of the transitions made by the specific mongod process which had the fatal error.
4. The metrics being in the report can't hurt in an "at a glance" sense though I feel like ultimately the fatal assertion / invariant / etc. text is going to be more informative of where to look next than likely what the metrics would say. Integrating our team's log message extraction on the server logs collected by Antithesis will probably be more of an area to focus on. I'll leave that to Alex to coordinate the priority.

Comment by javi Arguello [ 14/Aug/23 ]

(pasting my email here to max.hirschhorn@mongodb.com  for easier visibility and context tracking)
 
1) the top level metrics are good to go and highlighted at the top of each experiment report

2) In terms of the rollbackCommandCounts, is there some metric in particular that's interesting to track and/or a way to display the data? For example, here are a few ideas I could do: 
 

  • an aggregate count of the sum number of "insert", "create", "update", "collMod" and "createIndexes" across the whole experiment? 
  • sum of each across each branch? 
  • the number of rollbacks that have an "insert", but not a "create"? (or any combo of the various command options?) 
  • how many branches have an "collMod" vs how many branches have a "createIndexes"  (or any combo of the various command options)? 
     
    3) for the histograms representing the stats for each branch here's what those graphs would look like : 
     
    A.

    B.
     
     
    however, in addition to these graphs I generated a few more charts in case any of these would also be interesting:
     
    C.
     
    (this one above proves the statement "The count would increase if the branch was played for a longer number of instructions")
     
    D.
     
    E.

I was thinking that some kind of connection to bugs could be interesting? This one above is the number of state transitions seen in branches that had a fatal failure (aka a bug). Note this experiment found a lot of fatals (many of them duplicates of the same failure). 
 
F.
 

This one above is a short snippet of the count of all 3 metrics for each branch based on its exit time. Similar to graph C above. 
 
Let me know if any of these would be helpful to include, or if there's something else that would be helpful. 
 
Finally, taking a different angle to this question, I wonder if the specific stats mean something in the presence of a fatal failure. So here's a mockup of what I could add to the specific debug report for a failure found:

above is the mockup for a debug report including the counts of the new metrics, and below is a real debug report to see how it would fit 
 
If it would be easier to hop on a quick call, let me know I can schedule something for 15 min to get feedback. 
 

 

Generated at Thu Feb 08 06:40:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.