[SERVER-48932] ReplicaSetMonitor hitting failpoint set during drivers test Created: 17/Jun/20 Updated: 20/Oct/20 Resolved: 20/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Divjot Arora (Inactive) | Assignee: | Benjamin Caimano (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | sa-groomed | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
One of the Go driver streamable isMaster tests sets a failpoint to make isMaster requests fail with error code 91. The logs for the test execution show
One of the log lines says "RSM received failed isMaster". Does this mean that the server's internal ReplicaSetMonitor is triggering the fail point? |
| Comments |
| Comment by Divjot Arora (Inactive) [ 18/Jun/20 ] |
|
This test is for the streamable isMaster feature, so I've seen failures against the 4.4-latest and latest variants, which track the v4.4 and master server branches. I've only run these tests on Ubuntu16.04 so far and am seeing this issue non-deterministically across runs. It only happens on replica sets. |
| Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ] |
|
Also, importantly, can I have some info about what versions and OSes are in play? |
| Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ] |
|
Oh, it definitely should trigger the failpoint, we've got a whole set of our own tests that depend upon it. We could probably allow failCommand to filter internal clients ala "only/never/allow" if you're interested in that. Edit: I've had it pointed out to me that the failInternalCommands field should already do this. I'm also wondering what is targeting on that node and thus starting the RSM. It should already have full topology knowledge from the replication coordinator. Thus the RSM having issues wouldn't normally have very much effect in a replica set topology without routers. |
| Comment by Divjot Arora (Inactive) [ 18/Jun/20 ] |
|
The topology is a three-node replica set. Both the failpoint and logs are from the primary node. Your comment makes sense, but it's not clear to me whether or not this is a bug. It makes sense that the RSM behaves like a driver, but should it be considered "internal" and therefore not trigger the failpoint? |
| Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ] |
|
divjot.arora, functionally, each RSM is a client application like any driver. If isMaster requests are failing, the RSM will also have issues and likely treat that node as unreachable. We do our best to log that, but arguably we should probably have "I" level logs only when we remove or add nodes to our topology cache. I'm curious what your topology is. Are these log messages on a mongod or on a mongos? I assume the failpoint is being set on a mongod. |
| Comment by Divjot Arora (Inactive) [ 18/Jun/20 ] |
|
Small clarification: I'm not 100% sure that these log lines correspond to the execution of that test because the same cluster is used to run all of our tests. There's not many tests that set failpoints for isMaster, though, and the log lines about the RSM receiving a failed isMaster is confusing regardless of which test it's associated with. |