[SERVER-48932] ReplicaSetMonitor hitting failpoint set during drivers test Created: 17/Jun/20  Updated: 20/Oct/20  Resolved: 20/Oct/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Divjot Arora (Inactive) Assignee: Benjamin Caimano (Inactive)
Resolution: Duplicate Votes: 0
Labels: sa-groomed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-48985 Add logging for failCommand Closed
Duplicate
duplicates SERVER-49157 Initialize WireSpec during MONGO_INIT... Closed
duplicates SERVER-49220 failCommand with appName does not fai... Closed
duplicates SERVER-49336 Set client metadata if it is missing ... Closed
Related
is related to SERVER-49157 Initialize WireSpec during MONGO_INIT... Closed
Operating System: ALL
Participants:

 Description   

One of the Go driver streamable isMaster tests sets a failpoint to make isMaster requests fail with error code 91. The logs for the test execution show

Show all

{
  "t":{
    "$date":"2020-06-17T17:14:47.215+00:00"
  },
  "s":"D1",
  "c":"-",
  "id":23074,
  "ctx":"ReplicaSetMonitor-TaskExecutor",
  "msg":"User assertion ",
  "attr":{
    "error":"ShutdownInProgress: Failing command due to 'failCommand' failpoint",
    "file":"src/mongo/client/async_client.cpp",
    "line":108
  }
},
{
  "t":{
    "$date":"2020-06-17T17:14:47.216+00:00"
  },
  "s":"I",
  "c":"-",
  "id":4333222,
  "ctx":"ReplicaSetMonitor-TaskExecutor",
  "msg":"RSM received failed isMaster",
  "attr":{
    "host":"localhost:27018",
    "error":"ShutdownInProgress: Failing command due to 'failCommand' failpoint",
    "replicaSet":"repl0",
    "isMasterReply":"{}"
  }
},
{
  "t":{
    "$date":"2020-06-17T17:14:47.216+00:00"
  },
  "s":"I",
  "c":"NETWORK",
  "id":4712102,
  "ctx":"ReplicaSetMonitor-TaskExecutor",
  "msg":"Host failed in replica set",
  "attr":{
    "replicaSet":"repl0",
    "host":"localhost:27018",
    "error":{
      "code":91,
      "codeName":"ShutdownInProgress",
      "errmsg":"Failing command due to 'failCommand' failpoint"
    },
    "action":{
      "dropConnections":false,
      "requestImmediateCheck":false
    }
  }
}

One of the log lines says "RSM received failed isMaster". Does this mean that the server's internal ReplicaSetMonitor is triggering the fail point?



 Comments   
Comment by Divjot Arora (Inactive) [ 18/Jun/20 ]

This test is for the streamable isMaster feature, so I've seen failures against the 4.4-latest and latest variants, which track the v4.4 and master server branches. I've only run these tests on Ubuntu16.04 so far and am seeing this issue non-deterministically across runs. It only happens on replica sets.

Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ]

Also, importantly, can I have some info about what versions and OSes are in play?

Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ]

Oh, it definitely should trigger the failpoint, we've got a whole set of our own tests that depend upon it. We could probably allow failCommand to filter internal clients ala "only/never/allow" if you're interested in that.

Edit: I've had it pointed out to me that the failInternalCommands field should already do this.

I'm also wondering what is targeting on that node and thus starting the RSM. It should already have full topology knowledge from the replication coordinator. Thus the RSM having issues wouldn't normally have very much effect in a replica set topology without routers.

Comment by Divjot Arora (Inactive) [ 18/Jun/20 ]

The topology is a three-node replica set. Both the failpoint and logs are from the primary node. Your comment makes sense, but it's not clear to me whether or not this is a bug. It makes sense that the RSM behaves like a driver, but should it be considered "internal" and therefore not trigger the failpoint?

Comment by Benjamin Caimano (Inactive) [ 18/Jun/20 ]

divjot.arora, functionally, each RSM is a client application like any driver. If isMaster requests are failing, the RSM will also have issues and likely treat that node as unreachable. We do our best to log that, but arguably we should probably have "I" level logs only when we remove or add nodes to our topology cache.

I'm curious what your topology is. Are these log messages on a mongod or on a mongos? I assume the failpoint is being set on a mongod.

Comment by Divjot Arora (Inactive) [ 18/Jun/20 ]

Small clarification: I'm not 100% sure that these log lines correspond to the execution of that test because the same cluster is used to run all of our tests. There's not many tests that set failpoints for isMaster, though, and the log lines about the RSM receiving a failed isMaster is confusing regardless of which test it's associated with.

Generated at Thu Feb 08 05:18:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.