[CDRIVER-3708] /Client/descriptions failure Created: 05/Jun/20  Updated: 06/Feb/24  Resolved: 24/Jun/20

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: None
Fix Version/s: 1.17.0-rc0, 1.17.0

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Kevin Albertson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod-arbiter-27019.log     Text File mongod-secondary-27018.log    
Issue Links:
Depends
depends on SERVER-48715 Sending cluster time to an arbiter wi... Closed
Related
related to CDRIVER-3716 Single threaded monitoring does not u... Backlog
is related to CDRIVER-3709 /initial_dns_auth/dns-auth failure Closed
is related to PHPC-1634 Arbiters write a lot of logs 'unkown ... Closed
Server Compat: 4.7

 Description   

Observed on RHEL 6.2 here:
https://evergreen.mongodb.com/task/mongo_c_driver_rhel62_test_latest_replica_set_auth_sasl_openssl_733322e98fdad8d4cf5fa2ce2f256d8e014ce51e_20_06_05_17_49_51

Logs:

 [2020/06/05 18:19:01.503] test error in: /data/mci/0d6eb7bf3b62003b93710c5daf3bd836/mongoc/src/libmongoc/tests/test-mongoc-client.c 2464:test_mongoc_client_descriptions()
 [2020/06/05 18:19:01.503] still have 2 descriptions, not expected 3, after 1 sec
 [2020/06/05 18:19:01.531] .evergreen/run-tests.sh: line 90:  8773 Aborted                 (core dumped) ./src/libmongoc/test-libmongoc --no-fork $TEST_ARGS -d
 [2020/06/05 18:19:01.532] Command failed: command encountered problem: error waiting on process '88f6ecfb-b04b-4c65-9649-accc104aae3d': exit status 134

This may have been triggered from behavioral changes in multi-threaded monitoring of CDRIVER-3625.



 Comments   
Comment by Githook User [ 11/Jul/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3708 unskip tests
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/344e1ef2aec09339930910f2a1470aadfe84f2e1

Comment by Githook User [ 11/Jul/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3708 skip two tests on latest auth replset

Until SERVER-48715 is resolved, skip tests with auth against a >4.4
replica set that rely on arbiters being discovered.

Also skips the DNS auth test failures described in CDRIVER-3709.
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/91d0016a0d0ac342a34ff49ba6661f60f3de6f26

Comment by Githook User [ 24/Jun/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3708 unskip tests
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/2ea57593422f023d04c4717cd90db9c251897f26

Comment by Kevin Albertson [ 23/Jun/20 ]

PR to unskip: https://github.com/mongodb/mongo-c-driver/pull/644

Comment by Githook User [ 15/Jun/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3708 skip two tests on latest auth replset

Until SERVER-48715 is resolved, skip tests with auth against a >4.4
replica set that rely on arbiters being discovered.

Also skips the DNS auth test failures described in CDRIVER-3709.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/43c5aaf21185156cc9e56ac25ac105c51f0a44be

Comment by Kevin Albertson [ 12/Jun/20 ]

PR to temporarily skip this test and the DNS with auth tests of CDRIVER-3709https://github.com/mongodb/mongo-c-driver/pull/637

Comment by Kevin Albertson [ 10/Jun/20 ]

Upon further investigation, I think speculativeAuthenticate is unrelated. I can reproduce this locally with speculativeAuthenticate explicitly disabled.

Hypothesis: This is due gossiping the cluster time to an arbiter with auth enabled.

Observations:
Tests on replica set auth tasks on evergreen only started failing recently, and are not failing on the latest 4.4 RC. The earliest such I found was on server version 4.5.0-1606-gbd74e90.

I could reproduce locally. But I could not reproduce with 4.5.0-1602-g5c6e1e5. That indicates some server commit changed behavior within those four commits. The most relevant looking server commit is 7074dee1fbf6763c0d463c377c2e47d8ef2c4f6f.

I tried to reproduce locally by creating the same setup with a three node (PSA) replica set with auth enabled:

mlaunch init --replicaset --arbiter --nodes=2 --auth --username user --password password --name replicaSet --setParameter enableTestCommands=1 --verbose --dir CDRIVER-3708

The error can be reproduced by sending a command to the arbiter with a $clusterTime obtained from another node. Example with the shell:

var cmd = {
  "isMaster": 1,
  "$clusterTime" : {
    "clusterTime" : Timestamp(1591812742, 1),
    "signature" : {
        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
        "keyId" : NumberLong(0)
    }
  }
}
 
db.runCommand(cmd)

Replies with:

{
	"ok" : 0,
	"errmsg" : "aborting keys cache refresh because node is shutting down",
	"code" : 91,
	"codeName" : "ShutdownInProgress"
}

Interestingly, even though this reproduces without authenticating the connection, I cannot reproduce this if the server is started without auth enabled.

Note, this the test failure is triggered by changes to multi-threaded monitoring in CDRIVER-3625. But I think that libmongoc should be sending a $clusterTime per the drivers session spec. The single threaded monitor does not update the cluster time on ismaster responses (which seems like a bug, CDRIVER-3716) so the test failure is only observed in the pooled monitor.

Comment by Kevin Albertson [ 08/Jun/20 ]

Hypothesis:
This is related to performing speculative authentication against an arbiter.

Observations:
Using a patch with verbose test logging, the test logs the an ismaster reply with "ok":0 and the message "aborting keys cache refresh because node is shutting down":

2020/06/07 21:33:01.0594: [22473]:    DEBUG:      monitor: [localhost:27019] error in server reply: aborting keys cache refresh because node is shutting down

localhost:27019 is an arbiter. The full server logs are attached as mongod-arbiter-27019.log with some annotations starting with "#NOTE".

This has only been observed on 4.4+ replica set tests with authentication.

Similar to PHPC-1634, the arbiter's logs show many "User not Found" errors and other errors related to auth.

Here are the logs for one connection.

{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn18","msg":"client metadata","attr":{"remote":"127.0.0.1:38911","client":"conn18","doc":{"driver":{"name":"mongoc","version":"1.17.0-pre"},"os":{"type":"Linux","name":"Ubuntu","version":"16.04","architecture":"x86_64"},"platform":"cfg=0x02a15620c9 posix=200809 stdc=201112 CC=clang 3.8.0 (tags/RELEASE_380/final)"}}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"EXECUTOR", "id":23104,   "ctx":"AuthorizationManager-1","msg":"Starting thread","attr":{"threadName":"AuthorizationManager-1","poolName":"AuthorizationManager"}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"ACCESS",   "id":20238,   "ctx":"AuthorizationManager-1","msg":"Getting user record","attr":{"user":"bob@admin"}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"AuthorizationManager-1","msg":"User assertion","attr":{"error":"UserNotFound: Could not find user \"bob\" for db \"admin\"","file":"src/mongo/db/auth/authorization_manager_impl.cpp","line":713}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"conn18","msg":"User assertion","attr":{"error":"UserNotFound: Could not find user \"bob\" for db \"admin\"","file":"src/mongo/util/future_impl.h","line":674}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"I",  "c":"ACCESS",   "id":20251,   "ctx":"conn18","msg":"Supported SASL mechanisms requested for unknown user","attr":{"user":"bob@admin"}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"ACCESS",   "id":20238,   "ctx":"AuthorizationManager-1","msg":"Getting user record","attr":{"user":"bob@admin"}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"AuthorizationManager-1","msg":"User assertion","attr":{"error":"UserNotFound: Could not find user \"bob\" for db \"admin\"","file":"src/mongo/db/auth/authorization_manager_impl.cpp","line":713}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"conn18","msg":"User assertion","attr":{"error":"UserNotFound: Could not find user \"bob\" for db \"admin\"","file":"src/mongo/util/future_impl.h","line":674}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"I",  "c":"ACCESS",   "id":20249,   "ctx":"conn18","msg":"Authentication failed","attr":{"mechanism":"SCRAM-SHA-256","principalName":"bob","authenticationDatabase":"admin","client":"127.0.0.1:38911","result":"UserNotFound: Could not find user \"bob\" for db \"admin\""}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"conn18","msg":"User assertion","attr":{"error":"AuthenticationFailed: Authentication failed.","file":"src/mongo/db/auth/sasl_commands.cpp","line":336}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn18","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"isMaster":1,"client":{"driver":{"name":"mongoc","version":"1.17.0-pre"},"os":{"type":"Linux","name":"Ubuntu","version":"16.04","architecture":"x86_64"},"platform":"cfg=0x02a15620c9 posix=200809 stdc=201112 CC=clang 3.8.0 (tags/RELEASE_380/final)"},"compression":[],"saslSupportedMechs":"admin.bob","speculativeAuthenticate":{"saslStart":1,"mechanism":"SCRAM-SHA-256","payload":{"$binary":{"base64":"biwsbj1ib2Iscj1Cd0RTdEQ1bE1kUVQxc3JrTXQzeUUvbTRCT0xBWUhFeg==","subType":"0"}},"autoAuthorize":1,"options":{"skipEmptyExchange":true},"db":"admin"},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"numYields":0,"reslen":675,"locks":{},"protocol":"op_query","durationMillis":0}}
{"t":{"$date":"2020-06-07T21:32:15.731+00:00"},"s":"D1", "c":"QUERY",    "id":22783,   "ctx":"conn18","msg":"received interrupt request for unknown op: {opId}{printKnownOps_inlock}","attr":{"opId":558,"printKnownOps_inlock":""}}
{"t":{"$date":"2020-06-07T21:32:15.766+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn18","msg":"connection ended","attr":{"remote":"127.0.0.1:38911","connectionCount":2}}

The ismaster command includes saslSupportedMechs and speculativeAuthentication.

These are the logs for the connection that receives the "ok": 0 reply:

{"t":{"$date":"2020-06-07T21:33:01.594+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"conn288","msg":"User assertion","attr":{"error":"ShutdownInProgress: aborting keys cache refresh because node is shutting down","file":"src/mongo/db/keys_collection_manager.cpp","line":203}}
{"t":{"$date":"2020-06-07T21:33:01.594+00:00"},"s":"D1", "c":"COMMAND",  "id":21962,   "ctx":"conn288","msg":"Assertion while executing command","attr":{"command":"isMaster","db":"admin","commandArgs":{"isMaster":1,"client":{"driver":{"name":"mongoc","version":"1.17.0-pre"},"os":{"type":"Linux","name":"Ubuntu","version":"16.04","architecture":"x86_64"},"platform":"cfg=0x02a15620c9 posix=200809 stdc=201112 CC=clang 3.8.0 (tags/RELEASE_380/final)"},"compression":[],"$clusterTime":{"clusterTime":{"$timestamp":{"t":1591565581,"i":1}},"signature":{"hash":{"$binary":{"base64":"CVas/c3Osh+SGBmIqSELxHsztFc=","subType":"0"}},"keyId":6835721630207967235}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"error":"ShutdownInProgress: aborting keys cache refresh because node is shutting down"}}
{"t":{"$date":"2020-06-07T21:33:01.594+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn288","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"isMaster":1,"client":{"driver":{"name":"mongoc","version":"1.17.0-pre"},"os":{"type":"Linux","name":"Ubuntu","version":"16.04","architecture":"x86_64"},"platform":"cfg=0x02a15620c9 posix=200809 stdc=201112 CC=clang 3.8.0 (tags/RELEASE_380/final)"},"compression":[],"$clusterTime":{"clusterTime":{"$timestamp":{"t":1591565581,"i":1}},"signature":{"hash":{"$binary":{"base64":"CVas/c3Osh+SGBmIqSELxHsztFc=","subType":"0"}},"keyId":6835721630207967235}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"aborting keys cache refresh because node is shutting down","errName":"ShutdownInProgress","errCode":91,"reslen":166,"locks":{},"protocol":"op_query","durationMillis":0}}
{"t":{"$date":"2020-06-07T21:33:01.594+00:00"},"s":"D1", "c":"QUERY",    "id":22783,   "ctx":"conn288","msg":"received interrupt request for unknown op: {opId}{printKnownOps_inlock}","attr":{"opId":2067,"printKnownOps_inlock":""}}
{"t":{"$date":"2020-06-07T21:33:01.595+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn288","msg":"connection ended","attr":{"remote":"127.0.0.1:41374","connectionCount":2}}

That ismaster command does not include saslSupportedMechs or speculativeAuthenticate. In the context of the test, this is an ismaster sent in multi-threaded monitoring, so it does not do auth negotiation or speculative auth (whereas single-threaded monitoring would).

CC sara.golemon do you think this could be related to speculative auth against the arbiter? It is not yet clear to me what is triggering the server to enter shutdown, so I might be barking up the wrong tree.

Comment by Kevin Albertson [ 08/Jun/20 ]

Another:
https://evergreen.mongodb.com/task/mongo_c_driver_clang38ubuntu_test_coverage_latest_replica_set_auth_nosasl_openssl_10b05158e938e9a5248cf845126185d1366a2d80_20_06_08_17_57_31

Comment by Kevin Albertson [ 08/Jun/20 ]

Another: https://evergreen.mongodb.com/task/mongo_c_driver_clang38ubuntu_test_coverage_latest_replica_set_auth_nosasl_openssl_b5cf71ab24ce18d80e3fa2b70e9e714937ad403f_20_06_08_17_18_25/0

Generated at Wed Feb 07 21:18:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.