[SERVER-40560] Ensure system-collection protections for local and config system collections Created: 10/Apr/19  Updated: 29/Oct/23  Resolved: 28/Jan/20

Status: Closed
Project: Core Server
Component/s: Security
Affects Version/s: 4.0.5
Fix Version/s: 4.3.4

Type: Bug Priority: Major - P3
Reporter: John Arbash Meinel Assignee: Spencer Jackson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-29826 Prevent user writes to internal repli... Closed
Documented
is documented by DOCS-13532 Docs for SERVER-40560: Ensure system-... Closed
Related
is related to SERVER-29826 Prevent user writes to internal repli... Closed
is related to DOCS-12637 Add collection(s) to local db overview Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Steps To Reproduce:
original reproduction

I'll give the exact steps I used to reproduce, including the binary that I am testing against. I'm assuming the issue is not about the binary, though.
 

snap install juju-db
PATH="/snap/juju-db/current/bin:$PATH"
go get -v github.com/juju/testing/...
cd github.com/juju/testing
git remote add jameinel github.com/jameinel/juju-testing
git checkout mgo-with-replSet
export TEST_LOGGING_CONFIG="juju.testing=TRACE"
go test -check.v -check.f 'mgoSuite.*(Reset|Proxied)'

It does seem to require that both TestNewProxiedSession and TestResetWhenUnauthorizied are run.

I'm guessing it is actually an issue with the test suite tear down followed by another setup.

I realize this isn't a minimalist reproduction with only existing mongo tools in play, but given the bit that is failing is an invariant deep in the code, it does seem like a missed expectation.

Note also that the test does not fail 100% of the time, but probably close to 90%. So there is a bit of a race involved, just one that typically fails.

Note the setup of Mongodb during the test suite is being done here:

https://github.com/jameinel/juju-testing/blob/d2cd4adac5bf0330ef9cea5dcc18d485c2cb2b3c/mgo.go#L221

And if there is no --replSet (and also no replSetInitiate call), then the tests pass 100% of the time.

I can run other tests/try to find a smaller reproducer if that is relevant.

Sprint: Security 2019-05-06, Security 2019-05-20, Security 2019-06-03, Security 2019-06-17, Security 2019-07-01, Security 2019-08-12, Security 2019-08-26, Security 2019-09-09, Security 2019-09-23, Security 2019-10-07, Security 2019-10-21, Security 2019-11-04, Security 2019-12-16, Security 2019-12-30, Security 2019-01-13, Security 2019-01-27, Security 2020-02-10
Participants:

 Description   

Normal write permissions are the only requirement for writing to the following collections in `local`:

replset.election
replset.minvalid
replset.oplogTruncateAfterPoint
startup_log

Could we require more explicit write privileges
to prevent changing these collections?

original description

I'm doing testing of interacting with mongo and replicaset. And during the test suite, we start up a Mongodb, and clean it up between tests. One of our tests is testing the infrastructure can recover if a test resets a password and the server needs to be torn down and set up again.

Without --replSet having been set, the test passes just fine. With --replSet, I get a failure with this backtrace:

 mongod(_ZN5mongo15printStackTraceERSo+0x3A) [0x5650b0d4597a]
 mongod(+0x222EC3E) [0x5650b0d44c3e]
 mongod(+0x222F112) [0x5650b0d45112]
 libpthread.so.0(+0x12890) [0x7ff109b80890]
 libc.so.6(gsignal+0xC7) [0x7ff1097bbe97]
 libc.so.6(abort+0x141) [0x7ff1097bd801]
mongod(_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j+0x0) [0x5650af3d90e4] mongod(_ZNK5mongo4repl33ReplicationConsistencyMarkersImpl11getMinValidEPNS_16OperationContextE+0x258) [0x5650af6c7d78] mongod(_ZN5mongo4repl8SyncTail17_oplogApplicationEPNS0_11OplogBufferEPNS0_22ReplicationCoordinatorEPNS1_14OpQueueBatcherE+0x170) [0x5650af79c2c0]
mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEPNS0_11OplogBufferEPNS0_22ReplicationCoordinatorE+0x11E) [0x5650af79d8ee]
 mongod(+0xC76EB6) [0x5650af78ceb6] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x19A) [0x5650b05dc13a]
 mongod(+0x1AC682A) [0x5650b05dc82a]
 mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x1AF) [0x5650b056181f]
 mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xAA) [0x5650b0561b5a]
mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x95) [0x5650b05624c5]
 libstdc++.so.6(+0xBD57F) [0x7ff10a40057f]
 libpthread.so.0(+0x76DB) [0x7ff109b756db]
 libc.so.6(clone+0x3F) [0x7ff10989e88f]
-----  END BACKTRACE  -----
2019-04-10T17:36:10.691+0400 I NETWORK  [conn6] end connection 127.0.0.1:44610 (0 connections now open

 
Digging through the code it seems to be this invariant:

https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/replication_consistency_markers_impl.cpp#L117

bool ReplicationConsistencyMarkersImpl::getInitialSyncFlag(OperationContext* opCtx) const {     
            auto doc = _getMinValidDocument(opCtx);
            invariant(doc);  // Initialized at startup so it should never be missing.

I don't know why the doc is missing in this case.

 

 



 Comments   
Comment by Githook User [ 27/Jan/20 ]

Author:

{'email': 'spencer.jackson@mongodb.com', 'username': 'spencerjackson', 'name': 'Spencer Jackson'}

Message: SERVER-40560 Make local.replset.* collections non-normal resources
Branch: master
https://github.com/mongodb/mongo/commit/e846a6d700a9e4a5b67b7fe4360de9ee97ffe846

Comment by Eric Sedor [ 16/Apr/19 ]

jameinel, we are assigning this to an appropriate team to evaluate against our planned work. Updates will be posted as they happen.

Comment by Eric Sedor [ 11/Apr/19 ]

Hi jameinel, thank you for your report. We especially thank you very much for the additional proactive effort you've taken to find the triggering action for the failure, and related documentation. We'll look into this.

Comment by John Arbash Meinel [ 11/Apr/19 ]

So if there is a bug, it is that local.replset.minvalue isn't local.system.replset.minvalue and/or that Mongod doesn't prevent you from deleting a document that it will later invariant() that it exists.

I'm happy that it is a bug in our side that we can fix, but it should be something that users can't actually do.

 

Comment by John Arbash Meinel [ 11/Apr/19 ]

oddly, I found it with:

--setParameter "logComponentVerbosity={verbosity:1}"

With that, I was able to see that we were indeed removing the collection:

I COMMAND  [conn7] command config.$cmd command: delete { delete: "transactions", ordered: true, writeConcern: { getLastError: 1 }, $db: "config" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount:{ w: 1 } } } protocol:op_query 0ms
I COMMAND  [conn7] command local.$cmd command: listCollections { listCollections: 1, cursor: { batchSize: 0 }, $db: "local" } numYields:0 reslen:239 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { R: 1 } } } protocol:op_query 0ms
I COMMAND  [conn7] getmore local.$cmd.listCollections command: { getMore: 6923256052014468026, collection: "$cmd.listCollections", batchSize: 0 } originatingCommand: { listCollections: 1, cursor: { batchSize: 0 }, $db: "local" } planSummary: QUEUED_DATA cursorid:6923256052014468026 ntoreturn:0 keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:0 nreturned:7 reslen:1354 locks:{} 0ms
I COMMAND  [conn7] command local.me command: collStats { collstats: "me", $db: "local" } numYields:0 reslen:13597 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 1ms
I WRITE    [conn7] remove local.me command: { q: {}, limit: 0 } planSummary: COLLSCAN keysExamined:0 docsExamined:1 ndeleted:1 keysDeleted:1 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } 0ms
I COMMAND  [conn7] command local.$cmd command: delete { delete: "me", ordered: true, writeConcern: { getLastError: 1 }, $db: "local" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 0ms
I COMMAND  [conn7] command local.replset.election command: collStats { collstats: "replset.election", $db: "local" } numYields:0 reslen:13641 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 0ms
I WRITE    [conn7] remove local.replset.election command: { q: {}, limit: 0 } planSummary: COLLSCAN keysExamined:0 docsExamined:1 ndeleted:1 keysDeleted:1 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } 0ms
I COMMAND  [conn7] command local.$cmd command: delete { delete: "replset.election", ordered: true, writeConcern: { getLastError: 1 }, $db: "local" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 0ms
I COMMAND  [conn7] command local.replset.minvalid command: collStats { collstats: "replset.minvalid", $db: "local" } numYields:0 reslen:13639 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 1ms
I WRITE    [conn7] remove local.replset.minvalid command: { q: {}, limit: 0 } planSummary: COLLSCAN keysExamined:0 docsExamined:1 ndeleted:1 keysDeleted:1 numYields:0 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } 0ms
I COMMAND  [conn7] command local.$cmd command: delete { delete: "replset.minvalid", ordered: true, writeConcern: { getLastError: 1 }, $db: "local" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 0ms

Looking here:

https://docs.mongodb.com/manual/reference/local-database/

It doesn't define quite a few of those collections. "local.replset.election" "config.transaction"

We already have code that explicitly skips all collections that start with "system.", is there a reason that there are now "local.replset" collections that aren't under "local.system.replset" ?

 

Comment by John Arbash Meinel [ 11/Apr/19 ]

more debugging today.

If I launch the mongodb server with:

--profile=2

Then it stops crashing. (I was hoping that by setting profiling, I would see what the bad query was that causes it to fail.)

 

Comment by John Arbash Meinel [ 11/Apr/19 ]

I'm able to replicate the invariant failure even easier with this change:

--- a/mgo.go
+++ b/mgo.go
@@ -332,9 +332,14 @@ func (inst *MgoInstance) run() error {
 defer session.Close()
 session.SetMode(mgo.Monotonic, true)
 if err := session.Run(bson.D{{"replSetInitiate", nil}}, nil); err != nil {
- panic(err)
+ return err
 }
-
+ // now wait until we can get a valid connection before continuing.
+ session2, err := inst.Dial()
+ if err != nil {
+ return err
+ }
+ session2.Close()
 return nil
 }

 

Essentially, if I just connect immediately after calling replSetInitiate. Note that I am waiting for it to return, but maybe there is something else I should be waiting for?

I haven't been able to replicate this with the mongo shell:

#!/bin/bashrm -rf 0
mkdir 0
# Start mongo in the background, and wait for its message
mongod --dbpath 0 --bind_ip 127.0.0.1 --nssize 1 --noprealloc --smallfiles --replSet=sstxn-test --nounixsocket --setParameter enableTestCommands=1 --oplogSize 10 >0/out.txt &
# Wait until mongod is waiting for connections
(tail -f -n0 0/out.txt) | grep -q --line-buffered "waiting for"
mongo local --eval "
db.replset.minvalid.find().pretty();
rs.initiate();
db.replset.minvalid.find().pretty();
"
mongo local --eval "db.replset.minvalid.find().pretty()"
killall -SIGINT mongod
wait

But I also notice that mongo shell always prints out a

Implicit session: session { "id" : UUID("41b5734c-74b0-4837-a6cb-8a73cd1033c5") }

so it is doing more than just connecting to mongod.

 

Generated at Thu Feb 08 04:55:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.