[SERVER-67350] stalling during concurrent insertMany operations when unique index exists Created: 17/Jun/22  Updated: 03/Oct/22  Resolved: 03/Oct/22

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 5.0.9, 4.4.15
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Richard Martin Assignee: Louis Williams
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2022-06-27-03-40-25-783.png     PNG File image-2022-06-27-03-44-24-766.png     PNG File image-2022-06-27-03-45-22-122.png     PNG File image-2022-06-27-03-52-55-599.png     PNG File image-2022-06-27-04-09-41-371.png     PNG File image-2022-07-14-08-36-37-230.png     PNG File image-2022-07-14-08-36-55-924.png     PNG File image-2022-07-14-08-38-02-566.png     PNG File image-2022-07-14-12-41-19-603.png     PNG File image-2022-07-14-12-41-35-251.png     File requested_files.tar.gz    
Issue Links:
Related
related to WT-7264 Creating a new configuration for sear... Closed
related to SERVER-61185 Use prefix_search for unique index lo... Closed
Operating System: ALL
Steps To Reproduce:

I have created a simple java program to try demonstrate the issue

https://github.com/bobthekingofegypt/mongodb-insert-test

If you have a separate running mongodb server this can be run from the commandline with 

```
./gradlew bootRun --args="mongodb://<connection string> <number of connections>"
```

Running it with 5 connections usually demonstrates the problem on my machine and gcp servers.  This program tries to simulate load by just throwing 1 million entries into test database using threads to simulate the concurrency.  We host mongodb ourselves, not currently using clusters for this part of processing so it is a standalone server.

Sprint: Execution Team 2022-10-17
Participants:

 Description   

I was doing some testing on the data loading part of our pipeline and have been noticing a lot of problems when multiple concurrent insertMany operations are being triggered on a collection that contains a unique index.  For some reason MongoDB will just stop in the middle of the operations and the socket will stall, Mongodb will be using one or two cores at 100% CPU but there will be nothing in the logs, mongotop, currentOps or mongostat to show that anything is happening.  Eventually the connections will resume and finish the inserts without any errors but this may take anything from 2-10 minutes on my computer (my colleagues have tested it on their slower macbooks and have had stalling for up to 40 minutes).  In case it helps I don't believe the collection is locked, I can start another insert on another terminal with 1 connection and it will complete an insert of 1million rows without issue even when the original sockets are still stuck.

I have tested using docker to install a few different versions and this problem does not happen on v4.2, but does on v4.4 and v5.0.  On version v4.2 the test program will complete in 1.7 seconds where on the last v5 run I did it took 9m19sec.

I don't know if I'm doing something wrong, I should tweak some server configuration or if this is expected but it seems extreme.



 Comments   
Comment by Louis Williams [ 03/Oct/22 ]

bobthekingofegypt@gmail.com thanks for filing this bug report. We believe the poor performance you're observing can be explained by WT-7264. Essentially, we're not being very efficient when checking for the existence of duplicate keys when inserting documents, something that changed in 4.4.

The fix in MongoDB is tracked by SERVER-61185, which was released in 5.3.0 and will be released in 5.0.14. We are planning a 4.4 backport, but the performance improvement will not be enabled by default due to risks associated with backporting such a change to an older, more stable branch. Please follow SERVER-61185 for updates.

Comment by Sulabh Mahajan [ 02/Aug/22 ]

Given the sudden jump in the statistic "history store table reads missed" and that this issue started in 4.4, and goes away in 5.3, this could be WT-7264 and resolved in the server by SERVER-61185.

Comment by Chris Kelly [ 25/Jul/22 ]

The issue begins at this commit between 4.4.1 and 4.4.2 (4.4.1 does not have any issues). It contains a bunch of WT updates. I'll forward this to Storage Execution to investigate how these changes may affect concurrent insertMany operations. The issue goes away in 5.3.1 onward.

 

Comment by Chris Kelly [ 27/Jun/22 ]

Richard,

Thank you for taking the time to upload the extra information! What you submitted is very clear for us to look at, and it's appreciated.

Just to keep you in the loop, I am currently seeing the problem you're describing. I can clearly see that the queries insert the same amount of data, but take significantly longer in that scenario on 4.4 and 5.0 as you mentioned.

4.2

4.4:

5.0:

I see the slow queries lining up with your inserts. I tried to replicate your problem on 4.2 and 4.4 independently without success, however. I attempted with varying amounts of threads (up to 200) and my performance is the same across versions. In my test, I create 200 connections and attempt concurrent insertMany operations on a collection with an index {"a": 1}, {unique: true}) with a million documents each. For some reason, I don't get the same behavior you see with your driver, however I was using the jstests template in our mongodb repository, based on fsm_example.js, to create the unique index and then run workers that did insertMany().

That being said, I was able to use your supplied driver (thank you for providing it) and noticed it affects the following versions:
 

Version Stalling Behavior on Concurrent (10 connections) insertMany() ops
4.2.21 no
4.4.15 yes
5.0.9 yes
5.3.1 no
6.0.0 no

This is strange, however it looks like whatever was causing this was fixed in 5.3 onward. Specifically, what seemed to be happening during the stall periods was:

{"t":{"$date":"2022-07-14T07:09:39.373-04:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn30","msg":"Slow query","attr":{"type":"command","ns":"test_data_mongo_issue.mainthings","command":{"insert":"mainthings","ordered":true,"$db":"test_data_mongo_issue","lsid":{"id":{"$uuid":"448d8ba3-7d9d-4784-8837-0337809574f6"}}},"ninserted":1,"keysInserted":2,"numYields":0,"reslen":45,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":1},"storage":{},"protocol":"op_msg","durationMillis":0}}
{"t":{"$date":"2022-07-14T07:09:39.373-04:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn30","msg":"Received interrupt request for unknown op","attr":{"opId":8001}}
{"t":{"$date":"2022-07-14T07:09:39.373-04:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn30","msg":"Ops known during interrupt","attr":{"ops":[]}}
{"t":{"$date":"2022-07-14T07:09:39.413-04:00"},"s":"D2", "c":"REPL",     "id":22549,   "ctx":"conn29","msg":"Waiting for write concern. OpTime: {replOpTime}, write concern: {writeConcern}","attr":{"replOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"writeConcern":{"w":1,"wtimeout":0}}}

Followed by spamming of helloOk and "Received interrupt request for unknown op" in the logs (at logLevel 2):

"t":{"$date":"2022-07-14T07:10:15.946-04:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn23","msg":"Received interrupt request for unknown op","attr":{"opId":8340}}
{"t":{"$date":"2022-07-14T07:10:15.946-04:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn23","msg":"Ops known during interrupt","attr":{"ops":[]}}
{"t":{"$date":"2022-07-14T07:10:15.946-04:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn23","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"62cff70d87633714f2a1d583"},"counter":0},"maxAwaitTimeMS":10000,"$db":"admin"}}}
{"t":{"$date":"2022-07-14T07:10:25.885-04:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn22","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"$db":"admin"}}}
{"t":{"$date":"2022-07-14T07:10:25.885-04:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn22","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"hello":1,"$db":"admin"},"numYields":0,"reslen":313,"locks":{},"protocol":"op_msg","durationMillis":0}}
{"t":{"$date":"2022-07-14T07:10:25.885-04:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn22","msg":"Received interrupt request for unknown op","attr":{"opId":8591}}
{"t":{"$date":"2022-07-14T07:10:25.885-04:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn22","msg":"Ops known during interrupt","attr":{"ops":[]}}
{"t":{"$date":"2022-07-14T07:10:25.954-04:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn23","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"62cff70d87633714f2a1d583"},"counter":0},"maxAwaitTimeMS":10000,"$db":"admin"},"numYields":0,"reslen":323,"locks":{},"protocol":"op_msg","durationMillis":0}}
{"t":{"$date":"2022-07-14T07:10:25.954-04:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn23","msg":"Received interrupt request for unknown op","attr":{"opId":8467}}
{"t":{"$date":"2022-07-14T07:10:25.954-04:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn23","msg":"Ops known during interrupt","attr":{"ops":[]}}
{"t":{"$date":"2022-07-14T07:10:25.955-04:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn23","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"helloOk":true,"topologyVersion":{"processId":{"$oid":"62cff70d87633714f2a1d583"},"counter":0},"maxAwaitTimeMS":10000,"$db":"admin"}}}

I'll forward this to QE for some more information on what is causing that particular message, as it's happening during the stalls. Thank you for providing such specific information and a working reproduction of the issue!

 

Regards,

Christopher

Comment by Richard Martin [ 24/Jun/22 ]

Files attached, each version contains two runs.  I think I got the correct files off the docker container.  I didn't mark the time but checking the log I think these are the starting points, the logs are from freshly created docker images so there isn't any noise in them anyway.

5.0:

2022-06-24T12:43:21.254+00:00

2022-06-24T12:47:32.902+00:00

4.4:

2022-06-24T12:59:10.328+00:00

2022-06-24T13:11:36.767+00:00

4.2:

2022-06-24T13:17:04.363+0000

2022-06-24T13:17:09.093+0000

 

 

As far as your description that is basically what is happening but the import part is that the 1 million documents are being inserted from multiple concurrent insertMany operations not just one.  A single socket connection doing insertMany( 1 million documents) will finish without issue, the problems start when concurrency is added.

 

Comment by Chris Kelly [ 22/Jun/22 ]

Hi Richard,

Thanks for your report. A regression from 1.7 seconds to 9 minutes definitely seems quite extreme in this case. In order to look into this further, it'd be helpful to get your logs and FTDC to better verify this behavior and see what's going on.

Would you please archive (tar or zip) and upload to the ticket for v4.2, v4.4, and v5.0:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
  • the timestamps at which you run your test

It would be super appreciated so we can verify this behavior. It seems like the following is taking place:

Create index {"someId": 1},{unique: true}

Run db.collection.insertMany( 1000000 documents )

Document structure: {"someId": objectId("alphanumeric characters here")}

I will also try to reproduce this on my end as well.

Regards,
Christopher
 

Generated at Thu Feb 08 06:07:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.