[SERVER-67350] stalling during concurrent insertMany operations when unique index exists Created: 17/Jun/22 Updated: 03/Oct/22 Resolved: 03/Oct/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 5.0.9, 4.4.15 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Richard Martin | Assignee: | Louis Williams |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | I have created a simple java program to try demonstrate the issue https://github.com/bobthekingofegypt/mongodb-insert-test If you have a separate running mongodb server this can be run from the commandline with ``` Running it with 5 connections usually demonstrates the problem on my machine and gcp servers. This program tries to simulate load by just throwing 1 million entries into test database using threads to simulate the concurrency. We host mongodb ourselves, not currently using clusters for this part of processing so it is a standalone server. |
||||||||||||
| Sprint: | Execution Team 2022-10-17 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
I was doing some testing on the data loading part of our pipeline and have been noticing a lot of problems when multiple concurrent insertMany operations are being triggered on a collection that contains a unique index. For some reason MongoDB will just stop in the middle of the operations and the socket will stall, Mongodb will be using one or two cores at 100% CPU but there will be nothing in the logs, mongotop, currentOps or mongostat to show that anything is happening. Eventually the connections will resume and finish the inserts without any errors but this may take anything from 2-10 minutes on my computer (my colleagues have tested it on their slower macbooks and have had stalling for up to 40 minutes). In case it helps I don't believe the collection is locked, I can start another insert on another terminal with 1 connection and it will complete an insert of 1million rows without issue even when the original sockets are still stuck. I have tested using docker to install a few different versions and this problem does not happen on v4.2, but does on v4.4 and v5.0. On version v4.2 the test program will complete in 1.7 seconds where on the last v5 run I did it took 9m19sec. I don't know if I'm doing something wrong, I should tweak some server configuration or if this is expected but it seems extreme. |
| Comments |
| Comment by Louis Williams [ 03/Oct/22 ] | |||||||||||||||||||||||||||
|
bobthekingofegypt@gmail.com thanks for filing this bug report. We believe the poor performance you're observing can be explained by The fix in MongoDB is tracked by | |||||||||||||||||||||||||||
| Comment by Sulabh Mahajan [ 02/Aug/22 ] | |||||||||||||||||||||||||||
|
Given the sudden jump in the statistic "history store table reads missed" and that this issue started in 4.4, and goes away in 5.3, this could be | |||||||||||||||||||||||||||
| Comment by Chris Kelly [ 25/Jul/22 ] | |||||||||||||||||||||||||||
|
The issue begins at this commit between 4.4.1 and 4.4.2 (4.4.1 does not have any issues). It contains a bunch of WT updates. I'll forward this to Storage Execution to investigate how these changes may affect concurrent insertMany operations. The issue goes away in 5.3.1 onward.
| |||||||||||||||||||||||||||
| Comment by Chris Kelly [ 27/Jun/22 ] | |||||||||||||||||||||||||||
|
Richard, Thank you for taking the time to upload the extra information! What you submitted is very clear for us to look at, and it's appreciated. Just to keep you in the loop, I am currently seeing the problem you're describing. I can clearly see that the queries insert the same amount of data, but take significantly longer in that scenario on 4.4 and 5.0 as you mentioned. 4.2 4.4: 5.0: I see the slow queries lining up with your inserts. I tried to replicate your problem on 4.2 and 4.4 independently without success, however. I attempted with varying amounts of threads (up to 200) and my performance is the same across versions. In my test, I create 200 connections and attempt concurrent insertMany operations on a collection with an index {"a": 1}, {unique: true}) with a million documents each. For some reason, I don't get the same behavior you see with your driver, however I was using the jstests template in our mongodb repository, based on fsm_example.js, to create the unique index and then run workers that did insertMany(). That being said, I was able to use your supplied driver (thank you for providing it) and noticed it affects the following versions:
This is strange, however it looks like whatever was causing this was fixed in 5.3 onward. Specifically, what seemed to be happening during the stall periods was:
Followed by spamming of helloOk and "Received interrupt request for unknown op" in the logs (at logLevel 2):
I'll forward this to QE for some more information on what is causing that particular message, as it's happening during the stalls. Thank you for providing such specific information and a working reproduction of the issue!
Regards, Christopher | |||||||||||||||||||||||||||
| Comment by Richard Martin [ 24/Jun/22 ] | |||||||||||||||||||||||||||
|
Files attached, each version contains two runs. I think I got the correct files off the docker container. I didn't mark the time but checking the log I think these are the starting points, the logs are from freshly created docker images so there isn't any noise in them anyway. 5.0: 2022-06-24T12:43:21.254+00:00 2022-06-24T12:47:32.902+00:00 4.4: 2022-06-24T12:59:10.328+00:00 2022-06-24T13:11:36.767+00:00 4.2: 2022-06-24T13:17:04.363+0000 2022-06-24T13:17:09.093+0000
As far as your description that is basically what is happening but the import part is that the 1 million documents are being inserted from multiple concurrent insertMany operations not just one. A single socket connection doing insertMany( 1 million documents) will finish without issue, the problems start when concurrency is added.
| |||||||||||||||||||||||||||
| Comment by Chris Kelly [ 22/Jun/22 ] | |||||||||||||||||||||||||||
|
Hi Richard, Thanks for your report. A regression from 1.7 seconds to 9 minutes definitely seems quite extreme in this case. In order to look into this further, it'd be helpful to get your logs and FTDC to better verify this behavior and see what's going on. Would you please archive (tar or zip) and upload to the ticket for v4.2, v4.4, and v5.0:
It would be super appreciated so we can verify this behavior. It seems like the following is taking place: Create index {"someId": 1},{unique: true} Run db.collection.insertMany( 1000000 documents ) Document structure: {"someId": objectId("alphanumeric characters here")} I will also try to reproduce this on my end as well. Regards, |