[SERVER-31546] Write lock on one database causes lock on ALL databases Created: 13/Oct/17  Updated: 16/Oct/17  Resolved: 13/Oct/17

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication, WiredTiger
Affects Version/s: 3.2.17
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Mario Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data00     File diagnostic.data00     File diagnostic.data01     File diagnostic.data02     PNG File lag.png     HTML File mongo-000     HTML File mongo-001     HTML File mongo-002     File mongo_logs_2017-10-11.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-21307 Replicated DDL (catalog) operation du... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Hi,

We are using Mongo replica-set (3 Mongo servers - 1 Primary, 1 secundary, 1 arbiter).
Mongo version 3.2.17.

We had a problem on 10.10.2017 between 11:44:10 - 11:51:00 hours.

We know that we started dropIndex on one collection and commad executed ok:
2017-10-10T09:44:06.738+0000 I COMMAND [conn92280] CMD: dropIndexes igraci.listici

After that, all operation on+ all databases were very slow+ (about 5~6 minutes) , number of connections started to get very high and logs are full of
2017-10-10T09:50:58.824+0000 I NETWORK [conn93396] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server x.z.y.a

All our application were slowed in query execution, insert too.

Based on documentation https://docs.mongodb.com/manual/reference/command/dropIndexes/index.html only the affected database shoud be lock, but not all others and we think this happend.

In attachment are logs.

We can't explain our self what happend so asking for your help?
Thanks



 Comments   
Comment by Ramon Fernandez Marina [ 16/Oct/17 ]

mnuic, SERVER-21307 is currently labeled as "Backlog", meaning is not part of the current development cycle so unfortunately we don't have an estimate for resolution. Feel free to vote for that ticket.

Regards,
Ramón.

Comment by Mario [ 13/Oct/17 ]

Do you have any estimated time for resolving ticket 21307?

Thanks for everything Bruce, I will watch the ticket SERVER-21307 for updates.

Best,
Mario

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

Hi Mario,

Thanks for uploading the additional information. This confirms that you've encountered SERVER-21307.

On the primary we see an index build from 08:45 to 09:17, and then the problematic dropIndexes command on the same collection at 09:44:

2017-10-10T08:45:02.225+0000 I INDEX    [conn92339] build index on: igraci.listici properties: { v: 1, key: { igrac_id: 1, vrsta_uplate: 1, poredak: -1, broj: -1 }, name: "igrac_id_1_vrsta_uplate_1_poredak_-1_broj_-1", ns: "igraci.listici", background: true }
2017-10-10T09:17:30.937+0000 I INDEX    [conn92339] build index done.  scanned 8550219 total records. 1948 secs
2017-10-10T09:44:26.619+0000 I COMMAND  [conn92280] CMD: dropIndexes igraci.listici

On the secondary that index build begins at 09:17 when the index build has finished on the primary, and it is still running when the dropIndexes command occurs at 09:44, setting up the conditions for SERVER-21307:

2017-10-10T09:17:30.962+0000 I INDEX    [repl index builder 5] build index on: igraci.listici properties: { v: 1, key: { igrac_id: 1, vrsta_uplate: 1, poredak: -1, broj: -1 }, name: "igrac_id_1_vrsta_uplate_1_poredak_-1_broj_-1", ns: "igraci.listici", background: true }
...
2017-10-10T09:44:06.025+0000 I -        [repl index builder 5]   Index Build (background): 6434100/8551893 75%
2017-10-10T09:44:06.867+0000 I COMMAND  [repl writer worker 15] CMD: dropIndexes igraci.listici
2017-10-10T09:44:09.008+0000 I -        [repl index builder 5]   Index Build (background): 6446200/8551795 75%

Sorry you encountered this issue, and thanks for reporting it. I'll close this ticket as a duplicate of SERVER-21307; please watch that ticket for updates.

To avoid this problem until it is fixed, you can avoid doing dropIndexes commands until createIndexes commands on the same collection have finished on all nodes.

Bruce

Comment by Mario [ 13/Oct/17 ]

Hi Bruce,

Logs from mongo-0 are in attachment.

Can you suggest us what to do to avoid this in the future?

Thanks,
Mario

mongo-000 mongo-001 mongo-002

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

Hi Mario,

Thanks for the additional information. It looks like, as you suspected, the slowness is related to the dropIndexes commands, which we can see at A below:

However the mechanism for the slowdown was not locking, but rather replica lag and w:majority writes. We see member _id 0 (the mongo-0 node) stop replicating and its lag build over the course of almost 7 minutes until B, at which point it catches up very quickly. At that time a number of very long running requests finish reporting running times up to almost 7 minutes ("logged slowest query durations"), and they report an average wtime (replication wait time) of about the same. The log confirms that those writes specified w:majority, so could not complete until the secondary was caught up.

I suspect that the lag on the secondary relates to the replicated dropIndex command, but I'm not sure why that occurred. If you can attach the mongod log file for that time period and the entire diagnostic.data directory for the mongo-0 node we may be able to see why this occurred.

Thanks,
Bruce

Comment by Mario [ 13/Oct/17 ]

Hi Bruce,

Sorry not mentioning the timezone, it's -2 hour.
That means You shoud look logs from 9:44:10 - 9:51:00.

Attached entire diagnostic.data directory

Thanks,
Mario

diagnostic.data00 diagnostic.data01 diagnostic.data02

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

Hi Mario,

One additional request - can you clarify what is the timezone for the following?

We had a problem on 10.10.2017 between 11:44:10 - 11:51:00 hours.

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

Hi Mario,

Thanks for attaching the logs and the metrics files. However the attached metrics files don't cover the time period you mention. Can you please archive and attach to this ticket the entire content of the diagnostic.data directory so that we can investigate?

Note that this request is somewhat time critical as the data retention for diagnostic.data is typically a few days.

Thanks,
Bruce

Generated at Thu Feb 08 04:27:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.