[SERVER-67773] Index build hangs after "Index build: waiting for next action before completing final phase" Created: 05/Jul/22 Updated: 07/Sep/22 Resolved: 07/Sep/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Asel Magzhanova | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 0 |
| Labels: | index, indexing | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
I have a sharded cluster with one shard (replica set of three nodes) and case of restoring data using percona-backup-mongodb. The restore process was stuck on index creation for two days. I tried to restore gridFS collection of 400 GB in size, and index creation was performed for 3 days and didn't end. I watched adminCommand and thought, that it is becouse of cillection size. There were no information about locks:
Later I found in mongod logs that index creation hung after the phase "Index build: waiting for next action before completing final phase". All nodes of replica set were alive. Сould you explain what the problem was? I use 4.4 version of mongodb. |
| Comments |
| Comment by Chris Kelly [ 07/Sep/22 ] | ||||
|
Asik, Glad to hear you were able to resolve the issue! I'll go ahead and close this ticket for now.
| ||||
| Comment by Asel Magzhanova [ 01/Sep/22 ] | ||||
|
Hi! Sorry for the long answer. | ||||
| Comment by Chris Kelly [ 22/Aug/22 ] | ||||
|
Hi Asel! Just wanted to check in - did this end up resolving your issue? If so, we can go ahead and close this ticket. Christopher | ||||
| Comment by Chris Kelly [ 01/Aug/22 ] | ||||
|
Hi Asel, About 10 minutes before another index build starts at ~ 2022-07-01T21:00:00+03:00 , node 0 goes into rollback and then gets stuck in a recovery state because it's moved to maintenance mode. It stays this way for the remainder of the time FTDC is logged. (multiple hours). The logs show that this node is too stale for days after the event happens. This requires an initial sync to fix.
From what I can tell, there are multiple indexes being created at the same time as well. There's a lot of load during this process, and your secondary got out of sync, and requires a resync to catch back up. Starting in 4.4, the primary waits for all nodes to report back before committing the index build. This is described in the commit quorum defaults for the createIndex command. Possible mitigations for this could be to:
Let me know if any of these options resolve your issue. Regards, | ||||
| Comment by Asel Magzhanova [ 15/Jul/22 ] | ||||
|
Unfortunately, we saved diagnostic.data on only one node of the replica set. The replica set consists of three nodes, I have attached 3 archives, respectively. The problem arose on 2022-07-01 in the evening. Fragment of the log, after which the creation of the index hung:
| ||||
| Comment by Chris Kelly [ 05/Jul/22 ] | ||||
|
Asel, Thanks for your report. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to the ticket:
Christopher |