[SERVER-33057] Indexing fatal assertion on secondary during sync Created: 01/Feb/18  Updated: 21/Mar/18  Resolved: 16/Feb/18

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Chad Kreimendahl Assignee: William Schultz (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux


Issue Links:
Duplicate
duplicates SERVER-27122 Restart initial sync for known index ... Backlog
Participants:

 Description   

We recently added a new hidden secondary to one of our many clusters. The cluster, while on version 3.4.10, is running in featureCompatabilityMode: 3.2. After initial sync, while rebuilding numerous indexes, the server crashed while attempting to create what appears to have been its 65th index.

I suspect a fatal exception here is not ideal. I'm also curious how our other systems are still alive, while on the same version.

2018-02-01T11:19:12.226-0600 I INDEX    [repl index builder 1134] build index on: XDemo2.D_22 properties: { v: 1, key: { Fx.A.SV: 1 }, name: "Fx.A.SV_1", ns: "XDemo2.D_22", background: true }
2018-02-01T11:19:12.253-0600 I INDEX    [repl index builder 1134] build index done.  scanned 75 total records. 0 secs
2018-02-01T11:19:12.254-0600 I INDEX    [repl index builder 1135] add index fails, too many indexes for XDemo2.D_22 key:{ Fx.A.RRIds: 1 }
2018-02-01T11:19:12.254-0600 E INDEX    [repl index builder 1135] bad status from index build: CannotCreateIndex: add index fails, too many indexes for XDemo2.D_22 key:{ Fx.A.RRIds: 1 }
2018-02-01T11:19:12.255-0600 E INDEX    [repl index builder 1135] IndexBuilder could not build index: CannotCreateIndex: add index fails, too many indexes for XDemo2.D_22 key:{ Fx.A.RRIds: 1 }
2018-02-01T11:19:12.255-0600 I -        [repl index builder 1135] Fatal Assertion 28555 at src/mongo/db/index_builder.cpp 106
2018-02-01T11:19:12.255-0600 I -        [repl index builder 1135]
 
***aborting after fassert() failure
 
 
2018-02-01T11:19:12.327-0600 F -        [repl index builder 1135] Got signal: 6 (Aborted).
 
 0x558738e68641 0x558738e67859 0x558738e67d3d 0x7f26d476a890 0x7f26d43e5067 0x7f26d43e6448 0x5587381127a3 0x558738576be6 0x558738ddac41 0x5587398dda40 0x7f26d4763064 0x7f26d449862d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"5587378F7000","o":"1571641","s":"_ZN5mongo15printStackTraceERSo"},{"b":"5587378F7000","o":"1570859"},{"b":"5587378F7000","o":"1570D3D"},{"b":"7F26D475B000","o":"F890"},{"b":"7F26D43B0000","o":"35067","s":"gsignal"},{"b":"7F26D43B0000","o":"36448","s":"abort"},{"b":"5587378F7000","o":"81B7A3","s":"_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj"},{"b":"5587378F7000","o":"C7FBE6","s":"_ZN5mongo12IndexBuilder3runEv"},{"b":"5587378F7000","o":"14E3C41","s":"_ZN5mongo13BackgroundJob7jobBodyEv"},{"b":"5587378F7000","o":"1FE6A40"},{"b":"7F26D475B000","o":"8064"},{"b":"7F26D43B0000","o":"E862D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.10", "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-4-amd64", "version" : "#1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19)", "machine" : "x86_64" }, "somap" : [ { "b" : "5587378F7000", "elfType" : 3, "buildId" : "004E9FAB8E542F8BABDB0274A211B1E9D15DEDF1" }, { "b" : "7FFE359F5000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "D3D8F25E67DC1A88F1545D51C29476B4CC91F2E5" }, { "b" : "7F26D5697000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "21115992A1F885E1ACE88AADA60F126AD9759D03" }, { "b" : "7F26D529B000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "FD6376149047833953B0269E84DE181CA45DBE90" }, { "b" : "7F26D5093000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "A63C95FB33CCA970E141D2E13774B997C1CF0565" }, { "b" : "7F26D4E8F000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D70B531D672A34D71DB42EB32B68E63F2DCC5B6A" }, { "b" : "7F26D4B8E000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "152C93BA3E8590F7ED0BCDDF868600D55EC4DD6F" }, { "b" : "7F26D4978000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "D5FB04F64B3DAEA6D6B68B5E8B9D4D2BC1A6E1FC" }, { "b" : "7F26D475B000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9DA9387A60FFC196AEDB9526275552AFEF499C44" }, { "b" : "7F26D43B0000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "48C48BC6ABB794461B8A558DD76B29876A0551F0" }, { "b" : "7F26D58F8000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "1D98D41FBB1EABA7EC05D0FD7624B85D6F51C03C" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x558738e68641]
 mongod(+0x1570859) [0x558738e67859]
 mongod(+0x1570D3D) [0x558738e67d3d]
 libpthread.so.0(+0xF890) [0x7f26d476a890]
 libc.so.6(gsignal+0x37) [0x7f26d43e5067]
 libc.so.6(abort+0x148) [0x7f26d43e6448]
 mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x5587381127a3]
 mongod(_ZN5mongo12IndexBuilder3runEv+0xD86) [0x558738576be6]
 mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x131) [0x558738ddac41]
 mongod(+0x1FE6A40) [0x5587398dda40]
 libpthread.so.0(+0x8064) [0x7f26d4763064]
 libc.so.6(clone+0x6D) [0x7f26d449862d]
-----  END BACKTRACE  -----



 Comments   
Comment by Chad Kreimendahl [ 16/Feb/18 ]

It's certainly possible. There shouldn't have been any index changes, but they do happen occasionally. That specific collection didn't have any administrative updates in our console, suggesting it might have been a predictive analysis tool we have that picks which 64 fields get indexed. If you guys would update your limit on number of indexes to like 256, that tool wouldn't matter much to us, as we'd index almost everything. While I say that, I'm reminded that SERVER-1140 could be recreated as "128, 256, 512, 1024, etc), which it appears I did, SERVER-21591. I'm also reminded that the reduction in number of files per indexes would be necessary if this were to happen, as startup times are already insane for us. That'd be: SERVER-17675 and SERVER-25025 (which were supposed to be in 3.6, then 3.8, now are just tossed into backlog.)

Comment by William Schultz (Inactive) [ 15/Feb/18 ]

Thanks for submitting the log files sallgeud. This appears to be a known issue with initial sync related to the fact that MongoDB enforces a maximum of 64 indexes per collection. This creates potential problems for initial sync, due to the way that we clone data and apply oplog operations. Initial sync consists of two main data transfer operations, the Clone phase, where we copy all documents from each collection on the sync source, and the Oplog Fetching phase, where we fetch and buffer all oplog operations that occurred on the sync source for the duration of the initial sync. These processes happen concurrently. We can imagine the following initial sync event sequence:

  1. Oplog fetching starts
  2. An index I is created on the sync source, on collection C.
  3. Index I is dropped on the sync source.
  4. 64 indexes are created on collection C on the sync source (which is allowed).
  5. Collection C is cloned, along with all 64 indexes.
  6. Initial sync collection cloning finishes.
  7. Oplog application starts.
  8. The initial syncing node tries to apply the operation that originally created index I on C. This should fail, since it would try to create a 65th index. We can see that this temporary constraint violation would resolve itself, once we apply the subsequent dropIndex operation that removes I, but we don't have an explicit way to ignore this violation and continue.

This is likely the issue you ran into. After looking at the logs, it looks like collection OnspringDemo2.D_22 was the collection that violated the 64 index limit. It appears that, when it was cloned, it had 63 indexes on it, and then, during oplog application, a new index was created on it (a 64th index):

2018-02-01T11:19:12.226-0600 I INDEX    [pl index builder 1134] build index on: OnspringDemo2.D_22 properties: { v: 1, key: { Fx.646.SV: 1 }, name: "Fx.646.SV_1", ns: "OnspringDemo2.D_22", background: true }

and then another index creation op was applied, which fails:

2018-02-01T11:19:12.254-0600 I INDEX    [pl index builder 1135] add index fails, too many indexes for OnspringDemo2.D_22 key:{ Fx.646.RRIds: 1 }

Without the logs of the sync source, it is tough to tell exactly, but what I presume happened is that the Fx.646.RRIds index was created and then later deleted on the sync source, before the collection was cloned (that index doesn't show up in the list of collections built on OnspringDemo2.D_22 during the initial sync), and then during oplog application we see the error appear because it tried to apply that index creation op, violating the 64 index constraint.

Comment by Chad Kreimendahl [ 07/Feb/18 ]

Submitted. Takes a while to anonymize the data, schedule the firewall rules for egress of this type of info, and get it your way. You should have it now.

Comment by Kelsey Schubert [ 07/Feb/18 ]

sallgeud, would you please let us know when you've uploaded the files?

Thanks,
Kelsey

Comment by William Schultz (Inactive) [ 06/Feb/18 ]

As Spencer mentioned, the fact that the collection has 64 indexes on it and this occurs during initial sync makes it seem highly likely it is a known issue (SERVER-27122), that is not slated to be fixed at this point. To be sure, it would be worth seeing the full logs. I wasn't able to get them from the download portal kelsey.schubert?

Comment by Kelsey Schubert [ 05/Feb/18 ]

Hi sallgeud,

Sorry for the delay, here's a secure portal for you to use.

Kind regards,
Kelsey

Comment by Chad Kreimendahl [ 02/Feb/18 ]

Can I get a secure upload site?

I've already deleted half the indexes in that collection (for the time being), and will have our indexing tool recreate them after. I verified before I nuked some that it had exactly 64 on the primary and both existing secondaries.

Comment by Spencer Brody (Inactive) [ 01/Feb/18 ]

I suspect this is a dupe of SERVER-27122

Comment by Kelsey Schubert [ 01/Feb/18 ]

Hi sallgeud,

Would you please provide the complete logs starting when the initial sync began from the node that encountered this issue as well as the output of db.getSiblingDB("XDemo2").D_22.getIndexes()?

Thank you,
Kelsey

Comment by Chad Kreimendahl [ 01/Feb/18 ]

Apologies. This should be a "bug" not an "improvement". Won't let me change now.

Generated at Thu Feb 08 04:32:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.