[SERVER-83833] shard crash causes WiredTiger to fail replaying the oplog in an infinite loop Created: 03/Dec/23  Updated: 01/Jan/24

Status: Investigating
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Zimmerman Assignee: Backlog - Triage Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Triage
Operating System: ALL
Participants:

 Description   

Inserting record with thousands of values in a field that is indexed caused system to hang.

Database has approximately 60 billion records.  I had inserted a record with ~100,000 values for one key and the mongod process essentially hung.  It was processing the request for 3 days and each shard in the replicaset would not accept new connections.  Reads meanwhile still worked just fine from open connections.

The application I wrote can handle failures on write/read gracefully, so I restarted the primary shard, and neither secondary (PSS architecture) (26 shards total) shard promoted.  The newly restarted PRIMARY couldn't connect to any secondary.

Restarted all the shards in the replicaset and none of them will come back up.  They keep replaying the oplog trying to write this same transaction but keep erroring, then try again, etc and they try to connect outbound, but none of them are listening because of processing the oplog.

 

Ubuntu 22.04 - 5.15.0-86-generic #96-Ubuntu SMP

Mongo 6.0.12

 

..."attempts":13463}}...

 

{"t":\{"$date":"2023-12-03T04:27:39.130+00:00"}

,"s":"D3", "c":"STORAGE",  "id":22413,   "ctx":"ReplWriterWorker-0","msg":"WT rollback_transaction","attr":{"snapshotId":1518010}}

{"t":\{"$date":"2023-12-03T04:27:39.130+00:00"}

,"s":"D1", "c":"WRITE",    "id":4640401, "ctx":"ReplWriterWorker-0","msg":"Caught WriteConflictException","attr":{"operation":"applyOplogEntryOrGroupedInserts_CRUD","namespace":"database.collection","attempts":13463}}

{"t":\{"$date":"2023-12-03T04:27:39.185+00:00"}

,"s":"D1", "c":"WTWRTLOG", "id":22430,   "ctx":"JournalFlusher","msg":"WiredTiger message","attr":{"message":

{"ts_sec":1701577659,"ts_usec":185089,"thread":"4152853:0x7fdc39a99640","session_name":"WT_SESSION.log_flush","category":"WT_VERB_LOG","category_id":20,"verbose_level":"DEBUG","verbose_level_id":1,"msg":"log_flush: flags 0x4 LSN 28293/7000192"}

}}

{"t":\{"$date":"2023-12-03T04:27:39.185+00:00"}

,"s":"D4", "c":"STORAGE",  "id":22419,   "ctx":"JournalFlusher","msg":"flushed journal"}

{"t":\{"$date":"2023-12-03T04:27:39.236+00:00"}

,"s":"D3", "c":"REPL",     "id":21254,   "ctx":"ReplWriterWorker-0","msg":"Applying op (or grouped inserts)","attr":{"op":{"lsid":{"id":

{"$uuid":"64649147-795b-4d60-b933-75f9342cf688"}

,"uid":{"$binary":

{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}

}},"txnNumber":1151,"op":"i","ns":"database.collection","ui":{"$uuid":"4e33280f-7578-49a7-8cf0-4eb149c9d6a2"},"o":{"_id":

{"$oid":"6566a09432acf3ad46c309d5"}

,"linenum":35,"sessionids":["+++KIRT3UUNDXVYN7YCSCNJZ","+++UMMUA1LJAMWCME2GDTYD0","++-FPRX7E1RKPDKBEUIAK9AR","++-IKOOJT1XDCBXJALSRTCDX","++1RWUL5LRKT17YZGOPYYYOO","++1VB7TRZDLHIQDLEF1KAOO7","++37XZVPOOX60IY4LFQQDNZE","++3MQNL4LYTBNLBEPSTRQ8SK","++3ZDK4AUGOVTGKOKW2EKLS0","++4WQCQ3TP4ISAD930LVSO-8","++8NKSSBTUZCQWKQPWXMD-GN","++9HZQFHPOCKEKNGKEC--ASM","++9YB9GIEKTTHPYTBCVI+6SY","++AOULKEEIN37QO98A276S+I","++AQJA3IN3THJBJJ6IRMEV-P","++BEJTRTWPOR-H-6361MTSYG","++BEK0DPY0KJ4LXV1A9SBMNG","++BXGBXMFTIEGTVSX8M9SDZM","++BY-PBDWMJAJ16Y7S75VHEU","++C5DWFXPQPR1CTMFVRWPUY9","++CNGNVNNKEB+1OF0Z5LMRWM","++CS8PDWEGROS4B75E-IBGWG","++D0SZNJJ18BI213TPAG5VC3","++DN8PF8JENFEORYBVAZQXEV","++DPO8TMSJRAXQGWBINV1RGY","++E0ZMXRGP3CXMNUOCSJ8ZBV","++E2PJRT08CLLINTZL38KTH4","++EAOELV8LADXNNMIAAU6ZL2","++EBP4XO2ZGCISAPRKG1QKUO","++EGPO0VVQJHTCNIZEL8C8G7","++EI88+SHXUTJK4MSC18QKTE","++EWKA+HFIRLBSHBTXOETUC1","++EXZDPX08TGDJA+BTK9OKXP","++FXT-QXSYHICYUC+BULLXIV","++G64XD4BSY7KHJZIMPHU444","++GDS5VBIZCKQNDWALSDXVCT","++GTXVO5KSG-W+SXH5IHZTLF","++HF91OASMTDLNH2DCELLZBR","++HGSLNWUNA+SDSJH9HUX7EN","++I6NLNZHZNQL0OTSPIXDCP8","++ITOMUQIVWLYMM22ERISKNR","++IXC7LZ5E6JM+TGPRZXXYC1","++IY3LN4VHEWTLWSMJCT0YRA","++IYUNJUVVUXRFUNEJMTY5CZ","++J9JYB5ZORHBGYOQUNWWBIR","++JGOJ3DXZTXQSH7+TYRA6LE","++JH1737JT5GTYPZULVRWQVZ","++JIZ6TXRCKMW8GOD6GCIQM+","++JRX2QSUYQMT-SWMRGZYJRS","++KNX5O5VYPDFVJQB49VUW95","++KSYFOKXELJRJL2VMHO9G27","++L25IY4BR9FH9W0I0G-4ABV","++LI4KECOOTV5JM9HHRVMUTI","++LSSYTOXNR-BHBXVL90AXPC","++LX9K5V0L5ZWKLXL7J8BK6J","++M5DGSW2+AOEXD-H6XZGAZ3","++MJ63-8HJCMNDQUKIWOE3IO","++O+GA3R6MMIOQDGEL08JERI","++ODDSV3IC20TKQIR3WSJASE","++OU1Z3OVRRPGE5POKZMJN+H","++OZLPHL2MV45ZERWSBXT76Q","++PDP3-1DGSCGAPETITMLSEY","++POF8VYYFIDCGM9NULFDIIY","++Q1MO3L9MBHDCNN40CZDZGE","++Q6VYHQNWW3CIQ+B5ELYBGF","++QDAOR4RZLMI2GUGIQ8TQDL","++QQ8JYRAJP25DALW+HPTYAE","++QVTCINSTMZY1KQXJDM1GPJ","++R6YUDKAK5WHXD3D7XZKT6J","++RI9AMLJHVLKFEVIFYIROMY","++RUYJ8OCYVTFGIX8V2ZT4JA","++SYHLP2DH8NPCAOOQ10WRCB","++TGOTE-UGEOAIGYJ8HQBCSC","++THIK9VJYWDDTPQDRSI302B","++TSAF6T8YAI2JR-DIDHR7VD","++U6HWDH3E5WV7J3LAK0MWLB","++UK52G4VQR9IAYA5ILMJQGP","++URNZ3TITOEMAGWANHT+RHN","++UVNYHMV+HS+9FPQZ4LIQ06","++VGOM71BRJTPB1TM0GGQGOE","++VJTQBV4NNHDB8CJKUALX1V","++W-864G0W0KRSV7GTUZWIQ3","++WACSHB3NRHBVYPSVVI3XTX","++WAQ852AAH-DHOB5BBR48QX","++WIDTRBGRV9ZNHB6G0PFQ5L","++WM05POAA0GLAI41IGZXNRJ","++WNZERWKUIHR9JQADR0ANIN","++WOCC8VGLKAWYQW9WXM2TTN","++X-SUWR0M5FOJDRC--JNSUL","++XPRKKQW4SVU4CC+EVVPXHP","++XSDL6AS1YN+FO+URXGN88G","++XURA-PITSNPFWYWK7KAAPI","++YG0WP6KAVHEFLFXV34L7-S","++YNRYYWV+U8ZA2NBBUP5F1Q","++YP+W-OCVM4POQNCO03O3PV","++YTLQBI0KJ-PPBEVMXV7BPV","++ZDDIUJFLDZCUCSDXG-GRL9","++ZNMSBNQ+FAR-NA4ABN7SAL","++ZO0MVVQZULPJWT70TOW4N5","++ZTWHOCG+SXSGSHG5FWBJ6G","++ZZC8CXSDMMGCLUAZI+ACSI","+-+AWAFVNG51VRT0LVUTHCDW","+-+WODNRZJJSWYJSXQ2WVILZ","+---C-OIVPQCDZTZ2HEAQVOP","+--ZPFHVPEAMMDAPN7T3YUT4","+-2+D2EZRG2ZN5A+2+ERSXAY","+-2-O-ZDB0XM3RY-NTDGQQIL","+-2TMEMNXQ8UHZEJRFAQRZTY","+-4PTRHL9ZOC-ZL2Y3+VMOL7","+-50STQP+0DECDNBEKSKHB-E","+-5CUYOTSH3AIG-IMJRL-L8G","+-5J3DBY3Y-N-63PTFVIGODW","+-5VDAN766MQIVHV4E2FBJHR","+-7IXR3HTXJNDAQW-E3MVHKQ","+-7LMR5TZ8VFYOCFI3FIF5IG","+-8KVICJWARD2M5HJFSGY3ML","+-9KUIB-JGMXYTVGM38COIHG","+-9T1TWRRTNW0W-K14R7Y0J7","+-9U0BJZHVLCW7CHH9O-K-QC","+-ACUUGGLTEITH3P+A+R19I2","+-B-XPRGAVX9TMPWTDTVRZ5V","+-BBEXSUOL9JEQUTRH9GZI6P","+-BGF3JLNOQS3EZOMA2OV0CX","+-BKHNJYXDVOXRJ98I2ER8KN","+-BW-OES0BUOIDRHJCAPG0ER","+-CMVZJGNA5DOJYGCGH-JZ1+","+-CQHTSKRXNL-OSI81J6Y2I8","+-CWQNEW7JLQNY-A+6WU-P+I","+-DJ-I1-GLTZYOOSEZL-LXGP","+-DPNUWWTBV+EMBMXZWYL03S","+-DQWANHPA04GO0CYOMHLPGY","+-DWMWYZ6UUHYC8ILGYYYFJT","+-E5F-7U9QH8-JEWONQDDYQP","+-EJC-A73HK4TRK+WFBLP4WS","+-EPGZCXFDYERPI6HBSWPMP0","+-F2+AAX9LF6GFTTDLDALKMD","+-FBZAW0MRRJXOUXGNTCZAPV","+-FCP7F6MH2RADZAYVM3DHVG","+-FSXRO44UVIKCZ6R50Z7OCM","+-HMVAFCLAALREWNBWC5-OKP","+-HODPM7AZ3R6OE2ZS8HJSFL","+-HQKVU7VTSIVSYWMXJBB6LX","+-HRVYLQVJOZELZ4KWV6LR8B","+-IIAZ+MYAU+RQSWX5HK22JM","+-JDJE37IE72BUD3NSSMTF2G","+-JGF3DJR1+AQKAGD04D+MYM","+-JL7GHDIML-ZNYX4CISEVSZ","+-JPR0KJBB-2QTIVI-TGWH-R","+-JXRJ3PJMKQCG8GOW-NQVFL","+-K28AS7CF6SX3PQ6QIOQ7ZI","+-KJYZVXEVW1SBAI4V1RELGW","+-L2P98FMONRPWVEVUF23BZV","+-MJM1I0ZJFKT6C-N-F2Z6QU","+-MRWJJ4E+DKETFYF9-1NN+2","+-MT4DDIWQRGLD2T-VVRLQE+","+-MTA+PWK3EF0XK-AQB5A29Z","+-N+SO5ST4F5SOWXPEQBAJX6","+-NSZMW5V71TOUGP+74P5JKJ","+-OH0MSJEJXKWYOMX9RG7UAV","+-OUMIL5D6IYEL1S6KHIR+9H","+-P11OJZJA3RZ+P2FIRYDWC2","+-P7FD-C9PHXEIJCEASDDOHZ","+-PH-YJGUFD8PAGY2RBXZWOQ","+-PI4LAME4HKV15LI+CZY2DY","+-PNRASMXI3QPBAF+FSZUQFH","+-Q7GUECU7JUVRIZO67L1VRD","+-QCUYF+6M8TTMJOARGGCPBE","+-QHOYKH27EQUA+3OJSQSYPV","+-RADDP6NXGB3YSOVO80ZKYY","+-RAGF5KSEMF9IQE1ZMFYDZ9","+-RGTKBZQV1AXEKQSOK9KT3G","+-RKTJCHEA1DZF3S81MM6BH+","+-RUJ+LC5CE8+YPPXIOR17JH","+-RY0L8S3YL8AROA+JKC0COW","+-SBMTKIBEYZUPIOX6D89UCC","+-SECVDFOSRTE2S5REITZ1WZ","+-SLLFJIGFYN52MPS-OASVRI","+-SNDMVYUWH2-GAGF86520TG","+-SQRHJONZHBZUDQYBCV8BQA","+-T2Y4OA1O7Y2ZECGKZTT7OL","+-U2TO6VLPTELEE1RTXVN5CR","+-UE+WEHVBVHQGGP9WBYWTHL","+-UEG6ASOJDR94ZI8RNIWZP7","+-UGHTK-8EBKCYJ5C95U7MSD","+-UIBM24NX8PZCDMJILFNHGN","+-UXUG2F04YJYVGLMJFOE3-A","+-UYYYMYWFL6RBWFEFLXVPQ2","+-VL2VMTO0ZCNEE-XQMKY+RT","+-WA5ZM0QKQGSZKVTWDK6E7A","+-WD3NEW5IJAMHRY0VAHY5UW","+-WNTAM5XSWWGTJSYO0QGC49","+-XC7V9ZCCUIFSVVCQRUX3V9","+-YGVJIIQZ+XTVQPGLYBUSMD","+-YK-DHG5KAV-Q89DVTNYY78","+-YME3PNZ3QTBXQP0JK2ZUPE","+-YXIJKZBNJ6GOY4SSBBNDSK","+-Z4KWO3OBBVHUCYHEN9XP8Q","+-ZPCCUKPFHFDGAYGNIHHBM-","+-ZTELTHVROW4LZSKF-KW8NQ","+-ZTUC3HS6KBKRMOMY0AN8XQ","+-ZZBPMCXWHQBT2LA6V39K-F","+0+CK-EEUUXY1DM3XMJUC9Q-","+0-DKMCV1LE4XU7KPQZPCJCF","+0-LLV1ULPUD0PXNYPJ6T5YH","+00JE2MRU3QRWOMAU1VHVLXH","+01EEBAETPCIBRKGUPIAM25H","+01N8GCPCTZIKTKR649BW1O7","+01QV9E2AIJFJRC5TPYMO0-D","+02HPZQPP07BN5SKSZONIFQP","+02OBZB9SC3DBD1SCJTCVTFG","+036-8FHZIVLZ+INC6X6TYNI","+03YRMSZYZ2GWYNOD8SUQUEM","+05FARD7M3FMZRFUEH6QBY6L","+05T8MEPQMLSYRANULM2RERP","+06L52FSMIQTEPV9OI5UHQMC","+06YM-ALNECUCMG7BM0C4PLR","+07HLLKFRUTFD1AAZ6ID8X5+","+07TWY0YQZRU1GV2OWZ0BAWU","+08MHDMCBE6G+2R3FATH4A+J","+08V9YAFQNVDHES6FXCHBO+M","+08Y4WNJQ2UNLMXJX4C1GFJB","+09-2UGNXYT5CHEUECIZOVBS","+0A2RXHECHEC1QVAAQFU2GO9","+0ALA4C-OODZXIXG-QKWEKSR","+0AR3D4NMAX6A4WX+2SB3O65","+0AU+N9U44V6I4EGKJE4PYBL","+0BDLICC8DZ0DJKDU6YZSTC8","+0BTQNHHUFKSV7POVQ0OGQAR","+0C++KWLLTG4R2AYXEBPHZM7","+0CADRGL7BJNAHJQN1WA2HWG","+0CLDLZQ6DZPEPUDCVNB7QXB","+0CLZ6NUO0FMDIHM7BVQUITF","+0CNK2OAVMNVD+-E9ATXGHMQ","+0D6MNSXYQ2O2KTBDXYZQX4A","+0DD06NEKUX1NCAVFMSBLB9I","+0DJXT3PQCQCGXHCHWRWEHUO","+0DQA822YTUJP5LM8D5LXEK8","+0ETFSGVKFFLOHSCDYKQBPVN","+0FB1KO8J4S05HA2LLBPBOR9","+0FD8U1PCV8RHDQMN8VDPQZK","+0FUDWDTX9TDS7WMDDSBARN7","+0G1B24WYOJLKJZ-S6AYPVSM","+0G4T8G4RUCHQATQPBZX7V0R","+0GBXEESZESGRTR5+SPZMINC","+0GLPTOTGYECFNQYTZUS9GTM","+0GQM-MNT+V6MUBVOCTALTKT","+0GYWW6ZWN0JZYEVVJ2N7F2A","+0HYDCXBFVPFXBWGIKYCBBAI","+0HZRQY2MK8W3CNRCGHC1SNG","+0IBNKJ92A4N-1ZUPWBQYSZU","+0ICKIPBZUG6NL2FKQWMRRV0","+0IXHCKNPWDPAH2GNAPU3V0V","+0IXUE8NLIYYCYXOQCXL3ZBC","+0J6JC7THCLMO6V49R5LI646","+0J96AIEWUJFEBHZWVJHALHI","+0JMOAIIYBVGRPUJEMCQTAJL","+0JMXYDBSUOALA49GEDC5AIG","+0JVNNMB2MGXC0FDSZ5NUEU8","+0K-V408F78JT07-D2ZN3HMZ","+0KB-2RX6G6R20KWBJRCQPN-","+0KFXODTZKXMLBI-NKBMBPEN","+0LKKBZF+VX6HQ4SSGEEU26J","+0LN0H6NCMUXM9I3AREAOOEJ","+0LNGP0W4DZXXJUF2YDL3WTY","+0MDHHMGNIFTP-9OB3W3O140","+0MMR8X52M5A09+B22QNHHUF","+0MWJCASDO8UTDT-QUJN9IA2","+0MYKL7LXJWBZLANKGN1DJQV","+0MZKK2O714HLW8LVWG9FBVD","+0NNP5IPSBABWSHBC342KNBP","+0NYNIMSBMCIRWDXS7+O3SOE","+0O3M-TCSNX972G9OMPVUIM1","+0OJ+RJXW2IYLFB3T0XFFCD5","+0OJCL-B5MDUG+FH2ZB3IETF","+0OLAQSTKA7ACVKYFW94PDUA","+0PTYS-CWAVQXNOWDWXRM+CR","+0QGJ8U3297WBHTWESP2FWTM","+0R+0DROSOXD0QE5LZMUHUZT","+0S6L0E1QBKU4UTRVALAQBNI","+0S7NRLHLCQSWQXSCHZMGNA7","+0SCP8MJH8SISI237E2VA3EK","+0SDFH1EGEAEYEQMQHFDBGAL","+0SZOP5DS7XMGUREJEIUHR-V","+0T2ZWUOWAMIL0JIUOOM807S","+0TFOY9ZUXJNYBVBS2P1IMAT","+0TKUTMOAZD8AGCQSVAWF30A","+0UIY95RXIKOGSSFV2GBB0KU","+0ULQPDUQFFXXJQE9U4PW+KV","+0UOUTSX+UGLTM9FVHX2LRTF","+0UTW2-WFMN1MPF2717I3V92","+0UYTEH-N-DX-HIKMY78QJJ8","+0VYT-QGUNWUT854EYXNRM2D","+0W81L55OSXXBNJFNLIVDGYH","+0WNEZPR4BZWRDG5NNSIRAQX","+0XO25UA4P+CLLWYKBBNKAH6","+0XPAV-CD7DA-KEGJZYR0EIF","+0YD7WQVCXSWQPWF39FZLSNE","+0YZDHFUEEP6RAF+MPVSNXBY","+0Z-EPTUEFPQB8NWAKRP478B","+0ZER+ZQLAQJYIWZ5BDUAHY4","+0ZJLQGFI+SOQW3ZYW7RJYGN","+0ZPNUYSJFGEXZD9NWG4JSZW","+0ZSMRKHGTXLO4SFL8CPA6JG","+1++TDWVTNNQXN4DLS9QII+P","+1+F5VMU2JOARIUNTR-+FMYW","+1+KB+I8IZHEVJHJOXVSVVUW","+1-KPPJU7IQALSWUGF7QZZ4+","+1-L27J0B1P-YYOKQMFEIRCX","+10Q0WMELFLDSJ93EY-4JYUL","+12CKX-LJ3EWL1TOAX4EP41F","+13S+SD8NX2QTIAYFRKOLRQH","+13SPNQGWD0YHKBDCDW8UJCB","+14PRSMKDHHO-T3DEZG5BTIF","+15QW2PYAX25EJJYKFMEBDHU","+179MDCTAVRFHW84NMDBQYWS","+17GUGBXIBMDX7HGTWTI7UUA","+17ICNPG+OKZBNPXGKF7UOW1","+17IEWUYUK7FGENDF+D8ZEZ8","+192E6DHEXD7PRUKKLFQXXJA","+19WOMK5U8L0WZ-GTKZAOC8Y","+1A54IYNNBPZ-R7QOBXAJSFM","+1AB2ROEM2T2I+DDTTSJ08HR","+1ANP-8+BOEIMCC-S+LYHNTC","+1BFVXQ42BSLNK5UTD8JS6WR","+1CEWTNV1MTD-LKP7CKQ2UN8","+1CF7P3AKIUE3BZN3WMLA8NV","+1CGIHBTPKUBPR45PCOR7SC2","+1CSBZ0OITD2VE9QF2HDHTK2","+1CTBVJ8UIN6G41DJ71OLDLT","+1CX+C6ENG4IK9UWMTK1T2X+","+1DGKCR7E0QZFIOUQYQ9QZBS","+1DZQLTAN7VMNQ1P5LZVRR6N","+1E2EADOXCJJS3USXWMWOFJW","+1F0MFRNQMOOM5GOWQ6S9BZF","+1FKMRKYAHWWJEQWZWXEIPUS","+1FTZBMEQIH0TMVAICRHVHUR","+1G9SAIZGA4TTU--HXZVPXI-","+1GHLQPAFNW8G7BHZ5NAOGTJ","+1GLHM-VIUW0G4LPFMLD0C3G","+1GN5GMFBRK0ELI9BS83F3-9","+1GW8VPHEU9ZAAWHRPYZKKIC","+1GYE+DLD3NGJKW-JAIJCLHY","+1IP9F5EV1Y4S10ZDSI0OKAR","+1IQE+MIFOBNQ6FOCG1P2AX4","+1IVUQQMBI7ECWAMNYJMFVL1","+1IW8XVTJILT7GONOJRAUPHR","+1JSGFN7ATAB49KZG6W-D7BT","+1JVEB-MMKHXHBDFIIB528SN","+1KYSDIX1AKPYW1NCUQYXP3O","+1LE701ZU1KKNV36WPQ6EACF","+1LGPSLO+NDDW1KG3JKBUUUD","+1LKVG2OQL9ZK1QHCEKB9SZX","+1LM-5BJVE3AHH3BURVQEXKB","+1LYOWX+NNVKOY1ETOUMY5S-","+1MHATVU-SIPQ9FCYVPCCVDP","+1MX+YXDQDWZZG07Q32LUOPB","+1NFVHVFXTW3BBJACQOEODK1","+1NJL99FL+XAZDXUOVDROA37"]}},"oplogApplicationMode":"StableRecovering"},"truncated":{"op":{"o":{"sessionids":{"358":

{"type":"string","size":29}

}}}},"size":{"op":14493733}}

{"t":\{"$date":"2023-12-03T04:27:39.236+00:00"}

,"s":"D3", "c":"STORAGE",  "id":22414,   "ctx":"ReplWriterWorker-0","msg":"WT begin_transaction","attr":{"snapshotId":1518124,"readSource":"kNoTimestamp"}}

 



 Comments   
Comment by Eric Sedor [ 13/Dec/23 ]

Hi mzimmerman@gmail.com, I'm glad to hear you found a path forward that didn't involve forking. My strong suspicion based on your report and reported solution is that this is related to current transaction and cache size necessities.

It may or may not be a bug that we couldn't get to a better error during a state of high cache thrash.

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

If you still have this information available, then for each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Gratefully,
Eric

Comment by Matthew Zimmerman [ 04/Dec/23 ]

Cache full was a hint that I then followed – the size of our cache was configured to 16GB (our workload is very write-heavy, so having all indexes in RAM wasn't a thing I was trying to do, though I know that's best practice).

 

Using unmodified code, increasing that cache size then allowed the transaction to be written fully from the oplog and the system didn't hang.  After this particular transaction was inserted and the shard/replicaset was clean in a PSS architecture, I was then able to stop/restart each shard with the configured 16GB of memory and they've been running fine for ~8 hours without my modified code.

 

Will I hit this limit again somehow in the future?  What is it about inserting a large transaction that would cause this issue?  Any one record can only be 16MB in size... so how could writing just one larger record hang a system with 16GB of memory for the cache configured?

Comment by Matthew Zimmerman [ 04/Dec/23 ]

After commenting out some of the error handling (attempting to tell mongo that the write succeeded and to just move on with recovering everything else in the oplog (I don't care about this one transaction that it's stuck on)), I got another error message further along:

 

{"t":\{"$date":"2023-12-04T03:38:49.357+00:00"}

,"s":"D1", "c":"WTTXN",    "id":22430,   "ctx":"ReplWriterWorker-1","msg":"WiredTiger message","attr":{"message":

{"ts_sec":1701661129,"ts_usec":357612,"thread":"390727:0x7fb4944ca640","session_dhandle_name":"file:index-76--2567016947708944443.wt","session_name":"WT_CURSOR.insert","category":"WT_VERB_TRANSACTION","category_id":39,"verbose_level":"DEBUG","verbose_level_id":1,"msg":"Rollback reason: Cache full"}

}}

 

I can see where this is called/triggered in the code, but I'm not sure how to "skip" this to continue along with this troubleshooting idea.

Generated at Thu Feb 08 06:53:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.