[SERVER-44055] All secondary crashed in SessionUpdateTracker and cannot recovery Created: 17/Oct/19  Updated: 04/Mar/20  Resolved: 04/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Zhang Youdong Assignee: Dmitry Agranat
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-46466 Race with findAndModify retryable wri... Closed
Related
is related to SERVER-35323 sessionId matching ignores userId par... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Still don't how to reproduce

Participants:
Case:

 Description   

One secondary crashed

2019-10-16T18:51:49.855+0800 F -        [rsSync-0] Fatal Assertion 50843 at src/mongo/db/repl/session_update_tracker.cpp 69
    2019-10-16T18:51:49.855+0800 F -        [rsSync-0]
 
    ***aborting after fassert() failure

Then another secondary crashed after change sync source to primary 

2019-10-16T18:52:21.381+0800 I REPL     [rsBackgroundSync] Changed sync source from empty to XX.XX.XX.XX:3097(PrimaryNode IP)
2019-10-16T18:52:21.381+0800 I ASIO     [RS] Connecting to 11.203.27.243:3097
2019-10-16T18:52:21.780+0800 F -        [rsSync-0] Fatal Assertion 50843 at src/mongo/db/repl/session_update_tracker.cpp 69
2019-10-16T18:52:21.780+0800 F -        [rsSync-0]

The secondary cannot be restarted,and generate very wiered errors,but the _tmp database is created long before, and no one delete the directory.  Then no primary can be elected in the replica-set with three members,cause the service unavailable for a long time.

2019-10-16T18:54:51.635+0800 I STORAGE  [initandlisten] WiredTiger message [1571223291:635501][41642:0x2b90d4829d00], txn-recover: Recovering log 752 through 753
2019-10-16T18:54:51.730+0800 I STORAGE  [initandlisten] WiredTiger message [1571223291:730545][41642:0x2b90d4829d00], file:local/collection-2--8933994092597418142.wt, txn-recover: Recovering log 753 through 753
2019-10-16T18:54:51.796+0800 I STORAGE  [initandlisten] WiredTiger message [1571223291:796255][41642:0x2b90d4829d00], file:local/collection-2--8933994092597418142.wt, txn-recover: Set global recovery timestamp: 5da6f62b00000008
2019-10-16T18:54:51.814+0800 I RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(1571223083, 8)
2019-10-16T18:54:51.814+0800 I STORAGE  [initandlisten] Triggering the first stable checkpoint. Initial Data: Timestamp(1571223083, 8) PrevStable: Timestamp(0, 0) CurrStable: Timestamp(1571223083, 8)
2019-10-16T18:54:52.044+0800 E STORAGE  [initandlisten] WiredTiger error (2) [1571223292:44873][41642:0x2b90d4829d00], file:_tmp/collection-6--6434994499321235876.wt, WT_SESSION.open_cursor: __posix_open_file, 715: /home/mongo/mongo3103/data/_tmp/collection-6--6434994499321235876.wt: handle-open: open: No such file or directory Raw: [1571223292:44873][41642:0x2b90d4829d00], file:_tmp/collection-6--6434994499321235876.wt, WT_SESSION.open_cursor: __posix_open_file, 715: /home/mongo/mongo3103/data/_tmp/collection-6--6434994499321235876.wt: handle-open: open: No such file or directory
2019-10-16T18:54:52.044+0800 E STORAGE  [initandlisten] Failed to get the cursor for uri: table:_tmp/collection-6--6434994499321235876
2019-10-16T18:54:52.044+0800 E STORAGE  [initandlisten] This may be due to missing data files. Please read the documentation for starting MongoDB with --repair here: http://dochub.mongodb.org/core/repair
2019-10-16T18:54:52.044+0800 F -        [initandlisten] Fatal Assertion 50883 at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp 538
2019-10-16T18:54:52.044+0800 F -        [initandlisten]

The oplog cause the error in primary maybe as follows

Timestamp(1571223107, 5603), "t" : NumberLong(2), "h" : NumberLong("4179556560629599034"), "v" : 2, "op" : "n", "ns" : "db.coll", "ui" : BinData(4,"uKDv6Dw1RBKxq+ewcCqntg=="), "fromMigrate" : true, "o2" : { "lsid" : { "id" : BinData(4,"b9QQ1SZVT2mh4vBoD4u/LA=="), "uid" : BinData(0,"Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg=") }, "txnNumber" : NumberLong(2), "op" : "i", "ns" : "db.coll", "ui" : BinData(4,"uKDv6Dw1RBKxq+ewcCqntg=="), "o" : { "id" : "xx", "platformUId" : "yy", "players" : { "xx" : { "id" : NumberLong(123456), "sId" : NumberLong(1), "loginAt" : ISODate("2019-xx-xxT10:51:47.522Z") } }, "regTime" : ISODate("2019-10-16T10:51:47.522Z"), "id" : NumberLong(3096698), "platform" : "hortor", "platformId" : NumberLong(778902834), "openId" : "xx", "shareCode" : "yy", "channel" : "zz" }, "ts" : Timestamp(1571223107, 4431), "t" : NumberLong(2), "h" : NumberLong("2385553688009329962"), "v" : NumberLong(2), "wall" : ISODate("2019-10-16T10:51:47.520Z"), "stmtId" : 0, "prevOpTime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) } }, "wall" : ISODate("2019-10-16T10:51:47.520Z"), "lsid" : { "id" : BinData(4,"b9QQ1SZVT2mh4vBoD4u/LA=="), "uid" : BinData(0,"Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg=") }, "txnNumber" : NumberLong(2), "stmtId" : 0, "prevOpTime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "o" : { "$sessionMigrateInfo" : 1 } }

 



 Comments   
Comment by Dmitry Agranat [ 04/Mar/20 ]

After further investigation this issue was identified as SERVER-46466 and is fixed in 4.0.17

Comment by Dmitry Agranat [ 24/Nov/19 ]

Hi haidilip83@gmail.com, I suggest that you open a new ticket for this issue in order to keep track of the different deployments. In this new ticket, please provide a detailed explanation about your workload, configuration and deployment. Once the ticket is opened, we'll provide you with a secure location to upload all the needed information for us to investigate.

Comment by Dilip Kolasani [ 21/Nov/19 ]

we did upgrade to 4.0.13 and did initial sync as well. But still all secondaries are crashing with same error.

Comment by Dilip Kolasani [ 18/Nov/19 ]

we also hit  this bug in our sharded cluster running with 4.0.5. For now we have done initial Sync and started all effected shards. let us know if upgrading to 4.0.13 resolves this issue.So we can also upgrade.

Comment by Dmitry Agranat [ 10/Nov/19 ]

Hi zyd_com@126.com, did you also perform initial sync on this member as suggested in this comment?

Comment by Zhang Youdong [ 08/Nov/19 ]

2019-11-08T12:16:24.084+0800 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/home/mongo/mongo3040/data/diagnostic.data'
2019-11-08T12:16:24.087+0800 I REPL     [initandlisten] Rollback ID is 1
2019-11-08T12:16:24.087+0800 I REPL     [initandlisten] Recovering from stable timestamp: Timestamp(1573113489, 783) (top of oplog: { ts: Timestamp(1573120487, 2586), t: 8 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0))
2019-11-08T12:16:24.087+0800 I REPL     [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1573113489, 783)
2019-11-08T12:16:24.087+0800 I REPL     [initandlisten] Replaying stored operations from { : Timestamp(1573113489, 783) } (exclusive) to { : Timestamp(1573120487, 2586) } (inclusive).
2019-11-08T12:16:24.166+0800 F REPL     [initandlisten] Entry for session da467c20-76bd-4aba-92af-fe232d7adfa7 has txnNumber 46 < 54
2019-11-08T12:16:24.166+0800 F REPL     [initandlisten] New oplog entry: { ts: Timestamp(1573113489, 796), t: 1, h: 2786601466404211624, v: 2, op: "n", ns: "product_jd_pdd.pdd_ddk_search_unique", ui: UUID("8aa3b169-0244-4789-952b-3584dc347a79"), fromMigrate: true, o2: { lsid: { id: UUID("da467c20-76bd-4aba-92af-fe232d7adfa7"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 46, op: "i", ns: "product_jd_pdd.pdd_ddk_search_unique", ui: UUID("8aa3b169-0244-4789-952b-3584dc347a79"), o: { _id: ObjectId('5dc3ccf867e70a6ba27efe78'), category_name: "å<86><85>è¡£", clt_cpn_end_time: null, clt_cpn_min_amt: null, coupon_remain_quantity: 0, clt_cpn_remain_quantity: null, promotion_rate: 50, service_tags: [ 13 ], mall_id: 655715812, mall_name: "æ<89><80>为ä¼<8a>ä»<81>æ<9c><8d>饰åº<97>", mall_coupon_end_time: 0, clt_cpn_batch_sn: null, lgst_txt: "ä½<8e>", goods_name: "è¿<90>å<8a>¨å<86><85>衣女纯æ£<89>æ<96><87>è<83>¸æ<97> é<92>¢å<9c><88>é<98>²ä¸<8b>å<9e><82>è<81><9a>æ<8b>¢è£¹è<83>¸æ<8a>¹è<83>¸å<86><85>è¡£å<86><85>裤å¥<97>è£<85>女学ç<94><9f>", clt_cpn_discount: null, goods_id: 43106307010, goods_gallery_urls: null, goods_desc: null, opt_name: "å<86><85>è¡£", opt_ids: [ 19073, 1282, 1292, 12, 1293, 6413, 20493, 6414, 19086, 6415, 6416, 6417, 6418, 6419, 12691, 6420, 8599, 11927, 8600, 8601, 12698, 20516, 11941, 12581, 11942, 12588, 9527, 9528, 15288, 15290, 15291, 9532, 9533, 9534, 9535, 9536, 9537, 9538, 9539, 9540, 15429, 15430, 1351, 15431, 10696, 15432, 15433, 15434, 15435, 15436, 10703, 12753, 15319, 347, 225, 20577, 19042, 20578, 12643, 15332, 19044, 1765, 19045, 10733, 19183, 19184, 19185, 19186, 19187, 19188, 19189, 19190, 8569, 8570, 8571 ], goods_image_url: "https://t00img.yangkeduo.com/openapi/images/2019-07-25/61c9cd92f91ca1ee98673fa5a0763f7d.jpg", has_mall_coupon: false, min_group_price: 1515, coupon_start_time: 0, coupon_discount: 0, coupon_end_time: 0, zs_duo_id: 0, mall_coupon_remain_quantity: 0, plan_type: 1, clt_cpn_quantity: null, crt_rf_ordr_rto1m: 1.0, cps_sign: null, cat_ids: [ 8583, 8608, 8633 ], coupon_min_order_amount: 0, category_id: 19042, mall_coupon_discount_pct: 0, activity_type: null, cat_id: null, coupon_total_quantity: 0, mall_coupon_min_order_amount: 0, merchant_type: 6, clt_cpn_start_time: null, sales_tip: "0", only_scene_auth: false, mall_coupon_id: 0, desc_txt: "ä½<8e>", goods_thumbnail_url: "https://t00img.yangkeduo.com/goods/images/2019-09-09/3b519e5d4ff598a8b13dbf74f0ad45c3.jpeg", opt_id: 19042, min_normal_price: 1615, has_coupon: false, mall_coupon_start_time: 0, serv_txt: "ä½<8e>", mall_rate: 50, mall_coupon_total_quantity: 0, create_at: null, mall_coupon_max_discount_amount: 0, mall_cps: 1, update_time: 1573113077, fingerprint: "93a52dc723d44bb5e9deaee6c7867897", status: 1, last_modify_time: 1573113077, create_time: 1573113077, cat_id_ori: 8633, cat_id_sp: 8633 }, ts: Timestamp(1573113080, 249), t: 1, h: -6233198818409246151, v: 2, wall: new Date(1573113080179), stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 } }, wall: new Date(1573113080179), lsid: { id: UUID("da467c20-76bd-4aba-92af-fe232d7adfa7"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 46, stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 }, o: { $sessionMigrateInfo: 1 } }
2019-11-08T12:16:24.166+0800 F REPL     [initandlisten] Existing oplog entry: { ts: Timestamp(1573113489, 795), t: 1, h: 7511343871006108749, v: 2, op: "n", ns: "product_jd_pdd.pdd_ddk_search_unique", ui: UUID("8aa3b169-0244-4789-952b-3584dc347a79"), fromMigrate: true, o2: { lsid: { id: UUID("da467c20-76bd-4aba-92af-fe232d7adfa7"), uid: BinData(0, BB89D3175FB0981C86826C27759082A37160471F60504B4612415186C630AB70) }, txnNumber: 54, op: "u", ns: "product_jd_pdd.pdd_ddk_search_unique", ui: UUID("8aa3b169-0244-4789-952b-3584dc347a79"), o: { _id: ObjectId('5dc3ccf867e70a6ba27efe88'), category_name: "å<86><85>è¡£", clt_cpn_end_time: 1575129599, clt_cpn_min_amt: 1000, coupon_remain_quantity: 0, clt_cpn_remain_quantity: 2222, promotion_rate: 100, service_tags: [ 13 ], mall_id: 491033336, mall_name: "è<88><92>æ<82>¦å·´å<9b>¾", mall_coupon_end_time: 1575129599, clt_cpn_batch_sn: "A0201VC-499618100816085752", lgst_txt: "ä½<8e>", goods_name: "å¸<83>迪设计 æ<80>§æ<84><9f>è<96><84>款å<8d><8a>æ<9d>¯ç»<91>带é<95><82>空è<95>¾ä¸<9d>女士å<86><85>è¡£å¥<97>è£<85> æ<96><87>è<83>¸+å<86><85>裤", clt_cpn_discount: 1000, goods_id: 54817683491, goods_gallery_urls: null, goods_desc: "æ<96><87>è<83>¸å<8a><9f>è<83>½ï¼<9a>è<81><9a>æ<8b>¢ï¼<9b>å<86><85>裤尺ç <81>ï¼<9a>S M L XLï¼<9b>å<9b>¾æ¡<88>ï¼<9a>纯è<89>²ï¼<9b>æ<96><87>è<83>¸é£<8e>æ ¼ï¼<9a>æ<80>§æ<84><9f>ï¼<9b>è<85>°å<9e><8b>ï¼<9a>ä½<8e>è<85>°ï¼<9b>é¢<9c>è<89>²å<88><86>ç±»ï¼<9a>é»<91>è<89>² 红è<89>² 绿è<89>² è<82>¤è...", opt_name: "å<86><85>è¡£", opt_ids: [ 19073, 1282, 1292, 12, 1293, 6413, 20493, 6414, 19086, 6415, 6416, 6417, 6418, 6419, 12691, 6420, 8599, 11927, 8600, 8601, 12698, 20516, 11941, 12581, 11942, 12588, 9527, 9528, 15288, 15290, 15291, 9532, 9533, 9534, 9535, 9536, 9537, 9538, 9539, 9540, 15429, 15430, 1351, 15431, 10696, 15432, 15433, 15434, 15435, 15436, 10703, 12753, 15319, 347, 225, 20577, 19042, 20578, 12643, 15332, 19044, 1765, 19045, 10733, 19183, 19184, 19185, 19186, 19187, 19188, 19189, 19190, 8569, 8570, 8571 ], goods_image_url: "https://t00img.yangkeduo.com/goods/images/2019-10-17/48de0aad3b61efa8a25ee892bf0fe405.jpeg", has_mall_coupon: true, min_group_price: 33420, coupon_start_time: 0, coupon_discount: 0, coupon_end_time: 0, zs_duo_id: 0, mall_coupon_remain_quantity: 2222, plan_type: 1, clt_cpn_quantity: 2222, crt_rf_ordr_rto1m: 0.0, cps_sign: null, cat_ids: [ 8583, 8608, 8633 ], coupon_min_order_amount: 0, category_id: 19042, mall_coupon_discount_pct: 90, activity_type: null, cat_id: null, coupon_total_quantity: 0, mall_coupon_min_order_amount: 0, merchant_type: 6, clt_cpn_start_time: 1569859200, sales_tip: "0", only_scene_auth: true, mall_coupon_id: 2499033952, desc_txt: "ä½<8e>", goods_thumbnail_url: "https://t00img.yangkeduo.com/goods/images/2019-10-17/cfc98a1266e2fff27b19b5b7471cda6e.jpeg", opt_id: 19042, min_normal_price: 33520, has_coupon: false, mall_coupon_start_time: 1569513600, serv_txt: "ä½<8e>", mall_rate: 100, mall_coupon_total_quantity: 2222, create_at: null, mall_coupon_max_discount_amount: 2100, mall_cps: 1, update_time: 1573113078, fingerprint: "c57e6fefd6b3234845e9eb00252860ec", status: 1, last_modify_time: 1573113078 }, o2: { goods_id: 54817683491, _id: ObjectId('5dc3ccf867e70a6ba27efe88') }, ts: Timestamp(1573113080, 645), t: 1, h: -3331256655401805348, v: 2, wall: new Date(1573113080589), stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 }, preImageOpTime: { ts: Timestamp(1573113080, 644), t: 1 } }, wall: new Date(1573113080589), lsid: { id: UUID("da467c20-76bd-4aba-92af-fe232d7adfa7"), uid: BinData(0, BB89D3175FB0981C86826C27759082A37160471F60504B4612415186C630AB70) }, txnNumber: 54, stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 }, preImageOpTime: { ts: Timestamp(1573113489, 794), t: 1 }, o: { $sessionMigrateInfo: 1 } }
2019-11-08T12:16:24.166+0800 F -        [initandlisten] Fatal Assertion 50843 at src/mongo/db/repl/session_update_tracker.cpp 95
2019-11-08T12:16:24.166+0800 F -        [initandlisten]***aborting after fassert() failure

New version have the same problem 

Comment by Dmitry Agranat [ 03/Nov/19 ]

Hi zyd_com@126.com,

SERVER-35323 does not only effects read operations. I suggest upgrading to the latest (as of today MongoDB 4.0.13) and if you still encounter this issue, we'd be happy to take a look.

Thanks,
Dima

Comment by Zhang Youdong [ 18/Oct/19 ]

Hi, @Dmitry Agranat

Thank you for your reply.

I already disable the balancer, but I think SERVER-35323 may be not related to this issue, because it only affect read operations. 

 

Comment by Dmitry Agranat [ 17/Oct/19 ]

Hi zyd_com@126.com,

Thank you for the report.

This issue appears to be related to SERVER-35323 which was fixed in 4.0.4. The workaround to prevent this issue from reoccurring before 4.0.4 is to disable the balancer. In addition, initial sync needs to be performed on effected members which will allow to bypass the encountered error. Please make sure to re-enable the balancer after upgrading to 4.0.4 (or later MongoDB version).

Thanks,
Dima

Generated at Thu Feb 08 05:04:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.