[SERVER-50775] resumable index build rollback tests hang in RollbackTest.kSyncSourceOpsBeforeRollback Created: 06/Sep/20  Updated: 29/Oct/23  Resolved: 10/Sep/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.8.0

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-48419 Extend rollback to recover resumable ... Closed
is related to SERVER-49847 Turn on resumable index build feature... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2020-09-21
Participants:

 Description   

The rollback tests for resumable index builds sometimes hang during the RollbackTest.transitionToSyncSourceOperationsBeforeRollback phase. The logs indicate that the replSetStepDown is failing to acquire the RSTL to complete the stepdown:

[js_test:rollback_resumable_index_build_drain_writes_phase] 2020-09-06T11:16:14.276+0000 d21280| 2020-09-06T11:16:14.276+00:00 I  COMMAND  51803   [conn103] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"replSetStepDown":1,"force":true,"lsid":{"id":{"$uuid":"b9fb9ad3-9c2d-4a97-90ae-7861250da89c"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1599390943,"i":4}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 1000ms. opId: 2497, op: conn103, connId: 103.","errName":"LockTimeout","errCode":24,"reslen":347,"locks":{"ReplicationStateTransition":{"acquireCount":{"W":1},"acquireWaitCount":{"W":1},"timeAcquiringMicros":{"W":1001442}}},"protocol":"op_msg","durationMillis":1002}

The resumable index build rollback test fixture is supposed to detect an interrupted index build and disable the fail point that is blocking the stepdown in this parallel shell script.

[js_test:rollback_resumable_index_build_drain_writes_phase] 2020-09-06T11:15:44.360+0000 d21280| 2020-09-06T11:15:44.359+00:00 W  CONTROL  23829   [conn152] "Set failpoint","attr":{"failPointName":"hangIndexBuildDuringBulkLoadPhase","failPoint":{"mode":0,"data":{},"timesEntered":2}}

However, the lock manager diagnostic output shows the index build still waiting in the (disabled) fail point:

[js_test:rollback_resumable_index_build_drain_writes_phase] 2020-09-06T11:28:15.455+0000 d21280| 2020-09-06T11:28:15.454+00:00 I  -        20521   [initandlisten] "lock manager dump","attr":{"addr":"0x7f157b3805d0","locks":[{"lockAddr":"0x7f1587ac7f00","resourceId":"{6917529027641081857: Global, 1}","granted":[{"lockRequest":"0x877","lockRequestAddr":"0x7f1588764380","thread":"139729275131648","mode":"IX","convertMode":"NONE","enqueueAtFront":false,"compatibleFirst":false,"debugInfo":"index build: e7e6b33d-5d1e-475b-93ff-556d37705231","clientInfo":{"desc":"IndexBuildsCoordinatorMongod-0","opid":2167}}],"pending":[]},{"lockAddr":"0x7f1587ac7e80","resourceId":"{4611686018427387905: ReplicationStateTransition, 1}","granted":[{"lockRequest":"0x877","lockRequestAddr":"0x7f1588764380","thread":"139729275131648","mode":"IX","convertMode":"NONE","enqueueAtFront":false,"compatibleFirst":false,"debugInfo":"index build: e7e6b33d-5d1e-475b-93ff-556d37705231","clientInfo":{"desc":"IndexBuildsCoordinatorMongod-0","opid":2167}}],"pending":[]},{"lockAddr":"0x7f15889d0100","resourceId":"{10123292395995783581: Database, 899920359141007773, test}","granted":[{"lockRequest":"0x877","lockRequestAddr":"0x7f1588764380","thread":"139729275131648","mode":"IX","convertMode":"NONE","enqueueAtFront":false,"compatibleFirst":false,"debugInfo":"index build: e7e6b33d-5d1e-475b-93ff-556d37705231","clientInfo":{"desc":"IndexBuildsCoordinatorMongod-0","opid":2167}}],"pending":[]},{"lockAddr":"0x7f15889d0400","resourceId":"{11723802635811401853: Collection, 194587589742932093, test.rollback_resumable_index_build_drain_writes_phase}","granted":[{"lockRequest":"0x877","lockRequestAddr":"0x7f1588764380","thread":"139729275131648","mode":"IX","convertMode":"NONE","enqueueAtFront":false,"compatibleFirst":false,"debugInfo":"index build: e7e6b33d-5d1e-475b-93ff-556d37705231","clientInfo":{"desc":"IndexBuildsCoordinatorMongod-0","opid":2167}}],"pending":[]}]}



 Comments   
Comment by Githook User [ 11/Sep/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-50775 Use failpoint for yielding index build locks in RollbackResumableIndexBuildTest
Branch: svilen-optimizer-poc
https://github.com/mongodb/mongo/commit/4ff797fa438d9d1eb49ae9335174fb1140fc86eb

Comment by Githook User [ 10/Sep/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-50775 Use failpoint for yielding index build locks in RollbackResumableIndexBuildTest
Branch: master
https://github.com/mongodb/mongo/commit/4ff797fa438d9d1eb49ae9335174fb1140fc86eb

Generated at Thu Feb 08 05:23:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.