[SERVER-48610] ReplicaSetAwareService callbacks don't occur in ReplicationCoordinatorImpl tests Created: 05/Jun/20  Updated: 29/Oct/23  Resolved: 15/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Kevin Pulo
Resolution: Fixed Votes: 0
Labels: PM-1645-Milestone-3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-47914 Move clusterTime from LogicalClock to... Closed
related to SERVER-48406 Add onBecomeArbiter to ReplicaSetAwar... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2020-06-15
Participants:

 Description   

The ReplicaSetAwareService allows interested classes to receive callbacks on certain events related to the replica set state of the node. These callbacks originate in ReplicationCoordinatorImpl, which call into ReplicationCoordatorExternalStateImpl, which then calls into the relevant function on ReplicaSetAwareServiceRegistry (which then calls the registered ReplicaSetAwareService (RSAS) classes).

SERVER-48406 added onBecomeArbiter() to ReplicaSetAwareService, and SERVER-47914 finished plumbing through the disabling of the LogicalClock/VectorClock on arbiters. In doing so, failures were caused in the arbiter-related db_repl_coordinator_test unittests:

[main] "Running","attr":{"test":"TermAndLastCommittedOpTimeUpdatedFromHeartbeatWhenArbiter","rep":1,"reps":1}
...
[main] "Responding with {doc} (elapsed: {millis})","attr":{"doc":{"$replData":{"term":3,"lastOpCommitted":{"ts":{"$timestamp":{"t":10,"i":1}},"t":3},"lastCommittedWall":{"$date":"1970-01-01T00:01:40.000Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":10,"i":1}},"t":3},"configVersion":2,"configTerm":-1,"replicaSetId":{"$oid":"000000000000000000000000"},"syncSourceIndex":1,"isPrimary":true},"ok":1,"state":2,"v":2,"configTerm":-1,"set":"mySet"},"millisMillis":0}
[replexec] "Invariant failure","attr":{"expr":"!VectorClock::get(getServiceContext())->isEnabled() || _externalState->getGlobalTimestamp(getServiceContext()) >= opTime.getTimestamp()","file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":1378}
[replexec] "\n\n***aborting after invariant() failure\n\n"
[replexec] "{message}","attr":{"message":"Got signal: 6 (Aborted).\n"}
[replexec] "BACKTRACE","attr":{"bt":{"backtrace":[{"a":"7FE20830E7BA","b":"7FE205382000","o":"2F8C7BA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.592","s+":"1EA"},{"a":"7FE20830FC69","b":"7FE205382000","o":"2F8DC69","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7FE20830C3E6","b":"7FE205382000","o":"2F8A3E6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7FE203C1F7E0","b":"7FE203C10000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7FE2038AE4F5","b":"7FE20387C000","o":"324F5","s":"gsignal","s+":"35"},{"a":"7FE2038AFCD5","b":"7FE20387C000","o":"33CD5","s":"abort","s+":"175"},{"a":"7FE20629EDC9","b":"7FE205382000","o":"F1CDC9","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"7FE20691275D","b":"7FE205382000","o":"159075D","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl34_setMyLastAppliedOpTimeAndWallTimeENS_8WithLockERKNS0_17OpTimeAndWallTimeEbNS0_22ReplicationCoordinator15DataConsistencyE","s+":"ED"},{"a":"7FE2069133F3","b":"7FE205382000","o":"15913F3","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19_advanceCommitPointENS_8WithLockERKNS0_17OpTimeAndWallTimeEb","s+":"A3"},{"a":"7FE20693F8C0","b":"7FE205382000","o":"15BD8C0","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_handleHeartbeatResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEi","s+":"1CE0"},{"a":"7FE207E4E46C","b":"7FE205382000","o":"2ACC46C","s":"_ZZN5mongo8executor12TaskExecutor21scheduleRemoteCommandERKNS0_24RemoteCommandRequestImplINS_11HostAndPortEEERKSt8functionIFvRKNS1_25RemoteCommandCallbackArgsEEERKSt10shared_ptrINS_5BatonEEENKUlRKNS1_30RemoteCommandOnAnyCallbackArgsEE_clESM_","s+":"4C"},{"a":"7FE207CF3376","b":"7FE205382000","o":"2971376","s":"_ZN5mongo8executor12_GLOBAL__N_121remoteCommandFinishedERKNS0_12TaskExecutor12CallbackArgsERKSt8functionIFvRKNS2_30RemoteCommandOnAnyCallbackArgsEEERKNS0_24RemoteCommandRequestImplISt6vectorINS_11HostAndPortESaISG_EEEERKNS0_26RemoteCommandOnAnyResponseE","s+":"56"},{"a":"7FE207CF7583","b":"7FE205382000","o":"2975583","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"},{"a":"7FE207CF7962","b":"7FE205382000","o":"2975962","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"},{"a":"7FE2065ED2AB","b":"7FE205382000","o":"126B2AB","s":"_ZN5mongo8executor14ThreadPoolMock15_consumeOneTaskERSt11unique_lockINS_12latch_detail5LatchEE","s+":"12B"},{"a":"7FE2065ED526","b":"7FE205382000","o":"126B526","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_8executor14ThreadPoolMock7startupEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"146"},{"a":"7FE2084D003F","b":"7FE205382000","o":"314E03F","s":"execute_native_thread_routine","s+":"F"},{"a":"7FE203C17AA1","b":"7FE203C10000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7FE203964C4D","b":"7FE20387C000","o":"E8C4D","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"unknown","gitVersion":"none","compiledModules":["unknown"],"uname":{"sysname":"Linux","release":"2.6.32-220.el6.x86_64","version":"#1 SMP Wed Nov 9 08:03:13 EST 2011","machine":"x86_64"},"somap":[{"b":"7FE205382000","elfType":3,"buildId":"9F0FECD3257C33AAF626952C3078E347E0AEEDEA"}]}}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20830E7BA","b":"7FE205382000","o":"2F8C7BA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.592","s+":"1EA"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20830FC69","b":"7FE205382000","o":"2F8DC69","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20830C3E6","b":"7FE205382000","o":"2F8A3E6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE203C1F7E0","b":"7FE203C10000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2038AE4F5","b":"7FE20387C000","o":"324F5","s":"gsignal","s+":"35"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2038AFCD5","b":"7FE20387C000","o":"33CD5","s":"abort","s+":"175"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20629EDC9","b":"7FE205382000","o":"F1CDC9","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20691275D","b":"7FE205382000","o":"159075D","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl34_setMyLastAppliedOpTimeAndWallTimeENS_8WithLockERKNS0_17OpTimeAndWallTimeEbNS0_22ReplicationCoordinator15DataConsistencyE","s+":"ED"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2069133F3","b":"7FE205382000","o":"15913F3","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19_advanceCommitPointENS_8WithLockERKNS0_17OpTimeAndWallTimeEb","s+":"A3"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE20693F8C0","b":"7FE205382000","o":"15BD8C0","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_handleHeartbeatResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEi","s+":"1CE0"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE207E4E46C","b":"7FE205382000","o":"2ACC46C","s":"_ZZN5mongo8executor12TaskExecutor21scheduleRemoteCommandERKNS0_24RemoteCommandRequestImplINS_11HostAndPortEEERKSt8functionIFvRKNS1_25RemoteCommandCallbackArgsEEERKSt10shared_ptrINS_5BatonEEENKUlRKNS1_30RemoteCommandOnAnyCallbackArgsEE_clESM_","s+":"4C"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE207CF3376","b":"7FE205382000","o":"2971376","s":"_ZN5mongo8executor12_GLOBAL__N_121remoteCommandFinishedERKNS0_12TaskExecutor12CallbackArgsERKSt8functionIFvRKNS2_30RemoteCommandOnAnyCallbackArgsEEERKNS0_24RemoteCommandRequestImplISt6vectorINS_11HostAndPortESaISG_EEEERKNS0_26RemoteCommandOnAnyResponseE","s+":"56"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE207CF7583","b":"7FE205382000","o":"2975583","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE207CF7962","b":"7FE205382000","o":"2975962","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2065ED2AB","b":"7FE205382000","o":"126B2AB","s":"_ZN5mongo8executor14ThreadPoolMock15_consumeOneTaskERSt11unique_lockINS_12latch_detail5LatchEE","s+":"12B"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2065ED526","b":"7FE205382000","o":"126B526","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_8executor14ThreadPoolMock7startupEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"146"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE2084D003F","b":"7FE205382000","o":"314E03F","s":"execute_native_thread_routine","s+":"F"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE203C17AA1","b":"7FE203C10000","o":"7AA1","s":"start_thread","s+":"D1"}}
[replexec] "Frame","attr":{"frame":{"a":"7FE203964C4D","b":"7FE20387C000","o":"E8C4D","s":"clone","s+":"6D"}}

This was determined to be because those unittests use ReplicationCoordinatorExternalStateMock, which lacks the necessary calls to ReplicaSetAwareServiceRegistry. This meant that the LogicalClock/VectorClock was not being disabled as it should have been, and the test then later failed on this dassert.

As a temporary measure, SERVER-47914 added the call to ReplicaSetAwareServiceRegistry::onBecomeArbiter() inside ReplicationCoordinatorExternalStateMock::onBecomeArbiterHook(), which fixed this problem. However, there is a more general problem — all ReplicaSetAwareService callbacks should still occur even when the ReplicationCoordinator's external state has been mocked out.

It turns out that, for the externalState functions that call the RSAS callbacks, each is called from only one place in ReplicationCoordinatorImpl. So there are two potential ways to properly fix this:

  1. Make the necessary hook functions in ReplicationCoordinatorExternalStateMock call into ReplicaSetAwareServiceRegistry, the same as ReplicationCoordatorExternalStateImpl does.
  2. Remove the calls to ReplicaSetAwareServiceRegistry from ReplicationCoordatorExternalStateImpl/Mock, and put them directly in ReplicationCoordinatorImpl.

My preference is #2, because there doesn't seem to be any point in having this abstraction go via externalState, if correctness requires it to be the same in every externalState implementation. william.schultz, do you agree? Or is there some reason to keep it in externalState, or other possibility I've missed?



 Comments   
Comment by Githook User [ 15/Jun/20 ]

Author:

{'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}

Message: SERVER-48610 move ReplicaSetAwareService callbacks into the ReplicationCoordinator (instead of its external state)
Branch: master
https://github.com/mongodb/mongo/commit/b2f840c9efca29820fb48a89741fa18ace8a33b1

Comment by William Schultz (Inactive) [ 08/Jun/20 ]

My preference is #2, because there doesn't seem to be any point in having this abstraction go via externalState, if correctness requires it to be the same in every externalState implementation.

kevin.pulo I think this makes sense. I think the concept behind putting things in ReplicationCoordinatorExternalState is to keep an interface barrier between replication code and large parts of the rest of the codebase i.e. storage, sharding, etc. It seems, however, that ReplicaSetAwareInterface is already providing this kind of barrier, so I'd agree that it doesn't seem necessary have an extra layer of indirection via external state. As long as ReplicaSetAwareInterface remains generic, without knowledge of sharding system details, I think making calls to it directly from within ReplicationCoordinator, without going through the external state, is a reasonable design.

Generated at Thu Feb 08 05:17:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.