[SERVER-33287] Create passthrough that kills the primary node Created: 13/Feb/18  Updated: 29/Oct/23  Resolved: 12/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.5, 3.7.4

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Samyukta Lanka
Resolution: Fixed Votes: 0
Labels: rollback-non-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-32767 Add retries to ReplSetTest._callIsMas... Closed
depends on SERVER-33879 config.transactions is not updated du... Closed
Related
related to SERVER-34155 Add clean shutdowns to kill_secondari... Closed
related to SERVER-34241 Remove the skipValidationNamespaces f... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6
Sprint: TIG 2018-04-09, TIG 2018-04-23
Participants:

 Description   

We also may want a passthrough that kills both primary and secondary nodes.



 Comments   
Comment by Githook User [ 06/May/18 ]

Author:

{'email': 'samy.lanka@gmail.com', 'name': 'Samy Lanka', 'username': 'lankas'}

Message: SERVER-33287 Create passthrough that kills the primary node

(cherry picked from commit c05611bac298e6b904030e5e2e6efb79c4192a00)
Branch: v3.6
https://github.com/mongodb/mongo/commit/9593a3b7e69201d0d53a30c36d6439e7ba3a2f97

Comment by Githook User [ 06/May/18 ]

Author:

{'email': 'samy.lanka@gmail.com', 'name': 'Samy Lanka', 'username': 'lankas'}

Message: SERVER-33287 tag jstests that use commands which return inaccurate results after unclean shutdown

(cherry picked from commit ea5b5a97ed247e26d9de87089fe8dd81cda14a9e)
Branch: v3.6
https://github.com/mongodb/mongo/commit/772c3cfd76ac3720484823fcf3bc461dae928c06

Comment by Githook User [ 12/Apr/18 ]

Author:

{'email': 'samy.lanka@gmail.com', 'name': 'Samy Lanka', 'username': 'lankas'}

Message: SERVER-33287 Create passthrough that kills the primary node
Branch: master
https://github.com/mongodb/mongo/commit/c05611bac298e6b904030e5e2e6efb79c4192a00

Comment by Samyukta Lanka [ 10/Apr/18 ]

There is a known failure that can happen when running this passthrough which will be fixed after SERVER-32767.

The failure:

[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.663+0000 assert failed : checkOplogs, non-matching oplog entries for the following nodes:
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.670+0000 localhost:20250: {  "ts" : Timestamp(1523310782, 1),  "t" : NumberLong(19),  "h" : NumberLong("-1287453526515208947"),  "v" : 2,  "op" : "c",  "ns" : "test.$cmd",  "ui" : UUID("1090b5b8-17e1-486e-92d4-da3c23278d29"),  "wall" : ISODate("2018-04-09T21:53:02.291Z"),  "o" : {  "create" : "basic5",  "idIndex" : {  "v" : 2,  "key" : {  "_id" : 1 },  "name" : "_id_",  "ns" : "test.basic5" } } }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.679+0000 localhost:20251: {  "ts" : Timestamp(1523310785, 11),  "t" : NumberLong(20),  "h" : NumberLong("3072770154973711623"),  "v" : 2,  "op" : "c",  "ns" : "test.$cmd",  "ui" : UUID("ac796c39-3fd2-480c-8a51-be5dc075038b"),  "o2" : {  "collectionOptions_old" : {  "uuid" : UUID("ac796c39-3fd2-480c-8a51-be5dc075038b"),  "flags" : 1 } },  "wall" : ISODate("2018-04-09T21:53:05.520Z"),  "o" : {  "collMod" : "ne1",  "usePowerOf2Sizes" : true } }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.696+0000 doassert@src/mongo/shell/assert.js:18:14
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.704+0000 assert@src/mongo/shell/assert.js:146:9
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.705+0000 checkOplogs@src/mongo/shell/replsettest.js:1794:25
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.705+0000 ReplSetTest/this.checkReplicaSet@src/mongo/shell/replsettest.js:1378:13
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.705+0000 ReplSetTest/this.checkOplogs@src/mongo/shell/replsettest.js:1700:9
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.706+0000 @jstests/hooks/run_check_repl_oplogs.js:17:5
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.706+0000 @jstests/hooks/run_check_repl_oplogs.js:5:2

It's caused by:

[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.613+0000 ReplSetTest Could not call ismaster on node connection to localhost:20250: Error: error doing query: failed: network error while attempting to run command 'ismaster' on host 'localhost:20250'
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.618+0000 ReplSetTest awaitReplication: starting: optime for primary, localhost:20252, is { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.618+0000 ReplSetTest awaitReplication: checking secondaries against latest primary optime { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.623+0000 ReplSetTest awaitReplication: checking secondary #1: localhost:20251
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.625+0000 ReplSetTest awaitReplication: optime for secondary #1, localhost:20251, is { "ts" : Timestamp(1523310785, 1), "t" : NumberLong(20) } but latest is { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.625+0000 ReplSetTest awaitReplication: secondary #1, localhost:20251, is NOT synced
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.826+0000 ReplSetTest awaitReplication: checking secondaries against latest primary optime { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.831+0000 ReplSetTest awaitReplication: checking secondary #1: localhost:20251
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.833+0000 ReplSetTest awaitReplication: optime for secondary #1, localhost:20251, is { "ts" : Timestamp(1523310785, 7), "t" : NumberLong(20) } but latest is { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:05.833+0000 ReplSetTest awaitReplication: secondary #1, localhost:20251, is NOT synced
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.041+0000 ReplSetTest awaitReplication: checking secondaries against latest primary optime { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.049+0000 ReplSetTest awaitReplication: checking secondary #1: localhost:20251
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.050+0000 ReplSetTest awaitReplication: secondary #1, localhost:20251, is synced
[CheckReplOplogs:job1:basic5:CheckReplOplogs] 2018-04-09T21:53:06.055+0000 ReplSetTest awaitReplication: finished: all 1 secondaries synced at optime { "ts" : Timestamp(1523310785, 11), "t" : NumberLong(20) }

The primary (node 0) was killed, restarted and goes into rollback. At the same time, the test finishes and calls CheckReplOplog. During awaitReplication, ReplSetTest calls ismaster on node 0 but gets a network error while doing so. Now awaitReplication doesn't wait for node 0 to be replicated because it isn't part of the liveNodes list, so it returns before node 0 is synced. CheckReplOplog then fails because node 0 and the new primary have non-matching oplogs.

Comment by Githook User [ 06/Apr/18 ]

Author:

{'email': 'samy.lanka@gmail.com', 'name': 'Samy Lanka', 'username': 'lankas'}

Message: SERVER-33287 tag jstests that use commands which return inaccurate results after unclean shutdown
Branch: master
https://github.com/mongodb/mongo/commit/ea5b5a97ed247e26d9de87089fe8dd81cda14a9e

Comment by Max Hirschhorn [ 28/Mar/18 ]

samy.lanka, afer discussing with judah.schvimer and renctan, it doesn't sound like running the dbhash check with this test suite is expected to pass until SERVER-33879 is fixed. In the meantime, let's aim to get a version of the replica_sets_kill_primary_jscore_passthrough.yml test suite in where we skip doing dbhash checking on the "config" database. After the changes from SERVER-34178, it should be possible to specify TestData.excludedDBsFromDBHash=["config"] and have that be meaningful again.

- class: CheckReplDBHash
  shell_options:
    global_vars:
      TestData:
        excludedDBsFromDBHash:
        # TODO SERVER-34178: Check the dbhash of the config database after replication recovery
        # runs for the config.transaction collection on startup.
        - config

Comment by Spencer Brody (Inactive) [ 13/Feb/18 ]

FYI max.hirschhorn. I imagine this would be similar to the continuous stepdown passthroughs, but instead of stepping down we kill the primary.

Generated at Thu Feb 08 04:32:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.