[SERVER-34075] powercycle_replication* must run replication recovery to observe canary documents Created: 22/Mar/18  Updated: 29/Oct/23  Resolved: 01/May/18

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.6.6, 4.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: rollback-non-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-34108 Rebuilding unfinished unindexes one a... Closed
Related
related to WT-3998 Fix a bug where stable timestamp was ... Closed
related to SERVER-34070 Add flag to perform replication recov... Closed
related to SERVER-29213 Have KVWiredTigerEngine implement Sto... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6
Sprint: TIG 2018-04-23, TIG 2018-05-07
Participants:
Linked BF Score: 80
Story Points: 3

 Description   

SERVER-29213 will break the powercycle_replication tests ability to query for the canary document after a crash. As such, that patch is temporarily disabling them.

Specifically, after SERVER-29213, bringing a node up in standalone may result in stale data relative to what the node has accepted. The node has not lost the data, but simply, replication recovery needs to be done for the data to be queryable.

The powercycle tests bring a node back up to check for the canary document in standalone mode and the node is brought up on a different port than is used when running as a replica set member.

We suspect SERVER-34070 will make it easier to make the required changes to re-enable the powercycle_replication* tests. What's problematic is that running replication recovery requires starting the node up with the --replSet option. However, a node running with --replSet on a different port than in the replset config will not come up as a PRIMARY nor SECONDARY and thus not service reads.



 Comments   
Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-34075 Startup mongod as a replset during recovery mode in powertest.py

(cherry picked from commit 0c242bc59fd1db69a891c73dc82a29c69f13a400)
Branch: v3.6
https://github.com/mongodb/mongo/commit/a4d8d7f1e905481a7a396a11f987b21328ceeb56

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'jonathan@mongodb.com', 'name': 'Jonathan Abrahams', 'username': 'hptabster'}

Message: SERVER-34075 Reenable powercycle_replication* tests and startup mongod as a replset during recovery mode in powertest.py
Branch: master
https://github.com/mongodb/mongo/commit/0c242bc59fd1db69a891c73dc82a29c69f13a400

Comment by Jonathan Abrahams [ 11/Apr/18 ]

Starting mongod up in recovery (standalone) with noIdexBuildRetry fails with the following:

2018-04-11T18:25:06.586+0000 I STORAGE  [initandlisten]   not rebuilding interrupted indexes
2018-04-11T18:25:06.859+0000 F -        [initandlisten] Fatal assertion 28579 UnsupportedFormat: Unable to find metadata for table:index-935-3043352425042229629 Index: {name: foo1_1, ns: fsm-4db2.coll2} - version too new for this mongod. See http://dochub.mongodb.org/core/3.4-index-downgrade for detailed instructions on how to handle this error. at src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp 269
2018-04-11T18:25:06.859+0000 F -        [initandlisten] 
 
***aborting after fassert() failure

Comment by Eric Milkie [ 11/Apr/18 ]

SERVER-34108 is at the very top of our "3.7 Desired" list, so it ought to be the first thing that gets scheduled once we've exhausted all our 4.0 required tickets (a dwindling list at this point).

Comment by Jonathan Abrahams [ 11/Apr/18 ]

max.hirschhorn We cannot use --noIndexBuildRetry option in a replicaSet and would have to start it up as a standalone node:

Failed global initialization: BadValue: replication.replSet is not allowed when noIndexBuildRetry is specified

Comment by Max Hirschhorn [ 10/Apr/18 ]

jonathan.abrahams, it doesn't appear that SERVER-34108 is scheduled in the Storage team's current sprint. Is it possible to use the --noIndexBuildRetry option to skip trying to rebuild the indexes on start-up? CC milkie, daniel.gottlieb

Comment by Jonathan Abrahams [ 10/Apr/18 ]

Replication tests are failing, need to wait for the fix in SERVER-34108.

Comment by Jonathan Abrahams [ 09/Apr/18 ]

The powercycle test does not start the recovery node, on the secret port, as a single node replica set. I do not recall why this was done this way. I'll make a patch build to try always starting replication tests with the node as a replica set member.

Comment by Max Hirschhorn [ 27/Mar/18 ]

We should see if the mongod isn't being restarted as a replica set member when started on the secret port as there is already logic to do a reconfig so the node and identify itself in the configuration.

Generated at Thu Feb 08 04:35:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.