[SERVER-34922] mongofiles sharded cluster write concern test regression in MongoDB 3.7 Created: 09/May/18  Updated: 06/Dec/22  Resolved: 23/Jul/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.6, 4.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Golden Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04


Issue Links:
Depends
is depended on by TOOLS-2035 mongofiles_write_concern_mongos.js fa... Closed
Duplicate
duplicates SERVER-35092 ShardServerCatalogCacheLoader should ... Closed
Related
related to SERVER-34776 dropDatabase should respect user prov... Closed
related to SERVER-34370 Change commands to use AutoGetDb to g... Closed
Assigned Teams:
Sharding
Operating System: ALL
Sprint: Sharding 2018-07-16
Participants:

 Description   

Recent tools Evergreen tests have started showing a failure in the mongofiles_write_concern_mongos.js test when run against a 3.7 nightly. Nothing relating to mongofiles changed in the tools during the time the failures started and the tests pass when run against MongoDB 3.6.

The test case has 2 nodes of a 3-node shard down, but mongofiles is run with w:1,wtimeout:10000. In the failing case, mongofiles hangs waiting for a database response. Eventually, after several hours, the process is terminated.

Note: The mongofiles log output is deceptive, it claims the file is added, but the hang occurs during a deferred close() call when mgo is trying to flush all data to the database and ensure indexes exist.



 Comments   
Comment by Kaloian Manassiev [ 23/Jul/18 ]

Thanks for the analysis blake.oler. I am going to close this ticket as duplicate of SERVER-35092, since it is essentially the same problem.

Comment by Blake Oler [ 23/Jul/18 ]

Two things here:

  1. To verify the database version of an unsharded database in a sharded cluster, we need to wait for linearizable read concern. This ensures that the catalog cache will have the latest data. So the general behavior is "works as designed" – if a node in a cluster is partitioned from the other nodes, it will not be able to verify the database version.
  2. However, the fact that this hangs indefinitely is not ideal. The call to waitForLinearizableReadConcern doesn't have any timeout associated with it. A reasonable timeout is behavior that needs to be added. It's tracked in SERVER-35092.
Comment by Louis Williams [ 04/Jun/18 ]

The problematic code is checkDbVersion in create_indexes.cpp, introduced in this commit as part of SERVER-34370

Comment by Louis Williams [ 04/Jun/18 ]

Assigning to sharding

Comment by Louis Williams [ 04/Jun/18 ]

The hang is happening on createIndexes forĀ dbOne.fs.chunks.

It looks the primary is waiting while refreshing the database entry in the catalog cache, which ignores the user's writeConcern.

2018-05-07T17:40:31.456+0000 s20515| 2018-05-07T17:40:31.455+0000 I COMMAND  [conn22] query dbOne.fs.files command: { insert: "fs.files", writeConcern: { getLastError: 1, w: 1, wtimeout: 10000 }, ordered: true, $readPreference: { mode: "nearest" }, $db: "dbOne" } nShards:1 ninserted:1 numYields:0 reslen:185 120029ms
2018-05-07T17:40:31.457+0000 s20515| 2018-05-07T17:40:31.456+0000 D COMMAND  [conn22] createIndexes: dbOne.fs.chunks cmd:{ createIndexes: "fs.chunks", indexes: [ { name: "files_id_1_n_1", ns: "dbOne.fs.chunks", key: { files_id: 1, n: 1 }, unique: true } ], $db: "dbOne" }
2018-05-07T17:40:31.457+0000 d20511| 2018-05-07T17:40:31.456+0000 I SHARDING [conn34] Refreshing cached database entry for dbOne; current cached database info is { _id: "dbOne", primary: "mongofiles_write_concern_mongos-rs0", partitioned: false, version: { uuid: UUID("33658558-44e2-462e-944f-93c133a8b31a"), lastMod: 1 } }
2018-05-07T19:50:52.656+0000 s20515| 2018-05-07T19:50:52.641+0000 D -        [conn22] User Assertion: InterruptedAtShutdown: interrupted at shutdown src/mongo/db/operation_context.cpp 165
2018-05-07T19:50:52.657+0000 s20515| 2018-05-07T19:50:52.642+0000 I COMMAND  [conn22] query dbOne.fs.chunks command: { createIndexes: "fs.chunks", indexes: [ { name: "files_id_1_n_1", ns: "dbOne.fs.chunks", key: { files_id: 1, n: 1 }, unique: true } ], $db: "dbOne" } numYields:0 reslen:587 7821186ms
2018-05-07T19:50:52.661+0000 d20511| 2018-05-07T19:50:52.645+0000 I SHARDING [conn34] Failed to refresh databaseVersion for database dbOne :: caused by :: ShutdownInProgress: Unable to schedule routing table update because this is not the majority primary and may not have the latest data. :: caused by :: Replication is being shut down
2018-05-07T19:50:52.661+0000 d20511| 2018-05-07T19:50:52.645+0000 I COMMAND  [conn34] command dbOne.$cmd command: createIndexes { createIndexes: "fs.chunks", indexes: [ { name: "files_id_1_n_1", ns: "dbOne.fs.chunks", key: { files_id: 1, n: 1 }, unique: true } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], databaseVersion: { uuid: UUID("c918f944-1a83-40c0-b265-bd74d9d6dc2e"), lastMod: 1 }, allowImplicitCollectionCreation: false, $clusterTime: { clusterTime: Timestamp(1525714831, 4), signature: { hash: BinData(0, 54CBBC2B100FA619FCDACAED9070E8857A02342C), keyId: 6552892982884827165 } }, $configServerState: { opTime: { ts: Timestamp(1525714831, 3), t: 1 } }, $db: "dbOne" } numYields:0 ok:0 errMsg:"don't know dbVersion" errName:StaleDbVersion errCode:249 reslen:475 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_msg 7821188ms

Comment by David Golden [ 31/May/18 ]

This is still a problem as of mongodb-linux-x86_64-ubuntu1404-4.1.0-114-ga3fb68c

Comment by David Golden [ 10/May/18 ]

Might be related to SERVER-34776

Generated at Thu Feb 08 04:38:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.