|
Normal case:
1. Primary starts updating config.cache.chunks, sets the refreshing flag to true for config.cache.collections document.
2. Primary finishes updating config.cache.chunks, sets the refreshing flag to false for config.cache.collections document.
3. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
4. Secondary replication intercepts the oplog entry for setting the refresh flag to false, then notifies the CatalogCache to check again.
5. Secondary CatalogCache checks and sees that the flag is false, so it proceeds to reading config.cache.chunks.
Stuck case:
1. Secondary batch applied up to the oplog at T10 that includes the oplog that sets refreshing flag to true (but not the oplog that sets it to false).
2. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
3. Secondary replication intercepts the oplog entry with timestamp at T15 for setting the refresh flag to false, then notifies the CatalogCache to check again.
4. Secondary CatalogCache tries to read the refresh flag, but since the secondary is in the middle of batch replication, it reads from the last stable snapshot, which is at T10, and sees the flag is still set to true. So it ends up waiting for the notification again.
Once it ends up on this state, it will have to wait for the next refresh to happen and the snapshot it will have to read from must also have the flag set to false in order to break out of the loop.
Note: in order to hit this, there should be multiple refreshes that happen in a short span such that it will make the secondary wait for replication on the first refresh that happened on primary, and hit the stuck case above on the next incoming ones.
Original description:
We got very strange issue: all queries for one collection (product.productVariants) began to stick and end in timeout only on one secondary node in cluster.
Every query writes log messages like:
2019-08-09T11:45:01.100+0000 I SHARDING [conn1687013] Failed to refresh metadata for collectionproduct.productVariants :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] Timed out obtaining lock while trying to gather storage statistics for a slow operation
|
2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] command product.productVariants command: find \{ find: "productVariants", filter: { productId: { $in: [ "1464081735096156158-24-1-553-3541579132", "1465973890490715674-11-1-582-2096574603", "1465973890500950185-13-1-582-976357012", "1461932094024184784-166-1-553-3714864457", "1461935063694876366-54-1-553-2708221516", "1461935988919956077-34-1-553-1636835731", "1461937588876550474-147-1-553-178727871", "1461929262265524984-74-1-553-2941111310", "1463810588441994707-236-1-553-151836953", "1463810333622018613-11-1-553-3495339665", "1462962266606256953-204-1-553-3651680234", "1461930397452040780-28-1-553-3620775575" ] } }, readConcern: \{ level: "local" }, maxTimeMS: 60000, shardVersion: [ Timestamp(4547, 1), ObjectId('5c9d25a355eaa4f13e23434e') ], $readPreference: \{ mode: "secondaryPreferred", tags: [ { role: "main" } ] }, $clusterTime: \{ clusterTime: Timestamp(1565351041, 36), signature: { hash: BinData(0, 1ACB7B7FA0F5A4A9E7930DDAC1AFB93C717B5717), keyId: 6695428731097317566 } }, $configServerState: \{ opTime: { ts: Timestamp(1565351040, 874), t: 3 } }, $db: "product" } numYields:0 ok:0 errMsg:"epoch mismatch detected for product.productVariants, the collection may have been dropped and recreated" errName:StaleConfig errCode:13388 reslen:587 locks:\{ Global: { acquireCount: { r: 3 }, acquireWaitCount: \{ r: 1 }, timeAcquiringMicros: \{ r: 729 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_msg 60010ms
|
Metadata was refreshed immediately after executing query without {{readConcern: { level: "local" }}} directly on problem mongod instance.
We found related metadata update in configuration replica set logs:
|
2019-08-09T11:41:09.495+0000 I SH_REFR [ConfigServerCatalogCacheLoader-7172] Refresh for collection product.productVariants from version 4546|8||5c9d25a355eaa4f13e23434e to version 4547|1||5c9d25a355eaa4f13e23434e took 3 ms
|
2019-08-09T11:41:09.567+0000 I SHARDING [Balancer] distributed lock 'product.productVariants' acquired for 'Migrating chunk(s) in collection product.productVariants', ts : 5d10d3d9b6fda9ab6affa646
|
After than we grep Marking collection in problem mongod instance logs:
|
2019-08-09T11:36:51.393+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|1||5c9d25a355eaa4f13e23434e, shard version: 4546|1||5c9d25a355eaa4f13e23434e as unsharded
|
2019-08-09T11:36:51.633+0000 I SHARDING [conn1685706] Marking collection product.productVariants as sharded with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e
|
2019-08-09T11:41:08.422+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e as unsharded
|
2019-08-09T12:58:17.988+0000 I SHARDING [conn1720807] Marking collection product.productVariants as sharded with collection version: 4550|5||5c9d25a355eaa4f13e23434e, shard version: 4550|5||5c9d25a355eaa4f13e23434e
|
2019-08-09T13:01:36.409+0000 I SHARDING [repl writer worker 10] Marking collection product.productVariants with collection version: 4551|1||5c9d25a355eaa4f13e23434e, shard version: 4551|1||5c9d25a355eaa4f13e23434e as unsharded
|
2019-08-09T13:01:36.449+0000 I SHARDING [conn1721230] Marking collection product.productVariants as sharded with collection version: 4552|1||5c9d25a355eaa4f13e23434e, shard version: 4552|1||5c9d25a355eaa4f13e23434e
|
|
In our case we got above hour between Marking as unsharded and Marking as sharded.
|
|
Hi bozaro,
MongoDB 4.0.14-rc1 was recently released. This is a release candidate for MongoDB 4.0.14, which includes this fix. Assuming no issues are identified during our testing, I would expect MongoDB 4.0.14 GA will be available in about a week.
Kind regards,
Kelsey
|
|
Is there any forecast for the 4.0.14 version release date?
|
|
Author:
{'name': 'Randolph Tan', 'email': 'randolph@mongodb.com'}
Message: SERVER-42737 Make secondary reads in ShardServerCatalogCacheLoader block behind the PBW lock
(cherry picked from commit e0f6fdab23da872d9cf8a93b88c2332ba45041c0)
Branch: v4.2
https://github.com/mongodb/mongo/commit/75e89cb9cb05c3717ef9929694ec2cbe0873db90
|
|
Author:
{'name': 'Randolph Tan', 'email': 'randolph@mongodb.com'}
Message: SERVER-42737 Make secondary reads in ShardServerCatalogCacheLoader block behind the PBW lock
(cherry picked from commit e0f6fdab23da872d9cf8a93b88c2332ba45041c0)
Branch: v4.0
https://github.com/mongodb/mongo/commit/5790c40b7a87e16d6b67cd77bed55833661b7dd1
|
|
Author:
{'name': 'Randolph Tan', 'email': 'randolph@mongodb.com'}
Message: SERVER-42737 Make secondary reads in ShardServerCatalogCacheLoader block behind the PBW lock
Branch: master
https://github.com/mongodb/mongo/commit/e0f6fdab23da872d9cf8a93b88c2332ba45041c0
|
|
We really need this to be backported to 4.0.x. Migrating to 4.2.x or later is a non-trivial process, especially for sharded clusters.
|
Does this look like the same issue?
I think they are. Secondary reads getting stuck. I'm not sure what read concern level you used, but "available" is the only read concern that will get around this issue because it won't use shard versioning and therefore not cause or wait for sharding metadata refreshes. This is the default read concern when not specified and when not under a causally consistent session. Direct access to the secondary doesn't exhibit the issue because it is not versioned (which has the risk of talking to the wrong shard because the data no longer belongs there or see data that no longer belongs there). As for the getting resolved automatically, it is not exactly dependent on time, the condition for it to break out of being stuck is very hard to control because it has to do with the contents of the oplog entry that gets sent to the secondary in a batch and the timing of the metadata refresh on the primary.
Is there a way to produce a direct query to a mongod instance to make it hit exactly the same issue a query routed via mongos seems to hit? It would help both to find and to inspect the problem.
Once the secondary refresh is stuck, you just need to send the exact same query that timed out in the logs. If you use the mongo shell you should be able to copy and paste the exact find command in the logs and use runCommand to execute on the node directly.
Is there a way to manually quicken the resolution of this issue once it affected the cluster? We're willing to try some recipe you could advise - provided it is not too drastic.
I think the most sure-fire way to make the secondary un-stuck is to restart it. There some workarounds I can think of, but they are not safe and not guaranteed to work all the time.
|
|
Hi,
We haven't yet attempted to apply this workaround (partly because, on the whole, we may end up worse off, partly because changing such settings on a production cluster requires some amount of prior testing), so we keep hitting this issue quickly whenever a balancer is enabled after adding a new shard to the cluster. We keep making attempts to investigate it a little bit more, both in order to find some less intrusive workaround and to make sure that it is the very same issue every time, not several different ones.
So, I would like to add our recent observations. Please confirm if this looks consistent with your current idea of an underlying problem.
- Initial symptoms are the same: some reads via mongos are stalled, connections to the affected secondary mongod spike, messages from the original report are present in the log
- During the prior such incidents, we saw the issue resolve itself when a read with readConcern other than 'local' is made (I seem to remember this as a part of an original issue description - could it have been lost when it was last updated?). This time we tried to do so both via mongos and direct connection to the affected mongod instance, but to no avail.
- Direct reads to the affected collection on the affected mongod instance did not show this issue: no stalls, reads were quick and successful. We only tried some simple queries to the most recent documents, though.
- After about 30 minutes the issue on the secondary resolved itself, we're not sure it was something we did to it or not; most if this time the balancer was still enabled.
And the questions are:
- Does this look like the same issue?
- Is there a way to produce a direct query to a mongod instance to make it hit exactly the same issue a query routed via mongos seems to hit? It would help both to find and to inspect the problem.
- Is there a way to manually quicken the resolution of this issue once it affected the cluster? We're willing to try some recipe you could advise - provided it is not too drastic.
|
|
I think we found the issue that you are experiencing. I have also updated the description with my findings. If this is the issue that you are hitting, you can try setting the parameter for allowSecondaryReadsDuringBatchApplication to false (defaults to true) on the mongods to get around this issue. This docs page shows you ways on how you can change this parameter. Please note that setting this parameter to false has a side effect of blocking secondary reads when it is in the middle of applying oplog entries so it worsen the latency of secondary reads.
|
|
Hi, it's Artem's colleague here. The issue is reproducing quite steadily. So we've managed to collect logs with `sharding.shardingCatalogRefresh` and even core dump from troubled process.
Log (filtered out "I COMMAND", "I NETWORK" and "I ACCESS" lines):
2019-10-04T08:35:36.968+0000 I SHARDING [repl writer worker 12] Marking collection user.users with collection version: 1932|1||5a6a38d2999e40c3345af3f8, shard version: 1932|1||5a6a38d2999e40c3345af3f8 as unsharded
|
2019-10-04T08:35:36.970+0000 D SH_REFR [conn1213121] Refreshing chunks for collection user.users based on version 1932|1||5a6a38d2999e40c3345af3f8
|
2019-10-04T08:35:36.980+0000 I SH_REFR [ShardServerCatalogCacheLoader-14149] Refresh for collection user.users from version 1932|1||5a6a38d2999e40c3345af3f8 to version 1937|1||5a6a38d2999e40c3345af3f8 took 10 ms
|
2019-10-04T08:35:36.981+0000 D SH_REFR [conn1212854] Refreshing chunks for collection user.users based on version 1937|1||5a6a38d2999e40c3345af3f8
|
2019-10-04T08:35:36.981+0000 I SHARDING [conn1213176] Marking collection user.users as sharded with collection version: 1937|1||5a6a38d2999e40c3345af3f8, shard version: 1932|1||5a6a38d2999e40c3345af3f8
|
2019-10-04T08:35:36.987+0000 D SH_REFR [ShardServerCatalogCacheLoader-14149] Refresh for collection user.users from version 1937|1||5a6a38d2999e40c3345af3f8 to version 1937|1||5a6a38d2999e40c3345af3f8 took 5 ms
|
2019-10-04T08:35:51.431+0000 D SH_REFR [LogicalSessionCacheReap] Refreshing chunks for collection config.system.sessions based on version 2|0||5a79c43b91a6b0b1139a81b6
|
2019-10-04T08:35:51.438+0000 D SH_REFR [ShardServerCatalogCacheLoader-14149] Refresh for collection config.system.sessions from version 2|0||5a79c43b91a6b0b1139a81b6 to version 2|0||5a79c43b91a6b0b1139a81b6 took 6 ms
|
2019-10-04T08:36:28.435+0000 I SHARDING [repl writer worker 12] Marking collection user.users with collection version: 1937|1||5a6a38d2999e40c3345af3f8, shard version: 1932|1||5a6a38d2999e40c3345af3f8 as unsharded
|
2019-10-04T08:36:28.436+0000 D SH_REFR [conn1213095] Refreshing chunks for collection user.users based on version 1937|1||5a6a38d2999e40c3345af3f8
|
2019-10-04T08:37:47.089+0000 I SHARDING [conn1206894] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:03.289+0000 I SHARDING [conn1214419] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:05.499+0000 I SHARDING [conn1207570] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:07.374+0000 I SHARDING [conn1207247] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:14.538+0000 I SHARDING [conn1214522] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:15.030+0000 I SHARDING [conn1214511] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:21.833+0000 I SHARDING [conn1205746] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:22.682+0000 I SHARDING [conn1214495] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:24.857+0000 I SHARDING [conn1214622] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:38:41.542+0000 I SHARDING [conn1214705] Failed to refresh metadata for collectionuser.users :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-04T08:36:28 is the exact time when this replica began to malfunction.
In the core dump I see the thread with ShardServerCatalogCacheLoader waiting for new catalog to arrive:
Thread 265 (Thread 0x7f444b0b0700 (LWP 24550)):
|
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
|
#1 0x0000555b4bfc972c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
|
#2 0x0000555b4be0ad13 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(std::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t) ()
|
#3 0x0000555b4be0aef2 in mongo::OperationContext::waitForConditionOrInterruptNoAssert(std::condition_variable&, std::unique_lock<std::mutex>&) ()
|
#4 0x0000555b4be0b03d in mongo::OperationContext::waitForConditionOrInterrupt(std::condition_variable&, std::unique_lock<std::mutex>&) ()
|
#5 0x0000555b4a905b5e in mongo::ShardServerCatalogCacheLoader::_getCompletePersistedMetadataForSecondarySinceVersion(mongo::OperationContext*, mongo::NamespaceString const&, mongo::ChunkVers
|
ion const&) ()
|
#6 0x0000555b4a906643 in mongo::ShardServerCatalogCacheLoader::_runSecondaryGetChunksSince(mongo::OperationContext*, mongo::NamespaceString const&, mongo::ChunkVersion const&, std::function<
|
void (mongo::OperationContext*, mongo::StatusWith<mongo::CatalogCacheLoader::CollectionAndChangedChunks>)>) ()
|
#7 0x0000555b4a906c2a in mongo::ShardServerCatalogCacheLoader::getChunksSince(mongo::NamespaceString const&, mongo::ChunkVersion, std::function<void (mongo::OperationContext*, mongo::StatusW
|
ith<mongo::CatalogCacheLoader::CollectionAndChangedChunks>)>)::{lambda()#1}::operator()() const ()
|
#8 0x0000555b4b66d64c in mongo::ThreadPool::_doOneTask(std::unique_lock<std::mutex>*) ()
|
#9 0x0000555b4b66ec1c in mongo::ThreadPool::_consumeTasks() ()
|
#10 0x0000555b4b66f656 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
|
#11 0x0000555b4bfcc7a0 in execute_native_thread_routine ()
|
#12 0x00007f4493ec66ba in start_thread (arg=0x7f444b0b0700) at pthread_create.c:333
|
#13 0x00007f4493bfc41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
|
And a couple of thread with request handlers that are stumbled upon invalid catalog version and waiting for it to be refreshed:
Thread 262 (Thread 0x7f4469d40700 (LWP 29630)):
|
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
|
#1 0x0000555b4be146b8 in mongo::ClockSource::waitForConditionUntil(std::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t) ()
|
#2 0x0000555b4be0ab73 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(std::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t) ()
|
#3 0x0000555b4be0b0c0 in mongo::OperationContext::waitForConditionOrInterruptUntil(std::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t) ()
|
#4 0x0000555b4b6865d9 in mongo::CatalogCache::_getCollectionRoutingInfoAt(mongo::OperationContext*, mongo::NamespaceString const&, boost::optional<mongo::Timestamp>) ()
|
#5 0x0000555b4b687425 in mongo::CatalogCache::getCollectionRoutingInfoWithRefresh(mongo::OperationContext*, mongo::NamespaceString const&, bool) ()
|
#6 0x0000555b4a8f4131 in mongo::forceShardFilteringMetadataRefresh(mongo::OperationContext*, mongo::NamespaceString const&, bool) ()
|
#7 0x0000555b4a8f63d1 in mongo::onShardVersionMismatchNoExcept(mongo::OperationContext*, mongo::NamespaceString const&, mongo::ChunkVersion, bool) ()
|
#8 0x0000555b4a55994f in mongo::ServiceEntryPointMongod::Hooks::handleException(mongo::DBException const&, mongo::OperationContext*) const ()
|
#9 0x0000555b4a56a0a3 in mongo::(anonymous namespace)::execCommandDatabase(mongo::OperationContext*, mongo::Command*, mongo::OpMsgRequest const&, mongo::rpc::ReplyBuilderInterface*, mongo::ServiceEntryPointCommon::Hooks const&) [clone .constprop.314] ()
|
#10 0x0000555b4a56d019 in mongo::(anonymous namespace)::receivedCommands(mongo::OperationContext*, mongo::Message const&, mongo::ServiceEntryPointCommon::Hooks const&)::{lambda()#1}::operator()() const ()
|
#11 0x0000555b4a56df51 in mongo::ServiceEntryPointCommon::handleRequest(mongo::OperationContext*, mongo::Message const&, mongo::ServiceEntryPointCommon::Hooks const&) ()
|
#12 0x0000555b4a55977a in mongo::ServiceEntryPointMongod::handleRequest(mongo::OperationContext*, mongo::Message const&) ()
|
#13 0x0000555b4a56635a in mongo::ServiceStateMachine::_processMessage(mongo::ServiceStateMachine::ThreadGuard) ()
|
#14 0x0000555b4a561017 in mongo::ServiceStateMachine::_runNextInGuard(mongo::ServiceStateMachine::ThreadGuard) ()
|
#15 0x0000555b4a564831 in std::_Function_handler<void (), mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
|
#16 0x0000555b4b7132e2 in mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName) ()
|
#17 0x0000555b4a55f200 in mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership) ()
|
#18 0x0000555b4a562345 in mongo::ServiceStateMachine::_sourceCallback(mongo::Status) ()
|
#19 0x0000555b4a560757 in mongo::ServiceStateMachine::_sourceMessage(mongo::ServiceStateMachine::ThreadGuard) ()
|
#20 0x0000555b4a56109d in mongo::ServiceStateMachine::_runNextInGuard(mongo::ServiceStateMachine::ThreadGuard) ()
|
#21 0x0000555b4a564831 in std::_Function_handler<void (), mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
|
#22 0x0000555b4b713845 in std::_Function_handler<void (), mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
|
#23 0x0000555b4be14a24 in mongo::(anonymous namespace)::runFunc(void*) ()
|
#24 0x00007f4493ec66ba in start_thread (arg=0x7f4469d40700) at pthread_create.c:333
|
#25 0x00007f4493bfc41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
|
|
|
I'm attached `mongo-user3-3c.joom.x.oplog.rs.bson`.
|
|
bozaro, thanks for the uploading the requested files from the most recent problem. Would it be possible for you to mongodump your oplog from the problem shard and provide the result to the same Uploader? If you are hesitant to provide your full oplog, just the "config.cache.collections" namespace may be sufficient. The following command should provide that information:
mongodump --db=local --collection=oplog.rs --query='{ns:"config.cache.collections"}'
|
Additionally, would it be possible to increase the log level of a sharding component across your cluster so that we get more info from the logs if the issue happens again?
db.setLogLevel(1, "sharding.shardingCatalogRefresh")
|
|
|
Logs was attached to this issue:
- mongod logs from all nodes in the problem shard (mongo-user3-*.joom.x.mongod.log.20191003.gz)
- mongod logs from the Primary of the config server at the time these problems were experienced (mongo-user-conf-1a.joom.x.mongod.log.20191003.gz)
- mongos logs where some of these queries were being run (worker-1a.joom.x.mongos.log.20191003.gz)
|
|
Collection `user.devices` was marked as `unsharded` at `2019-10-02T21:26:30.827+0000` and stuck in this state.
Unfortunately we don't try to update metadata by executing query without {{readConcern:
{ level: "local" }
}}: mongod was restarted.
zgrep "user.devices" mongo-user3-3c.joom.x.mongod.log.20191003.gz | grep Marking | grep -v user.devicesByUserId
|
2019-10-02T14:18:09.408+0000 I SHARDING [repl writer worker 15] Marking collection user.devices with collection version: 7410|0||5a8aa1cc91a6b0b11319f508, shard version: 7410|0||5a8aa1cc91a6b0b11319f508 as unsharded
|
2019-10-02T14:18:09.442+0000 I SHARDING [conn1188336] Marking collection user.devices as sharded with collection version: 7411|1||5a8aa1cc91a6b0b11319f508, shard version: 7411|1||5a8aa1cc91a6b0b11319f508
|
2019-10-02T21:26:15.773+0000 I SHARDING [repl writer worker 11] Marking collection user.devices with collection version: 7411|1||5a8aa1cc91a6b0b11319f508, shard version: 7411|1||5a8aa1cc91a6b0b11319f508 as unsharded
|
2019-10-02T21:26:15.787+0000 I SHARDING [conn1197534] Marking collection user.devices as sharded with collection version: 7414|1||5a8aa1cc91a6b0b11319f508, shard version: 7411|1||5a8aa1cc91a6b0b11319f508
|
2019-10-02T21:26:30.827+0000 I SHARDING [repl writer worker 1] Marking collection user.devices with collection version: 7414|1||5a8aa1cc91a6b0b11319f508, shard version: 7411|1||5a8aa1cc91a6b0b11319f508 as unsharded
|
2019-10-03T00:25:46.163+0000 I SHARDING [conn68] Marking collection user.devices as sharded with collection version: 7416|1||5a8aa1cc91a6b0b11319f508, shard version: 7416|1||5a8aa1cc91a6b0b11319f508
|
|
|
We got same issue yesterday on another cluster.
zgrep "Failed to refresh metadata" mongo-user3-3c.joom.x.mongod.log.20191003.gz | head -5
|
2019-10-02T21:28:01.724+0000 I SHARDING [conn1190615] Failed to refresh metadata for collectionuser.devices :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-02T21:28:43.633+0000 I SHARDING [conn1193697] Failed to refresh metadata for collectionuser.devices :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-02T21:28:58.988+0000 I SHARDING [conn1197590] Failed to refresh metadata for collectionuser.devices :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-02T21:29:09.772+0000 I SHARDING [conn1198578] Failed to refresh metadata for collectionuser.devices :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
2019-10-02T21:29:12.376+0000 I SHARDING [conn1198579] Failed to refresh metadata for collectionuser.devices :: caused by :: MaxTimeMSExpired: operation exceeded time limit
|
|
|
In order for us to investigate, can you please upload the following to our Support Uploader? Please note that only MongoDB engineers will be able to read the files therein and they will be automatically deleted after 180 days.
- mongod logs from all nodes in the problem shard
- mongod logs from the Primary of the config server at the time these problems were experienced
- mongos logs where some of these queries were being run
|
Generated at Thu Feb 08 05:01:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.