[SERVER-36801] "mongos collstats doesn't know about" clusterTime fields warning in 3.6 Created: 22/Aug/18  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Logging
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: 章 黒鉄 Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 7
Labels: ShardingRoughEdges, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-38129 mongos CollectionStats command is too... Closed
Assigned Teams:
Cluster Scalability
Sprint: Sharding 2018-09-24
Participants:
Case:

 Description   

In our logs we regularly find the following lines with "W" severity coming from out 3.6 cluster mongos routers:

 

2018-08-22T09:01:59.913+0900 W COMMAND  [conn13334] mongos collstats doesn't know about: operationTime
2018-08-22T09:01:59.913+0900 W COMMAND  [conn13334] mongos collstats doesn't know about: gleStats
2018-08-22T09:01:59.913+0900 W COMMAND  [conn13334] mongos collstats doesn't know about: clusterTime
2018-08-22T09:01:59.913+0900 W COMMAND  [conn13334] mongos collstats doesn't know about: configServerState

It seems this warning was added in MongoDB v1.8 as reasonable warning against fields that shouldn't be included when executing the command on a mongos router (SERVER-2752), but it wasn't updated to accept the fields for the 'logical time' clock (SERVER-27745) (since renamed "clusterTime") that were added in 3.6.

I see in the v4.0 code it doesn't have this logic any more, but we won't be moving to 4.0 anytime soon. So just for the next minor version in the v3.6 branch could that block of code in the mongo/s/commands/commands_public.cpp file's CollectionStats::run() function be fixed to also ignore the "operationTime", "gleStats", "clusterTime", "configServerState" fields?



 Comments   
Comment by Spencer Brown [ 01/Jul/21 ]

This message is also occurring in mongos 4.0.24.

Comment by Kevin Pulo [ 20/Nov/18 ]

Ok, got it, thanks. We can keep this ticket for the log noise, and if you can open a support ticket to investigate the other issues (if you haven't already), then that would be best.

Comment by 章 黒鉄 [ 20/Nov/18 ]

Sorry, I wasn't clear. My intention when sharing this was to say 'here are some coinciding symptoms I've noticed since'.

  • The collStats not returning info from all shards is 1-to-1 as a far as I can see now, every time it fails to return all shard's info the log message appears and presumably vice versa.
  • The dataSize command error is just something that is occurring on the same mongos node, but only on two out of three nodes I've seen the collStats log message error.

So yes what is causing them is not the thing that makes the log message, I just wanted to share more parts of the picture I encountered later in case that helps.

Comment by Kevin Pulo [ 20/Nov/18 ]

I can't see how this message could cause the mongos to sometimes target the wrong set of shards, or to otherwise return partial results, or affect dataSize in this way. Surely that must be some other problem(s), with some other cause (especially if FCV fixes it), which would be best investigated in a support context? Or have I misunderstood?

Comment by 章 黒鉄 [ 19/Nov/18 ]

@kevin.pulo I iterated through the 500+ mongos nodes we have attached to the this cluster at the moment and I've found two that still reproduce the error when running collStats.

Furthermore one of them also fails to execute dataSize:

mongos> db.collections.findOne({_id: "search2.providers"})
{
        "_id" : "search2.providers",
        "lastmodEpoch" : ObjectId("5bed019d5d3726903ef0b553"),
        "lastmod" : ISODate("1970-02-19T17:02:47.296Z"),
        "dropped" : false,
        "key" : {
                "_id" : 1
        },
        "unique" : false,
        "uuid" : UUID("7b2420d4-1cf0-456a-a4c7-7fd9a5c5dc91")
}
mongos> db.chunks.find({ns: "search2.providers"}).limit(3).pretty()
{
        "_id" : "search2.providers-_id_\"vZ9x_0InTBCD4zIdQvg41g\"",
        "lastmod" : Timestamp(1, 3),
        "lastmodEpoch" : ObjectId("5bed019d5d3726903ef0b553"),
        "ns" : "search2.providers",
        "min" : {
                "_id" : "vZ9x_0InTBCD4zIdQvg41g"
        },
        "max" : {
                "_id" : { "$maxKey" : 1 }
        },
        "shard" : "Trv01"
}
{
        "_id" : "search2.providers-_id_MinKey",
        "lastmod" : Timestamp(2, 0),
        "lastmodEpoch" : ObjectId("5bed019d5d3726903ef0b553"),
        "ns" : "search2.providers",
        "min" : {
                "_id" : { "$minKey" : 1 }
        },
        "max" : {
                "_id" : "-0hz0J38ToKPBWEKglGaig"
        },
        "shard" : "Trv05"
}
{
        "_id" : "search2.providers-_id_\"-0hz0J38ToKPBWEKglGaig\"",
        "lastmod" : Timestamp(3, 0),
        "lastmodEpoch" : ObjectId("5bed019d5d3726903ef0b553"),
        "ns" : "search2.providers",
        "min" : {
                "_id" : "-0hz0J38ToKPBWEKglGaig"
        },
        "max" : {
                "_id" : "9zTqdtYrSImaxX1DUIuAOw"
        },
        "shard" : "Trv04"
}
mongos> db.runCommand({dataSize: "search2.providers", keyPattern: {_id" : 1}, min: {  "_id" : "-0hz0J38ToKPBWEKglGaig" }, max: {  "_id" : "9zTqdtYrSImaxX1DUIuAOw" }})
2018-11-19T17:05:13.271+0900 E QUERY    [thread1] SyntaxError: missing : after property id @(shell):1:62
mongos> db.runCommand({dataSize: "search2.providers", keyPattern: {"_id" : 1}, min: {  "_id" : "-0hz0J38ToKPBWEKglGaig" }, max: {  "_id" : "9zTqdtYrSImaxX1DUIuAOw" }})
{
        "ok" : 0,
        "errmsg" : "keyPattern must equal shard key",
        "code" : 13408,
        "codeName" : "Location13408",
        "operationTime" : Timestamp(1542614729, 2107),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542614729, 2107),
                "signature" : {
                        "hash" : BinData(0,"EjW4z3sRdDuHtNie1VlulspbRbE="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}

Want me to poke around with this whilst the in-memory state is still screwy? The database namespace in question will be dropped soon.

Comment by 章 黒鉄 [ 16/Nov/18 ]

Ah wait, I found this is not just log noise, it also causes truncation of the data in the sporadic cases it occurs.

E.g. we have a 4924626332 + 4698330239 + 4855230485 + 4234303379 + 5047820027 = 22GB datasize "options" collection in the "search2" db. Here is collStats run on each shard directly

[kuroganeakir01@atrrelease201z ~]$ for shpr in 101 102 103 104 105; do ~/bin/mongo "mongodb://mongosvr${shpr}-n1:27018/config" --authenticationDatabase admin --quiet --eval 'var clst = db.getSiblingDB("search2").getCollection("options").stats(); delete clst.wiredTiger; print(rs.status().set); printjson(clst);'; done
Trv01
{
        "ns" : "search2.options",
        "size" : 4924626332,
        "count" : 133713,
        "avgObjSize" : 36829,
        "storageSize" : 8575307776,
        "capped" : false,
        "nindexes" : 4,
        "totalIndexSize" : 82112512,
        "indexSizes" : {
                "_id_" : 37560320,
                "loc_2dsphere" : 7057408,
                "pid_1_ocd_1" : 23924736,
                "al_1" : 13570048
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345914, 488),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000006")
        },
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1542345914, 103),
                        "t" : NumberLong(4)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345914, 489),
                "signature" : {
                        "hash" : BinData(0,"ZldOtYj2QcTx/gJMBfMIiyZ3xUc="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}
Trv02
{
        "ns" : "search2.options",
        "size" : 4698330239,
        "count" : 128115,
        "avgObjSize" : 36672,
        "storageSize" : 1151541248,
        "capped" : false,
        "nindexes" : 4,
        "totalIndexSize" : 17940480,
        "indexSizes" : {
                "_id_" : 4055040,
                "loc_2dsphere" : 2265088,
                "pid_1_ocd_1" : 5627904,
                "al_1" : 5992448
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345914, 925),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("000000000000000000000000")
        },
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1542345914, 1),
                        "t" : NumberLong(4)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345914, 1062),
                "signature" : {
                        "hash" : BinData(0,"ZldOtYj2QcTx/gJMBfMIiyZ3xUc="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}
Trv03
{
        "ns" : "search2.options",
        "size" : 4855230485,
        "count" : 132129,
        "avgObjSize" : 36746,
        "storageSize" : 1187561472,
        "capped" : false,
        "nindexes" : 4,
        "totalIndexSize" : 18354176,
        "indexSizes" : {
                "_id_" : 4206592,
                "loc_2dsphere" : 2301952,
                "pid_1_ocd_1" : 5943296,
                "al_1" : 5902336
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345914, 1756),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000007")
        },
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1542345914, 1380),
                        "t" : NumberLong(4)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345914, 1757),
                "signature" : {
                        "hash" : BinData(0,"ZldOtYj2QcTx/gJMBfMIiyZ3xUc="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}
Trv04
{
        "ns" : "search2.options",
        "size" : 4234303379,
        "count" : 115248,
        "avgObjSize" : 36740,
        "storageSize" : 1035575296,
        "capped" : false,
        "nindexes" : 4,
        "totalIndexSize" : 16306176,
        "indexSizes" : {
                "_id_" : 3649536,
                "loc_2dsphere" : 1871872,
                "pid_1_ocd_1" : 5070848,
                "al_1" : 5713920
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345914, 2345),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000006")
        },
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1542345914, 1773),
                        "t" : NumberLong(4)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345914, 2345),
                "signature" : {
                        "hash" : BinData(0,"ZldOtYj2QcTx/gJMBfMIiyZ3xUc="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}
Trv05
{
        "ns" : "search2.options",
        "size" : 5047820027,
        "count" : 136722,
        "avgObjSize" : 36920,
        "storageSize" : 1234747392,
        "capped" : false,
        "nindexes" : 4,
        "totalIndexSize" : 19038208,
        "indexSizes" : {
                "_id_" : 4341760,
                "loc_2dsphere" : 2342912,
                "pid_1_ocd_1" : 6123520,
                "al_1" : 6230016
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345914, 2995),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000007")
        },
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1542345914, 2822),
                        "t" : NumberLong(4)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345914, 2996),
                "signature" : {
                        "hash" : BinData(0,"ZldOtYj2QcTx/gJMBfMIiyZ3xUc="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}

This is what I got when I ran it on one mongos node before running flushRouterConfig. This particular mongos node is a DBA-use one that had high uptime and rarely any user db writes coming through it.

[kuroganeakir01@atrrelease201z ~]$ ~/bin/mongo "mongodb://mongossvr:27017/config" --authenticationDatabase admin --quiet --eval 'var clst = db.getSiblingDB("search2").getCollection("options").stats(); for (shnm in clst.shards) {delete clst.shards[shnm].wiredTiger}; printjson(clst);'
{
        "sharded" : true,
        "capped" : false,
        "ns" : "search2.options",
        "count" : 133713,
        "size" : 4924626332,
        "storageSize" : 8575307776,
        "totalIndexSize" : 82112512,
        "indexSizes" : {
                "_id_" : 37560320,
                "al_1" : 13570048,
                "loc_2dsphere" : 7057408,
                "pid_1_ocd_1" : 23924736
        },
        "avgObjSize" : 36829,
        "nindexes" : 4,
        "nchunks" : 9,
        "shards" : {
                "Trv01" : {
                        "ns" : "search2.options",
                        "size" : 4924626332,
                        "count" : 133713,
                        "avgObjSize" : 36829,
                        "storageSize" : 8575307776,
                        "capped" : false,
                        "nindexes" : 4,
                        "totalIndexSize" : 82112512,
                        "indexSizes" : {
                                "_id_" : 37560320,
                                "loc_2dsphere" : 7057408,
                                "pid_1_ocd_1" : 23924736,
                                "al_1" : 13570048
                        },
                        "ok" : 1,
                        "operationTime" : Timestamp(1542345715, 3164),
                        "$gleStats" : {
                                "lastOpTime" : Timestamp(0, 0),
                                "electionId" : ObjectId("7fffffff0000000000000006")
                        },
                        "$configServerState" : {
                                "opTime" : {
                                        "ts" : Timestamp(1542345715, 3088),
                                        "t" : NumberLong(4)
                                }
                        },
                        "$clusterTime" : {
                                "clusterTime" : Timestamp(1542345715, 3164),
                                "signature" : {
                                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                                        "keyId" : NumberLong(0)
                                }
                        }
                }
        },
        "ok" : 1,
        "operationTime" : Timestamp(1542345715, 3107),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1542345715, 3107),
                "signature" : {
                        "hash" : BinData(0,"+2CQJ3h0cWoz+M7qHWYu/jRkwGw="),
                        "keyId" : NumberLong("6591042978351939586")
                }
        }
}

Comment by 章 黒鉄 [ 15/Nov/18 ]

My suggested patch was just working with the paradigm in place, so pretty dull I agree. I just wanted some log noise to go away sooner rather than later.

As the purpose of this log message seems to be to notify when future development breaks the logic, deleting it and creating a unit test instead is better I guess.

Comment by Kevin Pulo [ 15/Nov/18 ]

As a general principle, commands that are intolerant of unexpected (but harmless) fields tend to be backward-brittle and just make upgrades/multi-version issues needlessly complicated and annoying.

Rather than manufacturing busywork for ourselves whenever a new meta field is added, can we just get rid of this message? Is there actually any value in it anymore? Another option is to just change it to LOG(1) or 2.

Comment by 章 黒鉄 [ 23/Aug/18 ]

That's true, I am a signed-up contributor since before I worked at MongoDB even. I'm blocked from pushing to any external git repo at work though.

On the basis that appendRequiredFieldsToResponse() in strategy.cpp is already doing the right thing regarding the "operationTime" and "$clusterTime" fields that make it out in the top level of the collStats output, or is doing the best that the 3.6 branch ever will, the patch I suggest is:

diff --git a/src/mongo/s/commands/commands_public.cpp b/src/mongo/s/commands/commands_public.cpp
index 6f2249f37c..217b96e514 100644
--- a/src/mongo/s/commands/commands_public.cpp
+++ b/src/mongo/s/commands/commands_public.cpp
@@ -804,6 +804,15 @@ public:
                             warnedAboutIndexes = true;
                         }
                     }
+                // logical clock fields added in 3.6
+                } else if (str::equals(e.fieldName(), "operationTime")) {
+                    // skip this field in the rollup
+                } else if (str::equals(e.fieldName(), "$clusterTime")) {
+                    // skip this field in the rollup
+                } else if (str::equals(e.fieldName(), "$gleStats")) {
+                    // skip this field in the rollup
+                } else if (str::equals(e.fieldName(), "$configServerState")) {
+                    // skip this field in the rollup
                 } else {
                     warning() << "mongos collstats doesn't know about: " << e.fieldName();
                 }

I.e. skip them to avoid the last else that would complain about those fields because they're unknown to the shard-iterating block that sums count, sums size, recalcs avgObjSize, etc.

Comment by Kelsey Schubert [ 22/Aug/18 ]

Hi akira,

Would you like to open a pull request with the changes you've suggested? It looks like you've already signed the contributor agreement a while back

Best wishes,
Kelsey

Generated at Thu Feb 08 04:44:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.