|
Hello.
I Have 20 nodes in cluster. Sometimes random node freeze due to begin use wrong indexe for query.
Example:
2015-03-20T15:23:01.646+0100 I QUERY [conn2572] query goods.events query: { $query: { gt: ObjectId('53216161688fcfc9fed24755'), act: "attr", lang: "RU", ta : "name", st: 1 }, $readPreference: { mode: "nearest" } } planSummary: IXSCAN { act: 1, c: -1 } ntoskip:0 nscanned:36885811 nscannedObjects:36885811 keyUpdate s:0 writeConflicts:0 numYields:295880 nreturned:1 reslen:382 locks:{} 29456ms
|
Usually this query use "gt_act_lang_ta" index.
events0003:PRIMARY> db.getSiblingDB('goods').getCollection('events').getIndexes()
|
[
|
{
|
"v" : 1,
|
"key" : {
|
"_id" : 1
|
},
|
"name" : "_id_",
|
"ns" : "goods.events"
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"u" : -1
|
},
|
"name" : "updated",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"tn" : 1
|
},
|
"name" : "tn",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"c" : -1
|
},
|
"name" : "created",
|
"ns" : "goods.events",
|
"background" : false
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"t" : 1,
|
"c" : -1
|
},
|
"name" : "type_create",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"act" : 1,
|
"c" : -1
|
},
|
"name" : "act_create",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"ut" : 1,
|
"act" : 1,
|
"u" : 1
|
},
|
"name" : "ut_act_u",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"ntfy" : 1
|
},
|
"name" : "ntfy_1",
|
"ns" : "goods.events",
|
"sparse" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"ut2" : 1,
|
"act" : 1,
|
"u" : -1
|
},
|
"name" : "ut2_1_act_1_u_-1",
|
"ns" : "goods.events",
|
"sparse" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"gt" : 1,
|
"act" : 1,
|
"lang" : 1,
|
"ta" : 1
|
},
|
"name" : "gt_act_lang_ta",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"act" : 1,
|
"ta" : 1,
|
"c" : 1
|
},
|
"name" : "act_ta_c",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"exist" : 1,
|
"act" : 1,
|
"ut" : 1
|
},
|
"name" : "exist_act_ut",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"ut" : 1,
|
"u" : 1
|
},
|
"name" : "ut_u",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"unique" : true,
|
"key" : {
|
"gt" : 1,
|
"ut" : 1,
|
"act" : 1,
|
"ta" : 1,
|
"lang" : 1
|
},
|
"name" : "gt_1_ut_1_act_1_ta_1_lang_1",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"gt" : 1,
|
"act" : 1,
|
"geo" : "2dsphere"
|
},
|
"name" : "gt_1_act_1_geo_2dsphere",
|
"ns" : "goods.events",
|
"sparse" : true,
|
"background" : true,
|
"2dsphereIndexVersion" : 2
|
}
|
]
|
|
|
Thanks for testing SERVER-15225 esp1974. I'm closing this ticket as a duplicate of SERVER-15225, feel free to watch that ticket for further updates.
Regards,
Ramón.
|
|
Confirm, SERVER-15225 fix this issue. Waiting for backport to 2.6 branch.
|
|
David, if 3.1.3 fix this issue, for me 3.1.3 much more stable that 2.6.10 or 3.0.3
|
|
Sounds good Dmitry, let me know how it goes! I wanted to reiterate that 3.1.3 is an unstable release, so please keep that in mind as you test.
Best,
Dave
|
|
Thank you, David.
I'll install 3.1.3 on some nodes in production ( a test environment don't have real load ) where issue occurring and continue monitoring. Fix SERVER-15225 looks like same problem.
|
|
Hi Dmitry,
Thanks for the additional info. I looked through the plan cache output you provided, but unfortunately in all cases the output indicates that a plan using the "gt_act_lang_ta" index is in the cache. This means we have not captured information from when the system was in a bad state. I see two options at this point:
- Continue trying to capture plan cache output during a time when the bad plan selection is occurring.
- Test the fix for
SERVER-15225. This fix is available in development release 3.1.3. Although this version is not suitable for production, if you have a test environment on which this problem manifests, you could confirm that SERVER-15225 fixes the problem. (I expect it to resolve this issue.) Confirming the fix for SERVER-15225 would give us more confidence when we evaluate whether this change can be backported to the 2.6 or 3.0 stable branches.
Best,
Dave
|
|
David, I catched some plans when server was in bad state (btw,restart help resolve this).
Sent it by scp.
Hope, this will help you.
Sorry, but I cannot upload full data, due to user private policy ;(
|
|
Hi Dmitry,
1. Is any difference query from local mongo client or from mongos?
The plan selection issue itself should not be affected by whether or not queries are routed through mongos or delivered directly from a client to the mongod. However, I recommend that when running getPlansByQuery() to capture plan cache information you connect directly to the affected mongod rather than routing this command through mongos.
2. Can this issue due to indexes type backgroud?
Are you asking whether this issue is related to background index builds? I doubt it. It sounds like the problem lies within the query plan ranking and query plan caching mechanisms.
I can grant access to server when next time this occure by skype.
If you would like to provide us with access to your data files, you can upload them as follows:
scp -P 722 -r <filename> SERVER-17674@www.mongodb.com:
|
Just press enter when prompted for a password.
I am putting this ticket back into "Waiting for User" state pending more information. Without plan cache output, a repro scenario, or more detailed logs, we can only guess at the root cause of the issue.
Best,
Dave
|
|
In log file i see:
2015-05-20T13:24:11.279+0200 [conn63] query goods.events query:
{ gt: ObjectId('4f4387953d72220a7901cde3'), act: "attr", lang: "ANY", ta: "binding", st: 1 }
planSummary: IXSCAN
{ act: 1, c: -1 }
ntoreturn:1 ntoskip:0 nscanne
d:51987572 nscannedObjects:51987572 keyUpdates:0 numYields:320 locks(micros) r:115047487 nreturned:0 reslen:20 58054ms
But, this query in 10-15sec later PlanCache.getPlansByQuery() result is ok and with correct index.
My script for catch it :
use goods
db.setSlaveOk()
db.events.find(
{ gt: ObjectId('4f4387953d72220a7901cde3'), act: "attr", lang: "ANY", ta: "binding", st: 1 }
).explain(true);
db.events.getPlanCache().getPlansByQuery(
{ gt: ObjectId('4f4387953d72220a7901cde3'), act: "attr", lang: "ANY", ta: "binding", st: 1 }
);
db.events.getPlanCache().listQueryShapes()
Some my thoughts:
1. Is any difference query from local mongo client or from mongos?
2. Can this issue due to indexes type backgroud?
I can grant access to server when next time this occure by skype.
|
|
Hi Dmitry,
Sorry to hear that you are still experiencing this problem. At this point it looks like you have either
- encountered a novel plan selection bug, or
- this is a case that we expect to be fixed with
SERVER-15225.
In order to determine whether or not this is a new issue, the most useful thing would be plan cache output captured while you are experiencing the bad behavior. If it is possible to capture this information, please paste the output of PlanCache.getPlansByQuery() run against the problematic query shape while you are experiencing the slow queries.
Alternatively, if you are willing to provide your data set, we can try to reproduce the issue in a test environment. In this case, I can provide instructions for a way to upload your data to a private server that is only accessible to the MongoDB technical team. Uploading the logs from the affected server covering the time window leading up to and including the slow queries would also be quite helpful in developing a repro.
Please let me know how you would like to proceed.
Best,
Dave
|
|
2.6.10 not fixed this issue
|
|
After long testing, unfortunately neither 2.6.9 nor 3.0.3 not fixed this issue. But frequency of symptoms decreased.
|
|
Hi Dmitry,
I am closing this ticket as Incomplete. Feel free to re-open if you are still experiencing the problem.
Best,
Dave
|
|
Hi Dmitry,
I haven't heard from you in a few days. Are you still experiencing either of the issues you reported? Please see my previous comment for further information that may help lead to a diagnosis.
Best,
Dave
|
|
Hi Dmitry,
Thanks for the additional info. If you see the plan selection problem again now that you are on 2.6.9, please let us know. Glad to hear that so far 2.6.9 looks good.
As far as the "freezing" problem on 3.0.1, I would like to gather a little bit more info in order to understand the symptoms and the potential root cause.
- Are your 3.0.1 servers configured with WiredTiger or with the default MMAPv1 storage engine?
- Can you paste example log lines for the queries that "freeze" the cluster?
- You say that the query is a "non-index query". Does this mean that you have not built an index for this query, i.e. you are expecting the query to scan the full collection?
Best,
Dave
|
|
Hello Dave.
I upgraded 2.6.8 to 2.6.9 two day ago. Looks like no issue at this moment. But I continue monitoring.
I registered this issue in 3.0.1 too ;( (atm all nodes rolled back to 2.6 branch due to "freezing" cluster with 3.0 branch) So, I cannot confirm SERVER-14071 as my problem.
I tried capture bad plan. Its very hard due to random node, random query with high load server and very rarely issue. Will be good in MMS aggregate profiling data from all nodes in one order for fast locate problem.
Freezing is big trouble for me, im sure most of DBA have to observe this. Any non-index query in big collection(>100M docs) freeze node for long time in 3.0.1 with any storage (WT/nMAPv1). In 2.6, yes, server high load, but not freeze. Combination bad plan and freezing kill cluster for long time in my case. Mongos wait this node infinitely long.
WBW,
Dmitry.
|
|
Hi Dmitry,
It appears that a bad plan is making its way into the plan cache. I took a look at the attached plan cache output in out.txt and out2.txt in order to see if there is enough to identify the root cause. Nothing here jumps out to me as abnormal, but I suspect that this output was captured at a time when the correct plan was in the cache. Capturing the plan cache output during a time window where the bad plan is being used would be ideal, although I know this may be difficult if the problem reproduces intermittently.
A few questions:
- The plan selection problems observed on 2.6.8 could be a symptom of
SERVER-14071, fixed in 3.0.0 and 2.6.9. Have you seen the problem on a 3.0.1 server, or only on 2.6.8?
- As a follow-up, would it be possible to test whether an upgrade to 2.6.9 would resolve the issue?
In general, the more information you can provide that might help us reproduce the problem, the better.
BTW. another bad thing on 3.0.1 (not 2.6.8) with NMAPv1:
If one query load high node, all other querys in order "freeze". So, "freeze" whole shard cluster ;(
This sounds to me like a separate problem. If you can gather evidence of a bug affecting version 3.0.1, I suggest opening a separate ticket in the SERVER project. I would like to keep this ticket focused on the bad query plan selection behavior, as originally reported.
Thanks for your help on this!
Best,
Dave
|
|
Info in attachment a bit after .
|
|
mongo 2.6.8
Again same collection another query with exists index by ut2 field
{
|
"v" : 1,
|
"key" : {
|
"ut2" : 1,
|
"act" : 1,
|
"u" : -1
|
},
|
"name" : "ut2_1_act_1_u_-1",
|
"ns" : "goods.events",
|
"sparse" : true
|
},
|
use wrong index
{
|
"v" : 1,
|
"key" : {
|
"u" : -1
|
},
|
"name" : "updated",
|
"ns" : "goods.events",
|
"background" : true
|
},
|
for 7 minues only :
2015-03-27T16:31:03.494+0100 [conn35374] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:9303 locks(micros) r:490412498 reslen:48 257578ms
|
2015-03-27T16:31:03.515+0100 [conn35379] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:9309 locks(micros) r:490369656 reslen:48 257424ms
|
2015-03-27T16:31:03.831+0100 [conn35344] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:6558 locks(micros) r:490399347 reslen:48 255497ms
|
2015-03-27T16:31:08.322+0100 [conn35367] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:3651 locks(micros) r:490113029 reslen:48 252279ms
|
2015-03-27T16:31:08.451+0100 [conn35740] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('5513fb0deb25338db5c75509'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2748 locks(micros) r:488140410 reslen:48 249992ms
|
2015-03-27T16:31:08.675+0100 [conn35387] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:3661 locks(micros) r:490754888 reslen:48 252496ms
|
2015-03-27T16:31:16.007+0100 [conn35371] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2831 locks(micros) r:489071997 reslen:48 249729ms
|
2015-03-27T16:31:16.038+0100 [conn35415] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2860 locks(micros) r:487114000 reslen:48 249758ms
|
2015-03-27T16:31:16.292+0100 [conn36605] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('550bdc6d1202e46e7f1e4b02'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:3106 locks(micros) r:493173971 reslen:48 252764ms
|
2015-03-27T16:31:17.813+0100 [conn35362] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('5513fb0deb25338db5c75509'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2879 locks(micros) r:487106709 reslen:48 249284ms
|
2015-03-27T16:31:22.835+0100 [conn35337] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('5513fb0deb25338db5c75509'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2873 locks(micros) r:477315899 reslen:48 244214ms
|
2015-03-27T16:31:23.275+0100 [conn35338] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('54cbbf3d23612623b6d5bce2'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:2810 locks(micros) r:484461876 reslen:48 248030ms
|
2015-03-27T16:38:43.396+0100 [conn35354] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:18821 locks(micros) r:453814956 reslen:48 239953ms
|
2015-03-27T16:38:43.414+0100 [conn35347] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:11789 locks(micros) r:412537455 reslen:48 216602ms
|
2015-03-27T16:38:43.417+0100 [conn35985] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:14736 locks(micros) r:440894380 reslen:48 231896ms
|
2015-03-27T16:38:43.431+0100 [conn36605] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:12841 locks(micros) r:411715453 reslen:48 216538ms
|
2015-03-27T16:38:43.460+0100 [conn35357] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:15359 locks(micros) r:439861219 reslen:48 231871ms
|
2015-03-27T16:38:43.478+0100 [conn45879] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:14753 locks(micros) r:437537452 reslen:48 230249ms
|
2015-03-27T16:38:43.488+0100 [conn35356] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:4744 locks(micros) r:376841670 reslen:48 195377ms
|
2015-03-27T16:38:43.496+0100 [conn35337] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:15630 locks(micros) r:435973668 reslen:48 229998ms
|
2015-03-27T16:38:43.501+0100 [conn35382] command goods.$cmd command: count { count: "events", query: { ut2: ObjectId('55157892a80f76f41189a323'), u: { $gt: ne
|
w Date(0) } } } planSummary: IXSCAN { u: -1 } ntoreturn:1 keyUpdates:0 numYields:21201 locks(micros) r:452904440 reslen:48 240344ms
|
|
|
BTW. another bad thing on 3.0.1 (not 2.6.8) with NMAPv1:
If one query load high node, all other querys in order "freeze". So, "freeze" whole shard cluster ;(
|
|
Here is attachment with normal query and behaviour.
Bad behaviour hard to detect - it's random once or twice a day in random node. Only by logs I can see it.
|
|
Thanks for this report, to help us debug this issue, can you provide the output of the following commands:
db.collection.getPlanCache().listQueryShapes();
|
db.collection.getPlanCache().getPlansByQuery(<query>);
|
db.collection.find(<query>).explain(true);
|
Replace <query> with your actual query. This will let us better understand the behavior of the query system in this context.
Regards,
sam
|
Generated at Thu Feb 08 03:45:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.