[SERVER-23650] Mongodb distinct not using indexes with query ? Created: 12/Apr/16  Updated: 15/Apr/16  Resolved: 15/Apr/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.0.11, 3.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Feng Yu Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Here is my docs look like:

shifudaoRs:SECONDARY> db.equipment.rtdata.find()
{ "_id" : ObjectId("5610093899695e7062add4ac"), "timestamp" : NumberLong("1443891603962"), "1" : 0, "2" : 0, "3" : 26, "4" : 2530971136, "5" : 655369984, "38" : 0, "37" : 0, "36" : 1, "35" : 0, "34" : 0, "33" : 0, "32" : 1, "31" : 1, "54" : 0, "53" : 0, "52" : 0, "51" : 0, "50" : 0, "49" : 0, "48" : 0, "47" : 1, "70" : 1, "69" : 0, "68" : 0, "67" : 1, "66" : 0, "65" : 1, "64" : 0, "63" : 1, "isLoad" : 0, "isRun" : 0, "power" : 41.25, "6" : 10000, "7" : 10000, "8" : 20545, "9" : 0, "10" : 1280, "86" : 0, "85" : 0, "84" : 0, "83" : 0, "82" : 0, "81" : 0, "80" : 0, "79" : 0, "src" : NumberLong(33) }
{ "_id" : ObjectId("5610094299695e7062add4c4"), "timestamp" : NumberLong("1443891613010"), "1" : 0, "2" : 0, "3" : 26, "4" : 2530971136, "5" : 655369984, "38" : 0, "37" : 0, "36" : 1, "35" : 0, "34" : 0, "33" : 0, "32" : 1, "31" : 1, "54" : 0, "53" : 0, "52" : 0, "51" : 0, "50" : 0, "49" : 0, "48" : 0, "47" : 1, "70" : 1, "69" : 0, "68" : 0, "67" : 1, "66" : 0, "65" : 1, "64" : 0, "63" : 1, "isLoad" : 0, "isRun" : 0, "power" : 41.25, "6" : 10000, "7" : 10000, "8" : 20545, "9" : 0, "10" : 1280, "86" : 0, "85" : 0, "84" : 0, "83" : 0, "82" : 0, "81" : 0, "80" : 0, "79" : 0, "src" : NumberLong(33) }
{ "_id" : ObjectId("5610094c99695e7062add4da"), "timestamp" : NumberLong("1443891624859"), "1" : 0, "2" : 0, "3" : 26, "4" : 2530971136, "5" : 655369984, "38" : 0, "37" : 0, "36" : 1, "35" : 0, "34" : 0, "33" : 0, "32" : 1, "31" : 1, "54" : 0, "53" : 0, "52" : 0, "51" : 0, "50" : 0, "49" : 0, "48" : 0, "47" : 1, "70" : 1, "69" : 0, "68" : 0, "67" : 1, "66" : 0, "65" : 1, "64" : 0, "63" : 1, "isLoad" : 0, "isRun" : 0, "power" : 41.25, "6" : 10000, "7" : 10000, "8" : 20545, "9" : 0, "10" : 1280, "86" : 0, "85" : 0, "84" : 0, "83" : 0, "82" : 0, "81" : 0, "80" : 0, "79" : 0, "src" : NumberLong(33) }
{ "_id" : ObjectId("5610095699695e7062add4f1"), "timestamp" : NumberLong("1443891634234"), "1" : 0, "2" : 0, "3" : 26, "4" : 2530971136, "5" : 655369984, "38" : 0, "37" : 0, "36" : 1, "35" : 0, "34" : 0, "33" : 0, "32" : 1, "31" : 1, "54" : 0, "53" : 0, "52" : 0, "51" : 0, "50" : 0, "49" : 0, "48" : 0, "47" : 1, "70" : 1, "69" : 0, "68" : 0, "67" : 1, "66" : 0, "65" : 1, "64" : 0, "63" : 1, "isLoad" : 0, "isRun" : 0, "power" : 41.25, "6" : 10000, "7" : 10000, "8" : 20545, "9" : 0, "10" : 1280, "86" : 0, "85" : 0, "84" : 0, "83" : 0, "82" : 0, "81" : 0, "80" : 0, "79" : 0, "src" : NumberLong(33) }

And here is my indexes:

shifudaoRs:SECONDARY> db.equipment.rtdata.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "aircareRealtimeDB.equipment.rtdata"
	},
	{
		"v" : 1,
		"key" : {
			"src" : 1,
			"timestamp" : -1
		},
		"name" : "src_1_timestamp_-1",
		"ns" : "aircareRealtimeDB.equipment.rtdata",
		"background" : true
	},
	{
		"v" : 1,
		"key" : {
			"src" : 1
		},
		"name" : "src_1",
		"ns" : "aircareRealtimeDB.equipment.rtdata",
		"background" : true
	},
	{
		"v" : 1,
		"key" : {
			"timestamp" : -1
		},
		"name" : "timestamp_-1",
		"ns" : "aircareRealtimeDB.equipment.rtdata",
		"background" : true
	}
]

I want distinct("src") by timestamp duration.

I find only distinct the whole src executed very fast:

shifudaoRs:SECONDARY> db.runCommand({ distinct: 'equipment.rtdata',key:'src'})
{
	"values" : [
      ... SNIP ...
	],
	"stats" : {
		"n" : 467,
		"nscanned" : 467,
		"nscannedObjects" : 0,
		"timems" : 17,
		"planSummary" : "DISTINCT { src: 1 }"
	},
	"ok" : 1
}

But when I add a query about the timestamp duration, will be very slow.

db.runCommand({ distinct: 'equipment.rtdata',key:'src', query: {timestamp: {$gte:0, $lt:999999999999999999999}}})

Of course this will be the same result, but execute very slow. And have not finished for now.

shifudaoRs:SECONDARY> db.currentOp()
{
	"inprog" : [
		{
			"desc" : "conn51410",
			"threadId" : "0xf1dfd40",
			"connectionId" : 51410,
			"opid" : 20780367,
			"active" : true,
			"secs_running" : 2304,
			"microsecs_running" : NumberLong("2304321577"),
			"op" : "query",
			"ns" : "aircareRealtimeDB.equipment.rtdata",
			"query" : {
				"distinct" : "equipment.rtdata",
				"key" : "src",
				"query" : {
					"timestamp" : {
						"$gte" : 0,
						"$lt" : 1e+21
					}
				}
			},
			"client" : "127.0.0.1:50387",
			"numYields" : 452037,
			"locks" : {
				"Global" : "r",
				"Database" : "r",
				"Collection" : "r"
			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(904076)
					},
					"acquireWaitCount" : {
						"r" : NumberLong(4683)
					},
					"timeAcquiringMicros" : {
						"r" : NumberLong(10905279)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(452038)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"r" : NumberLong(452038)
					}
				}
			}
		}
	]
}

Maybe mongodb not use the "src" indexes? How could I optimize this distinct ?



 Comments   
Comment by Feng Yu [ 15/Apr/16 ]

Thanks. I've got it. Waiting for the improvement.

Comment by Kelsey Schubert [ 15/Apr/16 ]

Hi abcfy2,

Thanks for reporting this behavior. MongoDB is able to make certain optimizations when executing your first query

db.runCommand({ distinct: 'equipment.rtdata',key:'src'})

that it cannot make on your second query

db.runCommand({ distinct: 'equipment.rtdata',key:'src', query: {timestamp: {$gte:0, $lt:999999999999999999999}}})

The first query uses a plan called DISTINCT_SCAN, whereas the second uses an IXSCAN. The second query cannot include these optimizations because it has to do a range match on the timestamp field. Consequently, it is expected that this query may take significantly longer to complete. Despite the difference in execution time, both of these generated query plans utilize the appropriate indexes.

If you would like to perform queries of the second shape more quickly, I would recommend that you consider the following steps and determine whether this approach would fit your use case.

  1. Ensure you are using MongoDB 3.2 to take advantage of SERVER-12015
  2. Create an index on {"timestamp" : -1, "src" : 1}
  3. Rewrite your distinct command as an aggregation pipeline:

    db.equipment.rtdata.aggregate([{$match: {timestamp: {$gte:0, $lt:999999999999999999999}}},{$group: {_id:"1",ids:{$addToSet: "$src"}}},{$project: {"_id":0,"ids":1}}])
    

You may also want to watch SERVER-19507, which would remove the need to rewrite this query as an aggregation pipeline to use a covered query plan (as described in the third step).

Kind regards,
Thomas

Generated at Thu Feb 08 04:04:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.