-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Affects Version/s: django-mongodb-backend-5.2.0b1
-
Component/s: django
-
None
-
Python Drivers
-
None
-
None
-
None
-
None
-
None
-
None
Summary
Joins in Many to Many fields are slow
Motivation
Who is the affected end user?
Our users
How does this affect the end user?
The queries are slow (a few seconds)
How likely is it that this problem or use case will occur?
Every time the "pivot table" collection is large and has many occurences of the outer foreign key
If the problem does occur, what are the consequences and how severe are they?
Long running query
Is this issue urgent?
yes, we cannot use Many To Many fields like this
Is this ticket required by a downstream team?
NA
Is this ticket only for tests?
No
This issue was also reported on the github of django mongo backend, with a tentative patch:
https://github.com/mongodb/django-mongodb-backend/issues/309
Details:
I have an Observation model that has a ManyToMany to a Subscope model.
The observations collection is large (100k), the subscopes collection isn't, but the same subscope appears a lot in observations, so it is often in the pivot collection created by django. Namely a collection with (_id, observation_id and subscope_id), the composite unique index on observation_id and subscope_id is automatically generated)
When I access the many to many of an observation instance:
observation.subscopes.all()
A slow query is generated (this is with count() at the end but it's the same:
[{
'$lookup': {
'from': 'orm_observation_subscopes',
'let':
,
'pipeline': [{
'$match': {
'$expr': {
'$and': [
]
}
}
}],
'as': 'orm_observation_subscopes'
}
},
, {
'$match': {
'$expr':
}
}, {
'$facet': {
'group': [{
'$group': {
'__count': {
'$sum': {
'$cond': {
'if': {
'$in': [{
'$type':
},
['missing', 'null']
]
},
'then': None,
'else': 1
}
}
},
'_id': None
}
}]
}
}, {
'$addFields': {
'__count': {
'$getField': {
'input':
,
'field': '__count'
}
},
'_id': {
'$getField': {
'input':
,
'field': '_id'
}
}
}
}, {
'$project': {
'__count': {
'$ifNull': ['$__count',
]
}
}
}]
The match contains the join field, but that's after the lookup. This query takes 3 seconds.
If I change this query manually to add the match in the lookup like this:
[{
'$lookup': {
'from': 'orm_observation_subscopes',
'let':
,
'pipeline': [{
'$match': {
'$expr': {
'$and': [
,
]
}
}
}],
'as': 'orm_observation_subscopes'
}
},
,
]
It takes 0.020 seconds.
The first version of course is only slow if the pivot collection is large, it is expected. Moving the match in the lookup prevents the whole collection from being scanned.
I added a fairly naive but working patch to the github issue, that could certainly be improved.