[SERVER-15069] Better validation of stored geometries without failing index builds Created: 28/Aug/14  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Geo, Index Maintenance
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Hows Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: qi-geo
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Query Integration
Participants:

 Description   

Currently when building a Geo index on a collection with existing data the build will fail if documents being indexed contain invalid geometries.

This means that the only option to building the index is to to brute force the index build and resolving the errors that are picked up after each failed build. This can be a costly and time consuming operation especially when a large amount of data is involved.

We should provide some mechanism to allow users to validate geometries without waiting for index builds to fail.



 Comments   
Comment by Robert Jobson [ 25/Sep/14 ]

Normal `work around` we use which was included in the original commercial support ticket that spawned this feature request is to create a second collection with index and copy records over. Marking the ones that succeed with a flag then once you have run through them all moving the bad ones out. Something like

for (i = 0; i < 139; i++)

{ db.collection_clean.ensureIndex( {"geoJson":"2dsphere"}

);
db.collection.find({c:{$exists:0}},

{"geoJson":1}

).addOption(16).limit(500000).forEach(function(doc)

{ db.collection_clean.insert(doc); }

);
db.collection_clean.find({},{}).addOption(16).forEach(function(doc){
db.collection.update({_id:doc._id},{$set:{c:1}});
});
db.collection_clean.drop();
}
db.collection.find({c:{$exists:0}}).forEach(function(doc){
db.collection.update({_id:doc._id},{$set:

{malgeo:doc.geoJson}

,$unset:{geoJson:1}});
});

The request there was actually that a parameter be added to the the index build to provide a field name for where to move the bad geometries. So rather than dying on finding a malformed geometry the build would move it aside and continue. This greatly simplifies the amount of work required.

Bear in mind that we are dealing with data sets of a cumbersome size. The set that prompted this request took weeks to load and more weeks to clean.

Comment by Greg Studer [ 28/Aug/14 ]

Another option is to run one or more $geoWithin/$geoIntersects queries on the data in the unindexed collection - documents which are not valid GeoJSON will not be returned and can then be fixed.

Comment by Eric Milkie [ 28/Aug/14 ]

Can you create the new geo index on an empty collection and then write a script that attempts to copy each document? In this way, you could build up a list of which documents fail the index validation.

Generated at Thu Feb 08 03:36:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.