Options documentation would need to be updated
As we convert the tools to the new Go driver, the original PR will not apply. Instead, we'll implement higher-performance bulk insert/update built on the new Go driver bulk API, including the higher batch size limit.
The request to add a "Remove" mode has been pulled out to
TOOLS-2268 for separate triage
The below changes were implemented after consulting with our Mongo rep Anant Srivastava to meet internal implementation needs. I will be opening a pull request shortly with our changes for review in case some/all of these changes want to be rolled into the product.
Enable bulk upsert operations. In the live version of mongoimport, running in upsert mode limits to 1 insertion worker process and an effective batch size of 1. This results in performance that unfortunately rendered mongoimport not viable for our volumes. With the addition of bulk, multi-worker upserts, we are seeing a 400-700X performance boost. With this performance tweak, mongoimport became a viable tool for our update process.
--bulkUpdate command line option added. When toggled on, upserts can be executed in bulk and in multiple worker processes. This option was added to limit the impact to existing processes using mongoimport. There is some debate on whether this flag is necessary or if 'bulkUpdate' mode should be 'on' by default and toggled 'off' via the --maintainInsertionOrder option
The change for 'bulkUpdate' upsert mode was implemented through disabling maintainInsertionOrder, removing the restriction for 1 insertion worker and adding new method to BufferedBulkInserter to support bulk Upsert operations.
--mode remove option added. Will construct bson selectors using records from input file and --upsertFields to remove matching documents. Each selector will remove only a single matching document. Implemented through adding new method to BufferedBulkInserter to support bulk Remove operations.
--upsertFields are required when specifying this option.
With the MongoDB 3.6 batch size limit changes, the --batchSize option's maximum was raised to 100k documents. Mongoimport and mongo driver code (gopkg.in/mgo.v2) were patched to support this. Specifying a batch size larger than 1000 and targeting MongoDB <3.6 results in operations being batched driver side in chunks of 1000. The driver was also patched to split write operations >16MB into separate writeOpCommand calls for *insertOp, bulkUpdateOp, and bulkDeleteOp operation types.