[SERVER-32903] Ambiguous field name error should be ignored during initial sync Created: 25/Jan/18  Updated: 30/Oct/23  Resolved: 29/Jan/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication
Affects Version/s: None
Fix Version/s: 4.2.4, 3.6.18, 4.3.4, 4.0.18

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Ryan Timmons
Resolution: Fixed Votes: 0
Labels: former-quick-wins, former-robust-initial-sync, initialSync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File initsync_fuzzer-31f9-1580446991037-9.js    
Issue Links:
Backports
Depends
is depended on by SERVER-33946 Decrease number of initial sync attem... Blocked
Related
is related to SERVER-45827 Expand initial sync fuzzer grammar to... Backlog
is related to SERVER-45830 Add failpoint to allow InitialSyncTes... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6
Sprint: Repl 2020-01-27, Repl 2020-02-10
Participants:
Linked BF Score: 6

 Description   

The following error can happen when inserting a document like

{_id: 0, a: [{0: 1}]}

into a collection with an index on {"a.0": 1}:

> db.foo.ensureIndex({"a.0": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1
}
> db.foo.insert({a: [{0: 1}]})
WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 16746,
		"errmsg" : "Ambiguous field name found in array (do not use numeric field names in embedded elements in an array), field: '0' for array: { 0: { 0: 1.0 } }"
	}
})

When finishing initial sync and applying oplog entries that happened during the sync, this error should be ignored. Like a unique index constraint violation, this should eventually be resolved during oplog application, either by a another operation deleting the document, or by another oplog entry dropping the index.



 Comments   
Comment by Githook User [ 02/Apr/20 ]

Author:

{'name': 'Ryan Timmons', 'email': 'ryan.timmons@10gen.com', 'username': 'rtimmons'}

Message: SERVER-32903 Ambiguous field name error should be ignored during initial sync

create mode 100644 jstests/replsets/initial_sync_ambiguous_index.js

(cherry picked from commit 3423ca586b88566857f3fcdfeca1c6fdee7a0911)
Branch: v3.6
https://github.com/mongodb/mongo/commit/1fe9bc6787ca4f53410f86d01b960b256bf65561

Comment by Githook User [ 02/Apr/20 ]

Author:

{'name': 'Ryan Timmons', 'email': 'ryan.timmons@10gen.com', 'username': 'rtimmons'}

Message: SERVER-32903 Ambiguous field name error should be ignored during initial sync

create mode 100644 jstests/replsets/initial_sync_ambiguous_index.js

(cherry picked from commit 3423ca586b88566857f3fcdfeca1c6fdee7a0911)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b804b718aa7479f46ce7aa2536f558ab9526281b

Comment by Githook User [ 21/Feb/20 ]

Author:

{'name': 'Ryan Timmons', 'username': 'rtimmons', 'email': 'ryan.timmons@10gen.com'}

Message: SERVER-32903 add initial_sync_ambiguous_index.js to multiversion blacklist
Branch: master
https://github.com/mongodb/mongo/commit/63253ac8554b2a867de988fed5077ecfbc4522e0

Comment by Githook User [ 06/Feb/20 ]

Author:

{'username': 'rtimmons', 'name': 'Ryan Timmons', 'email': 'ryan.timmons@10gen.com'}

Message: SERVER-32903 Ambiguous field name error should be ignored during initial sync

create mode 100644 jstests/replsets/initial_sync_ambiguous_index.js

create mode 100644 jstests/replsets/initial_sync_ambiguous_index.js
Branch: v4.2
https://github.com/mongodb/mongo/commit/b1162d4db278728d54c4e2061b211e11ac923976

Comment by William Schultz (Inactive) [ 06/Feb/20 ]

If they apply cleanly they shouldn't require CRs. If they become more involved and require some manual intervention then a CR might be necessary. I expect these backports will be clean.

Comment by William Schultz (Inactive) [ 31/Jan/20 ]

ryan.timmons I think we should backport this fix to older versions. Also, since the original bug is not yet fixed on 4.2, the new test is failing in the multiversion suites i.e. BFG-432604. We can disable the test on master until it is backported by adding an exclusion here. Feel free to post that change as a follow up commit on this ticket.

Comment by William Schultz (Inactive) [ 31/Jan/20 ]

With some manual knob tuning, I was eventually able to get the initial sync fuzzer to reproduce bug (1) described above i.e. an error during oplog application. This is the error message:

[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 D3 REPL     [ReplWriterWorker-11] applying op (or grouped inserts): { op: "i", ns: "db2.coll2", ui: UUID("886dc21e-b287-483a-a578-cd7b4d6ca082"), o: { _id: 5.0, counter: 0.0, a: [ { 0: 1.0 } ] }, ts: Timestamp(1580447394, 11), t: 1, wall: new Date(1580447394902), v: 2 }, oplog application mode: InitialSync
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 F  REPL     [ReplWriterWorker-11] writer worker caught exception: Location16746: Ambiguous field name found in array (do not use numeric field names in embedded elements in an array), field: '0' for array: { 0: { 0: 1.0 } } on: { op: "i", ns: "db2.coll2", ui: UUID("886dc21e-b287-483a-a578-cd7b4d6ca082"), o: { _id: 5.0, counter: 0.0, a: [ { 0: 1.0 } ] }, ts: Timestamp(1580447394, 11), t: 1, wall: new Date(1580447394902), v: 2 }
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 F  REPL     [ReplCoordExtern-0] Failed to apply batch of operations. Number of operations in batch: 1. First operation: { op: "i", ns: "db2.coll2", ui: UUID("886dc21e-b287-483a-a578-cd7b4d6ca082"), o: { _id: 5.0, counter: 0.0, a: [ { 0: 1.0 } ] }, ts: Timestamp(1580447394, 11), t: 1, wall: new Date(1580447394902), v: 2 }. Last operation: { op: "i", ns: "db2.coll2", ui: UUID("886dc21e-b287-483a-a578-cd7b4d6ca082"), o: { _id: 5.0, counter: 0.0, a: [ { 0: 1.0 } ] }, ts: Timestamp(1580447394, 11), t: 1, wall: new Date(1580447394902), v: 2 }. Oplog application failed in writer thread 4: Location16746: Ambiguous field name found in array (do not use numeric field names in embedded elements in an array), field: '0' for array: { 0: { 0: 1.0 } }
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 I  INITSYNC [ReplCoordExtern-0] initialSyncFassertIfApplyingBatchFails fail point enabled.
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 F  -        [ReplCoordExtern-0] Fatal assertion 31210 Location16746: error applying batch :: caused by :: Ambiguous field name found in array (do not use numeric field names in embedded elements in an array), field: '0' for array: { 0: { 0: 1.0 } } at src/mongo/db/repl/initial_syncer.cpp 1396
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| 2020-01-31T05:10:24.423+0000 F  -        [ReplCoordExtern-0]
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021|
[js_test:initsync_fuzzer-31f9-1580446991037-9] 2020-01-31T05:10:24.423+0000 d20021| ***aborting after fassert() failure

Here is the repro (initsync_fuzzer-31f9-1580446991037-9.js), which reproduces the bug consistently run against commit bcae9e6. I had to fix an outstanding bug with applying 'abortIndexBuild' (see this comment) in the test patch, so that I wouldn't hit that bug every time instead. I added the specific document shape to the fuzzer grammar for inserts:

InsertCommand ->
    "insert: COLLNAME"
    ", documents: [{_id: FRESH_ID, counter: 0, a: [{0:1}]}]"

and the createIndex command:

CreateIndexCommand ->
    "createIndexes: COLLNAME"
    ", indexes: [{key: {'a.0': 1}, name: 'a_1'}]"

and I updated the following knob constants:

kNumMaxStepsCountForInitSync= 20
kNumMaxStepsCountForSeedData=100
kMaxNumDatabases=1
kMaxNumCollections=1

I also manually disabled generation of transactions commands since this was not a transactions related bug. The failure reproduced on the third execution of a patch build that generated 5 tasks each with 10 fuzzer tests each. So, it appears it took around 150 generated fuzzer tests before we hit the bug.

I think this demonstrates to a certain extent that the fuzzer can act as a tool to help search for deterministic repros of a hypothesized bug with less manual effort. We can augment the grammar with some of the interesting operations that we think are needed to find the bug and let the fuzzer run many generated test cases.

Comment by Githook User [ 29/Jan/20 ]

Author:

{'username': 'rtimmons', 'name': 'Ryan Timmons', 'email': 'ryan.timmons@10gen.com'}

Message: SERVER-32903 Ambiguous field name error should be ignored during initial sync

create mode 100644 jstests/replsets/initial_sync_ambiguous_index.js
Branch: master
https://github.com/mongodb/mongo/commit/3423ca586b88566857f3fcdfeca1c6fdee7a0911

Comment by William Schultz (Inactive) [ 23/Jan/20 ]

Discussed two hypothetical scenarios where this bug could manifest. It's difficult to verify them against existing failures since log files have been reaped. It does appear that in certain BFGs the error is in oplog application and in others it appears in index building.

 (1) We fail insertion during initial sync oplog application. If the sync source does an insert, delete, and then createIndex, and then we start the collection clone, building the problematic index, during oplog application we would to re-apply the insert which would throw the "ambiguous field name" error.

 (2) We fail index building during initial sync collection cloning. If we clone a collection with the problematic index on it, and then the index is dropped and an insert occurs in the sync source during collection cloning, we might try to clone the problematic document and insert it while we are building the problematic index.

If this is a real bug I am also curious if this is something the initial sync fuzzer would be able to catch and/or why it is not finding it.

Comment by Ryan Timmons [ 22/Jan/20 ]

Summary:

  • I couldn't repro this using the fuzzer-generated test-cases. This seems to be the only failure whose S3 assets hasn't been pruned. I ran this 5 times against master to no avail - it doesn't repro.
  • We can simply whitelist the "Ambiguous fieldname" exception when doing inserts. This change does that.
  • But this is hard to test.
    • The repro js given fails because ReplSetTest checks index hashes/counts and the repro explicitly creates the index on the secondary and not the primary. We can bypass these checks but overall this doesn't feel like a realistic test-case. I added a cleaned-up version of the repro js here. It passes with the whitelist change and fails on master, but it is icky.
    • I added an "IdempotencyTest" that ends up similarly. The "Ambiguous field name" exception isn't thrown but the "validateResults" checks are failing.

I'm going to reach out to other members on repl to figure out if simply adding this error status-code to the whitelist per above is really the best solution. And if so, how do we avoid failing index-validations.

Comment by Charlie Swanson [ 25/Jan/18 ]

Instead of adding 16746 to the whitelist, we could also change that uassert to use ErrorCodes::CannotBuildIndexKeys, which is already ignored.

Generated at Thu Feb 08 04:31:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.