Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
- RF
- SWNA
Environment:
Ubuntu 16.04

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have a 3-node replica set running version 3.4.10 on Ubuntu 16.04.

I ran a schema update that touched all 7 million rows of a collection with a $set and a $rename. Because one of the secondaries is about 30ms away in Azure, I used majority write concern to slow down the update and make sure at least one of the secondaries would stay in sync.

The query started at 14:19:29. At that point the Azure slave was probably 3-5 minutes behind because of earlier schema migrations. But by 14:27:00, the main secondary was unable to get results for oplog queries:

Jan 13 14:27:00 secondary mongod.27017[28273]: [replication-163] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 18482564 -- target:primary:27017 db:local expDate:2018-01-13T14:27:00.216+0000 cmd:{ getMore: 16483339842, collection: "oplog.rs", maxTimeMS: 5000, term: 25, lastKnownCommittedOpTime: { ts: Timestamp 1515853539000|8343, t: 25 } }. Last fetched optime (with hash): { ts: Timestamp 1515853555000|1852, t: 25 }[1175973526525408650]. Restarts remaining: 3

That's also the time the replica set stopped accepting connections from clients.

To get things running again I had to kill all three mongod processes (and then kill -9 because the shutdown tends to hang while in this state).

After letting the nodes sync up, I was able to reproduce this again with the same query.

I can provide logs and the query privately if that would be useful.

Just guessing based on what I learned in ~~SERVER-32398~~, maybe the primary froze up because it ran out of cache while waiting for the secondary to apply changes. But the update was running with majority read concern so I would have thought the secondary couldn't have gotten far enough behind for that to occur.

Assignee:: Bruce Lucas (Inactive)
Reporter:: Michael Smith
Participants:: Bruce Lucas, Michael Smith
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Jan 14 2018 12:50:34 AM UTC
Updated:: Oct 30 2023 11:09:18 PM UTC
Resolved:: Jan 19 2018 04:05:22 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates