[SERVER-36512] replset members: /data/WiredTigerLAS.wt grows unlimited Created: 07/Aug/18 Updated: 27/Oct/23 Resolved: 09/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.6 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Critical - P2 |
| Reporter: | Bruce Zu | Assignee: | Nick Brewer |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
replica set |
||
| Attachments: |
|
| Participants: |
| Description |
|
rs.conf() ... /data/WiredTigerLAS.wt grows unlimited to 92% diskspace of /dava volume and get the primary member in idle status, can not response to read and write operation. rs.config() and rs.config() still works rs.reconfig() will pending there.
Any other info need provide please let me know |
| Comments |
| Comment by Bruce Zu [ 10/Aug/18 ] |
|
Hi Nicky I am curious how can you said "your replica set did not have enough members to satisfy the read concern"
|
| Comment by Bruce Zu [ 09/Aug/18 ] |
|
Hi Nick old member 3.4.7(deleted) : ip-172-31-12-59 these 3 members have been deleted on July 22, show from the log |
| Comment by Bruce Zu [ 09/Aug/18 ] |
|
Hi Nick, I do not think it is the root reason;
At that time we have 3 data bearing members and one arbiter. Aug 3 primary in idle : 172.31.54.204 Aug 6 Secondly I did some test: I just did some test and find: If the major does include arbiter, I test to stop 2 data bearing member, only let a primary, a secondary and an arbiter alive. keep this status for 3 hours and nothing happen to the file 4.0K So I do not know how you figure out the root reason and without confirm with guest you close the issue.
Note we use default read reference 'Prinamry only' during all the process the size of WiredTigerLAS there is not any change. As the primary will not be primary and will not handle the read operation. As I mentioned in the first message. |
| Comment by Bruce Zu [ 09/Aug/18 ] |
|
let me provide more information used to fileter out not related message in log investigation
I also find the /data disk space size of the new member 172-31-67-188 is not the same as others. |
| Comment by Nick Brewer [ 09/Aug/18 ] |
|
brucezu MongoDB 3.6 enables read concern "majority" - with the three 3.4 nodes down, your replica set did not have enough members to satisfy the read concern, as the arbiter does not contain data. The majority read concern ensures that the data returned will not subsequently be rolled back, by confirming that it is acknowledged by the majority of data-bearing replica set members. Because of this, your primary node was required to store an increasing amount of data in the cache, which ultimately overflowed into the lookaside table (represented by WiredTigerLAS.wt). If you've removed the 3.4 nodes, you should have a majority of data-bearing nodes to satisfy the read concern. -Nick |
| Comment by Bruce Zu [ 08/Aug/18 ] |
|
Hi Nick, 1> The zip file for $dbpath/diagnostic.data}}directory has been uploaded yesterday can you find it? 2> mongod.log mongod_log.zip
3> "Can you link to the issue you're referring to?"
https://jira.mongodb.org/browse/SERVER-35795 3> "Is the full "'Bad value ..." message available in the mongod logs?" Yeah http://paste.openstack.org/show/727682/ you will see "...caused by :: BadValue: cannot write to 'config.system.sessions" I think that issue has no relation to the current issue Yeah
Thank you! Bruce |
| Comment by Nick Brewer [ 08/Aug/18 ] |
|
brucezu Could you upload an archive (tar or zip) of the $dbpath/diagnostic.data directory, and the mongod.log from a mongod that experienced the unbounded WiredTigerLAS.wt growth?
Can you link to the issue you're referring to? Is the full "'Bad value ..." message available in the mongod logs? Thanks, Thanks |
| Comment by Bruce Zu [ 07/Aug/18 ] |
|
`restart mongod service` fix this issue and WiredTigerLAS.wt shrinks to 4k All data bearing member are 3.6.6 Arbiter is 3.4.7 It is in AWS EC2. OS: 4.14.47-56.37.amzn1.x86_64 #1 SMP Wed Jun 6 18:49:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux RAM 4G Mem: 4041808k total, 2399252k used db.serverCmdLineOpts().parsed.storage { "dbPath" : "/data", "journal" : \{ "enabled" : true } }$ df -h
The replset is created with 3 data bearing node and one arbiter. all of them are 3.4.7 1 week ago I added 3 new secondaries with 3.6.6 and, select a new primary with a new member by reconfiguration of the priority. An issue happened at that time after old primary become secondary and a new member becomes primary: all old members except arbiter all down. the log shows it is caused by 'Bad value ...' during the sync process. I fixed it by clean the data of old member and restart them. them restore to normal status. this issue is reported and fixed on 3.6.7 just one day before I run into this issue. Anyway, I terminated all of those old members
Another thing need mention: I have removed the old members from replset config. but from the mongod.log I find primary still send heartbeat to those old members. fixed it by restart mongod service.
in 3 data bearing node and 1 arbiter:
result: http://paste.openstack.org/show/727581/
|