[SERVER-36512] replset members: /data/WiredTigerLAS.wt grows unlimited Created: 07/Aug/18  Updated: 27/Oct/23  Resolved: 09/Aug/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.6
Fix Version/s: None

Type: Question Priority: Critical - P2
Reporter: Bruce Zu Assignee: Nick Brewer
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

replica set
AWS EC2


Attachments: PDF File 172-31-54-204 is primary before I restart it.pdf     PNG File Screen Shot 2018-08-09 at 12.59.09 PM.png     PNG File Screen Shot 2018-08-09 at 12.59.09 PM.png     PNG File collections is not same.png     PNG File date diskspace is not same.png     Zip Archive diagnostic.zip     PNG File let data and test show true.png     PNG File local collections size.png     Zip Archive mongod_log.zip     PNG File old3members-deleted-July20.png     PNG File still connect removed member before restart mongod service.png     PNG File wired collection 6.png    
Participants:

 Description   

rs.conf()
{
...
"version" : 560512,
"protocolVersion" : NumberLong(1),

...

/data/WiredTigerLAS.wt grows unlimited to 92% diskspace of /dava volume and 

get the primary  member in idle status, can not response to read and write operation. 

rs.config() and rs.config() still works

rs.reconfig() will pending there. 

 

Any other info need provide please let me know



 Comments   
Comment by Bruce Zu [ 10/Aug/18 ]

Hi Nicky

I am curious how can you said "your replica set did not have enough members to satisfy the read concern"

Comment by Bruce Zu [ 09/Aug/18 ]

Hi Nick
I do not know why you close the issue before finding out the root reason
how to explain this issue. I know it is not easy to retrigger it but that does mean there is no bug there at all. 

old member 3.4.7(deleted) : ip-172-31-12-59
old member 3.4.7(deleted) : ip-172-31-20-52
old member 3.4.7(deleted) : ip-172-31-46-24

these 3 members have been deleted on July 22, show from the log
But primary still connect them till I restart the primary node mongod service:

Comment by Bruce Zu [ 09/Aug/18 ]

Hi Nick,

I do not think it is the root reason;
Firstly I think I have to list the issue in time order:
July 20 Afternoon:

  • Three 3.4 nodes down issue happened
  • I removed them from replset configuration at once.
    July 20 night
  • I got them back to work after clean their data and added them back to replset configure.
    July 23, Monday
  • I remove them from replset configuration and terminate them

At that time we have 3 data bearing members and one arbiter.
primary : 172.31.54.204
arbiter : ip-172-31-5-208
old member 3.4.7(deleted) : ip-172-31-12-59
old member 3.4.7(deleted) : ip-172-31-20-52
old member 3.4.7(deleted) : ip-172-31-46-24
secondary : ip-172-31-66-130
secondary : ip-172-31-82-157

Aug 3
One secondary became unreachable :

primary in idle : 172.31.54.204
arbiter : ip-172-31-5-208
sec with unreachable issue: ip-172-31-66-130
secondary : ip-172-31-82-157

Aug 6
primary WiredTigerLAS.wt grows to 92% was found
the other normal secondary WiredTigerLAS.wt grows to 64%
but the primary is still is primary. not step down to secondary before I restart mongod service on the primary node.

Secondly I did some test:
Current replset:
old primary now is secondary : 172.31.54.204
arbiter : ip-172-31-5-208
secondary ever with unreachable issue: ip-172-31-66-130
secondary (current is primarny) : ip-172-31-82-157
new secondary added on Aug 6 : ip-172-31-67-188

I just did some test and find:
stop any 3 secondary members or any 2 secondary + 1 arbiter will get the primary step down to secondary.

If the major does include arbiter, I test to stop 2 data bearing member, only let a primary, a secondary and an arbiter alive. keep this status for 3 hours and nothing happen to the file

4.0K rw------ 1 mongod mongod 4.0K Aug 9 20:27 /data/WiredTigerLAS.wt

So I do not know how you figure out the root reason and without confirm with guest you close the issue. 

 

Note we use default read reference 'Prinamry only'
So once the primary step down to be secondary it will end any read operation connection
http://paste.openstack.org/show/727769/

during all the process the size of WiredTigerLAS there is not any change.
4.0K rw------ 1 mongod mongod 4.0K Aug 6 20:05 WiredTigerLAS.wt

As the primary will not be primary and will not handle the read operation.
So how could it be possible " ..your primary node was required to store an increasing amount of data in the cache"
By the way, arbiter does not contain data but it can vote. it can be apart of the majority.
At least the test I did show it is not the root reason.

As I mentioned in the first message.
When I checking the mongod.log after WiredTigerLAS.wt happen on Aug 6
I find the primary still trying to connect the old 3 members with v3.4.7. This should a bug
As in the replset configuration, they have been removed one week ago. I double checked the
rs.config(). they are not there. But after I restarted mongod service. the rs configuration took effect. This can be checked from the mongod.log, There is not any more message to connnect those old 3 members. This never happened on 3.4.7

Comment by Bruce Zu [ 09/Aug/18 ]

let me provide more information used to fileter out not related message in log investigation

  1. old primary in idle : 172.31.54.204
  2. arbiter : ip-172-31-5-208
  3. old member 3.4.7(deleted) : ip-172-31-12-59
  4. old member 3.4.7(deleted) : ip-172-31-20-52
  5. old member 3.4.7(deleted) : ip-172-31-46-24
  6. sec with unreachable issue: ip-172-31-66-130
  7. secondary (new primarny) : ip-172-31-82-157
  8. new seccondary : ip-172-31-67-188

I also find the /data disk space size of the new member 172-31-67-188 is not the same as others.
I find this is caused by a file under /data whose name started with collection-6, in fact there are 2 files whose name start with collection-6.
This wired file has more than 390 M. It is severe in the secondary with an unreachable issue happen.
while in new member 172-31-67-188 there is one one file with name starting with collection-6

check the DB find it is the local.


Hope these are helpful to find the bug

Comment by Nick Brewer [ 09/Aug/18 ]

brucezu MongoDB 3.6 enables read concern "majority" - with the three 3.4 nodes down, your replica set did not have enough members to satisfy the read concern, as the arbiter does not contain data. The majority read concern ensures that the data returned will not subsequently be rolled back, by confirming that it is acknowledged by the majority of data-bearing replica set members.

Because of this, your primary node was required to store an increasing amount of data in the cache, which ultimately overflowed into the lookaside table (represented by WiredTigerLAS.wt). If you've removed the 3.4 nodes, you should have a majority of data-bearing nodes to satisfy the read concern.

-Nick

Comment by Bruce Zu [ 08/Aug/18 ]

Hi Nick,

1> The zip file for $dbpath/diagnostic.data}}directory has been uploaded yesterday

      can you find it? 

 2>  mongod.log  mongod_log.zip

       

 3> "Can you link to the issue you're referring to?"

all old members except arbiter all down. the log shows it is caused by 'Bad value ...' during the sync process. I fixed it by clean the data of old member and restart them. them restore to normal status. this issue is reported and fixed on 3.6.7 just one day before I run into this issue. Anyway, I terminated all of those old members

  https://jira.mongodb.org/browse/SERVER-35795

3> "Is the full "'Bad value ..." message available in the mongod logs?"

Yeah  http://paste.openstack.org/show/727682/

you will see "...caused by :: BadValue: cannot write to 'config.system.sessions"

I think that  issue has no relation to the current issue

Yeah

 

Thank you!

Bruce

Comment by Nick Brewer [ 08/Aug/18 ]

brucezu Could you upload an archive (tar or zip) of the $dbpath/diagnostic.data directory, and the mongod.log from a mongod that experienced the unbounded WiredTigerLAS.wt growth?

all old members except arbiter all down. the log shows it is caused by 'Bad value ...' during the sync process. I fixed it by clean the data of old member and restart them. them restore to normal status. this issue is reported and fixed on 3.6.7 just one day before I run into this issue. Anyway, I terminated all of those old members

Can you link to the issue you're referring to? Is the full "'Bad value ..." message available in the mongod logs?

Thanks,
Nick

Thanks

Comment by Bruce Zu [ 07/Aug/18 ]

`restart mongod service` fix this issue and WiredTigerLAS.wt shrinks to 4k 

All data bearing member are 3.6.6 

Arbiter is 3.4.7

It is in AWS EC2. OS:  4.14.47-56.37.amzn1.x86_64 #1 SMP Wed Jun 6 18:49:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

RAM 4G

Mem: 4041808k total, 2399252k used 

db.serverCmdLineOpts().parsed.storage

{ "dbPath" : "/data", "journal" : \{ "enabled" : true }

}$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 2.0G 68K 2.0G 1% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
/dev/xvda1 9.8G 1.3G 8.4G 14% /
/dev/xvdb 69G 760M 65G 2% /data
/dev/xvdc 9.8G 323M 8.9G 4% /journal
/dev/xvdd 4.8G 108M 4.5G 3% /log
[ec2-user@ip-172-31-82-157 ~]$ ls -l /data/journal/
total 307216
drwxr-xr-x 2 mongod mongod 16384 Jul 20 00:47 lost+found
rw------ 1 mongod mongod 104857600 Aug 7 21:35 WiredTigerLog.0000000042
rw------ 1 mongod mongod 104857600 Aug 6 20:05 WiredTigerPreplog.0000000001
rw------ 1 mongod mongod 104857600 Aug 6 20:05 WiredTigerPreplog.0000000002

 

The replset is created with 3 data bearing node and one arbiter. all of them are 3.4.7

1 week ago I added 3 new secondaries with 3.6.6 and, select a new primary with a new member by reconfiguration of the priority. An issue happened at that time after old primary become secondary and a new member becomes primary: all old members except arbiter all down. the log shows it is caused by 'Bad value ...' during the sync process.  I fixed it by clean the data of old member and restart them. them restore to normal status. this issue is reported and fixed on 3.6.7 just one day before I run into this issue. Anyway, I terminated all of those old members

 

Another thing need mention:  I have removed the old members from replset config. but from the mongod.log I find primary still send heartbeat to those old members.  

fixed it by restart mongod service. 

 

in 3 data bearing node and 1 arbiter: 

  •  db.version()  
  •   rs.conf() 
  •   rs.status() 
  •   db.serverStatus().metrics.cursor  

result: http://paste.openstack.org/show/727581/

 

 

Generated at Thu Feb 08 04:43:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.