[SERVER-40507] Track I/O stats in diagnostic data when mongod deployed in a POD using PersistentVolumeClaim Created: 05/Apr/19  Updated: 27/Oct/23  Resolved: 28/Feb/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andrey Brindeyev Assignee: Billy Donahue
Resolution: Works as Designed Votes: 0
Labels: kubernetes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.json    
Issue Links:
Related
related to SERVER-46501 Add /proc/self/mountinfo to hostInfo ... Closed
Sprint: Dev Tools 2020-02-10, Dev Tools 2020-02-24, Dev Tools 2020-03-09
Participants:
Case:

 Description   

At the moment the local SSD stats are recorded in FTDC while stats from PersistentVolumeClaim volumes are missing.



 Comments   
Comment by Billy Donahue [ 28/Feb/20 ]

Seems like we're reporting already, but need SERVER-46501 to connect the reported block devices to the filesystem hierarchy seen by mongod.

Comment by Billy Donahue [ 28/Feb/20 ]

That FTDC contains entries for xvdbb:

			"disks" : {
				"xvda" : {
					"reads" : 11613,
					"reads_merged" : 6,
					"read_sectors" : 912992,
					"read_time_ms" : 8260,
					"writes" : 2982210,
					"writes_merged" : 2792663,
					"write_sectors" : 120957114,
					"write_time_ms" : 5157860,
					"io_in_progress" : 0,
					"io_time_ms" : 1025668,
					"io_queued_ms" : 5166144
				},
				"xvdbb" : {
					"reads" : 289,
					"reads_merged" : 0,
					"read_sectors" : 13386,
					"read_time_ms" : 100,
					"writes" : 204410,
					"writes_merged" : 236622,
					"write_sectors" : 4600168,
					"write_time_ms" : 124788,
					"io_in_progress" : 0,
					"io_time_ms" : 107676,
					"io_queued_ms" : 124832
				}
			},

The difficulty is that nobody would know xvdbb is the /data mount point.

Maybe we just need to include the mount table in the FTDC or the hostInfo so the FTDC disk stats can be interpreted properly.

Comment by Louis Plissonneau (Inactive) [ 28/Feb/20 ]

billy.donahue

 

Here is the output of db.adminCommand({getDiagnosticData : 1})

 

diagnostic.json

Comment by Billy Donahue [ 28/Feb/20 ]

Louis gave me a way into his kube pod.
This crazy command is revealing something interesting.

I have no name!@replica-set-for-volume-stats-0:/$ stat -L /sys/block/$(mount | grep $(grep dbPath $(tr '\0' ' ' </proc/$(pidof mongod)/cmdline | awk '{print $3}') | awk '{print $2}') | awk '{print $1}' | sed 's#/dev/##')/device
 File: '/sys/block/xvdbb/device'
 Size: 0     	Blocks: 0     IO Block: 4096  directory
Device: 3ah/58d	Inode: 15542    Links: 4
Access: (0755/drwxr-xr-x) Uid: (  0/  root)  Gid: (  0/  root)
Access: 2020-02-27 20:47:54.390917130 +0000
Modify: 2020-02-27 20:47:54.390917130 +0000
Change: 2020-02-27 20:47:54.390917130 +0000
 Birth: -

Explanation: I dig into mongod’s cmdline. Its arg3 is a mongod.conf file. We find the dbPath line in there. Find that dbPath in mount. Mount shows it under /dev/xvdbb but that doesn’t exist. What does exist is /sys/block/xvdbb so we look there and stat its device symlink, finding that it is a directory. This is sufficient criteria for the xvdbb to show up in FTDC.
It looks like some kind of virtual block device, not NFS:
File: '/sys/block/xvdbb/device' -> '../../../vbd-268449024'
My suspicion is that this instance WOULD in fact, report stats on its /data PVC. Unfortunately I don’t know how to connect to this host’s mongod to issue the GetDiagnosticInfo() command to check that out.
In short, it may be that this instance doesn’t suffer from the bug described in SERVER-40507.

The "persistent volume" spec from which this device comes:

 
Billys-MacBook-Pro-3:~ billy$ kubectl get pv pvc-8099e4d0-81e1-4f32-9ef1-34bc693f4da2 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: aws-ebs-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
  creationTimestamp: "2020-02-27T18:41:42Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2a
  name: pvc-8099e4d0-81e1-4f32-9ef1-34bc693f4da2
  resourceVersion: "2227071"
  selfLink: /api/v1/persistentvolumes/pvc-8099e4d0-81e1-4f32-9ef1-34bc693f4da2
  uid: 8b7232ee-eebb-4b1b-859c-a74799df9f81
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://us-west-2a/vol-0be74d1366f7511e8
  capacity:
    storage: 15Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-replica-set-for-volume-stats-1
    namespace: mongodb
    resourceVersion: "2227057"
    uid: 8099e4d0-81e1-4f32-9ef1-34bc693f4da2
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - us-west-2a
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - us-west-2
  persistentVolumeReclaimPolicy: Delete
  storageClassName: gp2
  volumeMode: Filesystem
status:
  phase: Bound

It's an EBS (ie "Elastic Block Store") which is attached to the container as a block device directly, nothing do with nfs or anything like that.
If it shows up in that system's FTDC reports, and I think it should, then there's perhaps nothing to do here?

Comment by Billy Donahue [ 13/Feb/20 ]

Transitioning to "blocked" while we gather specific requirements.

Comment by Billy Donahue [ 11/Feb/20 ]

Is FTDC supposed to be reporting system-wide limits and stats or just container-imposed limits and stats? I feel like we aren't consistent about this and it's hard to write a comprehensive solution, I think it's because I'm not working from the context of an overarching epic, but just a few decoupled work items. It would be good to get more clarity on the expectations for these stats from the product side.

One solution here, leaning more toward the whole-system direction, would be to simply report on ALL the block devices in the system. We are currently filtering the /proc/diskstats to show only "interesting" disks.

We define a Linux block device to be "interesting" if it represents a physically connected device. We could conceivably solve this ticket by removing that filter and reporting about all block devices. If the Kubernetes Persistent Volume Claim bindings appear as all or part of some kind of block device, they'd turn up in the unfiltered /proc/diskstats. If they are implemented as some other kind of mount point (via fusefs or nfs tricks) that ISN'T backed by a block device they'd remain invisible. I'm not sure those would be visible even with cgroup blkio stats reads. They might show up as network traffic instead, for example.

I just wrote a similar statement in the closely related SERVER-40506.

Comment by Billy Donahue [ 02/Feb/20 ]

Okay, I'm looking at how lxcfs gathers its stats. It's all user-space.
It's aggregating numbers read from various files under /sys/fs/cgroup/blkio/

I don't understand it all yet but we could do what lxcfs does.

$ cat /sys/fs/cgroup/blkio/blkio.io_merged
8:16 Read 0
8:16 Write 0
8:16 Sync 0
8:16 Async 0
8:16 Total 0
8:0 Read 246
8:0 Write 42360
8:0 Sync 300
8:0 Async 42306
8:0 Total 42606
Total 42606

$ cat /sys/fs/cgroup/blkio/blkio.io_service_bytes
8:16 Read 1242112
8:16 Write 53248
8:16 Sync 1283072
8:16 Async 12288
8:16 Total 1295360
8:0 Read 90748928
8:0 Write 490371871744
8:0 Sync 12467730432
8:0 Async 477994890240
8:0 Total 490462620672
Total 490463916032

ETC....

Comment by Billy Donahue [ 01/Feb/20 ]

This is pretty challenging.

What we do now in ftdc_system_stats_linux.cpp is iterate through all the block devices listed in /proc/diskinfo,
So we get the raw devices only. Container PersistentVolume mounts don't talk to the kernel the same way, and
I don't know how we can access their raw stats. If anyone knows, please chime in!

I did find this FUSEfs module that replaces /proc/diskinfo (among other things) with Kubernetes-enhanced stats.
I think a container that had this lxcfs module installed would have this problem solved automatically.

 

https://linuxcontainers.org/lxcfs/introduction/

https://dzone.com/articles/kubernetes-demystified-using-lxcfs-to-improve-cont

It's possible the work required for this ticket is "tell people to install LXCFS"?

Comment by Brian Lane [ 03/Jun/19 ]

Can platforms take a look at this one?

Generated at Thu Feb 08 04:55:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.