-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Checkpoints
-
None
-
Storage Engines - Persistence
-
204.154
-
SE Persistence backlog
-
None
This is a derived ticket from WT-17563, Issue is by monitoring the size drop, a pattern is found that when drop_size appears, it continuous repeated multi times, and later lead the disagg database size overflow. A suspicious point here is the schema drop operation keeps or continuous re-submitted in multi checkpoints, then lead to the repeated decrease by adding drop_size by the schema table.
The log trace (see WT-17563 for more details)
' ckpt #571: db_size 2523959500 -> 2436031421 (ckpt_size_delta=580381, drop_size=88508460); this ckpt: subtracted=174759 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #572: db_size 2436031421 -> 2348131104 (ckpt_size_delta=608143, drop_size=88508460); this ckpt: subtracted=165731 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #573: db_size 2348131104 -> 2260180405 (ckpt_size_delta=557761, drop_size=88508460); this ckpt: subtracted=124805 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #574: db_size 2260180405 -> 2172248635 (ckpt_size_delta=576690, drop_size=88508460); this ckpt: subtracted=138822 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #575: db_size 2172248635 -> 2084153877 (ckpt_size_delta=413702, drop_size=88508460); this ckpt: subtracted=230042 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #576: db_size 2084153877 -> 1996166920 (ckpt_size_delta=521503, drop_size=88508460); this ckpt: subtracted=131622 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #577: db_size 1996166920 -> 1908239504 (ckpt_size_delta=581044, drop_size=88508460); this ckpt: subtracted=131661 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #578: db_size 1908239504 -> 1820264332 (ckpt_size_delta=533288, drop_size=88508460); this ckpt: subtracted=124528 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #579: db_size 1820264332 -> 1732315660 (ckpt_size_delta=559788, drop_size=88508460); this ckpt: subtracted=106222 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #580: db_size 1732315660 -> 1628617408 (ckpt_size_delta=443554, drop_size=104141806); this ckpt: subtracted=188912 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #581: db_size 1628617408 -> 1462403312 (ckpt_size_delta=461094, drop_size=166675190); this ckpt: subtracted=127669 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #582: db_size 1462403312 -> 1295745844 (ckpt_size_delta=17722, drop_size=166675190); this ckpt: subtracted=86479 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #583: db_size 1295745844 -> 1129076959 (ckpt_size_delta=6305, drop_size=166675190); this ckpt: subtracted=42898 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #584: db_size 1129076959 -> 962407839 (ckpt_size_delta=6070, drop_size=166675190); this ckpt: subtracted=42900 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #585: db_size 962407839 -> 795738731 (ckpt_size_delta=6082, drop_size=166675190); this ckpt: subtracted=42887 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #586: db_size 795738731 -> 629069624 (ckpt_size_delta=6083, drop_size=166675190); this ckpt: subtracted=42886 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #587: db_size 629069624 -> 462400517 (ckpt_size_delta=6083, drop_size=166675190); this ckpt: subtracted=42886 prevented_double_subtract=0; cumulative prevented=0 (count=0)\n' +
' ckpt #588: db_size 462400517
Context
- Cluster: sls-smoke-dev-aws-usw2
- Metric details: Victoria Metrics dashboard
- Related Jira: WT-16664
- Discussion suggests an arithmetic bug during development, possibly an underflow where subtraction occurs without a corresponding addition, causing the metric to wrap around to the maximum int64 value.
- The metric is not initialized to -1, so the issue is likely due to incorrect arithmetic logic.
- is related to
-
WT-16664 Merge the Checkpoint size for disagg into develop
-
- Closed
-