[SERVER-30283] PingStats::hit() should not set _numFailuresSinceLastStart to integer max Created: 24/Jul/17 Updated: 30/Oct/23 Resolved: 09/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.4, 3.7.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | neweng, pull-request | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v3.6, v3.4
|
||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 46 | ||||||||||||
| Description |
|
PingStats, defined in ToplogyCoordinatorImpl, is a data structure used to track statistics about replication heartbeat messages. It has a variable called _numFailuresSinceLastStart, which keeps track of failed heartbeat attempts. When a good heartbeat response comes back, this variable is set to std::numeric_limits<int>::max(), presumably as an easy way for this value to always be larger than others in certain comparisons. In the case that we actually get back a good heartbeat response (PingStats::hit) and then a failed heartbeat from the same node (PingStats::miss), we will actually overflow this _numFailuresSinceLastStart variable. We should not use the integer max as a way to determine that we received a good heartbeat response. Creating a separate field in PingStats to explicitly track this would probably be a simple approach. |
| Comments |
| Comment by Githook User [ 26/Feb/18 ] |
|
Author: {'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}Message: (cherry picked from commit a76bbc3d1b426954312acb4d03775350cda83733) |
| Comment by Githook User [ 26/Feb/18 ] |
|
Author: {'email': 'cramaechi@me.com', 'name': 'Chibuikem Amaechi', 'username': 'cramaechi'}Message: Closes #1196 Signed-off-by: William Schultz <william.schultz@mongodb.com> |
| Comment by Githook User [ 25/Jan/18 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: |
| Comment by Githook User [ 25/Jan/18 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: |
| Comment by Githook User [ 09/Jan/18 ] |
|
Author: {'email': 'cramaechi@me.com', 'name': 'Chibuikem Amaechi', 'username': 'cramaechi'}Message: Closes #1196 Signed-off-by: William Schultz <william.schultz@mongodb.com> |
| Comment by Chibuikem Amaechi [ 23/Dec/17 ] |
|
Hi Kelsey, No prob. Thank you! |
| Comment by Kelsey Schubert [ 23/Dec/17 ] |
|
Thanks for the pull request, cramaechi! Many developers are on vacation this coming week for the holidays, but we'll take a look soon after. |
| Comment by Chibuikem Amaechi [ 22/Dec/17 ] |
|
Hi Spencer, I have opened a pull request and looking forward to your feedback! Best Regards, |
| Comment by Chibuikem Amaechi [ 21/Dec/17 ] |
|
Hi Spencer, I ran into a some problems trying to make certain comparisons work the same with the hasFailed() method. Here are the comparisons: mongo/src/mongo/db/repl/topology_coordinator.cpp To fix this problem, I first changed the method name hasFailed() to noMoreRetries(), and then made it return true if _numFailuresSinceLastStart was greater than kMaxHeartBeatRetries or there was at least one good heartbeat received since the last call to start(). I adjusted all the comparisons to use noMoreRetries() except for the comparison in lines 1028 - 1029, to which I added the following: hbStats.getGoodHeartbeatReceived() == false. If my explanation is not clear enough, I can post screenshots of my work to give you a better idea of what I am proposing. Looking forward to hearing your thoughts. Best Regards, |
| Comment by Chibuikem Amaechi [ 19/Dec/17 ] |
|
Great. I will open a pull request by the end of today! |
| Comment by Spencer Brody (Inactive) [ 19/Dec/17 ] |
|
cramaechi, please feel free to put together a pull request and we'd be happy to take a look! |
| Comment by Spencer Brody (Inactive) [ 18/Dec/17 ] |
|
Seems like a reasonable approach, although instead of adding an accessor for _goodHeartbeatReceived, I would actually just add a method "hasFailed()" that returns true if the _numFailuresSinceLastStart is greater than kMaxHeartbeatRetries and there's been no good heartbeat received since the last start. Then you could also remove the accessor for _numFailuresSinceLastStart and tighten up the interface for PingStats. EDIT: Actually this log message might make it hard to get rid of the getNumFailuresSinceLastStart() accessor method, but you could probably still remove all the other callers in favor of using hasFailed() |
| Comment by Kelsey Schubert [ 18/Dec/17 ] |
|
spencer, would you please review cramaechi's proposed change? Thanks, |
| Comment by Chibuikem Amaechi [ 17/Dec/17 ] |
|
Hi William, I am a new server contributor and would like to be assigned this ticket. My proposed change would involve creating a separate field called _goodHeartbeatReceived in PingStats to flag when a good heartbeat was received, and then adding an accessor function to go along with this new field. The aforementioned changes would allow us to remove the assignment statement (_numFailuresSinceLastStart = std::numeric_limits<int>::max()) causing the potential overflow. Modifications would then be made to certain comparisons that referenced the _numFailuresSinceLastStart field to accommodate my new changes. Please share your thoughts. Best Regards, |