[SERVER-20206] cleanupOrphaned should log when it starts Created: 31/Aug/15  Updated: 19/Jul/16  Resolved: 08/Jul/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.3.10

Type: Improvement Priority: Minor - P4
Reporter: Kevin Pulo Assignee: Sam Dunietz
Resolution: Done Votes: 0
Labels: neweng, orphaned
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-15408 cleanupOrphaned should output progress Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 10 (02/19/16), Sharding 17 (07/15/16)
Participants:

 Description   

Currently when running cleanupOrphaned, the log output looks like this:

2015-08-31T16:50:54.395+1000 I SHARDING [conn127] remotely refreshing metadata for test.test based on current shard version 0|0||55e3f24ec97f895066a73485, current metadata version is 1|2||55e3f24ec97f895066a73485
2015-08-31T16:50:54.396+1000 I SHARDING [conn127] updating metadata for test.test from shard version 0|0||55e3f24ec97f895066a73485 to shard version 2|0||55e3f24ec97f895066a73485
2015-08-31T16:50:54.396+1000 I SHARDING [conn127] collection version was loaded at version 2|1||55e3f24ec97f895066a73485, took 0ms
2015-08-31T16:50:54.396+1000 I SHARDING [conn127] Deleter starting delete for: test.test from { _id: 0.0 } -> { _id: MaxKey }, with opId: 708054
2015-08-31T16:54:53.335+1000 I SHARDING [conn127] Helpers::removeRangeUnlocked time spent waiting for replication: 197408ms
2015-08-31T16:54:53.335+1000 I SHARDING [conn127] rangeDeleter deleted 100000 documents for test.test from { _id: 0.0 } -> { _id: MaxKey }
2015-08-31T16:54:53.335+1000 I COMMAND  [conn127] command admin.$cmd command: cleanupOrphaned { cleanupOrphaned: "test.test", startingFromKey: { _id: MinKey } } keyUpdates:0 writeConflicts:0 numYields:0 reslen:121 locks:{ Global: { acquireCount: { r: 200006, w: 200002 } }, MMAPV1Journal: { acquireCount: { r: 2, w: 200002 }, acquireWaitCount: { w: 342 }, timeAcquiringMicros: { w: 669571 } }, Database: { acquireCount: { r: 2, w: 200002 } }, Collection: { acquireCount: { R: 2, W: 100002 } }, oplog: { acquireCount: { w: 100000 }, acquireWaitCount: { w: 10 }, timeAcquiringMicros: { w: 2056 } } } 238940ms

The problem is that none of the initial messages mention "cleanupOrphaned". This means that while the command is in progress (eg. trying to diagnose a slow running cleanupOrphaned command), the only indication in the log is the initial "Deleter starting delete for" message. (Yes, it will be in currentOp, but this ticket is about the logfile). Unfortunately this message is also output by the RangeDeleter in other contexts (ie. chunk migrations), so its presence does not necessarily indicate that the cleanupOrphaned command was used. In addition, inexperienced users/operators/engineers might not be aware that they need to search the logs for "Deleter", which is less obvious than "cleanupOrphaned".

It would be much better if cleanupOrphaned could log a message when it starts to make it clear that it's about to run a RangeDeleter.

This could also be considered a special case of SERVER-15408, and closed as a duplicate of that ticket — as long as that ticket gets updated to make it clear that "progress" also means that cleanupOrphaned should log when it starts, as well while it's working (which happens to be via RangeDeleter).



 Comments   
Comment by Kevin Pulo [ 11/Jul/16 ]

Thanks!

Comment by Githook User [ 08/Jul/16 ]

Author:

{u'name': u'Sam Dunietz', u'email': u'sam.dunietz@10gen.com'}

Message: SERVER-20206 cleanupOrphaned should log when it starts
Branch: master
https://github.com/mongodb/mongo/commit/b862765688496ea7f57ba02de1d449aaefac73d7

Comment by Esha Maharishi (Inactive) [ 08/Jun/16 ]

Okay, sounds good.

Comment by Kevin Pulo [ 08/Jun/16 ]

I would only add to adjust the phrasing of the log line on L105 to match, since it also currently starts with "orphaned data cleanup requested for".

Comment by Esha Maharishi (Inactive) [ 07/Jun/16 ]

kevin.pulo, thanks so much!

In summary, I'd suggest changing the log line on L116 to logLevel=0 and rephrasing the initial part from "orphaned data cleanup requested for" to "cleanupOrphaned requested for".

If this doesn't get picked up by our interns, I'll do it at the end of our current sprint.

Comment by Kevin Pulo [ 06/Jun/16 ]

esha.maharishi, thanks for looking into this!

if it actually performs any deletes

You mean if it scans for deletes — but also if it knows that it doesn't need to (L105-106).

the logLevel was increased to 0?

Absolutely.

the command logged that it was called, even if no deletes are performed (and an ok:0 status is returned to the caller)?

I don't see any need to log at level 0 if the command returns quickly (either success or failure). So, the early returns above L116 (ie. L108, L97, L87) don't need any additional logging. And the returns after L116 (on L133/136), which might happen much later, will be adequately covered by the L116 logging once it's converted to be logLevel 0.

the log message contained the term "cleanupOrphaned"?

Absolutely.

Comment by Esha Maharishi (Inactive) [ 02/Jun/16 ]

kevin.pulo, it seems like cleanupOrphaned logs a line at logLevel=1 if it actually performs any deletes:
https://github.com/mongodb/mongo/blob/6dcdd23/src/mongo/db/s/cleanup_orphaned_cmd.cpp#L116-L118

Would it be helpful if:

  • the logLevel was increased to 0?
  • the command logged that it was called, even if no deletes are performed (and an ok:0 status is returned to the caller)?
  • the log message contained the term "cleanupOrphaned"?


Additional info: an example of similar log lines when the range deleter is invoked during moveChunk:

[js_test:cleanup_orphaned_basic] 2016-06-02T13:01:44.767-0400 d15513| 2016-06-02T13:01:44.767-0400 I SHARDING [conn13] moveChunk moved last chunk out for collection 'foo.bar'
[js_test:cleanup_orphaned_basic] 2016-06-02T13:01:44.819-0400 d15513| 2016-06-02T13:01:44.818-0400 I SHARDING [conn13] about to log metadata event into changelog: { _id: "eshamaharishi-X10DAi-2016-06-02T13:01:44.818-0400-575066785e64e5524faf4f9b", server: "eshamaharishi-X10DAi", clientAddr: "127.0.0.1:58824", time: new Date(1464886904818), what: "moveChunk.commit", ns: "foo.bar", details: { min: { _id: MinKey }, max: { _id: MaxKey }, from: "test-rs1", to: "test-rs0" } }
[js_test:cleanup_orphaned_basic] 2016-06-02T13:01:44.850-0400 d15513| 2016-06-02T13:01:44.850-0400 I SHARDING [conn13] forking for cleanup of chunk data
[js_test:cleanup_orphaned_basic] 2016-06-02T13:01:44.851-0400 d15513| 2016-06-02T13:01:44.850-0400 I SHARDING [RangeDeleter] Deleter starting delete for: foo.bar from { _id: MinKey } -> { _id: MaxKey }, with opId: 360
[js_test:cleanup_orphaned_basic] 2016-06-02T13:01:44.853-0400 d15513| 2016-06-02T13:01:44.852-0400 I SHARDING [RangeDeleter] rangeDeleter deleted 0 documents for foo.bar from { _id: MinKey } -> { _id: MaxKey }

Generated at Thu Feb 08 03:53:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.