[SERVER-32493] PlanExecutor killed: CappedPositionLost after upgrading from 3.4.1 => 3.6.1 Created: 31/Dec/17  Updated: 22/Feb/18  Resolved: 19/Jan/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Davis Ford Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 1
Labels: oplog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu@ip-10-0-0-192:~$ cat /etc/issue
Ubuntu 14.04.4 LTS \n \l

ubuntu@ip-10-0-0-192:~$ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)


Attachments: File mongo-java-driver-3.6.1.jar    
Issue Links:
Duplicate
duplicates SERVER-32606 Tailing oplog on secondary fails with... Closed
Related
Participants:

 Description   

We tail the oplog for some of our operations. After an upgrade today from 3.4.1 => 3.6.1 our oplog tailer is failing with the following stacktrace:

com.mongodb.MongoQueryException: Query failed with error code 175 and error message 'PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6505472699941257226)' on server ip-10-0-0-131:27017
at com.mongodb.operation.QueryHelper.translateCommandException(QueryHelper.java:31)
at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:215)
at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
at com.mongodb.DBCursor.hasNext(DBCursor.java:155)
at com.mongodb.casbah.MongoCursorBase$class.hasNext(MongoCursor.scala:73)
at com.mongodb.casbah.MongoCursor.hasNext(MongoCursor.scala:487)

Not finding a lot of info about this out there after searching. Please advise? We have also tried 3.6.1 driver but same result. What changed and how can we work around this?



 Comments   
Comment by Davis Ford [ 21/Jan/18 ]

OK Kelsey,

Just curious, do you have any ETA on when 3.6.3 will be released?

Thanks,
Davis

Comment by Kelsey Schubert [ 19/Jan/18 ]

Hi davisford@gmail.com,

I'm going to close this ticket as a duplicate of SERVER-32606. We've completed the backport to 3.6, and the fix will be included in MongoDB 3.6.3, the next release of MongoDB 3.6.

Kind regards,
Kelsey

Comment by Bruce Lucas (Inactive) [ 09/Jan/18 ]

Thanks for reporting this Davis. I've reproduced the problem and opened SERVER-32606 to track a fix. Please follow that ticket for updates.

Since it appears in my repo that this issue only affects the secondary, you might be able to work around the problem by tailing the oplog on the primary.

Comment by Bruce Lucas (Inactive) [ 09/Jan/18 ]

Thanks Davis, that is helpful. We're investigating and will let you know what we find.

Comment by Davis Ford [ 09/Jan/18 ]

Mongo server 131 (which the client was connecting to above) environ:

[ec2-user@ip-10-0-0-131 ~]$ date
Tue Jan  9 04:12:58 UTC 2018

[ec2-user@ip-10-0-0-131 ~]$ cat /etc/issue
Amazon Linux AMI release 2017.09
Kernel \r on an \m

Dropbox link with slice of mongod.log tail and full diagnostic.data/ dir: https://www.dropbox.com/sh/9t9tk03ld3ftyz0/AAB2GC7s9rTvWMkxfvjfLovza?dl=0

Bruce for the mongod log, I ran as follows:

tail --bytes=G /log/mongod.log  | gzip > ip-10-0-0-131.mongod.log.gz

That's a full gig of log data for you...should be enough?

I compressed the full diagnostic.data directory as well. Let me know what else you need. Appreciate you looking into this. It is a real problem for us, and hoping you can resolve it.

Best regards,
Davis

Comment by Davis Ford [ 09/Jan/18 ]

Here is most recent client log snippet:

2018-01-09 03:35:37,791 INFO  org.mongodb.driver.cluster  - Cluster created with settings {hosts=[10.0.0.131:27017, 10.0.0.87:27017, 10.0.0.94:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2018-01-09 03:35:37,793 INFO  org.mongodb.driver.cluster  - Adding discovered server 10.0.0.131:27017 to client view of cluster
2018-01-09 03:35:37,831 INFO  org.mongodb.driver.cluster  - Adding discovered server 10.0.0.87:27017 to client view of cluster
2018-01-09 03:35:37,834 INFO  org.mongodb.driver.cluster  - Adding discovered server 10.0.0.94:27017 to client view of cluster
[INFO] [01/09/2018 03:35:37.858] [warehouse-connector-akka.actor.default-dispatcher-4] [akka://warehouse-connector/user/oplog-consumer] starting tail from: TS time:Tue Jan 09 03:25:37 UTC 2018 inc:0
2018-01-09 03:35:37,896 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:3, serverValue:29125}] to 10.0.0.94:27017
2018-01-09 03:35:37,899 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:1, serverValue:17292}] to 10.0.0.131:27017
2018-01-09 03:35:37,903 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:2, serverValue:16546}] to 10.0.0.87:27017
2018-01-09 03:35:37,911 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=10.0.0.131:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=8313019, setName='repl-emagine', canonicalAddress=ip-10-0-0-131:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='2'}]}, electionId=null, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102448359437}
2018-01-09 03:35:37,912 INFO  org.mongodb.driver.cluster  - Discovered cluster type of REPLICA_SET
2018-01-09 03:35:37,912 INFO  org.mongodb.driver.cluster  - Adding discovered server ip-10-0-0-131:27017 to client view of cluster
2018-01-09 03:35:37,915 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=10.0.0.87:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=11698765, setName='repl-emagine', canonicalAddress=ip-10-0-0-87:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='3'}]}, electionId=null, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102452352058}
2018-01-09 03:35:37,917 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=10.0.0.94:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=16390944, setName='repl-emagine', canonicalAddress=ip-10-0-0-94:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='1'}]}, electionId=7fffffff0000000000000001, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102454391234}
2018-01-09 03:35:37,917 INFO  org.mongodb.driver.cluster  - Adding discovered server ip-10-0-0-87:27017 to client view of cluster
2018-01-09 03:35:37,919 INFO  org.mongodb.driver.cluster  - Adding discovered server ip-10-0-0-94:27017 to client view of cluster
2018-01-09 03:35:37,920 INFO  org.mongodb.driver.cluster  - Canonical address ip-10-0-0-131:27017 does not match server address.  Removing 10.0.0.131:27017 from client view of cluster
2018-01-09 03:35:37,923 INFO  org.mongodb.driver.cluster  - Server 10.0.0.94:27017 is no longer a member of the replica set.  Removing from client view of cluster.
2018-01-09 03:35:37,923 INFO  org.mongodb.driver.cluster  - Server 10.0.0.87:27017 is no longer a member of the replica set.  Removing from client view of cluster.
2018-01-09 03:35:37,923 INFO  org.mongodb.driver.cluster  - Canonical address ip-10-0-0-94:27017 does not match server address.  Removing 10.0.0.94:27017 from client view of cluster
2018-01-09 03:35:37,945 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:5, serverValue:16547}] to ip-10-0-0-87:27017
2018-01-09 03:35:37,946 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:6, serverValue:29126}] to ip-10-0-0-94:27017
2018-01-09 03:35:37,947 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:4, serverValue:17293}] to ip-10-0-0-131:27017
2018-01-09 03:35:37,949 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=ip-10-0-0-87:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=3068935, setName='repl-emagine', canonicalAddress=ip-10-0-0-87:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='3'}]}, electionId=null, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102486236106}
2018-01-09 03:35:37,950 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=ip-10-0-0-94:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=2963627, setName='repl-emagine', canonicalAddress=ip-10-0-0-94:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='1'}]}, electionId=7fffffff0000000000000001, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102487544878}
2018-01-09 03:35:37,950 INFO  org.mongodb.driver.cluster  - Setting max election id to 7fffffff0000000000000001 from replica set primary ip-10-0-0-94:27017
2018-01-09 03:35:37,951 INFO  org.mongodb.driver.cluster  - Monitor thread successfully connected to server with description ServerDescription{address=ip-10-0-0-131:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 1]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=4039335, setName='repl-emagine', canonicalAddress=ip-10-0-0-131:27017, hosts=[ip-10-0-0-131:27017, ip-10-0-0-87:27017, ip-10-0-0-94:27017], passives=[], arbiters=[], primary='ip-10-0-0-94:27017', tagSet=TagSet{[Tag{name='prefer', value='2'}]}, electionId=null, setVersion=56, lastWriteDate=Tue Jan 09 03:35:33 UTC 2018, lastUpdateTimeNanos=518102489009718}
2018-01-09 03:35:37,951 INFO  org.mongodb.driver.cluster  - Setting max set version to 56 from replica set primary ip-10-0-0-94:27017
2018-01-09 03:35:37,952 INFO  org.mongodb.driver.cluster  - Discovered replica set primary ip-10-0-0-94:27017
2018-01-09 03:35:38,054 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:7, serverValue:17294}] to ip-10-0-0-131:27017
2018-01-09 03:40:35,860 INFO  metrics  - type=HISTOGRAM, name=emaginepos.connector.transform.Transformer.oplog-lag, count=0, min=0, max=0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0
2018-01-09 03:40:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.BatchReportTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.BatchReportTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.CustomerTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.CustomerTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.DiscountTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.DiscountTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeRoleTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeRoleTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.LocationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.LocationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.MenuItemTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.MenuItemTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemLocationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemLocationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierOptionTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierOptionTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTypeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTypeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrganizationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrganizationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PaymentMethodTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PaymentMethodTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PriceLevelTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PriceLevelTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ReasonTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ReasonTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.StationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.StationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.SurchargeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.SurchargeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TaxTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TaxTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TeamTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TeamTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TimesheetTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,867 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TimesheetTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,868 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.UniqueModifierTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:40:35,868 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.UniqueModifierTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
[INFO] [01/09/2018 03:43:08.068] [warehouse-connector-akka.actor.default-dispatcher-9] [akka://warehouse-connector/user/oplog-consumer] started reading from cursor
2018-01-09 03:43:08,228 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:8, serverValue:17299}] to ip-10-0-0-131:27017
2018-01-09 03:43:16,064 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:9, serverValue:17300}] to ip-10-0-0-131:27017
2018-01-09 03:43:21,452 INFO  org.mongodb.driver.connection  - Opened connection [connectionId{localValue:10, serverValue:17301}] to ip-10-0-0-131:27017
[WARN] [01/09/2018 03:43:41.232] [warehouse-connector-akka.actor.default-dispatcher-7] [akka://warehouse-connector/user/oplog-consumer] MongoException: Query failed with error code 175 and error message 'PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6508891584103186443)' on server ip-10-0-0-131:27017, restarting after 1311572 reads.
[INFO] [01/09/2018 03:43:56.246] [warehouse-connector-akka.actor.default-dispatcher-4] [akka://warehouse-connector/user/oplog-consumer] starting tail from: TS time:Tue Jan 09 03:43:37 UTC 2018 inc:11
2018-01-09 03:45:35,859 INFO  metrics  - type=HISTOGRAM, name=emaginepos.connector.transform.Transformer.oplog-lag, count=9649, min=4, max=1051, mean=490.0414499369277, stddev=304.0537832436827, median=481.0, p75=739.0, p95=986.0, p98=1016.0, p99=1034.0, p999=1048.0
2018-01-09 03:45:35,860 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.BatchReportTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.BatchReportTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.CustomerTransformer.insert, count=4, mean_rate=0.40142760419240786, m1=0.4595960410195186, m5=0.5187358076928581, m15=0.23080377163097426, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.CustomerTransformer.update, count=2, mean_rate=0.20071379869874875, m1=0.2198308663397613, m5=0.2571885451136742, m15=0.11508043541086072, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.DiscountTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.DiscountTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,861 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeRoleTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeRoleTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.EmployeeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,862 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.LocationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.LocationTransformer.update, count=1, mean_rate=0.10035870815585483, m1=0.10991543316988064, m5=0.1285942725568371, m15=0.05754021770543036, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.MenuItemTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.MenuItemTransformer.update, count=1, mean_rate=0.1003588329745165, m1=0.10112708036951136, m5=0.12646879618460594, m15=0.057221436153510934, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemLocationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemLocationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupItemTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,863 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierGroupTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierOptionTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ModifierOptionTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTypeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrderTypeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrganizationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.OrganizationTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PaymentMethodTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,864 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PaymentMethodTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PriceLevelTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.PriceLevelTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ReasonTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.ReasonTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.StationTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.StationTransformer.update, count=25, mean_rate=2.509044211526626, m1=2.7901120486662663, m5=3.2236820135729944, m15=1.439796581959679, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.SurchargeTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.SurchargeTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TaxTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,865 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TaxTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TeamTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TeamTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TimesheetTransformer.insert, count=7, mean_rate=0.7025498575011336, m1=0.7187859490647415, m5=0.8875124427675815, m15=0.4008741329212487, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.TimesheetTransformer.update, count=195, mean_rate=19.571030242593437, m1=22.088706153427285, m5=25.209293562799097, m15=11.239760985782377, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.UniqueModifierTransformer.insert, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
2018-01-09 03:45:35,866 INFO  metrics  - type=METER, name=emaginepos.connector.transform.collection.UniqueModifierTransformer.update, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/minute
[INFO] [01/09/2018 03:46:06.283] [warehouse-connector-akka.actor.default-dispatcher-7] [akka://warehouse-connector/user/oplog-consumer] started reading from cursor
[WARN] [01/09/2018 03:46:29.146] [warehouse-connector-akka.actor.default-dispatcher-7] [akka://warehouse-connector/user/oplog-consumer] MongoException: Query failed with error code 175 and error message 'PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6508892305657692169)' on server ip-10-0-0-131:27017, restarting after 4011 reads.
[INFO] [01/09/2018 03:46:44.167] [warehouse-connector-akka.actor.default-dispatcher-6] [akka://warehouse-connector/user/oplog-consumer] starting tail from: TS time:Tue Jan 09 03:46:25 UTC 2018 inc:9
[INFO] [01/09/2018 03:47:33.420] [warehouse-connector-akka.actor.default-dispatcher-8] [akka://warehouse-connector/user/oplog-consumer] started reading from cursor
[WARN] [01/09/2018 03:47:43.741] [warehouse-connector-akka.actor.default-dispatcher-5] [akka://warehouse-connector/user/oplog-consumer] MongoException: Query failed with error code 175 and error message 'PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6508892623485272343)' on server ip-10-0-0-131:27017, restarting after 1817 reads.
[INFO] [01/09/2018 03:47:58.756] [warehouse-connector-akka.actor.default-dispatcher-3] [akka://warehouse-connector/user/oplog-consumer] starting tail from: TS time:Tue Jan 09 03:47:39 UTC 2018 inc:279

The important lines are the ones with this message/error from the client:

[WARN] [01/09/2018 03:46:29.146] [warehouse-connector-akka.actor.default-dispatcher-7] [akka://warehouse-connector/user/oplog-consumer] MongoException: Query failed with error code 175 and error message 'PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6508892305657692169)' on server ip-10-0-0-131:27017, restarting after 4011 reads.

Comment by Davis Ford [ 09/Jan/18 ]

OK, here's some more info on the running client/driver environment:

ubuntu@ip-10-0-0-192:~$ date
Tue Jan  9 03:49:42 UTC 2018

ubuntu@ip-10-0-0-192:~$ cat /etc/issue
Ubuntu 14.04.5 LTS \n \l

ubuntu@ip-10-0-0-192:~$ java -version
java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-2ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)

Mongo Java Driver: 3.6.1 (attached) mongo-java-driver-3.6.1.jar

Servers (both client and Mongo) are all running in Amazon EC2 in same VPC – have been running for about 4+ years with various upgrades.

I will post some logs shortly....

Comment by Bruce Lucas (Inactive) [ 08/Jan/18 ]

Hi Davis,

Yes, that would be good.

For mongod log please include a good chunk of context, for example going back to the most recent restart of mongod to the end of the log file.

For the diagnostic data, also please upload the entire directory; this will give us context, and also may allow us to correlate with other problems.

Thanks,
Bruce

Comment by Davis Ford [ 08/Jan/18 ]

Hi Bruce,

We can reproduce it – we reproduce it every day several times a day. Hoping my colleague, Matt Lehman can jump in on this thread – he wrote the tailer, and I believe he is subscribed to the issue. But I believe he was able to tweak a couple things and now it stays connected for 20-30 sec and then it gets terminated with this same error. He just tries to re-establish when it is disconnected.

We have two modes of operation:

a) Tail oplog which is meant to provide real-time data updates to another system
b) Batch process which takes some timestamps among other params to sync the data from Mongo to the other system.

The oplog tailer is currently limping along – staying connected for a bit then getting this exception thrown now. We run the batch processes periodically to help alleviate the issue.

So, in summary, this problem is happening all the time for us. I can grab logs from both sides and correlate them for you if you tell me what you want. I'm assuming:

a) mongod.log section that corresponds to problem
b) diagnostic.data/ log section that corresponds to problem
c) client side driver log that corresponds to problem

Is that what you are looking for?

Comment by Bruce Lucas (Inactive) [ 08/Jan/18 ]

Hi Davis,

In order to correlate events in the diagnostic data that you provided with the CappedPositionLost issue we'll need a specific timeline, i.e. the start of the tail operation and the exact timestamp of the failure.

Do you have that information for any of the failures that correspond to the data you already uploaded, i.e. a tail operation on the 131 node some time prior to 01-02 22:44:56 UTC? If not, would you be in a position to recreate the problem and provide timeline and upload diagnostic data from the relevant node?

Thanks,
Bruce

Comment by Davis Ford [ 04/Jan/18 ]

Hi Dan,

Thanks for reminding me that the oplog is capped. Completely forgot about that. If there's any more info we can provide, let us know.

Comment by Daniel Pasette (Inactive) [ 04/Jan/18 ]

Hi Davis, that extra info about the driver is quite helpful. I'm copying some folks who work on the java driver so that they can have a look.

BTW, Kelsey was saying that the error you got indicates normally that the oplog has rolled over (deleted the entry which it was positioned at). The oplog is itself a capped collection.

cc: jeff.yemin/ross.lawley

Comment by Davis Ford [ 04/Jan/18 ]

Hi Kelsey,

In our other account I have a JIRA issue where they also requested mongod logs and diagnostic.data logs.

That issue is here: https://jira.mongodb.org/projects/MMSSUPPORT/issues/MMSSUPPORT-19150?filter=allopenissues

In that issue, I generated those logs – similar to what you're looking for and dumped them in a dropbox link:

https://www.dropbox.com/sh/69dblq4e0wilr7x/AABh-o4haxHkhM8tonUd6uHba?dl=0

The issue we're having is not with a capped collection. I actually don't think we have any capped collections on this database at all. This issue was with a fairly large collection, but we traditionally had no problem tailing the oplog on it. Once we upgraded from 3.4.1 to 3.6.1 it broke..in Java/Scala driver. Note that I'm also using the Golang mgo driver and also tailing the oplog and it is working fine after the upgrade. Seems to be related to the Java driver?

If you need more info let me know, but the logs are up in that dropbox link.

Regards,
Davis

Comment by Kelsey Schubert [ 03/Jan/18 ]

Hi davisford@gmail.com,

Generally, this error indicates that the capped collection has rolled over – new insertions to the collection caused documents to be deleted that the query's cursor had yet to access. So we can continue to investigate this issue, would you please upload the complete logs and an archive of the diagnostic.data directory from the affected mongod node?

Thank you,
Kelsey

Comment by Davis Ford [ 03/Jan/18 ]

Hi, is there any chance you can move this under my other account (as the reporter)? I guess I seem to have 2 JIRA accounts on here. The one that is tied to our business and where I've submitted issues before is "davisford@gmail.com". This might be easier for us to track it.

If you have any questions about the bug or need any further info, please let us know. We are looking at possibly migrating to using the new Change Streams feature as a replacement https://docs.mongodb.com/v3.6/changeStreams/ – but so far since I just recently upgraded to 3.6.x I have not turned on the feature compatibility – wanted to let things settle and ensure our backups were working again, so it will indeed be a while before we can try to migrate to that.

This is causing us production headache. Appreciate any insight.

Generated at Thu Feb 08 04:30:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.