[SERVER-21444] WiredTiger replica set member is getting behind while syncing the oplog during the initialSync Created: 13/Nov/15  Updated: 11/Jan/16  Resolved: 11/Jan/16

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Ricardo Lorenzo Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File iostat_timeseries.png     PNG File ss_timeseries.png    
Issue Links:
Duplicate
duplicates SERVER-18908 Secondaries unable to keep up with pr... Closed
Related
Operating System: ALL
Participants:

 Description   

During the oplog sync phase of the initial sync, the WiredTiger replica set is getting behind (almost 8 hours per day). It could be related to SERVER-21230 or SERVER-16575.

namespace               operation    pattern    count    min (ms)    max (ms)    mean (ms)    95%-ile (ms)    sum (ms)
 
local.system.replset    query        None         179           0        3607          144           882.1    25944



 Comments   
Comment by Bruce Lucas (Inactive) [ 15/Dec/15 ]

I suspect this could be a duplicate of SERVER-18908 (and related tickets).

Comment by Ricardo Lorenzo [ 15/Dec/15 ]

After adding the new WiredTiger secondary member to the replica set, it continues getting behind and the lag is growing. The problem seems to be more general and not exclusive for the initial sync.

Comment by Ricardo Lorenzo [ 11/Dec/15 ]

The following is the only valid workaround at the moment for migrating from MMAPv1 to WiredTiger if you have a large database and a large oplog.

  • Remove one of the members from the current replica set (with MMAPv1)

    rs.remove("ip-192-168-44-40:17023");
    

  • Save the primary replica set configuration in a file

    printjson(rs.config())
    

  • Save the last operation from the removed node oplog in a file

    printjson(db.getSiblingDB("local").oplog.rs.find({}, { ts: 1, h: 1, _id: -1 }).sort({ \$natural: -1 }).limit(1).next())
    

  • Restart the server changing the --replSet value
  • Reconfigure the replica set name. Please note that rstest is the original replica set name and rstest2 is the new one

    var conf = rs.config();
    conf._id="rstest2";
    conf.members=[ { _id: 0, host: "ip-192-168-44-40:17023" } ];
    db.getSiblingDB("local").system.replset.insert(conf);
    db.getSiblingDB("local").system.replset.remove({ _id: "rstest" });
    

  • Restart mongod
  • Check the new configuration using rs.config()
  • Add the WiredTiger member and wait for the initial sync to complete
  • Remove the WiredTiger node from the replica set
  • Stop both servers and start them without the --replSet option
  • On each node, reconfigure back the old replicaset configuration.

    var conf = ORIGINAL_SAVED_CONF;
    db.getSiblingDB("local").system.replset.remove({ _id: "rstest2" });
    db.getSiblingDB("local").system.replset.insert(conf);
    

  • Recreate the oplog and apply the saved oplog document on both servers

    db.getSiblingDB("local").oplog.rs.drop();
    db.getSiblingDB("local").replset.minvalid.drop();
    db.getSiblingDB("local").runCommand({ create: "oplog.rs", capped: true, size: (1 * 1024 * 1024 * 1024) }));
    db.getSiblingDB("local").replset.minvalid.insert(SAVED_OPLOG_DOC);
    db.getSiblingDB("local").oplog.rs.save(SAVED_OPLOG_DOC);
    

  • Restart both servers changing the --replSet value
  • Wait for 5 seconds
  • Add both members to the original replica set

    rs.add("ip-192-168-44-40:17023")
    rs.add("ip-192-168-44-40:17024")
    

This procedure was tested successfully in the 4TB replica set. I tested it before using the following script. I'm still not sure why I can see "transition to STARTUP2" in the log of the new WiredTiger member, but this is not triggering an initial sync process. In this script, if I remove the sleep commands at the end of the process, the initial sync is triggered, so you would need to make sure you have the proper timing between each action.

#!/bin/bash
MONGO_DIR=/var/lib/mongodb-mms-automation/mongodb-linux-x86_64-3.0.7-ent
MONGO_DATA=/data
MONGO_RS="rstest"
MONGO_TEMP_RS="rstest2"
OUTPUT=""
 
function startMember {
  local RS_NAME=$1
  local MEMBER=$2
  local MONGOD_OPTS="--dbpath ${MONGO_DATA}/s${MEMBER} --port 1702${MEMBER}"
  if [ -z "${MEMBER}" ]; then
    echo "F:startMember: member not defined"
    return
  fi
  if [ -n "${RS_NAME}" ]; then
    MONGOD_OPTS="${MONGOD_OPTS} --replSet ${RS_NAME}"
  fi
  MONGOD_OPTS="${MONGOD_OPTS} --smallfiles --fork --logappend --logpath ${MONGO_DATA}/s${MEMBER}/mongod.log"
  ${MONGO_DIR}/bin/mongod ${MONGOD_OPTS} 2>&1 >/dev/null &
  while true; do
    MPID=$(getMemberPid ${MEMBER})
    if [ -z "${MPID}" ]; then
      break;
    fi
    sleep 1
    runCommandOnMember ${MEMBER} "db.isMaster()"
    if [ $? -eq 0 ]; then
      break;
    fi
  done
}
 
function stopMember {
  local MAX_COUNT=10
  local COUNT=1
  local MEMBER=$1
  if [ -z "${MEMBER}" ]; then
    echo "F:stopMember: member not defined"
    return
  fi
  local TEMP_PID=$(ps -ax | grep "\-\-port 1702${MEMBER}" | awk '{ print $1 }')
  if [ -z "${TEMP_PID}" ]; then
    #echo "F:stopMember: pid not found for ${MEMBER}"
    return
  fi
  kill ${TEMP_PID} 2>&1 >/dev/null
  while true; do
    MPID=$(getMemberPid ${MEMBER})
    if [ -z "${MPID}" ]; then
      break;
    fi
    sleep 1
    if [ "$COUNT" -gt "$MAX_COUNT" ]; then
      kill -9 ${TEMP_PID} 2>&1 >/dev/null
    fi
    COUNT=$(( COUNT + 1 )) 
  done
}
 
function getMemberPid {
  local MEMBER=$1
  if [ -z ${MEMBER} ]; then
    return
  fi
  local TEMP_PID=$(ps -ax | grep "\-\-port 1702${MEMBER}" | awk '{ print $1 }')
  echo ${TEMP_PID}
}
 
function dropAllMembers {
  echo -n "F:dropAllMembers: Dropping members .."
  for i in $(seq 1 1 5); do
    stopMember ${i}
    if [ -d ${MONGO_DATA}/s${i} ]; then
      rm -fr ${MONGO_DATA}/s${i}/*
    fi
  done
  echo " done"
}
 
function startAllMembers {
  echo -n "F:startAllMembers: Initializing members .."
  for i in $(seq 1 1 5); do
    if [ -d ${MONGO_DATA}/s${i} ]; then
      mkdir -p ${MONGO_DATA}/s${i}
    fi
    startMember "${MONGO_RS}" ${i}
  done
  echo " done"
}
 
function runCommandOnMember {
  local INDEX=0
  local OP=""
  declare -a PARAMS
 
  for o in $(seq 1 1 $#); do
    if [[ "${!o}" =~ ^\" ]]; then
       OP="${!o:1}"
       if [[ "${!o}" =~ \"$ ]]; then
         OP="${OP::-1}"
         PARAMS[$INDEX]=${OP}
         INDEX=$(( INDEX + 1 ))
         OP=""
       fi
       continue
    fi
    if [ -z "$OP" ]; then
       PARAMS[$INDEX]=${!o}
       INDEX=$(( INDEX + 1 ))
    else
       if [[ "${!o}" =~ \"$ ]]; then
         OP="${OP} ${!o::-1}"
         PARAMS[$INDEX]=${OP}
         INDEX=$(( INDEX + 1 ))
         OP=""
       else
         OP="${OP} ${!o}"
       fi
    fi
  done
  if [ -z "${PARAMS[0]}" ]; then
    echo "F:runCommandOnMember: member not defined"
    return
  fi
  if [ -z "${PARAMS[1]}" ]; then
    echo "F:runCommandOnMember: no commands provided"
    return
  fi
  if [ "${PARAMS[2]}" == "debug" ]; then
    ${MONGO_DIR}/bin/mongo ${HOSTNAME}:1702${PARAMS[0]} --quiet --eval "${PARAMS[1]}" 2>/dev/null
  else  
    ${MONGO_DIR}/bin/mongo ${HOSTNAME}:1702${PARAMS[0]} --quiet --eval "${PARAMS[1]}" 2>/dev/null >/dev/null
  fi
}
 
function runCommandOnMemberOutput {
  local COMMAND_OPTS=""
  local OP=""
  for o in $(seq 1 1 $#); do
    if [ -n "${COMMAND_OPTS}" ]; then
      COMMAND_OPTS="${COMMAND_OPTS} "
    fi
    if ! [[ ${!o} =~ ^-?[0-9]+$ ]]; then
      COMMAND_OPTS="${COMMAND_OPTS}\""
    fi
    OP="${!o}"
    #OP=${OP//\"/\\\"}
    #OP=${OP//\$/\\\$}
    COMMAND_OPTS="${COMMAND_OPTS}${OP}"
    if ! [[ ${!o} =~ ^-?[0-9]+$ ]]; then
      COMMAND_OPTS="${COMMAND_OPTS}\""
    fi
  done
  OUTPUT=$(runCommandOnMember ${COMMAND_OPTS})
}
 
function initializeRS {
  echo -n "F:initializeRS: Configuring the replica set .."
  local MEMBER=1
  runCommandOnMember ${MEMBER} "printjson(rs.initiate());"
  echo " done"
  for i in $(seq 2 1 3); do
    echo -n "F:initializeRS: Adding member ${i} to the replica set .."
    runCommandOnMember ${MEMBER} "printjson(rs.add(\"${HOSTNAME}:1702${i}\"));"
    sleep 1
  echo " done"
  runCommandOnMember ${MEMBER} "printjson(rs.addArb(\"${HOSTNAME}:17025\"));"
  sleep 1
  done
}
 
function runProcedure {
  local MEMBER=3
  local NEW_MEMBER=4
  local ORIG_CONF=""
  local OP_DOC=""
  echo -n "F:runProcedure: Removing member ${MEMBER} from the replica set .."
  runCommandOnMemberOutput 1 "printjson(rs.config())" "debug"
  ORIG_CONF="$OUTPUT"
  runCommandOnMember 1 "printjson(rs.remove(\"${HOSTNAME}:1702${MEMBER}\"));"
  runCommandOnMemberOutput ${MEMBER} "printjson(db.getSiblingDB(\"local\").oplog.rs.find({}, { ts: 1, h: 1, _id: -1 }).sort({ \$natural: -1 }).limit(1).next())" "debug"
  OP_DOC="$OUTPUT"
  sleep 1
  echo " done"
 
  echo -n "F:runProcedure: Reconfiguring the replica set in the excluded members .."
  runCommandOnMember ${MEMBER} "var conf = rs.config();
conf._id=\"${MONGO_TEMP_RS}\";
conf.members=[ { _id: 0, host: \"${HOSTNAME}:1702${MEMBER}\" } ];
printjson(db.getSiblingDB(\"local\").system.replset.insert(conf));
printjson(db.getSiblingDB(\"local\").system.replset.remove({ _id: \"${MONGO_RS}\" }));"
  stopMember ${MEMBER}
  startMember ${MONGO_TEMP_RS} ${MEMBER}
  echo " done"
  
  echo -n "F:runProcedure: Syncing the new member ${NEW_MEMBER} with the new replica set .."
  stopMember ${NEW_MEMBER}
  startMember ${MONGO_TEMP_RS} ${NEW_MEMBER}
  runCommandOnMember ${MEMBER} "printjson(rs.add(\"${HOSTNAME}:1702${NEW_MEMBER}\"))"
  sleep 4
  echo " done"
 
  echo -n "F:runProcedure: Reconfiguring back the replica set on members ${MEMBER} and ${NEW_MEMBER}"
  runCommandOnMember ${MEMBER} "printjson(rs.remove(\"${HOSTNAME}:1702${NEW_MEMBER}\"));"
  sleep 1
  stopMember ${MEMBER}
  stopMember ${NEW_MEMBER}
  startMember "" ${MEMBER}
  startMember "" ${NEW_MEMBER}
  runCommandOnMember ${MEMBER} "var conf = ${ORIG_CONF};
printjson(db.getSiblingDB(\"local\").system.replset.remove({ _id: \"${MONGO_TEMP_RS}\" }));
printjson(db.getSiblingDB(\"local\").system.replset.insert(conf));"
  runCommandOnMember ${NEW_MEMBER} "var conf = ${ORIG_CONF};
printjson(db.getSiblingDB(\"local\").system.replset.remove({ _id: \"${MONGO_TEMP_RS}\" }));
printjson(db.getSiblingDB(\"local\").system.replset.insert(conf));"
  echo " done"
 
  echo -n "F:runProcedure: Recreating the oplog and setting minValid point .."
  runCommandOnMember ${MEMBER} "printjson(db.getSiblingDB(\"local\").oplog.rs.drop());
printjson(db.getSiblingDB(\"local\").replset.minvalid.drop());
printjson(db.getSiblingDB(\"local\").runCommand({ create: \"oplog.rs\", capped: true, size: (1 * 1024 * 1024 * 1024) }));
printjson(db.getSiblingDB(\"local\").replset.minvalid.insert(${OP_DOC}));
printjson(db.getSiblingDB(\"local\").oplog.rs.save(${OP_DOC}));"
  runCommandOnMember ${NEW_MEMBER} "printjson(db.getSiblingDB(\"local\").oplog.rs.drop());
printjson(db.getSiblingDB(\"local\").replset.minvalid.drop());
printjson(db.getSiblingDB(\"local\").runCommand({ create: \"oplog.rs\", capped: true, size: (1 * 1024 * 1024 * 1024) }));
printjson(db.getSiblingDB(\"local\").replset.minvalid.insert(${OP_DOC}));
printjson(db.getSiblingDB(\"local\").oplog.rs.save(${OP_DOC}));"
  stopMember ${MEMBER}
  stopMember ${NEW_MEMBER}
  startMember ${MONGO_RS} ${MEMBER}
  startMember ${MONGO_RS} ${NEW_MEMBER}
  echo " done"
}
 
function loadData {
  local MEMBER=1
  local NUM_DOCS=$1
  if [ -z "${NUM_DOCS}" ]; then
    NUM_DOCS=10000
  fi
  echo -n "F:loadData: Loading ${NUM_DOCS} documents .."
  runCommandOnMember ${MEMBER} "var i = 0; db = db.getSiblingDB(\"testdb\");
  for(i = ${NUM_DOCS}; i > 0 ; --i) {
    db.testcol.insert({ timestamp: new Date(), value: i, name: \"Ricardo\" });
  }"
  echo " done"
}
 
dropAllMembers
startAllMembers
initializeRS
loadData 10000
runProcedure
loadData 10000
 
sleep 2
echo -n "Adding back the members .."
runCommandOnMember 1 "rs.add(\"ip-192-168-44-40:17023\")"
sleep 2
runCommandOnMember 1 "rs.add(\"ip-192-168-44-40:17024\")"
sleep 4
echo " done"
 
echo "************************** START: Old member log *****************************"
tail -50 ${MONGO_DATA}/s3/mongod.log
echo "*************************** END: Old member log ******************************"
sleep 2
echo "************************** START: New member log *****************************"
tail -50 ${MONGO_DATA}/s4/mongod.log
echo "*************************** END: New member log ******************************"

Comment by Ricardo Lorenzo [ 07/Dec/15 ]

ramon.fernandez, I'm sorry for the delay.

  • 4 node replica set
  • 4TB data size
  • Almost 1TB index size
  • All nodes are using 3.0.7 and MMAPv1
  • New node with WiredTiger is also 3.0.7

The first initial sync took 7 days and the applying oplog phase took other 7 and still can't catch the primary. The initial sync was never test it with MMAPv1 as the secondaries are always seed with the data files.

Comment by Ramon Fernandez Marina [ 20/Nov/15 ]

ricardo.lorenzo, can you please summarize the details of this replica set? Storage engine used by the primary, data size and oplog size may be sufficient to attempt to reproduce.

Thanks,
Ramón.

Generated at Thu Feb 08 03:57:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.