welcome to 's blog...


公告

我的分类(专题)

日志更新

最新评论

留言板

链接

搜索


Blog信息




Measuring your Domino server's reliability
aku1 发表于 2006-4-15 9:19:28

Level: Intermediate

John Paganetti, Software Developer, Iris Associates
Susan Florio, Content Developer, Iris Associates

01 May 1998

This article describes a new tool called Mean Time Between Failure (MTBF), which you can use to monitor the uptime of your Domino servers. MTBF is not an official product of Lotus or Iris, and is therefore, unsupported.

Introduction

How many Domino servers do you have in your organization? Whether its one, ten, one hundred, one thousand... most administrators would agree that their number one priority is keeping their Domino servers up and running. Reliable Domino servers easily translate into greater productivity for the entire company. Getting your organization's mission-critical work accomplished in a timely and leading-edge manner translates into dollars. This may be very important to you as an administrator, especially if your success is based on the "perceived" reliability numbers of your company's Domino servers. Wouldn't it be great to take the server's "perceived" number and replace it with a "truly measured" number? Wouldn't it be great to take that "truly measured" number and proactively troubleshoot the problem servers within your domain? A new tool that helps deliver this capability is Mean Time Between Failure (MTBF).

The MTBF tool is a Domino server add-in that we use here at Iris to help measure server reliability. You can use MTBF with Domino Release 4.0 and later (for English language releases only). Although unsupported, we recognize its value to our customers, so we've made the tool available on Notes.net for free.

MTBF allows you to view, in one database, specific information about each Domino server in your domain. You can also view information about multiple servers across multiple domains. Some of the specific information gathered by this tool includes the:

  • Domino release running on the server
  • Operating system and version on the server
  • Total elapsed time the Domino server has been up and running
  • Total number of Domino and Notes transactions processed by the Domino server since it was started
  • Average number of Domino and Notes transactions per minute since the Domino server was started
  • Time and date of server startups and shutdowns
  • Time and date of possible server crashes

You can discern a lot of information from this data, such as which servers need an upgrade, or which servers are currently your busiest. MTBF uses the startup, shutdown, and crash entries to generate a variety of server statistics, such as the total server uptime for the week, or the server's mean time between failures. This article introduces you further to MTBF, and shows you how to start using it to view these statistics.



Back to top


The history of MTBF

The past, present, and future success of Domino and Notes is based on the quality and reliability of our Domino servers. Over the years, as Domino and Notes have become worldwide leaders in groupware and messaging, we've challenged ourselves to make sure that each successive release attains certain quality and reliability criteria before allowing it to hit the streets. With each new major release comes many new features, functional enhancements, performance and scalability improvements, and bug fixes. We have to make sure these changes don't affect the overall reliability of the server. The big questions for us are, "How can we measure this?" and "When do we know we're done?"

This is where MTBF came on the scene. We needed a tool that allowed us to automatically measure the reliability of our servers as we briskly moved through our development cycle. If we could demonstrate with each successive build that the server reliability numbers were on the rise, then we knew the server was getting closer and closer to being ready for release to the public. However, constantly improving reliability numbers was not enough. We needed to demonstrate that the numbers were well within our acceptable range for shipping, and that we had achieved a server reliability goal greater than any previous major release of Domino. MTBF became an integral part in determining the final release dates of Domino and Notes.

Here at Iris, we currently use MTBF to monitor servers in three different domains, running on multiple platforms, with different releases of the server on each machine. We place many new builds of the server on a machine during the development process. MTBF helps us keep track of what builds are on what machines. It also helps us monitor how stable each build is. The following screen shows just some of the MTBF databases we use to monitor our servers:


Figure 1. Our MTBF databases

If MTBF reports that a server running a new build crashes once a week, then we know that the build isn't ready for release. We can look in our MTBF database, under the Server Entries view, for detailed information about the crash, which an administrator or developer might have entered. Then, we can look for the development of possible patterns. If the same event occurs before each crash, then we debug the server based on that data. Once we determine what caused the crash, we go back into the MTBF database and update the crash information. In the Server Crash document, we can state the reason for the crash and the solution, and then mark the crash as resolved.

If we know that the last 10 crashes were caused by the bug we just fixed, we update all 10 of the MTBF documents to mark the crashes as resolved. The MTBF still takes the resolved crashes into consideration when calculating statistics; however, it doesn't include resolved crashes when calculating the information in the adjusted server statistics chart. This way, we can see in the adjusted statistics what the server's uptime would be if the bug fix had been included earlier -- it's almost as though the 10 crashes didn't happen.



Back to top


How MTBF monitors servers

MTBF measures server reliability by monitoring your servers for shutdowns. It searches the Log files of the servers you install it on for the following entries:

  • "Lotus Domino server started"
  • "Server shutdown complete"

When an administrator shuts down a server intentionally, a "Server shutdown complete" message is written to the server's Log file (LOG.NSF) before the shutdown. When the server starts up again, a "Lotus Domino server started" message is written to the Log file. Therefore, a shutdown message should immediately precede each server startup message in the Log file. If MTBF finds a server startup message that is not preceded by a shutdown message, it creates a Server Crash document in the MTBF database. You should edit this document and add any information related to the crash.

Here's an example of entries from a LOG.NSF file on a Domino server running Release 4.6.1a:

03/31/98 05:10:04 PM  Lotus Domino #xae Server started, running Release 4.6.1a
04/01/98 12:38:11 PM  Server shutdown complete
04/01/98 12:43:47 PM  Lotus Domino #xae Server started, running Release 4.6.1a

Since these were the only server entries found, we can infer that the server was up and running between March 31st at 05:10:04 PM and April 1st at 12:38:11 PM. On that day, an administrator shut down the server at 12:38:11 PM, and the server was down until 12:43:47 PM. It was down for a little over 5 minutes, and it has been up since it was restarted.

Now let's take the example further as MTBF searches again and finds the following entries:

03/31/98 05:10:04 PM  Lotus Domino #xae Server started, running Release 4.6.1a
04/01/98 12:38:11 PM  Server shutdown complete
04/01/98 12:43:47 PM  Lotus Domino #xae Server started, running Release 4.6.1a
04/03/98 05:58:37 PM  Lotus Domino #xae Server started, running Release 4.6.1a

Notice that this time, there is no "Server shutdown complete" message before the "Lotus Domino server started" entry on April 3rd at 5:58:37 PM. Therefore, MTBF makes another pass through the log to find the last message that appears before this last server startup message. It finds the following entries:

03/31/98 05:10:04 PM Lotus Domino #xae Server started, running Release 4.6.1a
04/01/98 12:38:11 PM Server shutdown complete
04/01/98 12:43:47 PM Lotus Domino #xae Server started, running Release 4.6.1a
04/03/98 05:00:41 PM Database fixup process shutdown
04/03/98 05:58:37 PM Lotus Domino #xae Server started, running Release 4.6.1a 

Now, we know the last task that ran before the server shutdown, and have a place to start for debugging the problem.



Back to top


Getting started with MTBF

To set up MTBF on your servers, you need to download the files, and create the MTBF database for recording the statistics. Then you can choose one of several ways of running MTBF, and you can specify when you want MTBF to generate more extensive server statistics.

You need to install MTBF on every server that you want to monitor. If you want to view information about multiple servers from one database, we recommend that you use a hub-and-spoke model for setting up the MTBF database. You create the database on the "hub" server and set up selective replicas on the "spoke" servers. The "hub" server should have a Connection document for replicating with the "spoke" servers every few hours or so. The "hub" server gathers information from all the "spokes" by using a selective replication formula. The "spoke" servers only gather information for that particular server.

Downloading MTBF

To download the MTBF template and add-in:

  1. Go to the Sandbox for "Mean Time Between Failure for Lotus Notes/Domino."
  2. Click the mtbf.nsf icon next to the MTBF template, and save the template to your Notes/data directory.
  3. Click the mtbf.exe that corresponds to the platform of your server, and save the file to your Notes program directory.

Creating the MTBF database

To create the MTBF database on the "hub" server:

  1. Open Notes and choose File - Database - New to create a new database.
  2. Select the name of your "hub" server from the dropdown list in the Server field.
  3. Enter "Mean Time Between Failure" in the Title field.
  4. Enter MTBF.NSF in the File Name field.
  5. Select the "Show advanced templates" option.
  6. Select the "Mean Time Database Template" from the list of templates.
  7. You can optionally select the "Size Limit" button to limit the size of the new database to 4GB if you plan to monitor many servers.
  8. Click OK.

Creating a replica of the MTBF database

To create a replica of the MTBF database on the "spoke" servers:

  1. Select the MTBF database that you want to replicate.
  2. Choose File - Replication - New Replica.
  3. In the Server field, enter the name of the "spoke" server where you want to create the replica.
  4. Click Replication Settings, and choose the Advanced icon.
  5. Select the "Replicate a subset of documents" option.
  6. Select the "Select by formula" option.
  7. Paste in the following selective replication formula:

    SELECT ((@IsMember(Form;"Server":"LogEvent":
    "Server Crashes":"Server Shutdowns":"Server Mean Time":
    "Server Mean Time One Week":"Server Mean Time Two Week":
    "Server Mean Time Thirty Day":"Server Mean Time Sixty Day":
    "Server Mean Time Ninety Day":"Server Mean Time Six Month":
    "Server Mean Time One Year") & ServerName=@UserName))



  8. Click OK.

Note: If your server is running Release 4.1x or earlier, you may need to replace the @UserName with the canonical server name, such as "CN=Arista/O=Iris."

Running and configuring MTBF

You can run MTBF in several different ways. The first time you run MTBF on your server, you should run it manually. At the Domino server console window, enter "load MTBF -A", which adds this server to the MTBF.NSF database and runs MTBF for the first time. If you use MTBF Release 1.3 or earlier, you do not need to use the -A option. You can continue to run MTBF manually from the server console window without the -A option, each time you want to update the statistics within MTBF.

You can also set up MTBF to run every time you start your server. To start MTBF when you start your server:

  1. Open your NOTES.INI file.
  2. Add "MTBF" to the "ServerTasks =" line. For example:

    ServerTasks=Replica,Update,Router,MTBF

Your third option, is to schedule MTBF to run automatically at a specified time. To schedule MTBF, create a Program document in the Public Address Book. Enter the time that you want to run MTBF and how often you want MTBF to update (every hour or two keeps your server information up to date and doesn't impact server performance). The following Program document shows MTBF scheduled to run every 60 minutes:


Figure 2. A Program document

In any of these three cases, you can specify that you want to run MTBF with the parameter "-F". The "-F" parameter gives you "full blown" server statistics (shown in the Server Statistics view). You can specify that "MTBF -F" run at 5:00AM every day or every other day, depending on how often you'd like all the server statistics generated. We recommend you choose a time when the server load is low, such as early morning, since this operation is fairly intensive. Keep in mind that each iteration of "MTBF -F" generates many new documents, and the size of the database can grow very quickly if you're monitoring many servers.



Back to top


Using MTBF

After you install MTBF and run it once, open the database and begin looking at the available views. Each view within the database gives you different kinds of information about your servers. The following sections show you how to use MTBF, by providing you with information about each view. These sections also explain what information the fields within each document contain.

Viewing server information

The first view is the Server Information view, which allows you to view information about the current server. If you are on the "hub" server, you can view information for all the servers replicating to that hub. The following screen shows the Server Information view:


Figure 3. The Server Information view

If you double-click on a Server document from within this view, it appears as follows:


Figure 4. A Server document

This document provides you with general information about the server, such as its name, the Domino release it runs, and its domain. The document includes the following fields:

  • Server Name: The canonical name of the Domino server.
  • Server Build: The release of Domino and Notes running when the sample was taken.
  • Server Platform: The operating system version the server was running when the sample was taken.
  • Server Domain: The Notes domain the server was located in when the sample was taken.
  • Sample Taken: The last time MTBF ran on this server and updated the server information.
  • Server Started: The time the Domino server was last started.
  • Elapsed Time: The elapsed time since the Domino server was last started.
  • Peak Users: The peak number of users on the Domino server since it was last started.
  • Total Transactions: The total number of transactions since the Domino server was last started.
  • Average Per Minute: The average number of transactions per minute since the Domino server was last started.
  • MTBF Version: The version of MTBF that is running on this server.

Viewing server crashes

You can view all the information in the MTBF database, either by date or by server, in the Server Crashes view. Viewing server crashes by date allows you to see the most recent crashes. Viewing server crashes by server allows you to search for crashes on a particular server. Both views also provide you with basic information about the server, such as its name, build, domain, and platform. However, the important information in this view is the time and date of each server crash and the Descriptions column, which allows you to quickly look for patterns of crashes across several servers.

Here at Iris, if we see that servers running a new build have multiple crashes in a short period of time, we start debugging that build. If you notice a similar pattern with your servers, you can call Lotus Support and give them the history of your server crashes, or you can start looking into what the problem might be yourself. The Server Crashes view appears as follows, if you view the crashes by date:


Figure 5. The Server Crashes view, By Date

If you double-click on an entry in this view, you can open the Server Crash document. The following is an example of a Server Crash document:


Figure 6. A Server Crash document

Each Server Crash document contains a variety of information about the server at the time of the crash. The first six fields contain basic information about the server. MTBF automatically fills in this information, and the fields are not editable:

  • Date: The date and time of this crash.
  • Server Name: The name of the server this crash occurred on.
  • Server Build: The build or release number of Notes/Domino the server was running when this crash occurred.
  • Server Platform: The native operating system version the server was running when this crash occurred.
  • Server Domain: The Notes domain the server was located in when this crash occurred.
  • Message: The last message that appeared in LOG.NSF file immediately preceding this crash.

The document also includes the following editable fields and it's up to you to update these fields appropriately:

  • Resolved: The default is "No." It should be set to "Yes" when you feel the crash has been justifiably explained (for example, by power outages or resolved because of an upgrade).
  • Basic Reason: The default is "Unknown." You should select one of the keywords specified or add a new description of the problem. You can use this field in report generation to summarize why the majority of the crashes occurred (for example, because of panic, semaphore timeouts, hang, and so on).
  • Description: The default is "Still needs to be addressed!!!" You should add a brief description of the crash here. The description appears in the Server Crashes view, so it's important to keep it short and to the point. (At Iris, we often put the full Panic messages in here, such as, PANIC: LookupHandle: null handle.)
  • Rich Text: The default is the same as the Message field, and is the last message that appeared in the LOG.NSF file immediately preceding this crash. Here is where you add as much detail about the crash as possible. At Iris, we ask developers and administrators to attach things like NOTES.RIP files, and screen shots of the last few lines of the server console preceding the crash, as this information does not always make it to the LOG.NSF file.
  • Debugged By: There is no default for this field. We recommend the person who entered the crash information into MTBF put their name here. This way, if other developers have more questions about the crash, they know who to call.
  • Assigned To: There is no default for this field. We recommend you enter the name of the key person assigned to resolving this crash. In this way, MTBF acts as a monitoring tool. It allows you to assign a task, such as debugging a crash, and track the status of the task by looking at the modifications to the Server Crash document. You can also write an agent to send the "Assigned To" person a weekly reminder that the crash is still not resolved.

Viewing server shutdowns

The Server Shutdowns view shows you all the server shutdowns. The following screen shows the shutdowns by server, with the server name (A1Mail/CAM/H/Lotus), at the top of the screen:


Figure 7. The Server Shutdowns view, By Server

The following is a Server Shutdown document:


Figure 8. A Server Shutdown document

The first six fields of this document are identical to the first six fields in the Server Crash document. These fields provide you with basic information about the server. They list the server name, the build it is running, the time and date of the shutdown, whether or not the shutdown was explainable, and the reason for the shutdown, if an administrator entered a reason. You can't edit this basic information about the server. MTBF automatically fills in this information. The Message field contains the last message that appeared in the LOG.NSF file, which in the case of a server shutdown, will always be "Server shutdown complete."

The document also includes the following editable fields and it's up to you to update these fields appropriately:

  • Shutdown By: There is no default for this field. We recommend the person who shut down the server enter their name here in case any questions arise as to why the server was brought down.
  • Resolved: The default for this field is "No." You should set it to "Yes" when you feel the shutdown has been justifiably explained (for example, by an upgrade of the OS, an upgrade of Domino and Notes, and so on).
  • Reason: The default is "Still needs to be addressed!!!". This field contains information that can help you understand why a shutdown took place. The person who shuts down the server should add a brief reason for the shutdown. You can see this field in the Server Shutdowns views so it's important to keep it short and to the point (for example, Frog/Iris being shutdown to upgrade to R4.6.1a).
  • Rich Text: The default is the same as the Message field, and is the last message that appeared in LOG.NSF file. This message will always be "Server shutdown complete." You should add as much detail about the shutdown as possible. This is because you may be shutting down the server for a non-justifiable reason, such as users could not connect because the server was running low on virtual memory. Here at Iris, we ask developers and administrators to attach screen shots of problems they were seeing on the server console prior to the shutdown.
  • Debugged By: There is no default for this field. We recommend the person who entered the shutdown information into MTBF put their name here. This way, if other developers have more questions about the shutdown, they know who to call.
  • Assigned To: There is no default for this field. We recommend you enter the name of the key person assigned to resolving this shutdown. As previously stated, you can write an agent to send the "Assigned To" person a weekly reminder that the shutdown is still not resolved.

As an added feature, if an administrator broadcasts a shutdown message from the server console before actually shutting down the server, MTBF finds the broadcast message when searching the log and uses the broadcast message as the default Reason. It also sets the Resolved field to "Yes" when it creates the Server Shutdown document.

Viewing server entries

If you want to see a list of all the server shutdowns and crashes together, go to the Server Entries view. This view is a combination of all the information found in the Server Crashes view and the Server Shutdowns view. Here at Iris, we go to this view to find detailed information about the crash, which an administrator or developer might have entered. From within this view, we can sometimes see patterns in the reasons for the crashes. If the same event occurs before each crash, then we can debug the server based on that data. The following screen shows the Server Entries view:


Figure 9. The Server Entries view

Viewing server log entries

The Server Log Entries view contains messages that MTBF finds in LOG.NSF, including:

  • "Lotus Domino server started" messages
  • "Server shutdown complete" messages
  • Any log entry that is deemed a crash

A log message is deemed a crash if it directly precedes a "Lotus Domino server started" message. This screen shows the Server Log Entries view:


Figure 10. The Server Log Entries view

Viewing server statistics

The Server Statistics view allows you to see statistics on the uptime and downtime of your servers during a certain period of time. As with the other views, you can view server statistics by date or by server. In addition, each of these views breaks down further into a view for statistics over a period of:

  • One week
  • Two weeks
  • 30 days
  • 60 days
  • 90 days
  • Six months
  • One year

The following screen shows the Server Statistics view, by date, with statistics for one year selected:


Figure 11. The Server Statistics view

Each Server Statistics document contains similar statistic charts -- the only difference is the period of time displayed in the chart. To see a sample Server Statistic document, see the sidebar "MTBF Server Statistics."



Back to top


The future

In the future, we aim to include and support MTBF as part of the Domino server. It will not be available in Release 5.0, as part of Domino. However, MTBF can help you monitor all the servers in your organization today. You can keep track of how long each server runs, if there is a crash or shutdown, and the events before the crash that may have contributed to the crash. Having all this information can help you identify problems with servers in your organization. By keeping your Server Crash documents updated, and using some of your troubleshooting skills, you can identify pervasive server crashes to Lotus Support, or your authorized third-party support provider.

Watch the Downloads page on Notes.net for any updates, and feel free to use the Iris Cafe to send us feedback about MTBF. With MTBF, you no longer have to guess about how reliable your servers are. MTBF automatically calculates your server uptime for you, and gives you accurate numbers so you can measure your server reliability today.



Back to top


Resources



Back to top


About the authors

John Paganetti has been with Iris for five years as a software developer for the server team.


Susan contributed articles for the past year in the award-winning Notes.net webzine, Iris Today. She also wrote and designed the award-winning "History of Notes/Domino." Susan left Iris in July 1999 to pursue a writing opporunity at another Boston-based start-up company.



阅读全文 | 回复(0) | 引用通告 | 编辑


发表评论:

    昵称:
    密码: (游客无须输入密码)
    主页:
    标题:



Powered by Oblog.