IT System Maintenance Checklist
by Andrey Chervonets (24.10.2012)
In article “The Focus Points of IT management”
we had discussed some aspects of IT System management.
Here I will focus on items, that IT manager and/or DBA must be sure about in daily maintenance.
Depending on how big is your company and managed IT infrastructure the items in check-list
may be implemented quite simple or more complex solution may be required.
The main idea of this checklist – to have more or less complete list of important things, that a) admins must have and b) can significantly reduce number of problems or simplify IT infrastructure management. The list of such things should be reviewed periodically and items, that looks to be important – should be implemented and/or improved.
After each item in detailed list (see table below) is explained why admins need this. To make decision whether to implement each item or not, just compare efforts and expenses for implementation with results and risks from business perspective.
For example, implementing monitoring or backup strategy and solution compare with business expenses if system will not available for some hours or data for last day will be totally lost. This is final argument (in money equivalent) for company top management; they seldom have deep understanding of technical benefits, but they can easy understand language of numbers/money.
My experience shows that for keeping any system in good state we need at least the following things:
1. Meta information about all components of managed IT infrastructure: technical architecture (TA) schema and Inventory of all technical components. All this is not just formal papers and registry, but should work as good usable tools. In other words - “what we have” or “what is running”.
2. Information about running processes mapped to business information. While 1st item is focused on technical infrastructure, the 2nd is focused on business projects needs and how it is implemented at technical layer. In other words – the description of “Who is using” our technical infrastructure and “why”.
3. Backup and Recovery solutions for every component of the System(s). We do it for 2 reasons:
a) to recover business data after corruption or data loss.
b) to save time – restoring from backup in most cases is much more faster then re-installation and re-configuration.
Except other requirements, there are at least 2 things should be checked with backup:
a) it should work (hope You know what does it mean)
b) restore/recovery from backup should be tested
4. Monitoring – is mandatory element of IT management. It should help in:
a) controlling resource usage and informing in-advance, where possible, about potential problems, important system (or business) events, so administrators can take action in-time or execute fix actions automatically.
b) gathering and representing on-line state and extra diagnostic information for many metrics
c) gathering and keeping metric values for long period, so You can analyze trends and untypical behaviour.
Monitoring system can significantly reduce the “temperature” of Support phone
by notifying about critical events in-advance, so admins can fix problem before it become too critical.
5. Library of Admin Tools:
a) “How to” documents for most popular activities like: installation, users creation, etc.
b) Scripts for automated (or semi-automated) maintenance tasks like: required services auto-start after reboot, deliveries installation, backup, recovery and so on.
In most cases all this is home made solutions, but should be kept in order like code or documents library with defined structure and versions. It will save admins time for re-inventing solutions already done and during installation/upgrades.
Using version control system will help all admins to work on code and share new tools instead of everyone sitting in it's own "sandpiper".
c) Passwords storage system: it is bad practice to keep passwords for tens or hundreds of servers, databases, applications on slips of paper attached to monitor, better to keep it in encrypted database and keep master password in memory.
6. List of users, their business roles and privileges required. Depending on how big is your company – spreadsheet with required information may be enough or Identify Management system should be used.
7. Actual contact information of IT experts (internal or external), hardware vendors, which may help:
a) for non-typical problems resolution
b) consulting on development and/or TA design/optimisation
c) new software/hardware order
It is recommended to have at least 2 contacts for each critical specialisation (Hardware, OS support, DBA, each business application). If some application have only one vendor – track at least 2 persons, who can assist or find right persons. We never know what can happen in critical situation.
Keep up to date contact information of all other people, that may be involved in IT problems resolution or decisions making.
In next article I will explain above mentioned in more details.
It is not complete and to be updated later.
Table below displays above mentioned in more details. It is not complete and to be updated later.
||Why this is important
Technical Architecture (TA) description:
a) visual schema of TA components (networks, servers, storage, DB, AS etc) and it's relationships.
Backup and/or Disaster Recovery equipment should be also displayed (may be as another document).
b) List of all hardware
c) List of hosts (physical and virtual) and Operating Systems versions as well as installed/used resources like: CPU, RAM, HDD, Network interfaces and other details
d) Application Servers with applications running under it
f) Used software with current version used.
g) Storage arrays with HDD on board
h) Mapping between components
i) Network channels
j) List of external systems connected to/from our System(s) in role of Client or Server
We need clear picture of IT infrastructure elements we use and support: general view and detailed for each sub-system.
It should be done for every environment: development, testing and production.
The best practice is to have centralized repository or Inventory containing records about every hardware and software used. In simple case – it can be just spreadsheet with minimal information.
Information in such repository should be periodically compared to actual state and/or updated after any changes. This is also good information source for planning upgrades, migrations as well as licensing and technical support expenses.
If possible – automate such information gathering; manual checking components and it's versions every time when it is required – takes much time.
Having clear actual information about TA also helps in faster problems solving.
In other words – we can not effectively manage all that stuff if we do not count them.
System availability requirements
System availability requirements (if it make sense, define for each sub-component and every environment individually:
a) Business Hours – when system can be used by any user or process. All other time may be used for maintenance tasks planning
b) Total downtime allowed per time period (week/month) (% or hours) during business hours
c) Total downtime allowed per time period (week/month) (% or hours) out of business hours
d) Maximum downtime allowed per case (hours) during business hours
e) Maximum downtime allowed per case (hours) out of business hours
f) for SLA – number of downtime cases allowed for period.
Business people would be happy if system is available all the time: 24*7*365. But in real life providing such availability may cost too much or even not possible for technical reasons.
We may need time for planned changes:
• regular maintenance tasks that should stop database, disable access for regular user or need guaranteed exclusive access to some objects inside.
• patching, upgrades, migrations;
• adding new hardware;
• changes delivery into the system.
In case of severe problems admins must know how much time they have to fix the problem or decide to failover to backup site.
This also impacts TA design. For critical systems – equipment should be reserved and Disaster Recovery plan described and procedures tested.
Service Level Agreement (SLA)
- agreement between IT experts (internal or external) that should provide System maintenance and System's owner or business users, that defines the key service targets and responsibilities of both parties.
Except system availability requirement there are many other requirements to IT infrastructure that IT team should provide, like users accounts registration, old data archiving services and so on.
The agreement in practice is list of discussed requirements and it's evaluation criteria.
It is very useful, because helps to focus on items important for end users, not just making “right” or “nice” or “modern” infrastructure, as we can see sometimes.
If company does not have internal formal SLA – it is recommended to make it yourself anyway, just to list business users expectations.
In case of unrecoverable (can not be recovered until last changes) system crash and files corruptions:
a) how much changes is acceptable to loose (mean – mush be re-entered into the system) – in minutes.
b) acceptable time for recovery.
This requirements directly impacts technical architecture design. If more strict requirement we have, then more careful should be TA design and more components should be duplicated.
As result, the total cost of ownership is rising.
Backup and Recovery strategy – document that describes:
• backup needs for each class of critical data
• principles of backup +recovery and methods used in company/organization
• as well as used/required software and hardware
Clear understanding of backup needs and methods is starting point for developing backup and recovery solutions and ordering equipment or other resources.
Disaster Recovery Plan – describes how to make infrastructure (or part of it) running in other Data Centre (or just building) in case of primary DC is not working longer then acceptable for business
Business should be running!
Even if disaster not happened before or have low probability – it is still probable.
Even if disaster may be counter as Force Majeure – we need a plan how to continue work.
Backup Policy – defines for each class of critical data (OS images, software, databases , application files, configuration files, etc.) the following:
• backup class (current, historical, other...);
• how often backup will take place;
• backup types schedule (full and incremental);
• where backup copies will be stored;
• recovery window (for how long time in the past we should be able to recover), which impacts:
• how long to keep backup copies;
• expired backups removal schedule.
For two purposes: a) to be able to recover in case of problems; b) to make DB (or whole system) copy when business need it.
Backup Policy is defined taking in account recoverability requirements.
Class of backup may be:
• Current backup – is to make system up and running in case of problems.
• Historical backups may be required for audit system state in past.
• “Business” and other types backup – may be required to emulate some activities during last business cycle or reproduce problem found in application or business logic.
Capacity Plan - estimations of how much resources will be required for next period(s): CPU, HDD, RAM, network bandwidth
Backup space capacity plan – estimations of how much space will e required based on System Capacity Plan estimations and Backup Policy.
Hardware equipment as well as disk and tapes should be ordered in-advance.
Ordering in last moment will be more expensive.
Monitoring strategy – defines:
• monitoring needs (why and how deep) for each class of critical data
• methods of monitoring
• used/required software and other resources
While number of IT infrastructure grow – it becomes almost impossible to control the state of each target
manually and track all important events. It is bad practice to receive too bad news (like server failure
or space shortage) from end users.
But for big IT instrastructure making good Monitoring System can also be an issue.
The Strategy should define general approach how to do it without deep implementation details.
Monitoring Policy (or Plan) – describes exact targets to monitor, thresholds and events to react
for as well as reaction for that events: what should be done manually or automatically.
There are no one suits-all policy for all monitoring targets and events.
Some events more critical and some environments are more important.
For example many errors in development or test environment can be ignored
or at least resolved during working day.
In many cases information what is important, when and how to react
– resides in mind of administrators.
But when number of controlled targets of many types is growing,
IT staff changes within organization, changes admins roles and responsibility,
it become useful to define all the rules in one document.
Support team should make efforts to keep this document up to date and synchronize
with it the settings of monitoring system.
|Scope: Operation System, Database, Application
(auto) Start scripts or application (like Oracle ReStart).
a) Just save admin time: write once, use for many targets.
b) Reduce downtime – nobody will need to wait for sysadmin to start system after unplanned power off or planned reboot.
(auto) Stop scripts or application.
a) Save admin time.
b)During unplanned shutdown (which may be caused, for example by UPS event) it will stop application and database, saving in-memory data to disk.
Backup solution (scripts or application) with results notification.
Regular and automated!
It implements all defined in Strategy and Policy.
Backup protocols should be reported to sysadmins by e-mail or be easy-accessible.
Backup problems should be warned to admins (by e-mail or SMS) or tracked by monitoring system (which will notify about problmes).
Even high-cost hardware may fail.
Even very experienced professionals can make mistake.
We must have it.
We should be sure backup was completed without problem. Solution should always check the result code and/or protocol.
And notification transport reliability makes sense.
Recovery solution (automated if possible). Must be tested regularly!
Administrator should have scenario of any system component restore from zero (using backup copy).
You should test it time to time.
Not tested solution – mean “no solution”!
Do tests every time after significant changes in component architecture or configuration. Or at least twice per year.
For standard situation - automation is possible and will save huge time.
On-line Monitoring of critical events, processes and metrics values (technical or business).
This will deliver real information about System "health" and behaviour. Faster then users.
Historical Monitoring (or Statistics gathering) of important metrics values (technical or business).
For trend and anomalies detection. This is way from just mining
List of important configuration files and/or DB tables.
For setup, backup and changes tracking, which can be automated.
List of important log-files and/or DB tables.
For check-out and old log-info clean-up (archive or drop), which can be automated
Users list and their roles.
Users accounts should be mapped to real persons list obtained from Human Resources department
as well as with external applications list that use some special accounts to connect to target objects.
You have to know who is using the System, and what is allowed.
To track new users creation or privileges changes.
Any audit will ask for definitions and how the privileges are assigned and controlled.