version 4.0 – Winter 2023
Introduction and guide to this document.
Organisations use many types of telecommunication services: fixed and mobile telephony, videoconferencing, internet, encrypted links between offices, etc. In the last decade, organisations have become much more dependent on these services. Whereas in the past a telephone outage was an inconvenience, today the failure of telecom services often makes it impossible to do business at all. And as organisations move online and into the cloud, reliability of telecom services becomes even more essential.
At the same time, technological and market changes have made it more difficult to assess the reliability of telecommunication services. Networks grow continuously, new technologies replace old ones, and telecom operators outsource and merge their operations. For any end-to-end telecom service, several telecoms operators will be involved, and none of them can understand how important that service is to each customer.
This increased dependency applies even more to organisations that fulfil a vital role in society, such as fire services, medical care, water boards, utilities, banks, etc.
It is therefore important that organisations in general, and organisations that supply critical infrastructures in particular, understand the vulnerabilities and dependencies of the telecom services they use. This document describes a method, called Raster, to assist in this understanding.
The goal of Raster is that the organisation becomes less vulnerable to telecom failures. To reduce the vulnerability, the organisation must first understand what can go wrong with each telecom service they use. Also, these risks must be ranked, so that the most pressing risks can be addressed first. Raster helps a team of analysts to map and investigate one or more telecom services for an organisation. The result is a report, showing which risks should be addressed first, and why. Selection and execution of countermeasures is the next logical step, but is not part of the Raster method.
Incidents with availability of telecom services often happen because of component failures: an underground cable is damaged by a contractor, a power failure causes equipment to shut down. To prepare for these incidents, the organisation must first realise that the cable and equipment exist. An important part of the Raster method is therefore to draw a diagram showing all components involved in delivering the service.
Incidents can also happen when a single event leads to the simultaneous failure of two or more components. For example, two cables in the same duct can be cut in the same incident, or a software update can cause several servers to misbehave. These failures are called common cause failures, and they are dangerous because their impact can be quite large.
Major steps in the Raster method are to draw service diagrams, and to assess the likelihood and potential impact of single and common cause failures. However, unlike other methods Raster does not take a narrow numerical approach to assessing risks.
Risks with low probability and high effects are especially important. These rare but catastrophic events have been called “black swans”. Raster helps to uncover black swans in telecom services.
Risk assessments are always in part subjective, and information is hardly ever as complete as analysts would like it to be. This does not mean that biases and prejudices are acceptable. Raster tries to nudge analysts into a critical mode of thinking. Uncertainty is normal, and assessments can be explicitly marked as “Unknown” or “Ambiguous” if a more specific assessment cannot be made. Raster can be applied even when much of the desired information on the composition of telecom networks is unavailable or unknown. Missing information can be gradually added.
To avoid a narrow risk assessment, the Raster method is applied by a team of experts, each having his own area of expertise. Raster facilitates cooperation between experts of different backgrounds.
Raster facilitates the construction of a recommendation using a tested methodical analysis. This recommendation is not just based on the technical aspects of failure of telecoms services, but also takes account of the societal impact of failures, and of risk perceptions of external stakeholders.
One final remark: Raster can be deployed on its own, or as part of a company-wide risk management framework. This manual assumes a stand-alone application. When Raster is used as an element within an approach, the initiation stage (in which the scope of the study is determined) will likely need to be adapted.
The following parties are involved in applying the Raster method.
The team needs to encompass knowledge on essential business activities and technical aspects of telecommunication networks and services. Additionally, it will be useful if team members have some experience with risk assessment, and with the Raster method in particular. Because of this range of knowledge it will be necessary to include employees of the case organisation in the team of analysts.
Software tools are available to support the application of the method. Their use is strongly recommended, and this manual assumes that one of these applications is used. There are multiple versions: there is a standalone application for Windows and MacOS, and a web-based tool that requires an intranet server. All versions function almost identically. This manual uses “the application” without specific reference to indicate that either version can be read.
This manual is for the professionals who will execute the Raster method. It explains the method and provides guidance. These professionals can either be telecom experts or experts in any other field whose expertise is needed. Examples, notes and tips are typeset in text boxes.
This would be an example, note, tip or shortcut.
The first chapters of this manual describe the Raster method; the second part describes the Raster tools that aid the creation of diagrams and the analysis of Single Failures and Common Cause Failures. When executing an analysis using Raster, you will proceed as in the figure below.
The left-hand column shows the chapters in the first part of this manual. The right-hand column shows the second part, which covers the Raster tools.
Author: Eelco Vriezekolk.
Contact and downloads: https://risicotools.nl/
Source: https://github.com/EelcoV/RasterTool
This work was originally sponsored by Dutch Authority for Digital Infrastructure, and by University of Twente.
General outline of the Raster method and telecom service diagrams.
When using the Raster method, you and the rest of your team will perform a number of tasks. The method will guide you through these tasks in a methodical way, and the Raster tool will assist you in recording your progress. Based on your collective knowledge and expert judgement you will make estimates about the likelihood and impact of various vulnerabilities affecting the telecom services. Based on this analysis, you and your team will draft suitable risk treatment recommendations. The result of your efforts is a report that can be used by a decision maker to take informed business decisions about accepting, reducing, or avoiding the risks.
Raster consists of four stages, shown in the figure below.
The Initiation and Preparation stage describes the scope and purpose of the assessment. Which telecom services are involved, which users can be identified, who are external stakeholders, and what are the characteristics of the environment in which these services are used?
The Single Failures Analysis stage creates a telecom service diagram for each telecom service in use. These diagrams describe the most relevant telecommunication components, including cables, wireless links, and equipment items. These components are potentially vulnerable. The diagram does not have to be complete in all details. Parts of networks that are less relevant can be captured using a single “cloud” (unknown link). For all components an assessment of all applicable vulnerabilities is done. Only independent, single failures are taken into account during this stage.
The Common Cause Failures Analysis stage takes closer look at failure causes that lead to the failure of multiple components at once. One example is that of independent telecom services that both have a cable in the same underground duct. A single trenching incident may cut both cables at the same time, causing both services to fail. Another example is a large-scale power outage, causing equipment over a large area to fail simultaneously.
The Risk Evaluation stage contains the risk evaluation and creation of the final report. The overall risk level is assessed, and recommendations are done for risk treatment. These recommendations take into account the possible reactions of external stakeholders. The recommendations and their supporting argumentation form the final output of the Raster method. Stage 1, Stage 2, Stage 3 and Stage 4 describe each stage in detail.
Diagrams are central to the Raster method. A telecom service diagram describes the physical connectivity between components of a telecom service. Diagrams consist of nodes that are connected by lines. Each line represents a direct physical relation. It indicates that the nodes are attached to each other. There cannot be more than one line between two nodes; nodes are either connected or they are not.
Lines are not the same as cables. When two equipment items are connected via a cable, three nodes are used as in the picture above. The line between equipment and cable shows a physical connection: the cable is plugged into the equipment. There are five types of nodes, each identified by its unique shape.
Different pictures can be used to represent nodes, depending on the icon set used by the Raster tool. The examples below use the Default icon set.
Actors represent the (direct) users of telecom services. An actor can represent a single individual, or a group of individuals having the same role, e.g. 'journalists' or 'citizens'. Maintenance personnel are not modelled as actors, as they do not participate in communication.
An actor can only be connected to components of type 'equipment' or 'unknown link'. Actors cannot be connected directly to wired or wireless links, and the Raster tool will not allow such connections.
There must be at least two actors in the diagram. There must at least be a person communicating, and one other person to communicate with.
Wired links represent passive, physical cables, including their connectors, fittings and joints but excluding any active components such as amplifiers or switches. Fiber optic cables, coaxial cables, and traditional telephony copper pairs are typical examples of wired links. The two equipment items connected by the link are not part of the wired link itself, and need to be included in the model separately, either as equipment items or unknown links.
Each wired link has exactly two connections, each to a component of type either 'equipment' or 'unknown link'. To connect a wired link to an actor, wireless link, or an other wired link, place an equipment node in between.
Each wired link has some fixed capacity, a physical location (including a height above or below ground level). These properties need to be known in sufficient detail.
Wireless links represent direct radio connections, excluding any intermediate components. The transmission and reception installations are not part of the wireless link, and have to be modelled separately as equipment items. A wireless link can connect two or more nodes.
Each wireless link has a fixed capacity, but unlike wired links a wireless link does not always have a fixed location. Transmitters and receivers can be mobile or nomadic. The coverage area depends on factors such as transmission power and antenna properties. Wireless links have a fixed frequency or band. All of these properties need to be described in sufficient detail.
Each wireless link has at least two connections, each to a component of type either 'equipment' or 'unknown link'. It can have more than one, as in the example above. To connect a wireless link to an actor, equipment, or an other wireless link, place an equipment node in between.
Unknown links (cloud shapes) represent parts of networks for which insufficient information is available, or that do not need to be described in detail. Unlike wired and wireless links, that represent a single communication channel, unknown links are composed of equipment and wired and wireless links.
Because unknown links are collections of equipment and wired and wireless links, they can be used in any place where these nodes can be used. In short, unknown links can connect to any other node type. Also, unknown links can be connected to any number of nodes.
Equipment nodes represent all other physical components of telecom networks, such as switches, exchanges, routers, amplifiers, radio transmitters, radio receivers etc. An equipment node may model a single piece of equipment or an entire installation.
Each equipment node must be connected with at least one other component. An equipment node cannot be connected directly to another node of type 'equipment'.
The figure below shows an example of a valid telecom service diagram. The diagram shows three actors, communicating via telephony. Two actors are connected to the same private exchange (PABX); the third actor is abroad. One actor uses a wireless DECT handset and base station, the others use fixed handsets. We have no knowledge (yet) of the other portions of the network, other than that some PABX must exist, and some kind of international telephony network to facilitate the calls.
Define shared purpose and bounds to the study.
Before the study is started its scope must be made clear to the analysts and to the sponsor. The responsibilities and tasks of the case organisation must be described in some detail. Also, the position of the organisation within the wider system of suppliers, customers and stakeholders must be laid out.
In stage 1 you will collect the information that you need to complete the other stages. The result is a report and agreement from the sponsor to proceed.
The Initiation and Preparation stage consists of the following steps:
Create a list of all telecommunication services that are used by the case organisation. This list must be exhaustive. If a service is accidentally omitted, no risk assessment will be performed on it, and dependencies between the service and other services will not be discovered. As a result, decision makers may take unnecessary or ineffective countermeasures, or overlook necessary countermeasures.
To create the list of telecom services, the following information sources may be useful:
Briefly describe each telecom service. At this stage it is not yet necessary to describe the technical implementation, but if information is available on such items as handsets, terminals, or links, then this should be included in the descriptions.
If a telecom service acts as backup to some other telecom service, or when the service itself has fallback options, then these must be described as well.
The descriptions must also include the relevance of the telecom service to the operations of the organisation. That is, is the service essential, or merely a 'nice to have'?
It will also be useful at this stage to start a glossary of abbreviations and definitions of special terms that may not be clear to all analysts, or to the sponsor.
List, for each telecom service, the actors who may make use of that service. Main actors are members of the case organisation. All other actors are secondary actors. Actors can be the initiating party of communication session (calling party) or the receiving party (called party), or both.
List all external stakeholders to the case organisation.
Actors and external stakeholders may be identified using the same information sources as listed above for telecom services.
Before the analysis can start, it must be clear to which threats this organisation may be exposed. For example, the in-company fire service in charge of chemical plant safety will be confronted with different potential disasters than a crisis team controlling the spread of agricultural diseases. The latter is unlikely to be affected by violent destruction of hardware. Consequently, the threats to their telecom services will be very different in nature.
The threats to telecom services and their mechanisms must be described in as much detail as possible. Disaster scenarios describe the threats, their effects and mechanisms, their likelihood, and the required response from the case organisation.
In the Netherlands tornados seldom lead to damage to infrastructures. Typically, the threat of tornados will therefore be excluded from disaster scenarios. Flooding from sea or riverbeds, however, are quite common, and will likely be included.
For some studies intentional human-made events (crime, terrorism) are highly relevant. For other studies it may suffice to focus on accidental events only. The scope of the study need not be limited to technical aspects. When describing a disaster, the effects that it will have on telecom components is the most important part. To better understand the reactions of the general public it may be useful to also include some graphic descriptions of events that could be experienced by citizens, or that could be published in the media. This may facilitate the assessment of social risk factors in the Risk Evaluation stage.
It may be possible to reuse disaster scenarios from previous risk assessments, thus shortening the amount of work needed.
The results from Stage 1 must be recorded because the analyst will need to refer to this information during subsequent stages.
The following is a common outline of the output document of the Initiation and Preparation stage. This report forms the introduction to the final report.
All analysts must participate in a review of the Stage 1 report. All analysts must agree on its contents by consensus.
The Stage 1 report must then be presented to and discussed with the sponsor. The list of telecom services may contain unexpected services. The unexpected appearance of a service is informative, since it indicates that the risk assessment and preparation of the case organisation are insufficient, and that disaster response plans are incomplete.
The results of the Initiation and Preparation stage determine to a large extent the course of the risk assessment in the later stages. It is therefore important that the sponsor also agrees to the outcome of this stage, and gives formal agreement to the resulting documentation. As a consequence, the documents must be understandable to non-experts. A glossary may be helpful to that effect. Also, an executive summary should be written.
Describe telecom service networks and analyse vulnerabilities of components.
In this stage you will create a telecom service diagram for each telecom service, and assess the vulnerabilities on each of its components. This will give you a good understanding of the inner workings of each telecom service, and a first impression of its risks.
The result will be recorded in the Raster tool: telecom service diagrams and assessment of Frequency and Impact on vulnerabilities of diagram components.
The Single Failures Analysis stage consists of the following steps:
Based on the disaster scenarios that were described in Stage 1, you must describe the most common vulnerabilities of network components. Checklists are used for this. A checklist contains the name and description of the most common vulnerabilities. Good checklists make the analysis process faster and easier.
Create a fresh Raster project (see The Projects toolbar), and inspect the predefined checklist for each type (see Checklist windows). Add new vulnerabilities as deemed necessary. Include vulnerabilities that apply to most components of that type; omit vulnerabilities that only apply to a few components. The checklists do not have to be complete; any particular network component may have specific vulnerabilities that do not occur in the checklist. However, when the most common vulnerabilities are included in checklists, few special cases need to be considered.
Vulnerabilities can be natural or malicious. Natural vulnerabilities are unpredictable random events, sometimes caused by inattentiveness or other non-intentional human actions. Examples include fires, power failures, or equipment defects. Malicious vulnerabilities are bad-faith actions by people with the express purpose of causing harm, often exploiting weaknesses in the organisation’s defenses. Examples include theft and cybercrime. Natural and malicious vulnerabilities differ in their frequency and consequences.
There are three checklists, one each for equipment, wired and wireless links. For actor components no checklist exists. Vulnerabilities of actors are outside the scope of the Raster method. Also, unknown links do not have a separate checklist. They may contain any of the other component types, and therefore all vulnerabilities of the three checklists may apply to unknown links.
Vulnerabilities of actors are not taken into account. For example, Raster does not handle an actor misinterpreting a received message. However, configuration errors, incorrect handling of handsets or cyber crimes can be taken into account. These vulnerabilities are modelled in Raster as part of equipment components, not as part of the actor responsible for them. Maintenance personnel are not included in the diagrams as actors.
In the Raster tool, create a diagram tab for each telecom service (see Service tabs). When two services have a lot of components in common, it may be more convenient to combine those services into a single diagram. This avoids components from appearing in more than one diagram, but does tend to make the diagram more complex.
For example, if the office LAN is used for VoIP telephony too, it is more convenient to combine telephony and office automation into one diagram.
Then, for each telecom service, draw an initial diagram based on the information that is currently available. The diagrams will likely not be very detailed yet. At the very least all actors involved with the service must be drawn. Note that it is always possible to create a diagram; if absolutely no information is available beyond the actors involved then the actors can simply be connected using an unknown link (“cloud” symbol). Drawing and editing diagrams using the Raster tool is explained in Workspace.
When creating diagrams, the following guidelines may be helpful:
This activity must be performed for each component in turn. Each step, a component is selected for analysis.
Inspect the listed vulnerabilities of the component. Other vulnerabilities may exist that were not in the general checklist. These vulnerabilities must be added. The disaster scenarios that were prepared in Stage 1 must be used as guidance in decisions to add vulnerabilities.
Example: Telecommunication satellites are vulnerable to space debris. This vulnerability does not apply to any other kind of equipment, and will therefore not be in the equipment checklist. On the other hand, satellites are not vulnerable to flooding. Therefore “Collision with space debris” must be added, and “Flooding” must be removed from the list of satellite vulnerabilities.
A vulnerability must not be removed unless it is clearly nonsensical, e.g. configuration errors on devices that do not allow for any kind of configuration, or flood damage to a space satellite. To be removed, a vulnerability must be physically impossible, not just very unlikely in practice. In all other cases the frequency and impact of the vulnerability should be assessed (although they can both be set to Extremely low), and the vulnerability must be part of the review at the end of Stage 2.
When a vulnerability is removed, that node will also not be shown in the list for common cause failures. That is another reason not to remove vulnerabilities.
It is important that vulnerabilities that are merely unlikely but not physically impossible are retained in the analysis, because such vulnerabilities could have an extremely high impact. Low-probability/high-impact events must not be excluded from the risk analysis.
When the list of vulnerabilities for the component is complete, each vulnerability must be assessed. The analysts, based on their collective knowledge, estimate two factors:
Both factors Frequency and Impact are split into eight classes. The classes do not correspond to ranges (a highest and lowest permissible value); instead they mention a typical, characteristic value for the class. The selection of the proper class may require a discussion between analysts. Analysts must provide convincing arguments for their choice of class.
Sometimes a factor (a likelihood or impact) is extremely large, or extremely small. Extremely large values are not simply very big, but too big to fit in the normal scale, unacceptably high and intolerably high. Likewise, extremely small values are outside the scale of normal values, and sometimes may safely be ignored. Extreme values fall outside the normal experience of analysts or other stakeholders, and normal paths of reasoning cannot be applied.
If no consensus can be reached between the analysts, the class Ambiguous must be assigned. In the remarks the analysts should briefly explain the cause for disagreement, and the classes that different analysts would prefer to see.
A limited amount of uncertainty is unavoidable, and is normal for risk assessments. However, when uncertainty becomes too large, so that multiple classes could be assigned to a factor the class Unknown must be assigned.
The Raster tool assists in recording the analysis results. The tool will also automatically compute the combined vulnerability score for each vulnerability, and the overall vulnerability level for each node (see sections Vulnerability assessment window and Single failures view; for technical details see Computation of vulnerability levels).
Do not blindly trust your initial estimate of frequency and impact. You must not rely only on information that confirms your estimate, but also actively search for contradicting evidence.
For natural vulnerabilities the factor Frequency indicates the likelihood that the vulnerability will lead to an incident with an effect on the telecom service. All eight classes can be used for Frequency (see Frequency table).
A frequency of “once in 50 years” is an average, and does not mean that each 50 years an incident is guaranteed to occur. It may be interpreted as:
When the life time of a component is 5 years (or when the component is replaced every 5 years) the frequency of a vulnerability can still be “once in 500 years”.
Example: a component is always replaced after one year, even if it is still functioning. On average, 10% of components fail before their full year is up. The general frequency for this failure is therefore estimated as “once in 10 years” even though no component will be in use that long.
Note that this value is between the characteristic values for High and Medium. The analysts must together decide which of these two classes is assigned.
Class | Value | Symbol |
---|---|---|
High | Once in 5 years. For 100 identical components, each month 1 or 2 will experience an incident. |
H |
Medium | Once in 50 years. For 100 identical components, each year 2 will experience an incident. |
M |
Low | Once in 500 years. For 100 identical components, one incident will occur every five years. |
L |
Extremely high | Routine event. Very often. | V |
Extremely low | Very rare, but not physically impossible. | U |
Ambiguous | Indicates lack of consensus between analysts. | A |
Unknown | Indicates lack of knowledge or data. | X |
Not yet analysed | Default. Indicates that no assessment has been done yet. | – |
The likelihood of malicious vulnerabilities is not based on chance (as is the case for natural vulnerabilities), but is based on the difficulty of the action and on the determination and capabilities of the attacker. An attack that requires modest capabilities could already prove too demanding for a casual customer or employee. On the other hand, even a difficult attack may will be within the reach of skilled state-sponsored hackers. The Raster method is based on the most skilled attacker to the organisation, the worst plausible attacker.
Customers, employees | Unskilled, lightly motivated by opportunity or mild protest (e.g. perceived unfair treatment). |
Activists | Moderately skilled, aiming for media exposure to further their cause or protest. Visible impact. |
Criminals | Highly skilled, motivated by financial gains (e.g. ransomware). |
Competitors | Highly skilled, aiming to obtain trade secrets for competitive advantage. Avoid visible impact. |
State-sponsored hackers | Very highly skilled, motivated by geo-political advantages. Avoid visible impact. |
In the Raster tools you set the worst plausible attacker as part of the project properties. Since this is a property of the entire project, you only need to select the appropriate difficulty level of the exploit, as per the table below.
Class | Value |
---|---|
Very difficult | Exploit requires skill, custom attack tools, long preparation time and multiple weaknesses and zero-days. |
Difficult | Exploit requires skill, some customized attack tools and long preparation time |
Easy | Tools exist to execute the exploit. Basic skills required. |
Trivial | Requires no skill or tools at all. |
Nearly impossible | Exploit may be possible in theory, but consensus is that exploit is infeasible. |
Ambiguous | Indicates lack of consensus between analysts. |
Unknown | Indicates lack of knowledge or data. |
Not yet analysed | Default. Indicates that no assessment has been done yet. |
Use the following three-step procedure to determine the factor Frequency:
Find the frequency class that applies to this type of node in general.
This can be based on, for example, past experience or expert opinion. If available, MTBF (mean time between failures) figures or failure rates should be used.
Think of reasons why this particular node should have a lower or higher frequency than usual.
Existing countermeasures may make the frequency lower than usual. For example, if an organisation already has a stand-by generator that kicks in when power fails, then the frequency of power failure incidents is thereby reduced. Remember that the frequency does not reflect the likelihood that the vulnerability is triggered, but the likelihood that the vulnerability will lead to an incident.
For some components monitoring can detect failures that are imminent before they occur. This also will reduce the frequency of incidents. Another example is the use of premium quality components, or secure and controlled equipment rooms. All of these measures make incidents less likely.
The disaster scenarios may be an indication that the frequency should be higher than usual. In crisis situations it is often more likely that an incident will occur. For example, power outages are not very common, but are far more likely during flooding disasters. These disasters themselves are very uncommon. The overall frequency is therefore determined by:
Decide on the frequency class for this particular node.
Typically either Low, Medium, or High will be used. If neither of these accurately reflect the frequency, one of the extreme classes should be used. If no class can be assigned by consensus, one of Ambiguous or Unknown should be used.
The factor Impact indicates the severity of the effect when a vulnerability does lead to an incident. This severity is the effect to the service as a whole, not its effect to the component that experienced the vulnerability. For example, a power failure will cause equipment to stop functioning temporarily. This is normal, and in itself of little relevance, unless it has an effect on the availability of the telecom service. The power failure could cause the service to fail (if the equipment is essential), but could also have a no effect at all (if the equipment has a backup). Or any effect in between.
Only the effects on the telecom service must be taken into account in this stage. Loss of business, penalties, and other damage are not considered, but may be relevant during risk evaluation (see Assessing social risk factors).
The damage may be caused by an incident that also affects other components of the same telecom service. For example, a cable may be damaged by an earthquake; the same earthquake will likely cause damage to other components as well. However, this additional damage must not be taken into account. Only the damage resulting from the damage to this component must be considered. The next stage, common cause failures analysis, takes care of multiple failures due to a single incident.
The impact of some vulnerability on a component covers:
All eight classes can be used for Impact. Characteristic values for the classes high, medium, and low are given in Table .
Use the following three-step procedure to determine the factor Impact:
Choose the impact class that most accurately seems to describe the impact of the incident.
Think of reasons why the impact would be higher or lower than this initial assessment.
Existing redundancy can reduce or even annul the impact. For example, a telecom service may have been designed such that when a wireless link fails, a backup wired link is used automatically. The impact of the wireless link failing is thereby reduced.
Monitoring and automatic alarms may reduce the impact of incidents. When incidents are detected quickly, repairs can be initiated faster. Keeping stock of spare parts, well trained repair teams, and conducting regular drills and exercises all help in reducing the impact of failures and must be considered in the assessment. On the other hand, absence of these measures may increase the impact of the incident.
Decide on the impact class.
Typically either Low, Medium, or High will be used. If neither of these accurately reflect the impact, one of the extreme classes should be used. If no class can be assigned by consensus, one of Ambiguous or Unknown should be used.
Class | Value | Symbol |
---|---|---|
High | Partial unavailability, if unrepairable. Total unavailability, if long-term. |
H |
Medium | Partial unavailability, if repairable (short-term or long-term). Total unavailability, if short-term. |
M |
Low | Noticeable degradation, repairable (short-term or long-term) or unrepairable. | L |
Extremely high | Very long-term or unrepairable unavailability. | V |
Extremely low | Unnoticeable effects, or no actors affected. | U |
Ambiguous | Indicates lack of consensus between analysts. | A |
Unknown | Indicates lack of knowledge or data. | X |
Not yet analysed | Default. Indicates that no assessment has been done yet. | – |
It typically does not matter for the selection of impact class whether some or all actors are affected. All actors are important; they would not appear in the diagram otherwise. However, if the analysts agree that only very few actors are affected they can select the next lower class (e.g. Low instead of Medium).
The meaning of “short-term” and “long-term” depends on the tasks and use-cases of the actors. A two minute outage is short-term for fixed telephony but long-term for real-time remote control of drones and robots.
“Degradation” means that actors notice reduced performance (e.g. noise during telephone calls, unusual delay in delivery of email messages), but not so much that their tasks or responsibilities are affected.
“Partial unavailability” means severe degradation or unavailability of some aspects of the service, such that actors cannot effectively perform some of their tasks or responsibilities. For example: email can only be sent within the organisation; noise makes telephone calls almost unintelligible; mobile data is unavailable but mobile calls and SMS are not affected. Actors can still perform some of their tasks, but other tasks are impossible or require additional effort.
“Total unavailability” means that actors effectively cannot perform any of their tasks and responsibilities using the telecom service (e.g. phone calls can be made but are completely unintelligible because of extremely poor quality).
“Extremely high” means that if the incident happens the damage will be so large that major redesign of the telecom service is necessary, or the service has to be terminated and replaced with an alternative because repairs are unrealistic.
The overall vulnerability level of a component is defined as the worst vulnerability for that component. If some of the vulnerabilities are not assessed (no frequency or impact have been set on them), they will not contribute to the overall vulnerability level. It can thus be a useful time-saver to skip assessment of unimportant vulnerabilities.
It is very important that all vulnerabilities with High and Extremely high impact are assessed fully. This is true even when their Frequency is low.
When an unknown link receives an overall vulnerability level of Ambiguous or Unknown, the analysts must decide whether or not to expand the node. Expansion means that the internal make-up of the node is examined; the unknown link is removed from the diagram, and its constituent parts are added to the diagram as individual equipment items, wired and wireless links, and possibly further unknown links. Expansion adds more detail to the model, and results in additional diagram components. The vulnerabilities to these new components must also be analysed, as for any other diagram component.
It is not always necessary to expand unknown links. If the analysts think that the effort involved in expansion is too large, or that it will not lead to more accurate or insightful results then expansion should be omitted.
When all components have been analysed, a review must take place. All analysts must participate in this review. The purpose of the review is to detect mistakes and inconsistencies, and to decide whether the Single Failures Analysis stage can be concluded.
If any of the components has an overall vulnerability level of Ambiguous or Unknown, the analysts must decide whether or not to conduct further investigation, in order to assess the vulnerabilities to that node with greater certainty. If the analysts think that the effort involved is too large, or that it will not lead to more accurate or insightful results then the component should be left as is.
If the analysts decide to redo some part of the Single Failures Analysis stage, then they should again perform a review afterwards. This review may be omitted when the analysts agree that all changes are minor.
Determine and analyse common cause failures.
A common cause failure is an event that leads to the simultaneous failure of two or more components. For example: two cables in the same duct can both be cut in a single incident; multiple equipment items may be destroyed in a single fire.
For a common cause failure to happen, the affected components must be within range of each other, according to a critical property. For physical failure events such as fire and flooding, this property is geographical proximity: the components must be sufficiently close to be affected simultaneously. For configuration mistakes it is the similarity in maintenance procedures. For software bugs it is whether related firmware versions are used, regardless of geographical distance. Other events may have different critical properties.
For each failure scenario, the critical property has a maximum effect distance. Two equipment items can only be affected by a minor fire when they are in the same room; for a major fire the