4 Stage 2 — Single failures analysis
Describe telecom service networks and analyse vulnerabilities of components.
In this stage you will create a telecom service diagram for each telecom service, and assess the vulnerabilities on each of its components. This will give you a good understanding of the inner workings of each telecom service, and a first impression of its risks.
The result will be recorded in the Raster tool: telecom service diagrams and assessment of Frequency and Impact on vulnerabilities of diagram components.
The Single Failures Analysis stage consists of the following steps:
- Update the checklists of vulnerabilities
- Draw initial diagrams
- Analyse the vulnerabilities of components (assess frequency and impact)
- Expand unknown links
- Review
4.1 Update the checklists of vulnerabilities
Based on the disaster scenarios that were described in Stage 1, you must describe the most common vulnerabilities of network components. Checklists are used for this. A checklist contains the name and description of the most common vulnerabilities. Good checklists make the analysis process faster and easier.
Create a fresh Raster project (see The Projects toolbar), and inspect the predefined checklist for each type (see Checklist windows). Add new vulnerabilities as deemed necessary. Include vulnerabilities that apply to most components of that type; omit vulnerabilities that only apply to a few components. The checklists do not have to be complete; any particular network component may have specific vulnerabilities that do not occur in the checklist. However, when the most common vulnerabilities are included in checklists, few special cases need to be considered.
Vulnerabilities can be natural or malicious. Natural vulnerabilities are unpredictable random events, sometimes caused by inattentiveness or other non-intentional human actions. Examples include fires, power failures, or equipment defects. Malicious vulnerabilities are bad-faith actions by people with the express purpose of causing harm, often exploiting weaknesses in the organisation’s defenses. Examples include theft and cybercrime. Natural and malicious vulnerabilities differ in their frequency and consequences.
There are three checklists, one each for equipment, wired and wireless links. For actor components no checklist exists. Vulnerabilities of actors are outside the scope of the Raster method. Also, unknown links do not have a separate checklist. They may contain any of the other component types, and therefore all vulnerabilities of the three checklists may apply to unknown links.
Vulnerabilities of actors are not taken into account. For example, Raster does not handle an actor misinterpreting a received message. However, configuration errors, incorrect handling of handsets or cyber crimes can be taken into account. These vulnerabilities are modelled in Raster as part of equipment components, not as part of the actor responsible for them. Maintenance personnel are not included in the diagrams as actors.
4.2 Draw initial diagrams
In the Raster tool, create a diagram tab for each telecom service (see Service tabs). When two services have a lot of components in common, it may be more convenient to combine those services into a single diagram. This avoids components from appearing in more than one diagram, but does tend to make the diagram more complex.
For example, if the office LAN is used for VoIP telephony too, it is more convenient to combine telephony and office automation into one diagram.
Then, for each telecom service, draw an initial diagram based on the information that is currently available. The diagrams will likely not be very detailed yet. At the very least all actors involved with the service must be drawn. Note that it is always possible to create a diagram; if absolutely no information is available beyond the actors involved then the actors can simply be connected using an unknown link (“cloud” symbol). Drawing and editing diagrams using the Raster tool is explained in Workspace.
When creating diagrams, the following guidelines may be helpful:
- A cable containing multiple fibers or strands should be modelled as a single wired link. Two cables in the same duct should be modelled by two wired links in the diagram.
- Point-to-multipoint connections should be modelled using a single wireless link, but may sometimes be more conveniently modelled using separate wireless links to each receiving node. If you know in advance that the link to each individual node is subject to identical risks, then for simplicity a single wireless link should be used.
- Equipment components can be a single device, or an entire installation. For example, a small telephone exchange may be modelled as a single equipment node. However, installations such as these contain multiple cables and sub-components. Often it is not necessary to model these cables and equipment items separately. When an installation is separated over multiple rooms or when wireless links are used then the sub-components should be modelled separately. Alternatively, an unknown link may be used instead of an equipment item.
4.3 Analyse the vulnerabilities of components
This activity must be performed for each component in turn. Each step, a component is selected for analysis.
4.3.1 Add and remove vulnerabilities
Inspect the listed vulnerabilities of the component. Other vulnerabilities may exist that were not in the general checklist. These vulnerabilities must be added. The disaster scenarios that were prepared in Stage 1 must be used as guidance in decisions to add vulnerabilities.
Example: Telecommunication satellites are vulnerable to space debris. This vulnerability does not apply to any other kind of equipment, and will therefore not be in the equipment checklist. On the other hand, satellites are not vulnerable to flooding. Therefore “Collision with space debris” must be added, and “Flooding” must be removed from the list of satellite vulnerabilities.
A vulnerability must not be removed unless it is clearly nonsensical, e.g. configuration errors on devices that do not allow for any kind of configuration, or flood damage to a space satellite. To be removed, a vulnerability must be physically impossible, not just very unlikely in practice. In all other cases the frequency and impact of the vulnerability should be assessed (although they can both be set to Extremely low), and the vulnerability must be part of the review at the end of Stage 2.
When a vulnerability is removed, that node will also not be shown in the list for common cause failures. That is another reason not to remove vulnerabilities.
It is important that vulnerabilities that are merely unlikely but not physically impossible are retained in the analysis, because such vulnerabilities could have an extremely high impact. Low-probability/high-impact events must not be excluded from the risk analysis.
4.3.2 Assess vulnerabilities
When the list of vulnerabilities for the component is complete, each vulnerability must be assessed. The analysts, based on their collective knowledge, estimate two factors:
- the likelihood (frequency) that the vulnerability will lead to an incident, and
- the impact of that incident.
Both factors Frequency and Impact are split into eight classes. The classes do not correspond to ranges (a highest and lowest permissible value); instead they mention a typical, characteristic value for the class. The selection of the proper class may require a discussion between analysts. Analysts must provide convincing arguments for their choice of class.
Sometimes a factor (a likelihood or impact) is extremely large, or extremely small. Extremely large values are not simply very big, but too big to fit in the normal scale, unacceptably high and intolerably high. Likewise, extremely small values are outside the scale of normal values, and sometimes may safely be ignored. Extreme values fall outside the normal experience of analysts or other stakeholders, and normal paths of reasoning cannot be applied.
If no consensus can be reached between the analysts, the class Ambiguous must be assigned. In the remarks the analysts should briefly explain the cause for disagreement, and the classes that different analysts would prefer to see.
A limited amount of uncertainty is unavoidable, and is normal for risk assessments. However, when uncertainty becomes too large, so that multiple classes could be assigned to a factor the class Unknown must be assigned.
The Raster tool assists in recording the analysis results. The tool will also automatically compute the combined vulnerability score for each vulnerability, and the overall vulnerability level for each node (see sections Vulnerability assessment window and Single failures view; for technical details see Computation of vulnerability levels).
Do not blindly trust your initial estimate of frequency and impact. You must not rely only on information that confirms your estimate, but also actively search for contradicting evidence.
4.3.3 Assess frequency
For natural vulnerabilities the factor Frequency indicates the likelihood that the vulnerability will lead to an incident with an effect on the telecom service. All eight classes can be used for Frequency (see Frequency table).
A frequency of “once in 50 years” is an average, and does not mean that each 50 years an incident is guaranteed to occur. It may be interpreted as:
- The average timespan between incidents on a single component is 50 years.
- For a set of 50 identical components, each year on average one of them will experience an incident.
- Each year, the component has a 1 in 50 chance of experiencing an incident.
When the life time of a component is 5 years (or when the component is replaced every 5 years) the frequency of a vulnerability can still be “once in 500 years”.
Example: a component is always replaced after one year, even if it is still functioning. On average, 10% of components fail before their full year is up. The general frequency for this failure is therefore estimated as “once in 10 years” even though no component will be in use that long.
Note that this value is between the characteristic values for High and Medium. The analysts must together decide which of these two classes is assigned.
Class | Value | Symbol |
---|---|---|
High | Once in 5 years. For 100 identical components, each month 1 or 2 will experience an incident. |
H |
Medium | Once in 50 years. For 100 identical components, each year 2 will experience an incident. |
M |
Low | Once in 500 years. For 100 identical components, one incident will occur every five years. |
L |
Extremely high | Routine event. Very often. | V |
Extremely low | Very rare, but not physically impossible. | U |
Ambiguous | Indicates lack of consensus between analysts. | A |
Unknown | Indicates lack of knowledge or data. | X |
Not yet analysed | Default. Indicates that no assessment has been done yet. | – |
The likelihood of malicious vulnerabilities is not based on chance (as is the case for natural vulnerabilities), but is based on the difficulty of the action and on the determination and capabilities of the attacker. An attack that requires modest capabilities could already prove too demanding for a casual customer or employee. On the other hand, even a difficult attack may will be within the reach of skilled state-sponsored hackers. The Raster method is based on the most skilled attacker to the organisation, the worst plausible attacker.
Customers, employees | Unskilled, lightly motivated by opportunity or mild protest (e.g. perceived unfair treatment). |
Activists | Moderately skilled, aiming for media exposure to further their cause or protest. Visible impact. |
Criminals | Highly skilled, motivated by financial gains (e.g. ransomware). |
Competitors | Highly skilled, aiming to obtain trade secrets for competitive advantage. Avoid visible impact. |
State-sponsored hackers | Very highly skilled, motivated by geo-political advantages. Avoid visible impact. |
In the Raster tools you set the worst plausible attacker as part of the project properties. Since this is a property of the entire project, you only need to select the appropriate difficulty level of the exploit, as per the table below.
Class | Value |
---|---|
Very difficult | Exploit requires skill, custom attack tools, long preparation time and multiple weaknesses and zero-days. |
Difficult | Exploit requires skill, some customized attack tools and long preparation time |
Easy | Tools exist to execute the exploit. Basic skills required. |
Trivial | Requires no skill or tools at all. |
Nearly impossible | Exploit may be possible in theory, but consensus is that exploit is infeasible. |
Ambiguous | Indicates lack of consensus between analysts. |
Unknown | Indicates lack of knowledge or data. |
Not yet analysed | Default. Indicates that no assessment has been done yet. |
Use the following three-step procedure to determine the factor Frequency:
Find the frequency class that applies to this type of node in general.
This can be based on, for example, past experience or expert opinion. If available, MTBF (mean time between failures) figures or failure rates should be used.
Think of reasons why this particular node should have a lower or higher frequency than usual.
Existing countermeasures may make the frequency lower than usual. For example, if an organisation already has a stand-by generator that kicks in when power fails, then the frequency of power failure incidents is thereby reduced. Remember that the frequency does not reflect the likelihood that the vulnerability is triggered, but the likelihood that the vulnerability will lead to an incident.
For some components monitoring can detect failures that are imminent before they occur. This also will reduce the frequency of incidents. Another example is the use of premium quality components, or secure and controlled equipment rooms. All of these measures make incidents less likely.
The disaster scenarios may be an indication that the frequency should be higher than usual. In crisis situations it is often more likely that an incident will occur. For example, power outages are not very common, but are far more likely during flooding disasters. These disasters themselves are very uncommon. The overall frequency is therefore determined by:
- the likelihood of power outages during normal circumstances, and
- the likelihood of power outages during a flood, combined with the likelihood of flooding.
Decide on the frequency class for this particular node.
Typically either Low, Medium, or High will be used. If neither of these accurately reflect the frequency, one of the extreme classes should be used. If no class can be assigned by consensus, one of Ambiguous or Unknown should be used.
4.3.4 Assess impact
The factor Impact indicates the severity of the effect when a vulnerability does lead to an incident. This severity is the effect to the service as a whole, not its effect to the component that experienced the vulnerability. For example, a power failure will cause equipment to stop functioning temporarily. This is normal, and in itself of little relevance, unless it has an effect on the availability of the telecom service. The power failure could cause the service to fail (if the equipment is essential), but could also have a no effect at all (if the equipment has a backup). Or any effect in between.
Only the effects on the telecom service must be taken into account in this stage. Loss of business, penalties, and other damage are not considered, but may be relevant during risk evaluation (see Assessing social risk factors).
The damage may be caused by an incident that also affects other components of the same telecom service. For example, a cable may be damaged by an earthquake; the same earthquake will likely cause damage to other components as well. However, this additional damage must not be taken into account. Only the damage resulting from the damage to this component must be considered. The next stage, common cause failures analysis, takes care of multiple failures due to a single incident.
The impact of some vulnerability on a component covers:
- only effects to the service, not the effects to the component itself,
- only effects to the service, not subsequent damage to the organisation,
- only effects due to damage this single component, not effects due to the failure scenario.
All eight classes can be used for Impact. Characteristic values for the classes high, medium, and low are given in Table .
Use the following three-step procedure to determine the factor Impact:
Choose the impact class that most accurately seems to describe the impact of the incident.
Think of reasons why the impact would be higher or lower than this initial assessment.
Existing redundancy can reduce or even annul the impact. For example, a telecom service may have been designed such that when a wireless link fails, a backup wired link is used automatically. The impact of the wireless link failing is thereby reduced.
Monitoring and automatic alarms may reduce the impact of incidents. When incidents are detected quickly, repairs can be initiated faster. Keeping stock of spare parts, well trained repair teams, and conducting regular drills and exercises all help in reducing the impact of failures and must be considered in the assessment. On the other hand, absence of these measures may increase the impact of the incident.
Decide on the impact class.
Typically either Low, Medium, or High will be used. If neither of these accurately reflect the impact, one of the extreme classes should be used. If no class can be assigned by consensus, one of Ambiguous or Unknown should be used.
Class | Value | Symbol |
---|---|---|
High | Partial unavailability, if unrepairable. Total unavailability, if long-term. |
H |
Medium | Partial unavailability, if repairable (short-term or long-term). Total unavailability, if short-term. |
M |
Low | Noticeable degradation, repairable (short-term or long-term) or unrepairable. | L |
Extremely high | Very long-term or unrepairable unavailability. | V |
Extremely low | Unnoticeable effects, or no actors affected. | U |
Ambiguous | Indicates lack of consensus between analysts. | A |
Unknown | Indicates lack of knowledge or data. | X |
Not yet analysed | Default. Indicates that no assessment has been done yet. | – |
It typically does not matter for the selection of impact class whether some or all actors are affected. All actors are important; they would not appear in the diagram otherwise. However, if the analysts agree that only very few actors are affected they can select the next lower class (e.g. Low instead of Medium).
The meaning of “short-term” and “long-term” depends on the tasks and use-cases of the actors. A two minute outage is short-term for fixed telephony but long-term for real-time remote control of drones and robots.
“Degradation” means that actors notice reduced performance (e.g. noise during telephone calls, unusual delay in delivery of email messages), but not so much that their tasks or responsibilities are affected.
“Partial unavailability” means severe degradation or unavailability of some aspects of the service, such that actors cannot effectively perform some of their tasks or responsibilities. For example: email can only be sent within the organisation; noise makes telephone calls almost unintelligible; mobile data is unavailable but mobile calls and SMS are not affected. Actors can still perform some of their tasks, but other tasks are impossible or require additional effort.
“Total unavailability” means that actors effectively cannot perform any of their tasks and responsibilities using the telecom service (e.g. phone calls can be made but are completely unintelligible because of extremely poor quality).
“Extremely high” means that if the incident happens the damage will be so large that major redesign of the telecom service is necessary, or the service has to be terminated and replaced with an alternative because repairs are unrealistic.
4.3.5 Assessing all vulnerabilities on a component
The overall vulnerability level of a component is defined as the worst vulnerability for that component. If some of the vulnerabilities are not assessed (no frequency or impact have been set on them), they will not contribute to the overall vulnerability level. It can thus be a useful time-saver to skip assessment of unimportant vulnerabilities.
It is very important that all vulnerabilities with High and Extremely high impact are assessed fully. This is true even when their Frequency is low.
4.4 Expand unknown links
When an unknown link receives an overall vulnerability level of Ambiguous or Unknown, the analysts must decide whether or not to expand the node. Expansion means that the internal make-up of the node is examined; the unknown link is removed from the diagram, and its constituent parts are added to the diagram as individual equipment items, wired and wireless links, and possibly further unknown links. Expansion adds more detail to the model, and results in additional diagram components. The vulnerabilities to these new components must also be analysed, as for any other diagram component.
It is not always necessary to expand unknown links. If the analysts think that the effort involved in expansion is too large, or that it will not lead to more accurate or insightful results then expansion should be omitted.
4.5 Review
When all components have been analysed, a review must take place. All analysts must participate in this review. The purpose of the review is to detect mistakes and inconsistencies, and to decide whether the Single Failures Analysis stage can be concluded.
If any of the components has an overall vulnerability level of Ambiguous or Unknown, the analysts must decide whether or not to conduct further investigation, in order to assess the vulnerabilities to that node with greater certainty. If the analysts think that the effort involved is too large, or that it will not lead to more accurate or insightful results then the component should be left as is.
If the analysts decide to redo some part of the Single Failures Analysis stage, then they should again perform a review afterwards. This review may be omitted when the analysts agree that all changes are minor.