rfc9940v1.txt   rfc9940.txt 
Internet Engineering Task Force (IETF) N. Davis, Ed. Internet Engineering Task Force (IETF) N. Davis, Ed.
Request for Comments: 9940 Ciena Request for Comments: 9940 Ciena
Category: Informational A. Farrel, Ed. Category: Informational A. Farrel, Ed.
ISSN: 2070-1721 Old Dog Consulting ISSN: 2070-1721 Old Dog Consulting
T. Graf T. Graf
Swisscom Swisscom
Q. Wu Q. Wu
Huawei
C. Yu C. Yu
Huawei Technologies Huawei
February 2026 February 2026
Some Key Terms for Network Fault and Problem Management Some Key Terms for Network Fault and Problem Management
Abstract Abstract
This document sets out some terms that are fundamental to a common This document sets out some terms that are fundamental to a common
understanding of network fault and problem management within the understanding of network fault and problem management within the
IETF. IETF.
skipping to change at line 85 skipping to change at line 84
Successful operation of large networks depends on effective network Successful operation of large networks depends on effective network
management. This requires a virtuous circle of network control, management. This requires a virtuous circle of network control,
network observability, network analytics, network assurance, and back network observability, network analytics, network assurance, and back
to network control. Network fault and problem management [RFC6632] to network control. Network fault and problem management [RFC6632]
is an important aspect of network management and control solutions. is an important aspect of network management and control solutions.
It deals with the detection, reporting, inspection, isolation, It deals with the detection, reporting, inspection, isolation,
correlation, and management of events within the network. The correlation, and management of events within the network. The
intention of this document is to focus on those events that have a intention of this document is to focus on those events that have a
negative effect on the network's ability to forward traffic according negative effect on the network's ability to forward traffic according
to expected behavior and so deliver services, the ability to control to expected behaviors that may reduce the network's ability to
and operate the network, and other faults that reduce the quality or deliver services. Such events may also impact the ability to control
reliability of the delivered service. The concept of fault and and operate the network. The document also considers other faults
problem management extends to include actions taken to determine the that reduce the quality or reliability of the delivered service. The
causes of problems and to work toward recovery of expected network concept of fault and problem management extends to include actions
behavior. taken to determine the causes of problems and to work toward recovery
of expected network behavior.
A number of work efforts within the IETF seek to provide components A number of work efforts within the IETF seek to provide components
of a fault management system, such as YANG data models or management of a fault management system, such as YANG data models or management
protocols. It is important that a common terminology be used so that protocols. It is important that a common terminology be used so that
there is a clear understanding of how the elements of the management there is a clear understanding of how the elements of the management
and control solutions fit together and how faults and problems will and control solutions fit together and how faults and problems will
be handled. be handled.
This document sets out some terms that are fundamental to a common This document sets out some terms that are fundamental to a common
understanding of network fault and problem management. While understanding of network fault and problem management. While
skipping to change at line 178 skipping to change at line 178
process of collecting operational network data categorized process of collecting operational network data categorized
according to the network plane (e.g., Layer 3, Layer 2, and Layer according to the network plane (e.g., Layer 3, Layer 2, and Layer
1) from which it was derived. Data collected through the Network 1) from which it was derived. Data collected through the Network
Telemetry process does not contain any data related to service Telemetry process does not contain any data related to service
definitions (i.e., "intent" per Section 3.1 of [RFC9315]). definitions (i.e., "intent" per Section 3.1 of [RFC9315]).
Network Monitoring: This is the process of keeping a continuous Network Monitoring: This is the process of keeping a continuous
record of functions related to a network topology. It involves record of functions related to a network topology. It involves
tracking various aspects such as traffic patterns, device health, tracking various aspects such as traffic patterns, device health,
performance metrics, and overall network behavior. This approach performance metrics, and overall network behavior. This approach
differentiates network monitoring from resource or device differentiates Network Monitoring from resource or device
monitoring, which focuses on individual resources or components monitoring, which focuses on individual resources or components
(Section 3.2). (Section 3.2).
Network Analytics: This is the process of deriving analytical Network Analytics: This is the process of deriving analytical
insights from operational network data. A process could be insights from operational network data. A process could be
executed by a piece of software, a system, or a human that executed by a piece of software, a system, or a human that
analyzes operational data and outputs new analytical data related analyzes operational data and outputs new analytical data related
to the operational data -- for example, a symptom. to the operational data -- for example, a symptom.
Network Observability: This is the process of enabling network Network Observability: This is the process of enabling network
behavioral assessment through analysis of observed operational behavioral assessment through analysis of observed operational
network data (logs, alarms, traces, etc.) with the aim of network data (logs, alarms, traces, etc.) with the aim of
detecting symptoms of network behavior, and to identify anomalies detecting symptoms of network behavior, and identifying anomalies
and their causes. Network Observability begins with information and their causes. Network Observability begins with information
gathered using Network Monitoring tools and that may be further gathered using Network Monitoring tools. That information may be
enriched with other operational data. The expected outcome of the further enriched with other operational data. The expected
observability processes is identification and analysis of outcome of the observability processes is identification and
deviations in observed state versus the expected state of a analysis of deviations in observed state versus the expected state
network. of a network.
Thus, there is a cascaded sequence where the following relationships Thus, there is a cascaded sequence where the following relationships
apply: apply:
* Network Telemetry is the process of collecting operational data * Network Telemetry is the process of collecting operational data
from a network. from a network.
* Network Monitoring is the process of creating/keeping a record of * Network Monitoring is the process of creating/keeping a record of
data gathered in Network Telemetry. data gathered in Network Telemetry.
skipping to change at line 221 skipping to change at line 221
* Network Observability is the process of enabling behavioral * Network Observability is the process of enabling behavioral
assessment of a network through Network Analytics. assessment of a network through Network Analytics.
3.2. Core Terms 3.2. Core Terms
The terms in this section are presented in an order that is intended The terms in this section are presented in an order that is intended
to flow such that it is possible to gain understanding reading top to to flow such that it is possible to gain understanding reading top to
bottom. The figures and explanations in Section 4 may aid bottom. The figures and explanations in Section 4 may aid
understanding the terms set out here. understanding the terms set out here.
Resource: An element of a network system. Resource: A Resource is an element of a network system.
* Resource is a recursive concept so that a Resource may be a * Resource is a recursive concept so that a Resource may be a
collection of other Resources (for example, a network node collection of other Resources (for example, a network node
comprises a collection of network interfaces). comprises a collection of network interfaces).
Characteristic: Observable or measurable aspect or behavior Characteristic: A Characteristic is an observable or measurable
associated with a Resource. aspect or behavior associated with a Resource.
* A Characteristic may be considered to be built on facts (see * A Characteristic may be considered to be built on facts (see
'Value', below) and the contexts and descriptors that identify "Value", below) and the contexts and descriptors that identify
and give meaning to the facts. and give meaning to the facts.
* The term "Metric" [RFC9417] is another word for a measurable * The term "Metric" (see "metric" in [RFC9417]) is another word
Characteristic which may also be thought of as analogous to a for a measurable Characteristic, which may also be thought of
'variable'. as analogous to a "variable".
Value: A measure of a Characteristic associated with a Resource. It Value: A Value is a measure of a Characteristic associated with a
may be in the form of a categorization (e.g., high or low), an Resource. It may be in the form of a categorization (e.g., high
integer (e.g., a count or gauge), or a reading of a continuous or low), an integer (e.g., a count or gauge), or a reading of a
variable (e.g., an analog measurement), etc. continuous variable (e.g., an analog measurement), etc.
Change: In the context of Network Monitoring, the variation in the Change: In the context of Network Monitoring, a Change is the
Value of a Characteristic associated with a Resource. A Change variation in the Value of a Characteristic associated with a
may arise over a period of time. Resource. A Change may arise over a period of time.
* Not all Changes are noteworthy (i.e., they do not have * Not all Changes are noteworthy (i.e., they do not have
Relevance). Relevance).
* Perception of Change depends upon Detection, the sampling * Perception of Change depends upon Detection, the sampling
rate/accuracy/detail, and perspective. rate/accuracy/detail, and perspective.
* It may be helpful to qualify this as "Value Change" because the * It may be helpful to qualify this as "Value Change" because the
English word "change" is often heavily used. English word "change" is often heavily used.
Event: The variation in Value of a Characteristic of a Resource at a Event: An Event is the variation in Value of a Characteristic of a
distinct moment in time (i.e., the period is negligible). Resource at a distinct moment in time (i.e., the period is
negligible).
* Compared with a Change, which may be over a period of time, an * Compared with a Change, which may be over a period of time, an
Event happens at a distinct moment in time. Thus, an Event may Event happens at a distinct moment in time. Thus, an Event may
be the observation of a Change. be the observation of a Change.
Condition: An interpretation of the Values of a set of one or more Condition: A Condition is an interpretation of the Values of a set
Characteristics of a Resource (with respect to working order or of one or more Characteristics of a Resource (with respect to
some other aspect relevant to the Resource purpose/application) -- working order or some other aspect relevant to the Resource
for example, "low available memory". Thus, it is the output of a purpose/application) -- for example, "low available memory".
function applied to a set of one or more variables. Thus, it is the output of a function applied to a set of one or
more variables.
State: A particular Condition that a Resource has (i.e., it is in a State: A State is a particular Condition that a Resource has (i.e.,
State) at a specific time. For example, a router may report the it is in a State) at a specific time. For example, a router may
total amount of memory it has and how much is free. These are the report the total amount of memory it has and how much is free.
Values of two Characteristics of a Resource. These Values can be These are the Values of two Characteristics of a Resource. These
interpreted to determine the Condition of the Resource, and that Values can be interpreted to determine the Condition of the
may determine the State of the router, such as shortage of memory. Resource, and that may determine the State of the router, such as
shortage of memory.
* While a State may be observed at a specific moment in time, it * While a State may be observed at a specific moment in time, it
is actually determined by summarizing measurement over time in is actually determined by summarizing measurement over time in
a process sometimes called State compression. a process sometimes called State compression.
* It may be helpful to qualify this as "Resource State" to make * It may be helpful to qualify this as "Resource State" to make
clear the distinction between this and other uses of "state" clear the distinction between this and other uses of "state"
such as "protocol state". such as "protocol state".
* This term may be contrasted with "Operational State" as used in * This term may be contrasted with "operational state" as used in
[RFC8342]. For example, the state of a link might be up/down/ [RFC8342]. For example, the state of a link might be up/down/
degraded, but the operational state of the link would include a degraded, but the operational state of the link would include a
collection of Values of Characteristics of the link. collection of Values of Characteristics of the link.
Detect (hence Detected, Detection): To notice the presence of Detect (hence Detected, Detection): To Detect is to notice the
something (State, Change, Event, activity, etc.) and hence also to presence of something (State, Change, Event, activity, etc.)
notice a Change (from the perspective of an observer such as a
monitoring system).
Relevance: Consideration of an Event, State, or Value (through the * Also to notice a Change (from the perspective of an observer
application of policy, relative to a specific perspective, intent, such as a monitoring system).
and in relation to other Events, States, and Values) to determine
whether it is of note to the system that controls or manages the Relevance: Relevance is the consideration of an Event, State, or
network. Note, for example, that not all Changes are Relevant. Value (through the application of policy, relative to a specific
perspective or intent, and in relation to other Events, States,
and Values) to determine whether it is of note to the system that
controls or manages the network. Note, for example, that not all
Changes are Relevant.
* This term may also be used as "Relevant Event", "Relevant * This term may also be used as "Relevant Event", "Relevant
State", or "Relevant Value". State", or "Relevant Value".
Occurrence: A Relevant Event or a particular Relevant Change. Occurrence: An Occurrence is a Relevant Event or a particular
Relevant Change.
* An Occurrence may be an aggregation or abstraction of multiple * An Occurrence may be an aggregation or abstraction of multiple
fine-grained Events or Changes. fine-grained Events or Changes.
* An Occurrence may occur at any macro or micro scale because * An Occurrence may occur at any macro or micro scale because
Resources are a recursive concept, and may be perceived, Resources are a recursive concept. An Occurrence may be
depending on the scope of observation (i.e., according to the perceived, depending on the scope of observation (i.e.,
level of Resource recursion that is examined). That is, according to the level of Resource recursion that is examined).
Occurrences, themselves, are a recursive concept. That is, Occurrences, themselves, are a recursive concept.
Fault: An Occurrence (i.e., an Event or a Change) that is not Fault: A Fault is an Occurrence (i.e., an Event or a Change) that is
desired/required (as it may be indicative of a current or future not desired/required (as it may be indicative of a current or
undesired State). Thus, a Fault happens at a moment in time. A future undesired State). Thus, a Fault happens at a moment in
Fault can potentially be associated with a Cause. See [RFC8632] time. A Fault can potentially be associated with a Cause. See
for a more detailed discussion of network faults. [RFC8632] for a more detailed discussion of network faults.
* Note that there is a distinction between a Fault and a Problem * Note that there is a distinction between a Fault and a Problem
that depends on context. For example, in a connectivity that depends on context. For example, in a connectivity
service where redundancy is present, a link down is a Problem, service where redundancy is present, a link down is a Problem,
but from the perspective of managing the network resources, a but from the perspective of managing the network resources, a
link down is a Fault. Likewise, for example, in a router with link down is a Fault. Likewise, for example, in a router with
two power supplies, if the backup power supply fails leaving two power supplies, if the backup power supply fails leaving
the primary unprotected, this is a Problem. the primary unprotected, this is a Problem.
Problem: A State that is undesirable and that may require remedial Problem: A Problem is a State that is undesirable and that may
action. A Problem cannot necessarily be associated with a Cause. require remedial action. A Problem cannot necessarily be
The resolution of a Problem does not necessarily act on the thing associated with a Cause. The resolution of a Problem does not
that has the Problem. necessarily act on the thing that has the Problem.
* Note that there is a historic aspect to the concept of a * Note that there is a historic aspect to the concept of a
Problem. The current State may be operational, but there could Problem. The current State may be operational, but there could
have been a Fault that is unexplained, and the fact of that have been a Fault that is unexplained, and the fact of that
unexplained recent Fault is a Problem. unexplained recent Fault is a Problem.
* Note that while a Problem is unresolved it may continue to * Note that while a Problem is unresolved it may continue to
require attention. A record of resolved Problems may be require attention. A record of resolved Problems may be
maintained in a log. maintained in a log.
skipping to change at line 357 skipping to change at line 363
operational once more) but may leave the Problem as unresolved operational once more) but may leave the Problem as unresolved
(because the loss of light has not been explained). Further, (because the loss of light has not been explained). Further,
in this example, there could be another development (the reason in this example, there could be another development (the reason
for the temporary loss of light is traced to a microbend in the for the temporary loss of light is traced to a microbend in the
fiber that is repaired) resulting in that unresolved Problem fiber that is repaired) resulting in that unresolved Problem
now being resolved. But, in this example, this still leaves a now being resolved. But, in this example, this still leaves a
further Problem unresolved (a microbend occurred, and that further Problem unresolved (a microbend occurred, and that
Problem is not resolved until it is understood how it occurred Problem is not resolved until it is understood how it occurred
and a remedy is put in place to prevent recurrence). and a remedy is put in place to prevent recurrence).
Cause: The Events (Detected or otherwise) that gave rise to a Fault/ Cause: A Cause is the Events (Detected or otherwise) that gave rise
Problem. to a Fault/Problem.
Incident: Also referred to as "Network Incident". An Incident is an Incident: Also referred to as "Network Incident". An Incident is an
undesired Occurrence such as an unexpected interruption of a undesired Occurrence such as an unexpected interruption of a
network service, degradation of the quality of a network service, network service, degradation of the quality of a network service,
or the below-target performance of a network service. An Incident or the below-target performance of a network service. An Incident
results from one or more Problems, and a Problem may give rise to results from one or more Problems, and a Problem may give rise to
or contribute to one or more Incidents. Greater discussion of or contribute to one or more Incidents. Greater discussion of
Network Incident relationships, including Customer Incidents and Network Incident relationships, including Customer Incidents and
Incident management, can be found in [Net-Incident-Mgmt-YANG]. Incident management, can be found in [Net-Incident-Mgmt-YANG].
Symptom: An observable Value, Change, State, Event, or Condition Symptom: A Symptom is an observable Value, Change, State, Event, or
considered as an indication of a Problem or potential Problem. Condition considered as an indication of a Problem or potential
Problem.
Anomaly: Also referred to as "Network Anomaly". An Anomaly is an Anomaly: Also referred to as "Network Anomaly". An Anomaly is an
unusual or unexpected Event or pattern in network data in the unusual or unexpected Event or pattern in network data in the
forwarding plane, control plane, or management plane that deviates forwarding plane, control plane, or management plane that deviates
from the normal, expected behavior. See [Net-Anomaly-Arch] for from the normal, expected behavior. See [Net-Anomaly-Arch] for
more details. more details.
Alert: An indication of a Fault. Alert: An Alert is an indication of a Fault.
Alarm: As specified in [RFC8632], signifies an undesirable State in Alarm: As specified in [RFC8632], an Alarm signifies an undesirable
a Resource that requires corrective action. From a management State in a Resource that requires corrective action. From a
point of view, an Alarm can be seen as a State in its own right management point of view, an Alarm can be seen as a State in its
and the transition to this State may result in an Alert being own right and the transition to this State may result in an Alert
issued. The receipt of this Alert may give rise to a continuous being issued. The receipt of this Alert may give rise to a
indication (to a human operator) highlighting the potential or continuous indication (to a human operator) highlighting the
actual presence of a Problem. potential or actual presence of a Problem.
3.3. Other Terms 3.3. Other Terms
Three other terms may be helpful: Three other terms may be helpful:
Intermittent: A State that is not continuous but that keeps Intermittent: A State that is not continuous but that keeps
recurring in some time frame. recurring in some time frame.
Transient: A State that is not continuous and that occurs once in Transient: A State that is not continuous and that occurs once in
some time frame. some time frame.
skipping to change at line 461 skipping to change at line 468
Change at a time Change over time Change over time Change at a time Change over time Change over time
Figure 2: Characteristics and Changes Figure 2: Characteristics and Changes
Figure 3 shows the workflow progress for Events. As noted above, an Figure 3 shows the workflow progress for Events. As noted above, an
Event is a Change in the Value of a Characteristic at a time. The Event is a Change in the Value of a Characteristic at a time. The
Event may be evaluated (considering policy, relative to a specific Event may be evaluated (considering policy, relative to a specific
perspective, with a view to intent, and in relation to other Events, perspective, with a view to intent, and in relation to other Events,
States, and Values) to determine if it is an Occurrence and possibly States, and Values) to determine if it is an Occurrence and possibly
to indicate a Change of State. An Occurrence may be undesirable (a to indicate a Change of State. An Occurrence may be undesirable (a
Fault) and that can cause an Alert to be generated, may be evidence Fault), which might cause an Alert to be generated. Or, an
of a Problem and could directly indicate a Cause. In some cases, an Occurrence may be evidence of a Problem and could directly indicate a
Alert may give rise to an Alarm highlighting the potential or actual Cause. In some cases, an Alert may give rise to an Alarm
presence of a Problem. highlighting the potential or actual presence of a Problem.
Alert - - - > Alarm Alert - - - > Alarm
^ ^
| |
| -----> Cause | -----> Cause
| | | |
|----------> Problem |----------> Problem
| |
| |
Fault Fault
skipping to change at line 500 skipping to change at line 507
progress for States. As shown in Figure 2, Change noted at a progress for States. As shown in Figure 2, Change noted at a
particular time gives rise to State. The State may be deemed to have particular time gives rise to State. The State may be deemed to have
Relevance considering policy, relative to a specific perspective, Relevance considering policy, relative to a specific perspective,
with a view to intent, and in relation to other Events, States, and with a view to intent, and in relation to other Events, States, and
Values. A Relevant State may be deemed a Problem, or it may indicate Values. A Relevant State may be deemed a Problem, or it may indicate
a Problem or potential Problem. a Problem or potential Problem.
Problems may be considered based on Symptoms and may map directly or Problems may be considered based on Symptoms and may map directly or
indirectly to Causes. An Incident results from one or more Problems. indirectly to Causes. An Incident results from one or more Problems.
An Alarm may be raised as the result of a Problem, and the transition An Alarm may be raised as the result of a Problem, and the transition
to an Alarmed state may give rise to an Alert. to an alarmed State may give rise to an Alert.
Alarm - - -> Alert Alarm - - -> Alert
^ ^
| ------> Incident | ------> Incident
| | | |
| | ---> Cause | | ---> Cause
| | | | | |
Problem---------> Symptom Problem---------> Symptom
^ ^
| |
skipping to change at line 560 skipping to change at line 567
Events and States (and the Alerts that they might give rise to) must Events and States (and the Alerts that they might give rise to) must
be treated with caution to dampen any "flapping" (so that consistent be treated with caution to dampen any "flapping" (so that consistent
States may be observed) and to avoid overwhelming management States may be observed) and to avoid overwhelming management
processes or systems. Analog Values may be read or notified from the processes or systems. Analog Values may be read or notified from the
Resource and could transition a threshold, be deemed Relevant Values, Resource and could transition a threshold, be deemed Relevant Values,
or be evaluated over time. Events may be counted, and the Count may or be evaluated over time. Events may be counted, and the Count may
cross a threshold or reach a Relevant Value. cross a threshold or reach a Relevant Value.
The Threshold Process may be implementation specific and subject to The Threshold Process may be implementation specific and subject to
policies. When a threshold is crossed and any other conditions are policies. When a threshold is crossed and any other conditions are
matched, an Event may be determined and may be treated like any other matched, an Event may be determined and treated like any other Event.
Event.
Occurrence Occurrence
^ ^
| |
|---------------------> State |---------------------> State
| |
| ------- Relevance | ------- Relevance
|------>| Count |-----------------------------> Value |------>| Count |-----------------------------> Value
| ------- | ^ | ------- | ^
| | | | | | | |
skipping to change at line 689 skipping to change at line 695
[RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. [RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T.
Arumugam, "Service Assurance for Intent-Based Networking Arumugam, "Service Assurance for Intent-Based Networking
Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023, Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023,
<https://www.rfc-editor.org/info/rfc9417>. <https://www.rfc-editor.org/info/rfc9417>.
Acknowledgments Acknowledgments
The authors would like to thank Med Boucadair, Wanting Du, Joe The authors would like to thank Med Boucadair, Wanting Du, Joe
Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif
Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman, Mostafa, Kristian Larsson, Dirk Von Hugo, Carsten Bormann, Hilarie
Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad Rahman, Orman, Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad
Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and Deb Rahman, Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and
Cooley for their helpful comments. Deb Cooley for their helpful comments.
Special thanks to the team that met at a side meeting at IETF 120 to Special thanks to the team that met at a side meeting at IETF 120 to
discuss some of the thorny issues: discuss some of the thorny issues:
* Benoit Claise * Benoit Claise
* Watson Ladd * Watson Ladd
* Brad Peters * Brad Peters
* Bo Wu * Bo Wu
* Georgios Karagiannis * Georgios Karagiannis
* Olga Havel * Olga Havel
skipping to change at line 740 skipping to change at line 746
Qin Wu Qin Wu
Huawei Huawei
101 Software Avenue, Yuhua District 101 Software Avenue, Yuhua District
Nanjing Nanjing
Jiangsu, 210012 Jiangsu, 210012
China China
Email: bill.wu@huawei.com Email: bill.wu@huawei.com
Chaode Yu Chaode Yu
Huawei Technologies Huawei
Email: yuchaode@huawei.com Email: yuchaode@huawei.com
 End of changes. 31 change blocks. 
90 lines changed or deleted 96 lines changed or added

This html diff was produced by rfcdiff 1.48.