Security Operations Metrics: Introduction
Security Operations Metrics: Introduction - An overview of my thoughts on Security Operations and Incident Response Processes and Metrics that are easily achievable in most organizations.
I’ve been thinking for a long time about how to integrate processes and measurements and how to interpret data from those measurements related to Security Operations and Incident Response. This is just the introduction of what I’m sure will be many discussions on the topic. I specifically don’t address many subjects for various reasons, mainly because I want to focus on the Operational Processes and Measurements that can help increase efficiency and effectiveness of your Monitoring and Response Teams.
Disclaimer: What this post is not: Overall Risk Values, Executive/Mangement Dashboard, Security to Business requirement alignment, Vulnerability Management, a way to reach FISMA or PCI compliance or 100% complete. It is intended to produce thought around a workable set of internal metrics to help organizations grow and refine their security operations and incident response capabilities especially in relation to Detection and Response aspects of those teams.
Why are metrics important: I see it on a daily basis so I’d be willing to bet that most people would agree with me that a certain percentage of their organization’s systems are currently compromised. I’ll go further and say that many don’t have the time to address those machines and since they aren’t affecting business in a meaningful manner the current risk level is still acceptable. If we could change that thought process around a bit and we could find a way to measure the % of the data manipulated on those systems or across the enterprise I’d wager that any number above 0 would introduce a level of concern that makes the risk unacceptable. So which should we measure? Both if possible, but how likely is it to measure both? You must figure out the best way to reliably gather and process the information in order to tell the story.
For example if I have no way to tell how much data has been manipulated perhaps I need to start looking at that in a more holistic fashion across the enterprise, it seems to be a natural progression. I need better Collection and perhaps analytical techniques to address this problem set’s requirement. So with Security Operations and Incident Response in mind lets take a quick peek into the integration of processes, the categories of metrics and what we might be able to measure and what value we can gain from some those measurements. In future blog posts and workshops we’ll facilitate discussion into these areas much more specifically.
Collection: There are some obvious measurements here that you can and should measure from day 1 in your Security Operations Center.
1. Good Metric: Expected versus Actual Enterprise Visibility. This should be broken down to its organizational and technological components.
a. You should be able to quantify how many of “Event Source 1” exist and what “% of Event Source 1” you have log data for.
b. The next step is logical, what percentage of “Event Source 1” meets the standard logging levels defined by your organization and what is the deviation from that standard.
c. What does this do? It helps us understand what we can’t see in our organization and based on future incidents may help us justify increasing our visibility via log management projects, etc
2. Bad Metric Made Better: How effective is my SIM in reducing events I have to look at? This is the first metric I get asked to look at in nearly all cases. And while there is some value to it, I’m not sure it is the first piece of information I’d want to know in judging the stressfulness of my SOC or IRT.
a. The idea is this:
i. I have a total number of Event Sources feeding my Log Management Tool(s),
ii. From that I have a number of events being forwarded to my SIEM (Post Aggregation and Filtering)
iii. My SIEM has some number of base events and from there it generates “correlated events”
iv. From the correlated events I might have a subset of “Events of Interest” that require analysis.
b. Generally speaking the lower the number at the bottom the happier the customer is with the “tuning” of the system. I agree to certain extent that this is functional but only when you add it other elements like the context of the Problem set. In this metric it is critical to understand that the number of correlated events is irrelevant. Many systems generate correlation events that can be used by a number of more advanced correlation scenarios and in fact add to the overall efficiency of the system. The number of “events of interest” is the key measurement of what your analysts have to look at on a daily basis. More to the point the number of “Events of Interest” per problem set is the measurement that matters. Only once we understand the “EOI” in terms of problem set can we begin the process of tuning the system (reducing false positives, fixing event source log levels, tuning correlation timing and parameters, etc) to make identification more efficient.
Analysis:
So one the next logical question that comes up is “now that I know what I’m collecting how do I know how effective my analytical process is? “ Again not sure if it is the best possible question but let’s see if I can present at least one functional example of ideas for analytical metrics.
This set of measurements is the most commonly used attribute of most teams (in conjunction with # events, # incidents) to justify additional staffing. For example you’d want to be able to quantify to your executive team when you can’t possibly keep up with the data (even after tuning) for the problem sets you are trying to detect. They’ve agreed that those problem sets are critical, they agree that the program adds value (not necessarily ROI) but in terms of avoiding fines and hopefully in terms of overall security expectations, so it is reasonable to suggest that additional staffing may be required to deal with increased problem sets and/or coverage at a certain point.
I think that there are a couple of measurements here that can help.
1. Number of “events of interest” evaluated and number of Incidents escalated per problem set is an easy one.
2. It is a good idea to look at # EOI handled and # valid escalations per analyst. Both as a positive competitive metric and as a training indicator
3. Quality Assurance Analysis done against a data set to find # and % missed escalations is important for training and to understand if increasing data volumes, problem sets are overwhelming analyst capacity.
4. The number new problem sets created per month is another great indicator. If your team isn’t generating new problems sets to go after they are probably way too tied up in current volumes to proactively look forward.
There are plenty of other processes and metrics that can help. I am just using these as a starting point for discussion.
Escalation:
This is one of the more basic but overlooked aspects of operational security metrics. The core questions to ask are simple.
o Are we escalating the right things?
o Are we escalating with enough context?
o Are we escalating quickly enough?
o Are we escalating to the right people/teams?
This escalation can apply between tiers of security analysts, between analysts and Incident Response and between IR and customers. The thoughts and responses need to be captured at each level. The basic idea is to gain perspective if we as the security team are providing our customers (internal and external) with the right level of information, to the right people at the right time. If I’m failing in any of these areas my entire capability suffers.
Response/Remediation:
Measuring timeliness seems to be the key to success in the Incident Response world. Some examples:
o Time to Incident Identification
o Time to Incident Analysis (Live Capture)
o Time to Initial Mitigation
o Time to Create Monitoring Techniques
o Time to Final Resolution
This at least provides us goals to measure against, and highlights very quickly where problems exist in the remainder of the model. For example if I didn’t have adequate information to conduct the analysis I know I need better visibility (in terms of log levels, event sources, etc). If I can’t conduct live system forensics in a reasonable time period we can justify the need for more robust forensic support in our organization. Measuring these attributes may also point out failures in the overall Incident Response Plan (Incorrect POC or asset information), Political Hurdles, etc. This is the easiest way to show increasing competency in your organization as the overall timeframe shrinks from Months to Years down to Hours and Days.
Quick Tips:
Tip #1 Organize ahead of time: My favorite statements here are from Richard Bejtlich in his blog (Taosecurity) he talks about how to “Figure out how to play and score the game before you pretend to think you can improve the score”. The other is Richard’s analogy about measuring the score of the game instead of the height, weight and passing yards of the players. I’ve begin organizing my SOC and IR teams into focusing the processes and measurements into distinct areas for Collection, Analysis, Escalation and Remediation.
Tip#2 Figure out what matters in your organization/industry: The concept of measuring the % systems compromised across your organization seems to be gaining some level of support (I agree it is useful). It is easy, reproducible and has obvious value - You can quantify the known compromises/infections versus the overall number fairly easily and then ask the question – What percentage is acceptable to our company? You’ll find that the number varies based on who you ask (0%, 1,%, 10%,+) and the type of system, etc.
Summary:
The processes and measurements presented in the post organize security operations and incident response into Collection, Analysis, Escalation and Remediation as key functions that build upon one another. The measurements help enable better functionality across the board and this is only the very high level overview. The more conversations I have related to this the more I learn how others are attacking this problem and have actually implemented these concepts in their operations. The goal was to provide a framework you can actually implement and use today and that will help you consistently over the next few years. I hope this triggers thought and further discussion. Thanks for your time! - Rocky