Planning for Failure

Published: 2013-11-23
Last Updated: 2013-11-23 02:23:30 UTC
by Kevin Liston (Version: 1)
4 comment(s)

I have been witness to network and system security failure for nearly two decades.  While the players change and the tools and methods continue to evolve, it's usually the same story over and over: eventually errors add up and combine to create a situation that someone finds and exploits.  The root-cause analysis, or post mortem, or whatever you call it in your environment consists of a constellation-of-errors, or kill-chain, or what-have-you.  It's up to you do develop an environment that both provides the services that your business requires and has enough complementary layers of defenses to make incidents either a rare occurrence or a non-event.

Working against you are not only a seemingly endless army of humans and automata, but also the following truths or as I call them "The Three Axioms of Computer Security":

  1. There will always be a new vulnerability.
  2. AV will always be out of date.
  3. Users will always click on a link.

How do you create a network that can survive under these conditions?  Plan on it happening.

In your design, account for the inevitable failure of other tools and layers.  Work from the outside in: Firewall, WAF, Webserver, Database server.  Work from the bottom up: Hardware, OS, Security Tools, Application.

I'm imagining the typical DMZ layout for this hypothetical design: External Firewall, External services (DNS, Email Web,) Internal Firewall, Internal services (back office, file/print share, etc.)  This is your typical "defense in depth" layout that incorporates the assume-failure philosophy a bit.  It attempts to isolate the external servers, which the model assumes are more likely to be compromised, from the internal servers.  This isn't the correct assumption for modern exploit scenarios, so we'll fix that as we go through this exercise.

Nowadays, External Firewalls basically fail right off the bat in most exploit scenarios.  It has to let in DNS, SMTP, and HTTP/HTTPS.  So if it's not actively blocking source IPs in response to other triggers in your environment, it's only value-ad in protecting these services comes from it acting as a separate, corroborating log source.  But it's completely necessary, because without it, you open your network up to direct attack on services that you might not be aware that you're exposing.  So, firewalls are still a requirement, but keep in mind that it's not covering attacks coming in on our exposed applications.  It creates a choke point on your network that you can exploit for monitoring and enforcement (it's a good spot for your IDS which can inform the firewall to block known malicious traffic.)  This is all good strategy, but focusing on the threat from the outside.

However, if you plan your firewall strategy with the assumption that other layers will eventually fail, you'll configure the firewall so that outbound traffic is also strongly limited to just the protocols that it needs.  Does your webserver really need to send out email to the internet?  Does it need to surf the internet?  While you may not pay much attention to the various incoming requests that are dropped by the firewall, you need to pay critical attention to any outbound traffic that is being dropped.

The limitation of standard firewalls is that they don't perform deep packet inspection and are limited to the realm of IP addresses and ports for most of their decision making.  Enter the Application Firewall, or Web Application Firewall.  I'll admit that I have very little experience with this technology, however I do know how to not use it in your environment.  PCI requirement 6.6 states that you should secure your web applications either through Application Code Reviews or Application Firewalls. (  It should be both, not either/or because you have to assume that the WAF will fail or that code review will fail.

The next device is the application (DNS, email, web, etc.) server.  If you're assuming that the firewall and WAF will fail, where should you place the server?  It should be placed in the external DMZ, not inside your internal network "because it's got a WAF in front of it.)  We'll also take an orthogonal turn in our tour, starting from the bottom of the stack and work our way up.  IT managers generally understand hardware failure better than security failures.  Multiple power supplies and routes, storage redundancy, backups, DR plans etc.  The "Availability" in the CIA triad.

Running on your metal is the OS.  Failure here is also mostly of the "Availability" variety, but security/vulnerability patching starts to come to play at this layer.  Everybody patches, but does everybody confirm that the patches deployed?  This is where a good internal vulnerability/configuration scanning regimen becomes necessary.

As part of your standard build, you'll have a number of security and management applications that run on top of the OS: your HIDS, AV and inventory management agents.  Plan on these agents failing to check in, what is your plan to detect when an AV fails to check-in or update properly?  Is your inventory management agent adequately patched and secured?  While you're considering security failure, consider pre-deploying incident-response tools on your servers and workstations which will speed response-time when things eventually go wrong.

Next is the actual application.  How will you mitigate the amount of damage the application process can do when it is usurped by an attacker?  Chroot jail is often suggested, but is there any value to jailing the account running BIND if the servers sole purpose is DNS?  Consider instead questions like: does httpd really need write access to webroot?  Controls like selinux or tripwire come into play here, they're painful, but can mean the difference between a near-miss and the need to disclose a data compromise.

Also part of assuming that the application server will be compromised raises the need to send logs off of the server.  This can help reduce the load on the server since network traffic is cheaper than disk writes.  Having logs that are collected and timestamped centrally is a boon to any future investigation.  It also allows better monitoring and you can leverage any indicators found from an investigation throughout your entire environment easier.

Now, switching directions again and heading to the next layer, the internal firewall.  These days the internal firewall separates two hostile networks from one another.  The rules have to enforce a policy that protects the servers from the internal systems, as well as the internal systems from the servers since either is just as likely to be compromised these days (some may argue that internal workstations, etc are more likely to fall.)

Somewhere you're going to have a database.  It will likely contain stuff that someone will want to steal or modify.  It is critical to limit the privileges on the accounts that interact with it.  Does php really need read access to the password or credit card column?  Or can it get by with just being able to write?  It can be painful to work through all of the cases to get the requirements correct, but you'll be glad you did when you watch the "SELECT *" requests in your Apache access logs and you know they failed.

Workstations and other devices require a similar treatment. How much access do you need to do your job?  Do you need data on the system, or can you interact with remote servers?  Can you get away with application white-listing in your environment?  I'm purposefully vague here because there are so many variables, however you have to ask how failures will impact your solution, and what you can do to limit the impact of those failures.

Despite all of these measures, eventually everything will fail.  Don't feel glum, that's job security.

Learn from failure: instrument everything, and log everything.  Disk is cheap.  Netflow from your routers, logs from your firewalls, alerts from your IDS and AV, syslogs from your servers will give you a lot of clues when you have to figure out what went wrong.  I strongly suggest full packet capture in front of your key servers.  Use your WAF or load balancer to terminate the SSL session so you can inspect what's going into your web servers.  Put monitors in front of your database servers too.  Store as much as you can, use a circular queue and keep 2 weeks, 2 days, and have an easy way to freeze the files then you think you have an incident.  It's a lot easier to create a system that can store a lot of files than it is to create a system that can recover packets from the past.

I know we sound negative all the time, pointing out why things won't work.  That shouldn't be taken as an argument to not do something (e.g. AV won't protect you from X, so don't bother) instead it should be a reminder that you need to plan for what you do when it fails.  Consider the external firewall, it doesn't block most of the attacks that occur these days, but you wouldn't want to not have one.  Planning for failure is a valuable habit.

4 comment(s)


It's stressed repeatedly to verify that patches apply, but I don't hear nearly enough about verifying your agents are working. If you have tripwire, device control, application control, etc., you need to also verify that these agents aren't just installed but are actually working as planned. Spot testing with unauthorized USB devices, attempting to run an unauthorized executable, etc., can yield surprising results. Much like "trust, but verify," we need to remember to "install, but verify" with our security agents.

Great post!!! Giving me lots to think about. Thanks
I've found the following problems over and over again with companies:

Their design assumes nothing ever fails. The AV will always catch malware, the firewall will never be misconfigured, no one ever clicks a bad link or brings in an infected system. I try to educate them about layered defense, about able to handle at least one major failure without compromise. Their defense is always "your theoritical threat against my real expense looses", or "the site needs to be up on time and your going to delay us", or "We are too small for anyone to attack".

Install a magic box and never look at it again. No configuration tuning, no looking at logs looking for warning signs. No hiring someone with expertise, or training an in house person to do anything other than verify that its turned on and plugged in to meet the compliance checklist (yes we are running XXX).

Trust but verify. A lot of people rely on built in functions of applications for reporting. One place I know had a patch management system report that 98% of the systems were patched and up to date. What they did not say was that a system that is unreachable for 30 days gets automatically removed from the inventory. Unreachable means the agent installed did not communicate with the server. I did an independent survey of the network, almost THIRTY PERCENT OF THE SYSTEMS WERE NOT GETTING PATCHES. So if the agent stops working, patches stop getting applied, and the system disappears from the report. Worse, a person goes off network for an extended job. Server expires agent because of lack of contact. Person comes back, agent tries to update, but can't because its no longer registered. Does anyone get notified? Of course not. But it is in the logs, that nobody reads. Their AV had a similar problem, but I don't recall the stats.
That is a pet peeve of mine, so many installations are done "fire and forget". I would add another clause: where you cannot visit every instance (ie: desktop deployments), use something other than the tool to verify. One tool I know gave a 98% compliance with updates. What it didn't tell me were the hundreds of systems that fell off the monitor because the agent was no longer phoning home.

Diary Archives