sg-tightener: evidence-based AWS security group tightening

Tools in the suite

Five scripts, one purpose

Each tool covers a single responsibility. They share an approved-IPs file as the source of truth, so the same observed evidence flows through analyse, plan, apply, diagnose, and revert — with break-glass extend and rule compaction either side for live incidents.

sg_tightener.py

analyseplanapplyrevert

Reads VPC flow logs (CloudWatch Logs or S3), builds a plan, applies it atomically per group, and halts loudly with a revert command if anything fails partway through.

View on GitHub →

sg_diagnose.py

post-deployrejectsmerge

Surfaces private source IPs being REJECTED and not covered by any current rule. Lets you merge them into the approved list and re-apply in one loop.

View on GitHub →

sg_ou_report.py

org-wideSG + NACLrisk score

Walks the entire Organisation, assumes a cross-account role per account, scans every region in parallel, and ranks accounts by severity-weighted risk score. Exits non-zero on any CRITICAL, usable as a pipeline gate.

View on GitHub →

sg_extend.py

break-glassauto-discoverAWS-service aware

For DR failovers and live incidents. Reads the last 24h of flow logs, finds the source IPs being REJECTED, and — when --groups is omitted — looks up each REJECTed destination ENI to derive which SGs need the rule. Lambda / Route53-healthcheck traffic collapses into the AWS service prefix instead of /32s. Strictly additive; manifest is folded back into the evidence base on the next cycle.

View on GitHub →

sg_compact.py

planapplyrevert60-rule limit

Reclaims rule budget when a group nears the 60-rule cap by widening existing RFC 1918 CIDRs into fewer blocks. You pick a compaction ratio; plan mode ranks the busiest groups and sweeps ratios so you can see the trade-off before applying. Coverage is always preserved.

View on GitHub →

How it runs

Four commands, with safety nets

The flow is intentionally boring: read evidence, propose changes, apply with halt-on-error, diagnose anything that breaks. Every step writes a JSON artefact that the next step reads, with no hidden state.

STEP 01

Analyse

Read 90+ days of flow logs from CloudWatch Logs or S3. Write a sorted, deduplicated list of accepted source IPs to approved.json.

STEP 02

Plan

Collapse observed IPs into the smallest CIDR set within each security group's rule budget. Emit a signed plan.json showing every revoke and authorise.

STEP 03

Apply

Revoke broad rules and authorise tight replacements per group. If any single group fails, halt immediately and print the revert command. No partial silent state.

STEP 04

Diagnose

If anything legitimate gets caught, sg_diagnose.py scans REJECT entries, surfaces uncovered private sources, and merges them back into approved.json.

Standard workflow

# 1. Read 90 days of flow logs
python sg_tightener.py analyse \
  --region us-east-1 \
  --log-group /aws/vpc/flowlogs \
  --days 90 \
  --out approved.json

# 2. Build a plan (no AWS writes)
python sg_tightener.py plan \
  --region us-east-1 \
  --approved approved.json \
  --max-rules 60 \
  --out plan.json

# 3. Review plan.json, then apply
python sg_tightener.py apply --plan plan.json

# 4. If something legitimate is now being blocked
python sg_diagnose.py --region us-east-1 \
  --log-group /aws/vpc/flowlogs --hours 24

# Worst case: full revert from the manifest written by apply
python sg_tightener.py revert --manifest manifest-20260528T120000Z.json

Incident-time: break-glass extend, then compact

# Outage: at 2am you know things are broken — but not which SGs, which CIDRs,
# or which ports. Omit --groups and sg_extend discovers everything itself
# from the last 24h of REJECT flow logs (destination ENI -> attached SGs).
python sg_extend.py \
  --region us-east-1 \
  --log-group /aws/vpc/flowlogs

# When you do know which SGs, scope it explicitly. --include-public also
# turns on AWS-service summarisation: Lambda Hyperplane / R53 health-check
# sources collapse into the published service prefix, not /32 host routes.
python sg_extend.py \
  --region us-east-1 \
  --groups sg-aaaa,sg-bbbb \
  --log-group /aws/vpc/flowlogs \
  --hours 24 \
  --tolerance 0.5 \
  --ports 443,5432 \
  --include-public

# Afterwards a group may be near the 60-rule cap. See where the rules are and
# what each compaction ratio would reclaim (no AWS writes):
python sg_compact.py plan --region us-east-1

# Pick a ratio, write a plan, review it, then apply:
python sg_compact.py plan --region us-east-1 --ratio 0.5 --out plan.json
python sg_compact.py apply --plan plan.json

# Reverts via the same manifest machinery as sg_tightener.
python sg_compact.py revert --manifest sg_compact-manifest-20260528T120000Z.json

The CIDR-collapsing algorithm

Three layers: never more than 60 rules per group

AWS hard-caps a security group at 60 inbound rules by default. If 200 IPs have connected, you can't write 200 /32s, and a /16 reintroduces the permissiveness you're trying to remove. The algorithm finds the middle ground.

Layer 1 · widest block per IP

For each observed IP, walk outward to the widest containing prefix where the gap fraction (addresses in the block that were never observed) stays within the configured tolerance (default 30%).

Densely-populated subnets collapse aggressively. Sparse outliers stay as /32 host routes. Nothing widens past the IP's RFC 1918 home block.

Layer 2 · tolerance widening

If layer 1 produces more rules than the group's budget, widen the tolerance in 5% steps up to 95%, recomputing each time. Every step is logged so the operator can see exactly what trade-off was made.

Layer 3 · force-fit merge

If 95% tolerance is still over budget, merge the closest pair of blocks whose union introduces the smallest amount of new untrusted space. Merges never cross an RFC 1918 boundary, so a 10/8 block is never fused with a 172.16/12 block.

Force-fit prints a loud warning recommending an AWS Support quota increase.

Per-group rule budget

The budget for replacement rules is computed from the current state of each group, not the global limit. If a group has 25 rules being left alone (SG references, public 0.0.0.0/0, already-tight CIDRs), the budget for replacements is 60 − 25 − (broad rules removed).

Prevents the failure mode where apply succeeds in revoking but fails in authorising because the destination can't hold the new rules.

Scope

What the tool will and will not touch

Eligibility uses strict subset semantics (not overlap) so overlapping non-private ranges like 192.0.0.0/4 are correctly excluded. Rules at /24 or tighter are not modified.

Tightened

10.0.0.0/8 subsets 172.16.0.0/12 subsets 192.168.0.0/16 subsets prefix < /24

Left untouched

0.0.0.0/0 SG references IPv6 /24 or tighter non-private overlaps

Reported, not changed

Network ACLs

NACLs are scanned and labelled in the OU report. Automated NACL tightening is a planned phase two: stateless, subnet-scoped, and a 20-rule limit need separate care.

Out of scope

Egress rules Public exposure audit

sg-tightener does not evaluate whether services should be reachable at all, only whether the source CIDR on existing private rules is broader than the evidence supports.

Operational care

Absence of evidence isn't evidence of absence

The default 90-day window is long enough to catch most regular traffic and not long enough to catch everything. Categories of traffic most likely to be missed are the ones that matter most in a crisis, and the tool is built around that risk, not in spite of it.

Window is configurable

Extend with --days 180 or longer for accounts where you know seasonal or infrequent traffic patterns exist: quarterly DR tests, month-end batches, blue-green failovers where the dormant environment was inactive during analysis.

Halt-on-failure apply

If any single security group fails to update cleanly, apply halts immediately and prints the revert command. Partial silent state is impossible. Every apply writes a timestamped manifest of every change so revert can be one command.

Stale-plan detection

Plans are signed with a SHA-256 hash of the security group snapshot they were built from. If anyone touches a relevant rule between plan and apply, apply refuses to run.

Break-glass extension

sg_extend.py exists for the cases the standard loop can't cover: DR failovers, supplier IP cutovers, on-call moments where connectivity must come back in minutes. It reads the last 24h of flow logs, adds the REJECTED private sources — collapsed into CIDRs by a configurable tolerance — strictly additively, and logs a manifest.

At 2am the operator often knows something is broken but not which security groups need patching. Omit --groups and sg_extend looks up each REJECTed destination ENI and derives the attached SGs itself; flows are attributed only to the SGs whose ENIs actually saw them, so a typo can't fan rules out across the estate. A --max-groups cap (default 20) is the hard ceiling.

AWS service summarisation

VPC Lambda traffic, Route53 health-checkers, and other managed-ENI sources arrive from AWS-published IP ranges that rotate over time. When --include-public is on, sg_extend classifies each public source against the AWS ip-ranges.json and collapses every flow that falls inside a service prefix into one rule per service — instead of a fistful of /32s that go stale on the next AWS rotation. The rule description tags the service and region for audit visibility.

The AMAZON catch-all is deliberately blocklisted — it covers essentially all of AWS and is too broad to be a trust source. Pass --no-aws-summarise to revert to per-IP host routes.

Rule-budget compaction

A noisy incident can push a group toward the 60-rule cap. sg_compact.py reclaims budget by widening existing RFC 1918 CIDRs into fewer blocks, gated by a compaction ratio — the fraction of unused space you'll tolerate. Plan mode ranks the busiest groups and sweeps ratios first; coverage is never reduced.

The broader point

Trust should be earned, not inherited

Most organisations spend considerable effort building security controls at the perimeter: WAFs, DDoS protection, identity federation. What receives far less attention is the internal trust model once traffic is past the perimeter. The implicit assumption in most hybrid cloud estates is that the corporate network is trusted, and that assumption is encoded directly into security group rules as broad RFC 1918 CIDR blocks that nobody has revisited since they were written.

Modern threat models assume the corporate network is already compromised, or will be. Ransomware operators routinely move laterally across flat trusted networks before triggering payloads. Compromised build agents are a standard initial-access vector precisely because they sit in trusted ranges with broad permissions into production. The cloud did not eliminate flat networks; it gave many organisations the tools to build more sophisticated ones while quietly replicating the same trust assumptions they always made.

sg-tightener exists because trust should be earned through observed behaviour, not inherited from a datacenter subnet designed fifteen years ago.

sg-tightener: evidence-based CIDR reduction

From 196,608 trusted addresses to 74

Before → after

Five scripts, one purpose

sg_tightener.py

sg_diagnose.py

sg_ou_report.py

sg_extend.py

sg_compact.py

Four commands, with safety nets

Analyse

Plan

Apply

Diagnose

Standard workflow

Incident-time: break-glass extend, then compact

Three layers: never more than 60 rules per group

Layer 1 · widest block per IP

Layer 2 · tolerance widening

Layer 3 · force-fit merge

Per-group rule budget

What the tool will and will not touch

Tightened

Left untouched

Reported, not changed

Out of scope

Absence of evidence isn't evidence of absence

Window is configurable

Halt-on-failure apply

Stale-plan detection

Break-glass extension

AWS service summarisation

Rule-budget compaction

66 regression tests, no AWS credentials needed

Run the suite

Trust should be earned, not inherited