Attention-Weighted Retention: Logging Everything Is Lazy - Start Scoring What Matters

Observability•June 21, 2025•12 min read

TL;DR: Logging everythinnetig is lazy. Most teams dump logs into storage and pray they'll need them later. 90% of logs never get touched again. Attention-weighted retention scores logs based on actual usage - if nothing looks at it, it decays. Hot logs stay, cold logs go. Simple concept, massive impact.

the problem: dump-everything-and-pray

most teams still run on this passive collection mindset. throw all logs into storage, cross your fingers, and hope some future query will need them. it's less of a strategy, more of an excuse.

here's what actually happens:

90% of logs never get touched - no queries, no alerts, no dashboards
they just sit there eating space - racking up infra costs, dragging down agility
storing 100PB sounds cool until you realize it's mostly dead bytes
false sense of safety - you think you're covered because you kept everything

when incident hits, you're digging through terabytes of noise with zero prioritization. that's not coverage, that's entropy.

enter attention-weighted retention

this isn't just about reducing storage bills. it's about turning retention from guesswork into a feedback system.

Core idea: Logs get to live longer if people or alerts are actually interacting with them. If they aren't? They decay over time. Every log line gets an attention score based on real usage.

that one line you emitted six months ago that never triggered an alert, never got queried, and never surfaced in an investigation? why are we still storing it at full fidelity?

the architecture: 5 pieces

1. Ingestion Layer (keep it simple)

fluentbit, vector, otel agents, whatever. no need to touch that. keep it open and generic.

2. Attention Signal Collector

wire into every place where humans or systems consume logs:

alerting systems (Alertmanager, PagerDuty)
dashboards (Grafana, Datadog)
CLI tools (kubectl logs, grep)
saved searches and bookmarks

track every interaction. whether it's a page firing or a query in Grafana, all of that is a signal.

3. Signal Indexer

build a lightweight index that maps attention events to log sources:

timestamps, stream tags, log ids
store in ClickHouse or VictoriaMetrics
keep it TTL'd - we don't need forever history
just enough to score retention

4. Retention Scorer

run a scoring pass per log stream or service:

queries in the last week = weight
triggered alerts = more weight
never touched = decays over time
use exponential decay functions
stratify into buckets: hot, warm, cold

5. Retention Enforcer

apply the policy based on signal-weighted retention score:

cold logs = expired
warm logs = compacted with lossless compression
hot logs = stick around in fast search storage

engineering overhead: worth it?

people are gonna ask about cost. yeah, there's some cpu overhead in scoring and indexing attention signals. but compared to storing petabytes of untouched data and scaling that infra, it's a worthwhile tradeoff.

key insight: indexing access patterns is inherently cheaper than indexing log content. the attention index can be sparse and still useful.

Important: Attention scoring should be async and backoff-friendly. Don't slow down ingest paths. Don't create another bottleneck. Build it to degrade gracefully.

why ingest-everything-query-later breaks

this model fails under real-world conditions:

cold storage latency - if your cold logs take minutes to hydrate, good luck using them during a live incident
alert drift - if logs aren't looked at, you don't notice when alerts become stale or broken
cognitive load - engineers don't know what logs are safe to ignore, so they stop trusting logs at all
false sense of safety - you think you're covered because you kept everything, but when incident hits, you're digging through terabytes of noise

unseen value: organizational signal

there's a layer no one talks about: attention scoring exposes which parts of the system are actually getting human time.

if no one's ever looked at a particular service's logs, it either means:

it's working perfectly (good to know)
no one's paying attention (also good to know)

you can track operational blind spots this way. audit which services have real SRE touchpoints and which ones are totally dark.

tradeoffs and risks

this isn't a drop-in solution. you need:

structured logs
tagging discipline
allowlists for rare-but-critical logs
manual overrides when needed

also, attention itself is noisy. someone fat-fingering a query shouldn't preserve an entire log stream. so the scoring logic needs smoothing:

moving averages
weighted decay
anomaly detection to avoid flukes

closing thoughts

attention-weighted retention isn't about saving money. it's about raising signal quality. if you know what's being used, you can stop guessing.

in systems this complex, every byte you drop with confidence is one less byte stealing focus. this isn't optimization, it's operational clarity. and it scales better than blind trust in cheap storage.

Bottom line: Stop logging everything. Start scoring what matters. Your future self (and your storage bill) will thank you.

References

LogicMonitor highlights that logs are often archived for compliance and seldom accessed, showing how log retention is widely practiced but rarely questioned. "Log retention refers to the archiving of event logs…allowing companies to hold information on security-related activities." (logicmonitor.com)
Last9's guide on log retention outlines best practices and acknowledges that standard retention durations (14–90 days) are arbitrary and not usage‑based (last9.io)
SigNoz's default retention settings for logs and traces (15 days, 30 days) demonstrate fixed TTL is common—but not oriented around real usage (signoz.io)
Acceldata outlines the data lifecycle, including archival and retention, but doesn't tie retention length to actual usage or attention metrics (acceldata.io)
Observe Inc. digs into retention versus cost—"keeping long term logs online … can cost an arm and a leg"—confirming the classic trade‑off around observability scale (observeinc.com)
Grafana Loki documentation shows streams live forever by default, but supports stream-level retention policies—highlighting a move toward more granular control (grafana.com)
Lilian Weng's "Attention? Attention!" is a foundational explainer on attention mechanisms in ML, illustrating how weighted relevance improves system focus (lilianweng.github.io)
The 2019 paper "Log-based software monitoring: a systematic mapping study" identifies storage and retention efficiency as under-explored problems in observability research (arxiv.org)
The 2020 Site Reliability Engineering survey, along with Last9's content on retention, show logs are essential but long-term retention and smart pruning remain unsolved challenges (last9.io)
AWS engineering blog post provides a practical pattern: compute signals (via EventBridge + Lambda) to automate retention—proof that signal-based retention is implementable (aws.amazon.com)