QUESTION: Are we training AI too late?
Nishawn Smagh, Director of Intelligence at GreyNoise: Artificial intelligence anchors modern security operations. Detection models are typically trained on labeled breach logs, malware samples, threat feeds, and post-incident investigations; sources that provide validated ground truth and enable reliable classification.
But these sources share a critical structural limitation: They reflect attacker behavior only after malicious activity has already been confirmed.
The central question becomes whether we are training AI to recognize impact or intent. For the answer, let’s look at IP patterns associated with malicious scanning activity.
The Fresh Infrastructure Problem
Internet-scale telemetry shows that high-impact exploitation frequently originates from infrastructure with little or no prior malicious history. According to GreyNoise’s 2026 State of the Edge report:
52% of remote code execution (RCE) exploitation traffic originated from IPs that had not appeared in common threat feeds.
38% of authentication bypass attempts involved previously unseen IPs.
For basic reconnaissance (e.g., information disclosure), the number of IPs with no scanning history drops to 29%.
A striking pattern emerges: the more severe the activity, the more likely it is to involve new infrastructure. Adversaries appear to understand the constraints of reputation systems, increasingly deploying new cloud instances, short-lived VPS environments, and residential proxy networks to avoid leaving reusable IP history.
Reputation-based approaches remain valuable, but inherently retrospective. If AI models heavily weight historical indicators and post-compromise artifacts, they risk inheriting the same lag. Infrastructure novelty, especially when paired with high-impact behavior, is becoming a meaningful risk signal in its own right.
Attacker Behavior Often Comes First
The timing gap may begin even earlier than most defensive workflows assume. GreyNoise analyzed edge-related activity starting in September 2024 and identified 216 statistically significant spike events after applying strict anomaly thresholds. When compared against subsequent common vulnerability exposure (CVE ) disclosures affecting the same technologies:
50% of spikes were followed by a new CVE disclosure within three weeks.
80% were followed by a new disclosure within six weeks.
This pattern spanned eight enterprise-focused edge-facing systems (such as VPNs, routers, firewalls, and internet-facing management systems). Correlation does not prove causation, but the recurring temporal relationship suggests that attacker intent can surface before formal vulnerability disclosure.
Most spike activity involved exploit attempts against previously known vulnerabilities, consistent with adversaries’ inventorying exposed systems or testing exploit paths ahead of a coordinated campaign.
Why the Edge Matters
Edge-facing systems are increasingly becoming strategic access points, and large-language model (LLM) inference servers represent a particularly acute version of this problem. A compromised inference endpoint isn’t just a foothold; it’s a position from which adversaries can manipulate model outputs, exfiltrate training data, or pivot to internal systems querying it.
Reconnaissance targeting inference ports is already underway. If defenders are training AI to protect AI infrastructure using only post-compromise artifacts, then the most novel attack surface in the enterprise is being defended with the oldest detection logic.
Edge systems capture exactly this kind of pre-compromise telemetry, reconnaissance, authentication probing, and infrastructure rotation patterns that reflect attacker coordination before a breach is confirmed.
CrowdStrike’s 2026 Global Threat Report reinforces the emphasis adversaries place on edge devices, noting that nation-state and ransomware operators targeted network perimeter devices as strategic entry points. China-nexus actors favor edge exploitation because it provides immediate access while limiting defender visibility
This creates a structural asymmetry. Adversaries exploit the edge precisely because visibility is constrained. Yet defenders often train AI on artifacts that appear only after edge access has succeeded. At the perimeter, they see probing, exploit attempts, and infrastructure rotation, signals that may not map to a confirmed compromise, but frequently precede it.
Detecting the 216 spike events required internet-scale baselining. A single enterprise might observe exploit attempts against its own systems, but it cannot easily determine whether they represent background noise or a coordinated global deviation. The visibility gap becomes a training gap.
Implications for AI Strategy
Post-incident artifacts remain essential; they provide reliable labels and serve as anchors for supervised detection systems. But if training datasets emphasize confirmed compromise and post-disclosure exploitation while excluding pre-exploitation behavioral telemetry, models will skew toward reactive signals.
The findings point toward two measurable opportunities:
A meaningful association between infrastructure novelty and higher-impact exploitation.
A recurring relationship between behavioral spikes and subsequent CVE disclosures in edge technologies.
Earlier signals exist, and they are measurable. Incorporating features such as first-seen IP timing, anomaly-detection outputs, infrastructure churn rates, and pre-disclosure spike behavior into AI pipelines could shift detection closer to attacker reconnaissance rather than to attacker success.
Shift the Training Window
Training earlier in the attack lifecycle doesn’t mean abandoning validated impact data. It means expanding the signal set.
As infrastructure rotation accelerates and edge systems remain high-value targets, defensive advantage will increasingly depend on how effectively AI integrates both confirmed compromise artifacts and internet-scale pre-exploitation telemetry. Organizations that close that timing gap move from reacting to breaches toward recognizing coordinated behavior before a breach occurs.
