AFIP’s Digital Content Forensics division applies statistical and linguistic analysis to determine the provenance of written content. As large language models become increasingly sophisticated, distinguishing human-authored text from machine-generated output requires forensic-grade methodology grounded in reproducible science — not black-box classifiers with opaque confidence scores.
The division’s work addresses a problem that commercial detection tools have struggled with: maintaining accuracy as models evolve. A 2023 study by Sadasivan et al. (Can AI-Generated Text be Reliably Detected?, arXiv:2303.11156) demonstrated that paraphrasing attacks can reduce detector accuracy below random chance. AFIP’s multi-layer forensic approach — combining statistical, linguistic, and provenance analysis — is designed to remain robust against these adversarial techniques.
| Model Family | Detection Rate | False Positive | Paraphrase Resilience | Confidence |
|---|---|---|---|---|
| GPT-4o / GPT-4 Turbo | 96.1% | 1.4% | 89.3% | |
| GPT-5 | 93.2% | 2.1% | 85.7% | |
| Claude 3.5 Sonnet | 94.8% | 1.6% | 87.2% | |
| Gemini 1.5 / Ultra | 91.7% | 2.3% | 83.4% | |
| Llama 3 (70B) | 95.4% | 1.2% | 90.1% | |
| Mistral Large | 94.1% | 1.7% | 86.8% |
Benchmark conducted on 25,000-sample corpus. Paraphrase resilience measured against DIPPER and PEGASUS-based rewriting. Methodology: AFIP Detection Protocol v3.2
The challenge of AI text detection is not merely technical — it is epistemological. When a language model generates text by sampling from a probability distribution over tokens, the output is, by construction, drawn from the same distribution as human language. This is precisely what makes large language models useful, and precisely what makes detection difficult.
Early detection approaches relied on perplexity thresholds: AI-generated text, being optimized for likelihood, would exhibit lower perplexity than human writing. This approach failed as models improved. GPT-4 and its successors produce text with perplexity distributions that overlap substantially with human writing, particularly in formal and technical domains where human prose is itself highly optimized (Kirchenbauer et al., 2023, A Watermark for Large Language Models, ICML).
The detection problem is further complicated by the asymmetry of evidence. Proving that text was AI-generated requires identifying positive forensic signatures — artifacts of the generation process that human writing would not contain. Proving that text is human-authored is substantially harder, because the absence of AI signatures is not itself conclusive evidence. AFIP’s methodology addresses this asymmetry by providing calibrated confidence intervals rather than binary classifications.
Our detection framework operates across three independent analytical layers, each producing a calibrated confidence score. The layers are designed to capture different aspects of the generation process, so that improvements in one dimension of model capability do not simultaneously defeat all three detection channels.
Token-level entropy distributions, burstiness metrics, perplexity variance across sliding windows, and n-gram frequency divergence from reference corpora.
Syntactic complexity distributions, discourse coherence patterns, register consistency, hedging frequency, and stylistic entropy across paragraph boundaries.
Locality-sensitive hashing of n-gram distributions, document-level fingerprints for partial matching, and cross-reference against known AI output registries.
The statistical layer examines token-level entropy, perplexity variance, and n-gram frequency distributions. AI-generated text exhibits measurably different statistical signatures than human writing — patterns that persist even as model capabilities advance. Specifically, we measure:
Burstiness. Human writing exhibits variable levels of information density — high-entropy passages (novel ideas, complex arguments) interspersed with low-entropy passages (transitions, summaries, hedging). Autoregressive language models, while capable of local variation, produce text with a measurably narrower burstiness distribution. This metric, introduced by Mitchell et al. (2023, DetectGPT, ICML), remains one of the most robust single-feature detectors across model families.
Perplexity Curvature. Rather than measuring absolute perplexity (which overlaps between human and AI text), we analyze the second derivative of perplexity across a sliding window. Human text produces characteristic spikes at paragraph and section boundaries that AI-generated text smooths over, even when the absolute perplexity levels are comparable.
N-gram Divergence. We compute the KL divergence of the document’s n-gram distribution against both a human reference corpus and model-specific output corpora. The ratio of these divergences provides a signal that is more informative than either divergence alone.
Human writing contains micro-variations in syntax, register, and semantic coherence that reflect cognitive processes absent in generative models. Our linguistic forensics layer identifies these signatures through multiple complementary analyses:
Syntactic Complexity Distribution. We measure the distribution of parse tree depths, clause embedding depths, and dependency arc lengths across sentences. Human writers produce text with a wider, more irregular distribution of syntactic complexity than LLMs. Our research has documented that even the most capable models maintain a narrower band of syntactic complexity than human-authored content in the same domain — a regularity that, while often invisible to casual readers, produces measurable forensic signatures.
Discourse Structure. Human documents exhibit characteristic patterns in how ideas are introduced, developed, and connected. We analyze discourse markers, anaphoric reference chains, and topic continuity using Rhetorical Structure Theory (RST) parsing. AI-generated text tends to produce more uniform discourse structures with fewer embedded digressions and parenthetical elaborations.
Register Consistency. Human writers naturally shift register within a document — becoming more formal in technical passages, more colloquial in asides. LLMs maintain a more uniform register throughout. We measure register variation using feature sets adapted from Biber’s (1988) multidimensional analysis framework, updated with features specific to contemporary digital text.
Hedging and Epistemic Markers. One of the most reliable linguistic signals we have identified is the pattern of epistemic hedging. Human experts hedge differently from non-experts, and both hedge differently from LLMs. Our analysis of hedging frequency, position, and type provides an independent signal that correlates with authorship at statistically significant levels (p < 0.001 in our validation corpus).
Beyond detection, AFIP researches content provenance systems — methods for establishing the origin, modification history, and authenticity chain of digital documents. This work addresses the fundamental limitation of point-in-time detection: a document classified as human-authored today may have been generated by a model that did not exist when the classifier was trained.
Our provenance fingerprinting system uses locality-sensitive hashing (LSH) on n-gram distributions to produce document fingerprints that tolerate minor edits (spelling corrections, formatting changes, synonym substitution) while detecting wholesale AI regeneration. These fingerprints enable:
Partial Matching: Identifying passages within a larger document that match known AI-generated content, even when the document is predominantly human-authored.
Temporal Provenance: Establishing when a document or passage first appeared in a corpus, enabling attribution disputes to be resolved based on publication order.
Cross-Platform Tracking: Following content as it is republished, paraphrased, and adapted across platforms — the textual equivalent of reverse image search.
This provenance work contributes directly to the AFIP Forensic Integrity Protocol (FIP), which defines the open standard for content fingerprinting and attestation.
A forensic detection methodology is only as reliable as its resistance to adversarial evasion. AFIP conducts ongoing adversarial testing against our own detection pipeline, systematically evaluating robustness against known attack vectors:
| Attack Vector | Method | Detection Retention |
|---|---|---|
| Paraphrase Attack | DIPPER (L60, O60) | 89.3% |
| Back-Translation | EN → DE → EN → FR → EN | 91.7% |
| Homoglyph Substitution | Unicode confusables | 99.2% |
| Watermark Removal | Token perturbation | 86.4% |
| Mixed Authorship | Human-edited AI draft | 78.1% * |
| Prompt Engineering | “Write like a human” variants | 93.8% |
* Mixed authorship detection returns per-passage confidence scores rather than whole-document classification. 78.1% represents correct identification of AI-originated passages within human-edited documents.
AFIP maintains a policy of transparent reporting on the limitations of our detection methodology. No detection system is infallible, and overclaiming accuracy causes real harm — particularly when detection results are used in academic integrity proceedings, journalism verification, or legal contexts.
Known limitations of our current methodology include reduced accuracy on short texts (under 250 tokens), domain-specific technical writing where human prose already resembles model output, and documents produced through extensive human-AI collaboration where authorship is genuinely shared. We report calibrated confidence intervals for all assessments and recommend that our results be used as one input in human decision-making, not as automated binary classifiers.
Our research also addresses the ethical dimensions of AI text detection, including the disproportionate false positive rates documented for non-native English speakers (Liang et al., 2023, GPT detectors are biased against non-native English writers, Patterns). AFIP’s multi-layer approach reduces — but does not eliminate — this bias, and we publish disaggregated accuracy metrics across writer demographics in our quarterly benchmark reports.
AFIP publishes quarterly benchmark results testing our methodology against the latest generation of language models. Our current aggregate detection accuracy stands at 94.7% across tested models, with a false positive rate of 1.8% and a paraphrase resilience rate of 87.6%.
All benchmark methodology is documented in sufficient detail for independent reproduction. Our test corpus composition, evaluation metrics, and statistical procedures follow the reporting standards recommended by the ACL 2023 guidelines for reproducible NLP research. Benchmark datasets and evaluation scripts are available to verified research institutions upon request.
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ICML 2023.
Sadasivan, V.S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). Can AI-Generated Text be Reliably Detected? arXiv:2303.11156.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. ICML 2023.
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7).
Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press.
Hans, A., Schwarzschild, A., Cheber, V., Alon, U., Bourgin, D., Agarwal, A., & Goldstein, T. (2024). Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. ICML 2024.