Every analytics-tier vendor that shipped a product in 2025 now claims their answers are "cited," "sourced," or "evidence-grounded." The marketing pages are full of footnote icons. The demos show provenance popovers. The buyer leaves the call feeling reassured.
Half of those claims survive a 20-minute technical review. The other half don't · because "evidence-grounded" isn't a binary, it's a spectrum, and most vendors are clustered near the soft end. Forrester's January 2026 survey[4] found that 42% of enterprises that deployed un-grounded LLM products have rolled them back · the top three reasons were false claims (61%), citation drift (37%), and regulatory exposure (29%). The cost is real.
Half of "evidence-grounded" marketing claims fail a 20-minute technical review. The other half don't. This is the field guide for telling the difference.
Section 1: The four grounding levels
Procurement teams need a vocabulary to grade vendor claims. Here's the one we use internally at PYRAMYD when evaluating partners and the one we recommend buyers use when evaluating us:
Level 0: Model-grounded (no retrieval)
The LLM answers from its training data. Citations, if any, are fabricated · the model invents plausible-sounding source URLs that may or may not exist. Hallucination rate: ~41% on TruthfulQA-style benchmarks.[1] Procurement red flag: any demo where the citation panel shows URLs but the vendor can't produce a request log showing those URLs were actually fetched.
Level 1: Document-grounded (vector RAG)
The system retrieves passages from a document corpus by cosine similarity, concatenates them into the prompt, and asks the LLM to answer with citations to the retrieved passages. This is the substrate of ChatGPT Enterprise's first-generation connectors and most "chat with your docs" products. Hallucination drops to 18-26%, depending on retrieval quality. The main failure mode is "lost in the middle"[2] · the LLM under-uses information from the middle of the context window, so citations skew toward the start/end of retrieved chunks even when the supporting evidence sits in the middle.
Level 2: Entity-grounded (typed-document RAG)
Documents are pre-processed into typed entities · "Salesforce" tagged as a Vendor entity, "CRM" tagged as a Category entity, etc. Retrieval can filter by entity type, not just similarity. The citation popover shows the entity, not just the passage. Hallucination drops further (typically 8-14%) because the LLM is constrained to talk about entities it can name and link.
Level 3: Graph-grounded (Graph RAG)
The retrieval traverses typed edges between entities · "What products did Salesforce ship last quarter" resolves via a graph walk from the Salesforce vendor node to its Product children to their Release events. Citations carry the full path. Hallucination on multi-hop questions drops 70-80% vs. baseline vector RAG per the Microsoft benchmarks.[8] This is the bar PYRAMYD operates at, and it's the bar the EU AI Act's Article 50 transparency obligations[5] are converging on for regulated deployments.
41%
hallucination rate · model-grounded only
18-26%
document-grounded (vector RAG)
<5%
graph-grounded · entity-linked citations
Section 2: What a real citation looks like
A citation is real if · and only if · all four of the following hold:
- 1. The source URL was actually fetched. The request shows up in the vendor's retrieval logs, with a timestamp that precedes the LLM call. If the vendor can't show you the log, the URL was generated, not retrieved.
- 2. The cited passage supports the claim. Gao et al.'s 2023 citation-quality benchmark[3] distinguishes "recall" (every claim has a citation) from "precision" (the cited passage actually supports the claim). Vendors love to report recall and stay silent on precision. Always ask for both.
- 3. The retrieval timestamp is recent. A citation to a page that was fetched 18 months ago and the content has since changed is worse than no citation · it's a misleading one. Production-grade systems refresh on a defined cadence and surface the retrieval timestamp.
- 4. The provenance chain is auditable. NIST's AI Risk Management Framework[9] lists provenance and verifiability as two of seven core characteristics for trustworthy AI. If the audit log can't show which model produced the answer, with which prompt, against which retrieved set, at which time, the system is not auditable.
Section 3: The 7 disclosure controls procurement should require
Beyond the citation mechanics, there are seven structural disclosures any AI procurement contract should include · most of them aligned with the EU AI Act Article 50[5] and NIST AI RMF[9] requirements:
- 1. Model card. Which foundation models are in the inference path, at which versions, in which regions. Contract should include 30-day notice on changes.
- 2. Retrieval source manifest. Every data source the system can pull from, with refresh cadence and last-updated timestamp. Updated quarterly minimum.
- 3. Citation precision SLA. Vendor commits to a measured precision rate (e.g., >90%) on a defined benchmark · with quarterly third-party audit rights.
- 4. Hallucination disclosure. Vendor's self-reported hallucination rate, plus the test methodology. If they don't publish one, assume it's in the Level 0-1 range.
- 5. Audit log retention. Per-answer logs (prompt, retrieval set, model, citation set, user) retained for the duration of the contract plus 7 years for regulated buyers.
- 6. Exportable provenance. Your tenant data · including all citations, retrieval logs, and the substrate graph · is exportable in machine-readable format at any time. No lock-in on the audit trail.
- 7. Bias and fairness statement. Disclosure of any known systematic bias in the training data, retrieval corpus, or ranking algorithms. NIST AI RMF requires this for federal deployments; we recommend it as table-stakes for all buyers.
Section 4: Why the substrate matters more than the LLM
A common procurement mistake: buyers grade vendors on the LLM they use ("they ship on Claude Opus 4.7, must be smart") instead of the substrate the LLM retrieves from. The substrate is the bottleneck.
A frontier-tier LLM grounded on a document-only retrieval layer is roughly equivalent to a mid-tier LLM grounded on a typed-graph layer · the Microsoft GraphRAG benchmarks show that retrieval architecture dominates model selection for the question types enterprise buyers actually ask.[8] Your AI procurement budget is better spent on substrate quality than on model premium.
Where this lands for PYRAMYD customers
PYRAMYD is graph-grounded by construction · every APEX answer traverses the typed 88-node graph, every citation carries the entity ID, the source URL, the retrieval timestamp, the model used, and the prompt hash. The full per-answer audit log is exportable. We publish our citation precision rate quarterly. This is the grounding tier the EU AI Act and NIST AI RMF are converging on · we're building for the bar that's coming, not the one that's comfortable.
Gartner predicts 40% of agentic AI projects will be cancelled by end of 2027[6] · the largest single category of AI procurement risk in their TRiSM forecast. The cancellations won't be because the LLMs got worse · they'll be because the grounding underneath was Level 0 or Level 1 when the regulator, the auditor, or the CFO asked Level 3 questions. The field guide above is how you avoid that outcome.
References
- [1]Lin, S. et al., TruthfulQA: Measuring How Models Mimic Human Falsehoods, ACL 2022 · Foundational benchmark for LLM truthfulness · base models hallucinate confidently on 41% of probes; RAG-grounded models drop to 18-26% depending on retrieval quality.
- [2]Liu, N. et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 (Jul 2023) · LLMs reliably attend to information at the start and end of context but lose middle-context information · the citation reliability problem that vendors hide.
- [3]Gao, T. et al., Enabling Large Language Models to Generate Text with Citations, EMNLP 2023 · Citation-quality benchmark · 'recall' (fraction of claims that have a citation) and 'precision' (fraction of citations that actually support the claim).
- [4]Forrester, The Cost of AI Trust Failures: 2026 Enterprise Survey (Jan 2026) · 42% of enterprises that deployed un-grounded LLM products have rolled them back. Top three reasons: false claims (61%), citation drift (37%), regulatory exposure (29%).
- [5]EU AI Act, Article 50 · Transparency Obligations for Certain AI Systems (effective Aug 2026) · Generated content must be machine-detectable; the user must be informed it's AI output. Citation provenance is the audit trail.
- [6]Gartner, AI Trust, Risk and Security Management (TRiSM) 2026 Forecast · Predicts 40% of agentic AI projects cancelled by end of 2027 due to inadequate grounding · the single largest 'AI procurement risk' category.
- [7]Bender, E. et al., On the Dangers of Stochastic Parrots, FAccT '21 · Foundational critique of un-grounded language models · the source of the 'stochastic parrot' framing that procurement teams now reference.
- [8]Edge, D. et al., GraphRAG, Microsoft Research arXiv:2404.16130 (Apr 2024) · Graph-grounded retrieval reports 70-80% reduction in unsupported claims vs. baseline vector RAG · the empirical case for graph-grounding.
- [9]NIST, AI Risk Management Framework 1.0 (Jan 2023) · Federal-grade AI risk framework · provenance, verifiability, and explainability are three of the seven core characteristics.
- [10]Anthropic, Constitutional AI: Harmlessness from AI Feedback (2022) · Reference architecture for how LLMs are trained to defer to grounded sources · explains why model-only responses still hallucinate.
