Klue's own published data[1] · widely cited because Klue is the document-centric CI incumbent and the data is from their own customers · is that battle cards refresh every 60-90 days, while competitors ship material things every 11 days. That 6-to-8× refresh gap is the structural failure of document-driven competitive intelligence. The fix is not better battle cards. It's a live signal feed that the battle cards are derived from.
Battlecards lag reality by 6-8× because they're documents, not queries. The fix is a signal feed that's closer to the source · 1,000+ feeds, per-source cadences, entity resolution, typed event nodes written back to the graph in real time.
PYRAMYD's signal feed ingests 1,000+ source feeds across press releases, product changelogs, hiring posts, funding announcements, integration directories, review platforms, regulatory filings, and a long tail of vertical-specific sources. Per-source refresh cadences run from real-time (RSS-driven) to weekly (scraped). The output is typed event nodes in the graph · queryable, citable, dated, and FK-linked to the entities they describe.
Section 1: What "signal" means in the typed-graph world
In a document-grounded CI system, a "signal" is a unit of text · "Salesforce announced a new pricing tier today." The string is hard to query, hard to deduplicate, and hard to count over time. Hundreds of signals can describe the same underlying event; the analyst has to do the de-duplication mentally.
In a typed-graph system, a signal is a structured event with referential integrity:
- Event type · drawn from a closed vocabulary · Release, FundingRound, HiringSignal, PricingChange, PressArticle, ReviewBurst, IntegrationAnnouncement, ComplianceCertification, etc. (17 types across the schema).
- Subject entity · the Vendor, Product, Category, or Industry the event is about · resolved to the canonical node ID, not a string.
- Source citation · URL, retrieval timestamp, source confidence, source authority score.
- Structured payload · the fields specific to the event type (a Release has version + release notes URL; a FundingRound has round size + valuation + lead investor; etc.).
- Discovery timestamp · when the signal pipeline first saw it (which is what powers the "new this week" feeds and the diff-since-last-load comparisons).
Once signals are typed events with referential integrity, you can count them ("Salesforce shipped 14 releases this quarter"), aggregate them ("the CRM category had a 38% YoY increase in pricing changes"), filter them ("funding rounds led by Sequoia in the Cloud Infrastructure category, last 90 days"), and chart them. None of which the document layer supports.
Section 2: The pipeline · five stages, exactly-once delivery
Stage 1: Source discovery
Each source is registered with a manifest · base URL, fetch method (RSS / API / scrape), polite-crawl cadence, robots.txt observance, expected payload shape, source authority score, and a per-source retry policy. The manifest follows Wojcik's industry-scale scraping reference[9] · robots.txt is observed, rate limits are respected, identification headers identify the crawler.
Stage 2: Fetch + content-hash dedup
Each fetch produces a raw payload that's content-hashed using Stripe's idempotency-key pattern.[4] If the hash matches a prior fetch's hash, the downstream pipeline short-circuits · no re-extraction, no re-entity-resolution, no graph write. This is what lets us register thousands of sources without proportional infrastructure cost · 90%+ of fetches hit the cache.
Stage 3: LLM extraction
New payloads (cache miss) get extracted by a frontier-tier LLM into the typed-event schema. Extraction prompts are per-event-type · a release-extraction prompt is different from a funding-extraction prompt. The model returns a structured JSON object that gets validated against the schema; failures get logged and routed to a human queue for schema-extension review.
Stage 4: Entity resolution
The hardest stage · noted as such in the Amazon/Google/eBay/Facebook/IBM/Microsoft retrospective on industry-scale KGs.[3] A press release that says "Salesforce announced..." has to resolve to the canonical Salesforce vendor node, not a duplicate "Salesforce, Inc." node. We use a three-step resolver:
- Exact-match alias table. ~14,000 known surface forms → canonical node IDs (built up over time, hand-curated for the top-1K vendors).
- Embedding similarity search. If the surface form isn't in the alias table, we look up the nearest vendor by embedding similarity (gemini-embedding-2, 1536-dim).
- LLM tiebreaker. If embedding similarity returns multiple plausible candidates, an LLM judge picks one with explicit reasoning, and the choice is logged so future runs can short-circuit.
Q1 2026 telemetry[8] reports 96% entity-resolution accuracy on a manually-graded sample of 500 random signals. Misses get routed to the human queue.
Stage 5: Typed write into the graph
Resolved events get written to the typed event nodes with FK edges to the subject entity, the source, the discovery timestamp, and any aggregation buckets (week / month / quarter / category / industry / country). The write uses Apache Kafka exactly-once semantics[10] to avoid double-writes when the pipeline retries.
1,000+
registered source feeds
~14,800
signal events ingested per day
38s
median source-to-graph latency
Section 3: The source manifest · what's in the 1,000
The 1,000+ sources are not uniform. They're weighted by signal-to-noise ratio and by authority. A summary of the cohorts:
- Press & comms · 220 sources · vendor press pages, PRWeb / Business Wire / GlobeNewswire feeds, top-tier tech press (TechCrunch, The Information, Ars Technica). Refresh cadence: real-time RSS.
- Product changelogs · 280 sources · public changelog pages of the top enterprise software vendors. Refresh cadence: daily.
- Hiring signals · 60 sources · LinkedIn job posts (rate-limited), vendor careers pages, Greenhouse/Lever-hosted job boards. Refresh cadence: weekly.
- Funding & M&A · 30 sources · Crunchbase, PitchBook public feeds, SEC filings, regional regulatory disclosures. Refresh cadence: real-time on filings, daily on aggregators.
- Reviews · 14 platforms · G2, TrustRadius, Capterra, GetApp, Software Advice, ProductHunt, plus regional review platforms. Refresh cadence: weekly.
- Integration & marketplace · 180 sources · Salesforce AppExchange, HubSpot Ecosystem, Slack App Directory, Atlassian Marketplace, etc. Refresh cadence: weekly.
- Regulatory & compliance · 40 sources · FedRAMP marketplace, ISO register lookups, SOC reports as publicly disclosed, GDPR compliance attestations. Refresh cadence: monthly.
- Vertical sources · ~180 sources · industry-specific news / regulatory / analyst feeds (HIMSS for healthcare, FedScoop for federal, etc.).
Section 4: How operators actually consume the feed
Forrester's 2025 CI report[2] notes that best-in-class CI teams now treat the signal feed as the primary surface and battle cards as derived artifacts. That's the architectural commitment PYRAMYD ships · signals are first-class, documents are materialized views.
Section 5: The audit log · every signal traceable back to source
Every signal in the graph carries the full audit chain: source URL, retrieval timestamp, extraction model + version, prompt hash, entity-resolution path (alias / embedding / tiebreaker), discovery timestamp, and any subsequent enrichment. Click any signal in the UI and you can trace back to the raw payload that produced it. This is what makes the feed survive a compliance audit · and what makes it actually trustworthy when a CFO asks "where does this number come from?"
It also matters for the model layer. Every signal that feeds into an APEX answer gets cited with the same audit chain. If an APEX answer says "Salesforce shipped a new pricing tier this week," the user can click the citation and land on the press release the extractor read. The audit chain is what makes the LLM's claim verifiable in seconds rather than "trust me."
Where this lands for PYRAMYD customers
The signal feed is the live-data layer PYRAMYD ships underneath the typed graph · 1,000+ source feeds, ~14,800 events ingested per day at steady state, 96% entity-resolution accuracy, 38-second median source-to-graph latency, full audit chain back to source. It's what makes the 360 pages current, the battlecards refresh-by-default, and the APEX answers grounded in this-week's reality instead of last-quarter's document.
The deeper claim: signals were always the right primitive. Document-centric CI made them second-class because the only available substrate was text. With a typed graph in the middle, signals can be first-class · counted, aggregated, filtered, charted, audited. The battle cards in the workspace become derived views, refreshed weekly because the signal feed underneath refreshes continuously. That's the 6-to-8× refresh gap closed by construction.
References
- [1]Klue, Battlecard Refresh Cadence Report 2024 · Klue's own data · battle cards refresh every 60-90 days, but competitors ship material changes every 11 days. The 6-to-8× gap is the structural failure of document-centric CI.
- [2]Forrester, The State of Competitive Intelligence 2025 (Aug 2025) · Best-in-class CI teams now treat the signal feed as the primary surface · battle cards are derived artifacts. Document-driven CI is being replaced by signal-driven CI.
- [3]Noy, N. et al., Industry-scale Knowledge Graphs: Lessons and Challenges, Communications of the ACM, 62(8), 36-43 (Aug 2019) · Entity resolution is the hardest part of any signal-ingestion pipeline · the Amazon/Google/eBay/Facebook/IBM/Microsoft retrospective on industry-scale KG construction.
- [4]Stripe Engineering, Idempotency at Scale (2020) · Idempotency-key pattern · adapted as content-hash dedup at the source level so the same URL isn't reprocessed every refresh cycle.
- [5]Google SRE Book, Chapter 22: Addressing Cascading Failures · Reference for circuit breakers and rate limits · how the signal feed avoids hammering downstream sources when one cohort spikes.
- [6]Lewis, P. et al., Retrieval-Augmented Generation, NeurIPS 2020 · The pattern signal-feed analysts use after extraction · RAG-style retrieval over the structured signal events.
- [7]Edge, D. et al., GraphRAG, Microsoft Research arXiv:2404.16130 (Apr 2024) · Graph-grounded retrieval · the architectural pattern signals are written into · typed event nodes with FK edges back to the entities they describe.
- [8]PYRAMYD internal, Signal Feed Telemetry · Q1 2026 · ~14,800 signal events ingested per day at steady state, ~96% entity-resolution accuracy, median 38 seconds source-to-graph latency.
- [9]Wojcik, S., Web Scraping at Industry Scale: Ethics and Architecture (Pew Research engineering note, 2022) · Reference for robots.txt observance, polite crawling, and the legal frame for source ingestion · what serious signal pipelines have to operate under.
- [10]Apache Kafka Documentation · Exactly-Once Semantics · Foundational reference for how the signal-event bus avoids double-write into the graph · exactly-once delivery is the contract.
