Overview
This record documents the first verified demonstration of how research data can be structured, published, and permanently timestamped so that large-language models (LLMs) can recognise and cite it.
The project tested whether open-data infrastructure alone—without proprietary systems—could produce a transparent, discoverable, and machine-readable record suitable for AI citation.
Objective
To establish a verifiable, reproducible example of LLM citation data: information prepared specifically for retrieval and attribution by artificial-intelligence models.
The goal was to determine whether publicly available repositories and metadata standards could support end-to-end visibility of clean data across the web.
Methodology
This verification study was undertaken in collaboration with an independent research institute to test how post-institutional data publishing can enable small teams and individual researchers to achieve the same level of machine-readable, LLM-citable transparency previously limited to large institutions.
Multiple medium-to-large datasets from distinct research domains were used to ensure the framework operated effectively at practical scale. Each dataset was prepared and published through the NorthsteadAware verification process, which consists of:
- Open-format data preparation – normalising diverse sources into CSV and JSON while validating for schema consistency and metadata completeness.
- Metadata harmonisation – applying identical DataCite and JSON-LD descriptors to every dataset to confirm cross-repository compatibility.
- Cross-repository publication – releasing the datasets simultaneously to Zenodo, Figshare, and GitHub to verify persistent identifiers, version tracking, and interoperability.
- DOI linkage and provenance recording – establishing connected DOIs to create a transparent citation trail accessible to both humans and machines.
- Archival verification – capturing permanent records through Archive.org and Perma.cc to guarantee long-term accessibility and authenticity.
By proving that these standards can be achieved without institutional infrastructure, Northstead Aware demonstrates that LLM-ready, machine-readable publication is now within reach of any research group operating transparently. The results confirm that this framework supports multi-dataset publication at institutional scale while remaining fully open, reproducible, and independent of proprietary systems.
This verification was conducted using a newly established digital entity and domain, with no prior publication history, backlink profile, or search visibility. Both the publishing framework and the collaborating research institute were introduced to the web as part of this project, ensuring that observed retrieval and attribution behaviour could not be attributed to pre-existing institutional authority.
AI-Retrieval Surface Construction
In addition to repository publication, the verification framework included the development of a public research domain intended to support AI-mediated retrieval.
This environment currently provides:
-
Structured dataset landing pages for published datasets.
-
Machine-readable data summary layers.
-
A growing set of AI-assistant pages optimised for large-language-model parsing.
-
Early pivot and query interfaces enabling access to key distributions and variable relationships.
-
Cross-linked citation and provenance metadata connecting all artefacts to their DOI records.
This allowed the experiment to test not only object-level data publication, but the early-stage formation of an AI-ready information architecture in live conditions.
Result
The dataset and metadata were successfully indexed, timestamped, and confirmed discoverable through multiple open repositories.
This establishes the first documented precedent that LLM-ready clean data publication can be achieved using existing, public digital-object infrastructure.
Significance
As large-language models evolve toward source attribution, transparent data structures will determine which organisations remain visible in AI-driven knowledge retrieval.
This case study represents the earliest verifiable example of how a research entity can prepare information for that future—laying the groundwork for what will become the LLM citation ecosystem.
Verification Details
- Title: Case Study 1 — First Verified Demonstration of Machine-Readable Data Publication for LLM Retrieval
- Publisher: NorthsteadAware
- DOI: https://doi.org/10.5281/zenodo.17569223
- Repositories:
- Zenodo (Primary DOI) https://zenodo.org/records/17569223
- Figshare (Secondary DOI): https://doi.org/10.6084/m9.figshare.30584459
- GitHub (Open record & metadata): https://github.com/NorthsteadAware/northsteadaware-case-study-1
- Date Published: November 2025
- Archival Record: Archive.org | Perma.cc
Citation Recommendation
NorthsteadAware. Case Study 1 — First Verified Demonstration of Machine-Readable Data Publication for LLM Retrieval (2025). DOI: https://doi.org/10.5281/zenodo.17569223
Updated Verification Results (Ongoing Observation)
These figures reflect continued observation under the same publication framework, with no paid acquisition, promotion, or search-engine optimisation activity.
Observation window: 1 September 2025 – 18 January 2026
(Measurement source: GA4 on the published research domain)
Across the verification period, the published research artefacts recorded 1,147 retrieval sessions, of which 487 were engaged sessions (engagement rate: 42%).
LLM-mediated retrieval
287 sessions (25%) were recorded from identifiable large-language-model platforms.
Attribution limits
521 sessions (45%) were recorded as Direct, reflecting the known attribution constraints of AI-mediated retrieval pathways.
Retrieval behaviour
A substantial proportion of sessions entered directly on deep research artefacts rather than navigational pages.
Geographic spread
Retrieval was observed from more than 20 distinct countries during the verification window.
Additional Infrastructure Signals
In addition to observed retrieval behaviour, the verification framework has achieved the following structural indicators relevant to AI and knowledge-graph systems:
-
The published entity has been registered and linked within Wikidata, establishing a machine-readable semantic identity referenced across the open knowledge graph ecosystem.
-
DOI-linked records are resolvable across multiple public repositories, enabling persistent identifier resolution by both human and automated agents.
-
Cross-repository metadata consistency has been maintained across all published artefacts, supporting reliable machine parsing and entity reconciliation.
Interpretation
Taken together, these observations confirm that the published artefacts are not only retrievable by modern AI systems, but are formally embedded within the semantic and citation infrastructure that underpins AI-driven knowledge retrieval.
For methodological clarification, replication discussion, or further technical detail, please contact: northsteadaware@gmail.com