Web 3.0 Requires Data Integrity

If you’ve ever taken a computer security class, you’ve probably learned about the three legs of computer security—confidentiality, integrity, and availability—known as the CIA triad. When we talk about a system being secure, that’s what we’re referring to. All are important, but to different degrees in different contexts. In a world populated by artificial intelligence (AI) systems and artificial intelligent agents, integrity will be paramount.

What is data integrity? It’s ensuring that no one can modify data—that’s the security angle—but it’s much more than that. It encompasses accuracy, completeness, and quality of data—all over both time and space. It’s preventing accidental data loss; the “undo” button is a primitive integrity measure. It’s also making sure that data is accurate when it’s collected—that it comes from a trustworthy source, that nothing important is missing, and that it doesn’t change as it moves from format to format. The ability to restart your computer is another integrity measure.

The CIA triad has evolved with the Internet. The first iteration of the Web—Web 1.0 of the 1990s and early 2000s—prioritized availability. This era saw organizations and individuals rush to digitize their content, creating what has become an unprecedented repository of human knowledge. Organizations worldwide established their digital presence, leading to massive digitization projects where quantity took precedence over quality. The emphasis on making information available overshadowed other concerns.

As Web technologies matured, the focus shifted to protecting the vast amounts of data flowing through online systems. This is Web 2.0: the Internet of today. Interactive features and user-generated content transformed the Web from a read-only medium to a participatory platform. The increase in personal data, and the emergence of interactive platforms for e-commerce, social media, and online everything demanded both data protection and user privacy. Confidentiality became paramount.

We stand at the threshold of a new Web paradigm: Web 3.0. This is a distributed, decentralized, intelligent Web. Peer-to-peer social-networking systems promise to break the tech monopolies’ control on how we interact with each other. Tim Berners-Lee’s open W3C protocol, Solid, represents a fundamental shift in how we think about data ownership and control. A future filled with AI agents requires verifiable, trustworthy personal data and computation. In this world, data integrity takes center stage.

For example, the 5G communications revolution isn’t just about faster access to videos; it’s about Internet-connected things talking to other Internet-connected things without our intervention. Without data integrity, for example, there’s no real-time car-to-car communications about road movements and conditions. There’s no drone swarm coordination, smart power grid, or reliable mesh networking. And there’s no way to securely empower AI agents.

In particular, AI systems require robust integrity controls because of how they process data. This means technical controls to ensure data is accurate, that its meaning is preserved as it is processed, that it produces reliable results, and that humans can reliably alter it when it’s wrong. Just as a scientific instrument must be calibrated to measure reality accurately, AI systems need integrity controls that preserve the connection between their data and ground truth.

This goes beyond preventing data tampering. It means building systems that maintain verifiable chains of trust between their inputs, processing, and outputs, so humans can understand and validate what the AI is doing. AI systems need clean, consistent, and verifiable control processes to learn and make decisions effectively. Without this foundation of verifiable truth, AI systems risk becoming a series of opaque boxes.

Recent history provides many sobering examples of integrity failures that naturally undermine public trust in AI systems. Machine-learning (ML) models trained without thought on expansive datasets have produced predictably biased results in hiring systems. Autonomous vehicles with incorrect data have made incorrect—and fatal—decisions. Medical diagnosis systems have given flawed recommendations without being able to explain themselves. A lack of integrity controls undermines AI systems and harms people who depend on them.

They also highlight how AI integrity failures can manifest at multiple levels of system operation. At the training level, data may be subtly corrupted or biased even before model development begins. At the model level, mathematical foundations and training processes can introduce new integrity issues even with clean data. During execution, environmental changes and runtime modifications can corrupt previously valid models. And at the output level, the challenge of verifying AI-generated content and tracking it through system chains creates new integrity concerns. Each level compounds the challenges of the ones before it, ultimately manifesting in human costs, such as reinforced biases and diminished agency.

Think of it like protecting a house. You don’t just lock a door; you also use safe concrete foundations, sturdy framing, a durable roof, secure double-pane windows, and maybe motion-sensor cameras. Similarly, we need digital security at every layer to ensure the whole system can be trusted.

This layered approach to understanding security becomes increasingly critical as AI systems grow in complexity and autonomy, particularly with large language models (LLMs) and deep-learning systems making high-stakes decisions. We need to verify the integrity of each layer when building and deploying digital systems that impact human lives and societal outcomes.

At the foundation level, bits are stored in computer hardware. This represents the most basic encoding of our data, model weights, and computational instructions. The next layer up is the file system architecture: the way those binary sequences are organized into structured files and directories that a computer can efficiently access and process. In AI systems, this includes how we store and organize training data, model checkpoints, and hyperparameter configurations.

On top of that are the application layers—the programs and frameworks, such as PyTorch and TensorFlow, that allow us to train models, process data, and generate outputs. This layer handles the complex mathematics of neural networks, gradient descent, and other ML operations.

Finally, at the user-interface level, we have visualization and interaction systems—what humans actually see and engage with. For AI systems, this could be everything from confidence scores and prediction probabilities to generated text and images or autonomous robot movements.

Why does this layered perspective matter? Vulnerabilities and integrity issues can manifest at any level, so understanding these layers helps security experts and AI researchers perform comprehensive threat modeling. This enables the implementation of defense-in-depth strategies—from cryptographic verification of training data to robust model architectures to interpretable outputs. This multi-layered security approach becomes especially crucial as AI systems take on more autonomous decision-making roles in critical domains such as healthcare, finance, and public safety. We must ensure integrity and reliability at every level of the stack.

The risks of deploying AI without proper integrity control measures are severe and often underappreciated. When AI systems operate without sufficient security measures to handle corrupted or manipulated data, they can produce subtly flawed outputs that appear valid on the surface. The failures can cascade through interconnected systems, amplifying errors and biases. Without proper integrity controls, an AI system might train on polluted data, make decisions based on misleading assumptions, or have outputs altered without detection. The results of this can range from degraded performance to catastrophic failures.

We see four areas where integrity is paramount in this Web 3.0 world. The first is granular access, which allows users and organizations to maintain precise control over who can access and modify what information and for what purposes. The second is authentication—much more nuanced than the simple “Who are you?” authentication mechanisms of today—which ensures that data access is properly verified and authorized at every step. The third is transparent data ownership, which allows data owners to know when and how their data is used and creates an auditable trail of data providence. Finally, the fourth is access standardization: common interfaces and protocols that enable consistent data access while maintaining security.

Luckily, we’re not starting from scratch. There are open W3C protocols that address some of this: decentralized identifiers for verifiable digital identity, the verifiable credentials data model for expressing digital credentials, ActivityPub for decentralized social networking (that’s what Mastodon uses), Solid for distributed data storage and retrieval, and WebAuthn for strong authentication standards. By providing standardized ways to verify data provenance and maintain data integrity throughout its lifecycle, Web 3.0 creates the trusted environment that AI systems require to operate reliably. This architectural leap for integrity control in the hands of users helps ensure that data remains trustworthy from generation and collection through processing and storage.

Integrity is essential to trust, on both technical and human levels. Looking forward, integrity controls will fundamentally shape AI development by moving from optional features to core architectural requirements, much as SSL certificates evolved from a banking luxury to a baseline expectation for any Web service.

Web 3.0 protocols can build integrity controls into their foundation, creating a more reliable infrastructure for AI systems. Today, we take availability for granted; anything less than 100% uptime for critical websites is intolerable. In the future, we will need the same assurances for integrity. Success will require following practical guidelines for maintaining data integrity throughout the AI lifecycle—from data collection through model training and finally to deployment, use, and evolution. These guidelines will address not just technical controls but also governance structures and human oversight, similar to how privacy policies evolved from legal boilerplate into comprehensive frameworks for data stewardship. Common standards and protocols, developed through industry collaboration and regulatory frameworks, will ensure consistent integrity controls across different AI systems and applications.

Just as the HTTPS protocol created a foundation for trusted e-commerce, it’s time for new integrity-focused standards to enable the trusted AI services of tomorrow.

This essay was written with Davi Ottenheimer, and originally appeared in Communications of the ACM.

Posted on April 3, 2025 at 7:05 AM20 Comments

Comments

Staffan April 3, 2025 7:29 AM

Is this the third attempt to create a “Web 3”? The semantic web-based Web 3.0 (first named 1999, popularized around 2005?), then the blockchain-based “web3” (first named 2014, popularized around 2020?). What is the basis for this iteration of “Web 3” — Solid? (first established 2016, popularized never). Just like “web3” tried to attach itself to actually-popular phenomena of the times (gaming and VR, for some reason), this essay seem to attach itself/Solid to AI and agents. I just don’t see it.

müzso April 3, 2025 8:11 AM

I’d have picked “Web 3.0 Requires Data Integrity and Authenticity” for the title. Integrity alone is insufficient to invoke “trust” between any parties where the method of communication doesn’t imply authenticity.

Bilateralrope April 3, 2025 8:45 AM

For example, the 5G communications revolution

What 5G communications revolution ?

I remember all the hype talking about one with questionable claims about what 5G would do. Some claims about things that 5G didn’t achieve, some claims about things that 4G has enough bandwidth for.

Now that 5G is here, what has it actually changed ?

Niko Bonnieure April 3, 2025 9:23 AM

Hello everyone, I read Bruce’s blog regularly, and I didn’t see this coming. If you are interested in data integrity, E2EE, Solid, ActivityPub and how web users can regain control over their privacy and their data, then you have to have a look at what I have been preparing since many years, and specially in the last 3 years: NextGraph, which combines all of the above.
Please let me know of your comments here or in our forum.

https://nextgraph.org

PS: The security audit is planned but not conducted yet. Please bear with us in the meanwhile.

I Can't Remember The Handle I Used to Comment on This Blog Anymore April 3, 2025 10:22 AM

All right, I’m going to be as polite and serious with my question here, but…

I’m assuming that, as you’re talking about “training” AI, and AI “generating” data, I’m presuming that we’re talking about transformer and diffusion models, or basically that approach. The kind of thing you call “generative AI”.

I’m not assuming the classifier machine learning models is in this discussion, because those I think are suited for having attribution and data integrity incorporated into their training data, and aside from adversarial attacks, we can expect those should work fairly predictably, because they have all this while.

But a major problem with the models we understand as “generative AI” is that they hallucinate. It doesn’t matter if you establish chains of provenance or ensure that all the data that goes in is “clean”, what comes out is only coincidentally true. Like, even saying “generative AI occasionally hallucinates” is misleading, when it’s more accurate to say that “generative AI” occasionally and coincidentally speaks something resembling the truth.

It’s not that garbage in, garbage out: it’s literally, doesn’t matter what goes in, a certain percentage of it is garbage out, and the model cannot tell you which bit is garbage. All of its output must be scrutinized.

Including generative AI into any part of this chain pollutes the chain of attestation, because the function of these models is token prediction of some kind, and it’s not even predictable. That’s not any kind of foundation for any kind of model of knowledge, much less the foundation of any kind of successor the web we’ve got right now.

Unless I’ve missed the memo and there are alternative approaches to AI that I’ve missed. Are there?

Clive Robinson April 3, 2025 10:43 AM

@ Bruce, ALL,

The acid drop point is,

“In a world populated by artificial intelligence (AI) systems and artificial intelligent agents, integrity will be paramount.”

Because it begs the question,

“To Whom?”

And it’s going to have to be answered very soon.

Because the simple fact is, it turns out most people at the moment “care not a fig” about current AI LLM and ML system output accuracy. And see AI as at best as a “slow news day filler”…

On the apparent assumption of,

“Garbage out is the norm, anything that’s not, is a modern day miracle.”

Thus cause for, in equal measure, “surprise and concern”.

Lets be honest by a show of hands,

Who on asking an AI how to make a cup of tea would be surprised to put “white wood glue” in the cup first?

(Believe it or not it actually makes some kind of sense… White wood glue is based around “casin” as is “cheese”… Both are “milk protein” product. See and compare,

‘https://www.healthline.com/nutrition/what-is-casein

‘https://gluesavior.com/what-is-casein-glue/ )

The fact is humans have “agency” and from that they are “environmentally aware” and this allows “context” to happen without being specifically announced, AI is none, and has none, of these things.

There is a joke going around that,

“AI is the new pentium bug, only it gets less answers to add up.”

As for other AI jokes about their output issues… I’d get a life time ban if I mentioned any of them…

AL April 3, 2025 12:21 PM

and there are alternative approaches to AI that I’ve missed.

I don’t know about “alternative”, but, in AI, there is a “temperature” setting that deliberately adds a “randomness” aspect, because they want to make AI (like Copilot) your digital “friend” that you pour your heart out to.

I see the future in carefully tailored AI. Want a legal AI, it does legal. It doesn’t do Twitter or Reddit. And this “temperature” setting is shut off, so the output is always repeatable.

I see a future in AI, but it isn’t going to be these free AIs that want to slurp up your data so that a profile of the user can be built.

ResearcherZero April 3, 2025 10:12 PM

Hopefully standard weights and measures will mean something, text and calculations remain legible. Lately, I have been concerned that in certain places, that is no longer true.

A better solution than AI “friends” helping relieve retirement savings by brute force passwords in a coordinated manner, or integrity ain’t gunna mean shoot! Possibly with some new kind of new interpreter to decode what eva anyone is babbling about.

‘https://www.cyberdaily.au/security/11940-hackers-target-aussie-pensioners-in-major-super-fund-cyber-attack

ResearcherZero April 4, 2025 12:45 AM

It should be the responsibility of the large tech companies to demonstrate how they are going to authenticate and create trust in information generated by artificial intelligence.

How can we trust it if we cannot authenticate it and replicate it’s proof?

Verifying information must become simpler and more accessible for the average Joe.

Bio-metric authentication, documentation and accreditation will become simpler to forge.
Artificial information and electronic communication must provide a verifiable means of authentication. Secure and private communication should be built in to the infrastructure and the methods of authenticating and securely communicating taught as basic skills.

Secure techniques should be provided to the public, but more importantly traditional services must remain supported by the government when private digital services suffer service failure, permanently close or are disabled by attack. If public services and information services like local news publications and public broadcasting are not widely available, when private companies shut branches or suffer bankruptcy, a void is left.

Voids will be filled.often by criminal markets

Nation state campaigns employ AI to generate fabricated information, copy criminal methods and use criminal services. AI generated personas and disinformation are becoming more sophisticated and by the time forgeries are identified, often the damage is already done.

One issue in establishing integrity, is the large number of criminal web services operating in criminal states. Not only do adversarial governments allow malicious ASNs, domains and cloaking services to operate, they utilize them to conduct state-led operations.

‘https://vsquare.org/how-russian-disinformation-campaign-influenced-german-elections-afd-cdu-greens-cyberoperations/

The Scam Empire – criminal infrastructure and services.

‘https://www.qurium.org/

Some of these ASNs are registered in Western countries, often using stolen identities.
https://www.hyas.com/blog/hyas-threat-intel-report-may-6-2024

Many companies ignore the misuse of identity and consumers’ privacy settings.

‘http://innovation.consumerreports.org/Mixed-Signals-Many-Companies-May-Be-Ignoring-Opt-Out-Requests-Under-State-Privacy-Laws.pdf

ResearcherZero April 4, 2025 12:50 AM

@Clive Robinson, Bruce, ALL

Everything has become so complicated that not even pigeons and postmen are safe.

Clive Robinson April 4, 2025 5:25 AM

@ ResearcherZero, Bruce, ALL,

With regards,

“Everything has become so complicated that not even pigeons and postmen are safe.”

Complicated perhaps, but consider,

“Even the dogs of war need work during the peace.”

So the pack must still hunt, “To keep their claws in trim”.

And as it’s said, “You can’t teach an old dog new tricks”…

So the war dogs still have to chase after others…

So the prey now, are any and all peacefully doing their own jobs.

David Wittenberg April 4, 2025 8:49 AM

One nit. You say The third is transparent data ownership, which allows data owners to know when and how their data is used and creates an auditable trail of data providence.

I disagree with the first half. While I owe my readers an auditable trail, I do not owe the author that. I must credit him/her, but I do not have the responsibility to inform him/her that I am using his/her data (unless it came in a private communication).

Rontea April 4, 2025 9:31 AM

Integrity is the cornerstone of Web 3.0, also known as the first version of the Matrix, as it ensures the accuracy, completeness, and quality of data.

ResearcherZero April 7, 2025 11:46 PM

I can’t help but keep thinking this potential move seems a little silly at the moment.

[Hail Warning]

Fabtastic time to make sensitive financial data – completely accessible through an API. 😐

‘https://www.wired.com/story/doge-hackathon-irs-data-palantir/

Insiders confirm the White House is being run by “a bunch of fools”.
https://www.politico.com/news/2025/04/02/waltzs-team-set-up-at-least-20-signal-group-chats-for-crises-across-the-world-00266845

U.S. networks and security are no longer a federal government priority.
https://www.justsecurity.org/109853/regulatory-landscape-big-tech/

substitute with your favourite APT

Establishing Legends – covert activity in target countries including the U.S.

“Today, some evidence suggests that the 161 Centre is organized into a headquarters unit, three training units, an operational planning unit, three operational units, a financial and logistical unit, and a supply unit. It deploys personnel to Europe and other locations for active intelligence under partial legalization.”

What are the objectives in conducting these attacks?

‘https://www.csis.org/analysis/russias-shadow-war-against-west

To recap something that happened in the United States.
https://www.volexity.com/blog/2024/11/22/the-nearest-neighbor-attack-how-a-russian-apt-weaponized-nearby-wi-fi-networks-for-covert-access/

…but don’t worry, the Russian hacker made it back home safe after the job was over.
https://www.wired.com/story/russian-prisoner-swap-vladislav-klyushin-evan-gershkovich/

Gert-Jan April 10, 2025 12:52 PM

What a load of nonsense.

The article pretty accurately describes all the problems with AI. Sure, integrity is an aspect, but in the global scheme of things, it is just a small technical aspect that can be easily managed.

As others have mentioned, what does it matter if the information you received from an unauthenticated source wasn’t altered? Information from an unauthenticated source isn’t trustworthy anyway. And on top of authenticity, you probably need more. For example reputation, or some other means of weeding out undesirable sources.

Without this foundation of verifiable truth, AI systems risk becoming a series of opaque boxes.

“becoming”? They are a series of opaque boxes. The article actually lists examples that illustrate that. They lack intelligence, lack accountability, and people use them at their peril.

Also, tell me where I can find this “verifiable truth”? Science and journalism probably comes closest to anything resembling truth in our world.

Clive Robinson April 11, 2025 4:59 AM

@ AL, ALL,

With regards,

“I see the future in carefully tailored AI. Want a legal AI, it does legal. It doesn’t do Twitter or Reddit. And this “temperature” setting is shut off, so the output is always repeatable.”

Such “carefully tailored AI” systems have existed for over forty years and they are very definitely not the current crop of LLMs built by Transformer style Gradient ML.

We call them “Expert Systems” and they work much the same as they have done since the 1980’s. Importantly they are designed to give “reliable answers” not be “Chatty Kathy Bots” as faux BFF’s, which the current AI LLM and ML systems are designed to do.

The purpose of an AI “Expert System” is,

:- To GIVE dependable verified information TO the user.

The purpose of an AI “Chatty Kathy” is,

:- To OBTAIN personal information with resale value FROM the user.

At the foundation they are both “simple databases” the difference is the record structure and query methods. In the case of Expert systems they are logical and ordered. In the case of Chatty Kathy’s they range from chaotic to disorganised and at best random.

The fact that the current crop of shysters and shills behind OpenAI, Microsoft, Meta, Google and others are mostly “pumping the nonsense bubble” about some mythical G Spot reaching AI being “just around the corner”… Somehow able to magically go beyond reasoning and “reaching the singularity” of human equivalent intelligence is just a nonsense to drag in investor dollars by the multiple trillions.

All that bubble hype fails to two basic questions,

1, What is human intelligence?
2, How does human intelligence work?

The first is a generic question on which the specific answers to the second can be built.

We still have no idea beyond hand waving and talking fast how to even approach answering the first.

Heck we are not even sure about other types of intelligence we see thus say exists. We see at least two types in vertebrates, then there are the likes of squid…

https://www.quantamagazine.org/intelligence-evolved-at-least-twice-in-vertebrate-animals-20250407/

We could fairly say we’ve not even reached the first rung of the ladder leading upto the answer to the first question.

ResearcherZero April 12, 2025 1:45 AM

@Clive, ALL

3, and the third option – distraction.

I need to be entertained. You need to be entertained. We all to feel appreciated, needed and like we are important even if we have to pay someone to create the artificial impression that is is true.

Now embrace me with your warm electron glow! 😀

I need to escape. Dope me up internet. [tosses away integrity]

Artificial Intelligence alters our perceptions of the the challenges facing the world, while hiding the responsibility and lack of real leadership of those who are in positions of influence and power. Worst of all – AI chat bots are delivering a steady diet of corporate propaganda – when queried about these issues.

‘https://theconversation.com/ai-is-bad-for-the-environment-and-the-problem-is-bigger-than-energy-consumption-247842

PetroDragonic Apocalypse, or Dawn of Eternal Night: An Annihilation of Planet Earth and the Beginning of Merciless Damnation.

‘https://www.youtube.com/watch?v=0Fj8kA93ddc

These AI technologies may never deliver any environmental outcomes as presently designed.
https://techcrunch.com/2025/02/06/orgs-demand-action-to-mitigate-ais-environmental-harm/

Whiz-bangs to distract the human consciousness, dull our responses and pacify our concerns.
https://www.cnet.com/tech/at-ces-2025-silence-on-sustainability-as-climate-change-disasters-rage/

The Century of the Self
https://www.youtube.com/watch?v=DnPmg0R1M04

ResearcherZero April 12, 2025 2:01 AM

@Gert-Jan

The datasets that AI models are trained on can easily be poisoned and corrupted.

Input systems of artificial systems can easily be fooled by a number of methods.
Problems within systems and networks due to poor design can corrupt data entries.

Corrupted, poisoned or ‘data voids’ within crucial reference sections undermines integrity.

Undesirable content might be your response. It could be any bad news the company doesn’t like. Perhaps entire locations, cultures or languages because the expense of compiling that information (or the difficulty) does not meet economic or ideological expectations.

Sometimes important data does not even exist when a model is trained, or is overlooked.

ResearcherZero April 12, 2025 2:26 AM

Ask the Post Masters affected by the Post Office Scandal about the importance of data integrity. Those issues remained in place for two decades.

Many systems in use today have similar issues with data integrity in their databases.

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.