Your AI product strategy is only as good as your data strategy
Everyone's talking about GenAI, models, agents, and virtual assistants, but many companies are facing a more basic issue: data.
Over the last few years I’ve worked hands on with AI, building products and integrating GenAI into real systems using platforms like OpenAI, Claude Code, and Copilot. The first thing I learned was that an AI app is easy to demo, but hard to make work in real life, because once you move past the demo, you’re dealing with real data, third party integrations, edge cases, and users who have high expectations and will use your product in ways you never imagined.
When a GenAI project fails, it’s rarely because of the model, it’s usually because there isn’t enough data, teams work in silos, systems don’t communicate with each other, and there’s no clear ownership.
A recent report from MIT Technology Review found that only 13% of large organisations get real value from their data strategy.
During a demo, any model you use will sound very intelligent because the questions are simple and don’t require much context. In production, things change. The performance of the model is only as good as the context and instructions we give it, which means any system built on top of the model needs to know how to fetch, filter, and inject data, knowledge, and instructions into every request.
This is where most teams struggle. Once they start planning, they realise the data is hard to access, outdated, or inconsistent. On top of that, if the instructions and rules are not clear, or the model doesn’t have the right skills, the system ends up without the context it needs to make reliable decisions.
If you’re integrating AI into your product, start with the data and a clear use case. Don’t try to centralise everything, that usually creates more problems. Keep data where it belongs, but connect your tools so everything can be accessed through a single, consistent interface.
Also pick a real use case. For example, a support chatbot that only handles refunds. Then ask yourself, what data do we actually need for this to work? Once you’re clear on the use case and know where your data lives, the next challenge is making all of that work at scale.
1. Infrastructure
Your app needs to handle real traffic without slowing down, run models, and connect to internal systems like shared drives, CRMs, and databases.
From a technical point of view, this usually involves using platforms like AWS, Azure or Google Cloud. This is where software architects start thinking about latency, cost and reliability, especially when you move from a demo to a production-ready app. If this layer isn’t well designed, everything built on top becomes unreliable, slow or too expensive to run. Remember, most AI problems are data problems, but most of the cost sits in infrastructure.
2. Data + AI
This is the part most teams underestimate. Cleaning and standardising the data, removing duplicates, fixing inconsistencies, defining formats, and adding structure. If your data isn’t up to date or well organised, the results you get from the model will be unreliable.
You might think you’re in a good place because you have plenty of data, but when you look closer, the data is spread across Slack, call logs, support tickets, and docs. It isn't tagged and sits across multiple databases that don’t really talk to each other. An AI strategy doesn’t mean collecting more data, it means making the data you already have usable.
“Knowledge base quality is the single strongest predictor of AI agent performance. Not the model. Not the prompt engineering. Not the integration architecture. The knowledge.”
— Lorikeet (Is Your Knowledge Base Ready for AI?)
Depending on the use case, you’ll need to handle both structured and unstructured data. Structured data: users, orders, inventory. Unstructured data: docs, support tickets, Slack messages, call logs.
Today, keeping documentation up to date is more important than ever before, not by asking people to constantly update it in real time, but by having AI agents continuously reviewing and improving it.
The first step is to connect your tools, pull data from Zoom, Slack, and Gmail, and let AI categorise and structure it. Then, an AI agent reviews what you already have, spots gaps, inconsistencies, and outdated information, and suggests improvements. If you're using Google Docs, you already have what you need: "Suggesting mode" to propose changes, "Version history" to compare them, and "Approvals" to review and approve them before they're applied.
An AI agent flow would look something like this:
Read the doc and external data
Propose changes directly in the doc
Trigger an approval request via the Drive API
Human reviews, compares via version history, and approves
3. Context
Context is not something the model has, it’s something your system builds, and once you move into production it becomes a first-class problem, because instead of just sending a prompt you are pulling data from different sources, and if that data is missing or not relevant, the model fills the gaps with guesses.
To make this work you need two things, context and specifications. Context gives the model the knowledge it needs to answer correctly and specifications define how it should behave.
A specifications files defines things like:
Role
Tone
Constraints
Workflow
And Context retrieval (RAG) gives the model the right information at runtime:
Where the model gets its information from
How it stays up to date
How it answers domain-specific questions
The question many product managers have is, how do we control the behaviour of the model? How do we actually make sure it does what we expect?
Just like business owners ask their lawyer to review an important email before sending it, AI systems follow a similar pattern, especially in production. They use the same model, configured with a different role, to review the generated response. Something many people don’t realise is that generating an answer is not the final step. Before anything reaches the user, the system needs to verify that the response is acceptable.
During the validation process, the model takes the role of a lawyer. It checks whether the answer follows the spec, tone, constraints, format, and whether it’s making claims that are not backed by the available context (this is known as hallucinations).
There are also tools and services to implement this validation layer. For example, NeMo Guardrails by NVIDIA to keep conversations within defined boundaries, or Guardrails AI to validate the structure and quality of responses.
4. APIs
Your APIs are the bridge between your data, your systems, and your AI. They can be used for almost anything, from fetching order details, updating a delivery address, creating a support ticket, or accessing information about your products and services.
The engineering team is responsible for designing the APIs, making sure they have a clear interface, are versioned, that errors provide context, and that responses are consistent and well structured so the model can use them without guessing.
My advice, log everything. Logs are now the notes your AI reads to understand your business problems. The more context, the better. Instead of “Error: Order 1234 not found”, APIs should log errors in a structured way, explaining what failed, why, where, what was expected, and what to do next.
5. Access Control
Access control means authentication, authorisation, and role-based access. It also needs to be enforced before data reaches the model, not after, otherwise you risk leaking sensitive information.
A 2025 analysis of the Forbes AI 50 found that approximately 65% of high-profile AI startups have experienced verified leaks of sensitive information. (Source: OWASP)
Defining who can access what is a shared responsibility. Usually, product defines the rules, engineering enforces them. Security or legal teams usually step in here as well, especially in regulated environments. They help define policies, review risks, and make sure you’re not exposing something you shouldn’t.
Product and legal teams should always start from the data. Is it sensitive? Is it personal data, payments, internal notes? What happens if this leaks?
Conclusion
As you can see, the mindset needs to shift from “we need an AI strategy” to “we need reliable, accessible data for a specific problem”. My advice is simple, don’t try to build the perfect AI platform upfront. Start small, focus on one use case, then expand.
Recommended open-source tools:
RAGFlow: A high-quality RAG engine with strong document parsing that turns messy data into structured, queryable knowledge. Great for real production systems, not just demos.
LlamaIndex: A data framework for LLM apps that connects your data sources to models cleanly. It’s basically the data layer for AI.
Flowise: A visual builder for LLM apps and agents. It’s the Figma for AI workflows, widely used for building internal tools fast.
OpenClaw: An AI agent that can browse the web, use tools, and execute real tasks autonomously, not just generate text. It can be run directly from source code by cloning its GitHub repository.
Read more:

