AI PoCs - how to scale, not fail

24 Sept 2025|Technology

Gen AI pilots are everywhere. A March 2025 McKinsey survey found that 71% of organisations reported regularly using GenAI in at least one business function, an increase from 65% in early 2024. This figure is undoubtedly higher now, just a few months after that survey. Many firms struggle, however, to turn promising experiments into enterprise‑wide results. What sets the successful implementations apart, and how should CIOs and CTOs drive this forward?

Person pointing at code on dual monitor setup while working at desk with brick wall background.

Why most pilots stall

Two major hurdles stall progress:

The jump from a “that’s cool” demo to a production‑grade service that is reliable, secure and supported by a solid eval approach.
The bigger leap from a single production use case to a connected portfolio of adjacent solutions that share assets and accelerate one another.

It’s that second step where value can start to seriously curve upward, as teams reuse connectors, components, evaluation suites, guardrails and even the change management comms. Designing a pilot for that type of extensibility from day one is a key predictor of success.

We see integration as the dividing line. AI generates value when it is embedded in the workflow, bound to existing identity and enterprise data, measured for business outcomes, and operated as software rather than a perpetual pilot.

What successful implementations do differently

1) Put assistants in the workflow, not beside it

Goldman Sachs took its internal GS AI Assistant firm‑wide this summer. The tool is firewalled and integrated with the bank’s data and controls so employees can summarise complex documents, draft material and analyse data inside familiar workflows. Adoption has already reached tens of thousands of staff. The important part is not the model but the secure connection into existing processes.

2) Reallocate time with clear productivity evidence

JPMorgan reports its in‑house coding assistant lifted engineers’ efficiency by 10–20% across a very large developer base. That headroom is being reinvested in higher‑value AI and data work.

3) Work with existing systems

KONE co-developed TenderLens with Futurice to speed up tender preparation using gen AI. The tool is embedded in KONE’s existing sales systems — integrated with its CPQ and Salesforce CRM — so extracted data works with processes already in use. We’ve achieved well over 90 percent accuracy and dramatically cut cycle time. Read more

4) Build for the domain, not just the demo

Siemens’ Industrial Copilot won the 2025 Hermes Award at Hannover Messe for bringing generative AI into real industrial environments. The recognition reflects production‑grade integration with engineering tools and shopfloor workflows, now rolling out across customers.

A practical sequence for CIOs and CTOs

1) Anchor AI to one measurable workflow

Pick a single, frequent, high‑value flow such as claims adjudication, policy search, field service triage, incident routing, product onboarding, or developer support. Define two numbers upfront: the unit of work and the target change (for example, minutes per case, cases per agent, conversion rate, or defect rate). McKinsey’s 2025 data shows more teams are now linking gen‑AI use to revenue and cost changes within business units, which is the right level to aim for initially.

2) Wire it to identity, data and controls

Bind the assistant to your identity provider and authorisation model so it knows who is asking. Connect to authoritative systems of record and log every action. This is what distinguishes Goldman’s and JPMorgan’s approaches and is why their usage is real work, not a demo.

3) Choose the right customisation approach

Leverage system prompts, retrieval augmented generation and tool use so your workflow reasons over your domain while keeping knowledge fresh. Use fine‑tuning only where you need domain style, structured outputs or consistent task performance at low latency and cost on smaller models.

4) Instrument quality with explicit evaluators

Do not rely on opinion. Evaluate groundedness (is the answer supported by sources), relevance, response completeness and safety. Azure AI Foundry, for example, documents built‑in evaluators for these, and content‑safety groundedness checks can be automated as a guardrail. Track these alongside business KPIs.

5) Engineer for cost and latency from the start

Operations matter. Techniques like prompt caching can hold down costs and speed up responses where large shared prefixes are reused. Combine with semantic caching at the API gateway for further savings. Using a mix of models is probably a good call – smaller ones for simpler tasks and larger ones for the more complex ones. Use benchmarks and your own evaluations to guide on what model to use where.

6) Put security and compliance on rails

Adopt the 2025 OWASP Top 10 for LLM applications to mitigate prompt‑injection, data leakage and training‑data risks, and line up with NIST’s Generative AI Profile to evidence risk management. In Europe, note that the AI Act obligations for providers of general‑purpose AI models entered into application on 2 August 2025, supported by the Commission’s new GPAI Code of Practice. Your due diligence and supplier management should reference these.

7) Scale as a product portfolio, not a project list

When the first assistant is delivering reliable value, create a small catalogue of adjacent use cases that reuse the same integration pattern.

What to show budget‑holders

Budget‑holders want to see a bridge from experiment to capability. The evidence to bring to the table should include:

Before‑and‑after workflow metrics for a real population of users, not a lab test. The UK government’s published results are a good benchmark: 20,000 officials saved an average 26 minutes per day with M365 Copilot embedded in daily tools, and a cross‑government coding‑assistant trial reported roughly 56 minutes saved per developer per day.
Quality and safety telemetry (for example, groundedness and response completeness over time) alongside your operational and financial KPIs. Azure’s evaluators and observability guidance explain how to make this a routine practice.
A clear run‑rate view of cost that shows you have designed for scale, making use of caching and other controls.

The takeaway

The pattern is clear. Pilots that remain showpieces rarely change how work is done. The ones that break through start with a single workflow and wire the assistant into identity, systems and measures. They treat quality, safety and cost as engineering disciplines. They scale by reusing the same integration pattern across a portfolio.

If you'd like to know more about the topic, or need help scaling your AI POCs, please feel free to contact us.