← Back to portfolio
AI Reliability • LLMOps • Governance

Claims AI Reliability Platform

A governed release and monitoring system for deciding when a claims knowledge assistant is safe enough to scale — and when it should be restricted, reviewed, or rolled back.

I built this project around a harder question than “can the assistant answer claims questions?” The real question was whether a Swiss insurer could trust that assistant enough to release it into controlled production use.

That meant the project had to go beyond a useful interface or a decent retrieval flow. It needed governed release behavior, benchmark discipline, escalation-sensitive evaluation, live telemetry, incident logic, review workflows, and containment recommendations if live behavior later became unsafe.

The result is a full trust-loop case: a governed claims assistant, a release gate that caught a real boundary-handling bug, and a packaged live monitoring layer that can detect unsafe post-release behavior, open review workflows, and recommend rollback.
Trust loop visual for the Claims AI Reliability Platform showing governed assistant, release gate, monitoring, incidents, review, and rollback
The trust loop this project proves

Release discipline before deployment, and containment logic after deployment.

Proof of project weight
Release candidates evaluated
3
Real bug caught before release
1
Blockers in approved RC
0
Live rollback recommendation
Yes

Why a normal prototype was not enough

A promising internal assistant is still a risk if the business cannot tell when it is safe to release, when it should escalate, and how unsafe live behavior should be contained once the system is already in use.

The real business tension

In a claims setting, helpful is not enough. The assistant also has to be governable.

Management did not need a prettier prototype. It needed evidence. The company had to know whether a release candidate handled high-risk boundary cases correctly, whether escalation behavior held up under evaluation, and whether live degradation could still be detected after approval.

That changed the shape of the project. Instead of stopping at answer generation, I built around operational trust.

What made the problem operationally serious

  • A fluent answer is not the same as a safe answer
  • Claims workflows have boundary cases where escalation matters more than helpfulness
  • Management needed release evidence before broader adoption, not confidence based on a good demo
  • Even an approved release still needed monitoring in case live behavior later degraded

What I actually built

The project is best understood as a controlled release and monitoring system around a claims knowledge assistant, not as a standalone chatbot.

Governed assistant

A FastAPI claims knowledge assistant built with release-aware behavior, explicit retrieval control, and boundary-sensitive answer logic rather than open-ended assistant behavior.

Benchmark release gate

A structured evaluation layer across smoke, release-gate, boundary, and change-sensitive benchmark suites, with blocker logic, scoring, and release recommendation.

Live monitoring and incident logic

A telemetry and monitoring layer that converts unsafe live interactions into incidents, review queue items, and containment recommendations instead of leaving risk hidden inside logs.

Packaged runtime with persistence

A Docker and PostgreSQL-backed runtime that persists telemetry and monitoring state so the system feels operational, not just locally demoable.

This is what gives the project weight: it does not only show that an assistant can be built. It shows how a release can be evaluated, approved, monitored, and later contained if the live behavior stops being safe enough.

The release-gate story

The release process was not ceremonial. It materially changed the outcome. Early release candidates failed on scoring, blocker logic, and high-risk handling. The final one only passed once the benchmark results were good enough to justify scaled production.

Release progression visual showing claims assistant movement from blocked release candidates to scaled production approval
From blocked release to scaled production approval

The benchmark gate forced iteration. It did not just describe quality. It changed the release path.

The strongest takeaway here: the project did not just evaluate a claims assistant. It built a governed release gate, used it to catch a real issue, forced iteration, and only cleared the release once the evidence supported it.

The bug that changed the release outcome

The most credible part of the release story is that the benchmark exposed a real boundary-handling defect in the assistant logic rather than just producing abstract scores.

What the benchmark caught

  • A legal-liability / out-of-scope boundary case entered a product-line boundary branch first
  • The answer was blocked, but it did not behave like a true mandatory escalation path
  • That meant the system was technically restrictive, but operationally weaker than it should have been
  • The benchmark exposed that behavior, the service logic was fixed, and the release decision improved because of it
Simplified bug and fix flow for the Claims AI Reliability Platform showing wrong boundary branch behavior and corrected escalation logic
A subtle bug with real operational consequences

The difference between “blocked” and “properly escalated” mattered enough to affect the release decision.

This mattered because it proved the release gate was doing real work. It was not rubber-stamping a polished demo. It was identifying behavior the business would genuinely care about before release.

Live monitoring and rollback story

The project’s strongest Phase 3 proof is that approval was not treated as the end of the story. A release that had already been cleared could still later be detected as unsafe in live use.

What happened in live monitoring

  • A deliberately unsafe post-release interaction was injected into the live workflow
  • The assistant answered directly where escalation should have happened
  • That event triggered multiple incidents across boundary, escalation, and grounding-related logic
  • The platform also created a P1 review queue item and raised release degradation incidents
  • The top containment action became rollback_release

Why this changes the project’s weight

A lot of AI case studies stop once the release is approved. This one keeps going. It proves that an approved release can still degrade in practice, generate operational signals, open review workflows, and trigger a stronger containment recommendation.

That is what makes the platform feel mature: it treats production trust as something that must be maintained, not assumed.

Claims AI control room in a critical state showing incidents, review queue, and rollback recommendation

Control room in degraded live state

The approved release later entered a critical state after unsafe live behavior triggered incidents and changed the top containment action to rollback.

Claims AI control room tables showing incidents, review queue, and recent interactions

Unsafe behavior translated into operational action

The platform did not stop at red flags. It generated incidents, opened a P1 review queue item, and made the release operationally reviewable.

This is the most important proof on the page: the platform can detect unsafe post-release behavior inside the operational loop, not only inside offline evaluation.

Packaged runtime and operational realism

The runtime extension strengthened the project, but the important part is how it was done. Infrastructure was added where it created real operational value, not just to make the stack sound bigger.

What the packaging added

  • Docker and Docker Compose packaged the assistant, control room, and PostgreSQL-backed persistence into one runnable environment
  • PostgreSQL was used where it added real value: telemetry storage, monitoring storage, and operational state persistence
  • The governed retrieval path was deliberately kept intact rather than rewritten just to force infrastructure into the answer path
  • That was a disciplined choice: the retrieval behavior was already explicit, benchmarked, and central to release-gate behavior
Docker Compose runtime showing packaged services for the Claims AI Reliability Platform
Multi-service runtime validation

The assistant, monitoring layer, and persisted operational state could run together inside one packaged environment.

The disciplined choice here matters: I did not rewrite the governed retrieval path just to force database technology into the live answer flow. I kept the release-critical behavior intact and used PostgreSQL where it created operational realism.

Why this project matters in the portfolio

This is one of the strongest projects in the portfolio because it shows a more mature AI question than “can I build it?” It asks whether the business can govern it, test it, trust it, and contain it once it is live.

This project is stronger than a normal GenAI case because it proves release discipline, operational monitoring, and rollback readiness rather than just answer generation.

It shows that I can work on the harder problem behind enterprise AI adoption: not only whether a system can answer, but whether the business can trust, govern, and contain it.

Next step

Want to explore more of the portfolio or get in touch?

You can go back to the selected projects section or contact me directly by email.