Update (2026/03): I've received some auditor feedback on this topic and updated the article below - I added the auditor opinions at the end.

A startup recently reached out to use which was planning to build software as a medical device which would use an LLM somewhere in its backend. The device would likely be MDR class IIa.

The big question was: How do auditors currently "interpret" the MDR regarding using LLMs in a medical device? And specifically, would it be okay to use an external LLM API, e.g. from OpenAI or Anthropic?

I had the chance to (very informally) talk to someone at a notified body and get their (again, very informal) opinion on this.

The short answer is that, unfortunately, you can't use external LLM APIs for now.

Read on for the details and our thoughts.

Machine Learning in Software as a Medical Device

Putting machine learning software into medical devices is not really new. I was fortunate to be part of the team which got one of the first machine learning - based software medical devices certified as a medical device (MDD class IIb at the time). It was a machine learning - based software to screen mammographies (x-ray images) for breast cancer. That was a cool experience.

The learnings at the time were:

In general, adding machine learning components to your software as a medical device is okay.
Your weights must be frozen - in other words, no "continuous learning" in production, no updating of weights, etc. - any update of weights would be a new release of your product.
It's up to your auditor to determine how much your notified body would get involved for each of these new "weight update" releases. Most data points I have here from manufacturers and auditors all seemed pretty pragmatic, in the sense that you can often release new updates without notified body involvement if they're low risk, e.g. 1) only updating the weights, 2) model performance has improved (you have a test set) and 3) you don't ship other higher-risk changes, e.g. switching to a new model, other features, etc.
This was surprisingly pragmatic, in my opinion, and that was cool.
You had to create some documentation around your "machine learning life cycle": Which training data did you use, which testing methods did you use, what's your test set comprised of, what are your results, etc., etc. - check out our free template for the Machine Learning Algorithm Validation Report which covers this.

And these learnings are still true today.

You might wonder why there's no official guidance, given that machine learning has become "mainstream" for ~10 years now already (remember AlexNet, VGG, etc.?). The reason for that is that the MDCG, a rather informal, opaque and mostly undemocratic group of people who determine how the MDR should be interpreted (??) hasn't found time in the past ~10 years to publish any sort of guidance on this.

I can only speculate as to why. I don't know. I do know that there are close to zero capable software developers in the committee. A good data point for this is that their PDF guidance documents tend to have terrible formatting, some are even titled "DRAFT". Maybe they need some addition employee training as part of their "quality management". Anyway..

Back to topic. How do we now handle external LLM APIs?

Adding The OpenAI / Anthropic API to a Medical Device

In the end, building an LLM-based medical device is nothing else than building a machine learning - based medical device. LLMs are a subset of machine learning, so, umm.. that should be obvious, but many people in the regulatory bubble are not aware of these technical fundamentals and puff up their chest and say "wow, these LLMs are real AI, we need to re-evaluate all our current regulations!".

No you don't. It's machine learning, and machine learning has been around for a while. Do the legendary Andrew Ng Course Course on Machine Learning and become one of the <1% of regulatory people who actually have hands-on technical knowledge.

So the only differences here are:

The model is being hosted by an external provider.

That's it. Nothing else.

So - if you medical device uses the OpenAI or Anthropic API, is that a problem, given that you can mitigate most risks from this?

Let's look at the potential risks first:

"Update of weights": The provider updates the model weights without informing you, and your medical device starts receiving different API responses for the same requests (with better or worse model performance).
"Disruption of service": The provider goes down and your medical device is broken.

Intuitively, I think that the risks can be mitigated very well and/or are low enough to be acceptable:

Update of weights: The OpenAI and Anthropics APIs allow you to specify a specific model "version", usually the model name plus a date, in your API requests. So you can opt to not send your request to the always-newest model, but instead you send it to a model which was released on a certain date, the weights of which are no longer being modified. Examples for this are "gpt-5-2025-08-07" for OpenAI, and "claude-sonnet-4-5-20250929" for Anthropic.
So this risk can be controlled easily. Sure.. the remaining risk is that the providers will still update those models even though they shouldn't, but that sounds very unlikely (and would be a rather stupid move on their side). A more relevant risk might be that these models get retired rather fast, and the medical device lifecycle tends to be slow, so it might be a problem if a model becomes unavailable in, say, 2 years or so. But then you could just switch to another newer model, run your tests again, and push a new release of your medical device (the notified body procedure will depend on your auditor, as outlined earlier).
Disruption of service: OpenAI and Anthropic can go down anytime, or they might only offer limited service to you while they experience capacity constraints. While I think this risk is very real, the question I'd pose here is: OpenAI and Anthropic have very smart DevOps teams working on keeping their services online. They run on the best cloud platforms (AWS etc.). And now you, as a small medical device manufacturer, think it's more reliable to host your own model on your own servers, with one dude or dudess doing DevOps part-time? That sounds like a terrible proposition.
So, personally, I think this risk is simply already as low as possible.

What Does an Auditor Think?

I confidently reached out to a high-up person at a notified body and outlined my thinking, assuming that using the Anthropic or OpenAI APIs would be not much of a problem.

You see, our startup customer had already reached out to a few other consultants which all said that it would be a huge problem. I assumed the other consultants were over-engineering (as they usually do), and disregarded their answers given their number of competent software developers in their teams (usually none).

Man, was I wrong.

Turns out, notified bodies indeed see it as a huge problem right now. This leads to the outcome that, as a company, realistically, you can't use external LLM APIs in your product.

The reasoning and the mentioned risks were mostly about control:

The provider might update model weights
The provider might truncate model answer length due to capacity constraints
The provider might decide to spontaneously cancel the business relationship

I mean, all of these risks could be controlled, as I outlined above, but apparently the NBs interpretation is that they are larger than I thought. What a bummer.

I tend to rant about notified bodies and auditors, but I do understand their point of view here.

Then again, I think there's a rational (and interesting) discussion to be had about this.

Discussion points would be:

Does self-hosting an LLM really lead to an overall improvement of the "product risk profile"?
Availability / uptime will likely be worse (DevOps etc.), performance will likely be worse (self-hosting is hard to scale).
Given that self-hostable models tend to have worse performance than frontier models, is the trade-off worth it?
The open-source models which you can self-host (e.g. Qwen, Llama, etc.) tend to have worse performance than the closed-source, commercial models of OpenAI and Anthropic. So you're essentially making a bet and wagering that it's safer to have full control (self-hosting) over a model which has worse performance than instead having less control (external API) over a model which has better performance.
I'm not sure if this bet is the right bet in all use cases. It will surely depend on your LLM use case. If you're doing diagnostic stuff, using the absolute frontier models might be life-changing (literally).
Are we even optimizing for the right outcome?
Every startup building LLM-based medical device will now spend additional "development cycles" on setting up their own LLM infrastructure. This will lead to less resources being spent on actual product development. Is this worth it?
Second-order effects: Will we see crappy "certified" providers? Won't they be worse?
When we gotted audited by the LaGeSo in Berlin, they offered us to do the audit remotely. But they warned us that they can't use Google Meet, they can only use their own "certified video call provider" in which "you can't share your screen". So you literally have to say things like "okay I'm opening file123.pdf now, please scroll to page 8 and read the first sentence" etc. - crazy!
Likewise, in many regulated industries, we observe the emergence of these crappy "certified providers" which usually offer a much-worse flavor of existing software (here: Google Meet), while providing additional "quality documentation" (supplier contract, service level agreement, etc.) which gives their customers the false impression that this is a "safer choice".
But now you've optimized for a safer paper trail, and you've made huge compromises in product quality along the way.
I can imagine seeing similar crappy "certified providers" emerging for LLM API deployments. Like, for example, instead of choosing the OpenAI API, you can build your medical device on the "Black Forest Certified LLM API" (powered by the Telekom Cloud, another terrible "certified provider", running on Huawei hardware, by the way) which costs 100x as much, has much worse API docs, worse performance, worse availability, but better contracts, and it's a German GmbH. Great for compliance, terrible for product quality.
Is this worth it?
(By the way, we're currently disrupting the whole "certified eQMS provider" industry by offering our eQMS software for free, with higher tiers starting at 99€ / month - check it out!)

Again, I see the auditor's point, but I don't know whether these trade-offs are worth it and whether the "net product safety" benefits from this.

For now, you won't be able to use the OpenAI and Anthropic APIs in your medical device.

What do you think?

Update: Actual Auditor Feedback

So I've had some auditors reach out to me in the meantime who kindly told me how they see this right now. As always, heavily anonymized and I've grouped similar opinions together, so you get an idea of how different auditors see things.. differently:

Group 1: "Just like any other library" and "in-house models suck"
I would call these auditors the "technically proficient realists": Auditors who actually have a technical understanding of software, and now, LLMs, who make reasonable assessments on how to use these technologies in medical devices in a sensible way (unfortunately, not all auditors are like this). They essentially say using an external LLM API is like using any sort of external software library - it might go offline and/or become unmaintained in the future, just like any other library or API provider might. So using them in general should be possible.

Further, they tend to agree that in-house trained and/or hosted models simply tend to have worse performance, for well-known reasons - the model architecture is not state of the art, the training data and/or process was worse, etc. So the best choice would indeed be to choose the best model available which nowadays tends to be an external model (OpenAI etc.).

These auditors also tend to note that heavy logging and monitoring might make a lot of sense to ensure that nothing crazy happens.

My opinion: I agree with this - it sounds like a sensible approach by rational people. I wish we had more of those.

Group 2: "Nope"
These are the hardliners - irrational people who have long departed the sphere of reality and live in their own little bubble of "controlled software development". Usually, these people have never written code in their life, and the last code they wrote was COBOL code 50+ years ago, before they became an auditor and believed they were smarter than everyone by telling them to do "compliant software development".

Where was I.. these people tend to think that using external LLM APIs is unacceptable, because the manufacturer needs full control over the development lifecycle (what does that even mean?); they problem is that they are ignoring the reality of Group 1 above, i.e. that in-house self-trained models simply tend to perform worse than state-of-the-art external models. So you're essentially using something with worse performance, only to be able to point to the documentation and be able to say "but we control the development lifecycle!".

My opinion: This is detached from reality, irrational and frankly a technically very uninformed standpoint.

So that's what auditors think right now.

As is often the case, different software auditors have vastly different ideas on how things should be done.

It'll be interesting to see whether trends emerge between notified bodies, e.g. whether some notified bodies will have internal guidelines to favor the rational approach (Group 1), or whether auditors just come up with their own ideas and it'll be up to a coin toss which auditor you'll be assigned to within a notified body and whether an external LLM will be acceptable for your product or not.

Can You Use the OpenAI API In Your Medical Device?

Machine Learning in Software as a Medical Device

Adding The OpenAI / Anthropic API to a Medical Device

What Does an Auditor Think?

Update: Actual Auditor Feedback

Dr. Oliver Eidel

Keep reading

Machine Learning Software as a Medical Device: Can We Update Our Model?

Medical Device Classification

Software as a Medical Device: What's a Significant Change?

Congratulations! You read this far.