
Artificial intelligence is rapidly becoming part of everyday clinical care, especially in diagnostics. In eye care, AI systems can autonomously detect diabetic retinopathy, estimate disease risk and support clinical decision-making with unprecedented speed and consistency. But alongside this promise comes a critical reality: diagnostic AI is only as safe and fair as both its design and its real-world use allows it to be.
Regulators emphasize that algorithmic fairness and clinical safety do not depend solely on how an AI system is trained. Bias can be introduced or amplified at multiple stages, including deployment, workflow integration and day-to-day clinical use. For adopters, misunderstandings about how the AI was validated, which populations it was tested on, or how it should be used at the point of care can unintentionally generate biased outcomes. This can lead to reduced accuracy or inequitable care.
Trustworthy diagnostic AI is not just a property of the technology; it is a shared responsibility between creators and adopters. End users directly influence safety, fairness and performance through day-to-day use. But for busy clinicians, it is not always clear which technical details matter, what questions to ask vendors, or how to spot subtle mismatches between the evidence provided and the realities of your patient population or workflow. Without this clarity, unintentional bias, performance drift and misapplication can occur even with high-quality AI.
The following six questions can help ensure a diagnostic AI system is not only accurate in principle, but reliable, equitable and appropriate for your practice in the real world.
1. Is the AI validated in the environment where it will be used?
An AI system tested only in controlled laboratory settings or large academic medical centers may not perform the same way in private practice, retail clinics, or telehealth-supported environments. Differences in lighting, patient mix, operator experience and/or workflow constraints can significantly affect real-world performance. End-users must confirm that the system’s validation reflects the intended use environment, as mismatches here are a frequent case for AI failure post-deployment.
2. Is the clinical trial published and peer-reviewed?
Peer-reviewed publications provide independent scrutiny and scientific assessment. Prospective trials with pre-specified endpoints offer stronger evidence than retrospective analyses and show how the system performs under realistic clinical conditions. Peer-reviewed, prospective evidence should be the baseline expectation for any AI that informs clinical decision-making, to give clinicians the ability to integrate the system safely into practice.
3. Is a high-quality reference standard used?
AI is only as reliable as the “ground truth” against which it is measured. Ask:
- Is the reference standard, independent and unbiased?
- Are outcome-based endpoints, such as clinically meaningful events, included?
- Does it use the most accurate imaging or measurement methods available?
Suboptimal reference standards can mask bias and inflate performance.
4. Is there real-world evidence from settings similar to yours?
Performance observed in controlled trials may not translate directly to everyday clinical practice. Many trials draw participants from specialized centers or urban academic hospitals, which often do not reflect the diversity of smaller clinics, retail practices, or rural health settings. As a result, certain groups can be unintentionally excluded from validation. These “invisible populations” may include patients who live far from trial sites, individuals with limited mobility, or people from underrepresented communities. When these populations are not part of the evidence base, the AI may underperform or produce biased results once deployed in real-world settings.
Although it is impossible to validate an AI system in all clinical settings, look for peer-reviewed, published real-world evidence from clinical environments comparable to your own to help guide adoption. Examining real-world evidence helps ensure that the AI’s reported performance aligns with the outcomes your patients and staff can realistically expect. Even widely adopted diagnostic systems have faced real-world challenges when evidence from controlled trials did not fully capture population diversity or workflow variability, highlighting the importance of independent, practice-based validation.
5. Does the vendor provide transparency, lifecycle monitoring and accountability?
- AI performance is not static – ongoing monitoring is essential. Adopters should expect:
- Clear documentation of known limitations
- Active post-market surveillance or performance monitoring in real-world settings
- Transparency regarding algorithm updates and changes
- Clarity on liability and intended use, including guidance for operator judgment in uncertain cases
Vendors have an obligation to provide the tools, metrics and transparency needed for ongoing evaluation, and adopters must commit to using them. Fair, safe and equitable diagnostic AI depends on an active partnership between creators and users.
6. Who assumes liability for a misdiagnosis?
For a traditional diagnosis made solely by a medical professional, the liability for the diagnostic outcome rests with the diagnosing physician. For autonomous AI diagnostic systems, the diagnosis is made by the computer and end-users are not required to have specialty training. In these cases, the liability should rest with the developer of the AI system. For all other diagnostic AI models, the boundaries of responsibility are ambiguous, making it essential to ask each vendor both how they view liability and where accountability lies.
Conclusion
Diagnostic AI has the potential to expand access, increase consistency and support better outcomes in eye care, but its benefits are not automatic. Safe and equitable performance depends on more than the quality of the algorithm, but on how the system is evaluated, deployed, monitored and used in real-world care. Developers must provide clear guidance and outline the mechanisms needed for adopters to integrate and oversee AI responsibly. End users also shape outcomes through the decisions they make at the point of care.
Importantly, a lack of some of the above does not necessarily preclude adoption. It simply means that additional guardrails may be needed to ensure safe use. These safeguards might include enhanced monitoring after implementation, validating the system on a small subset of your own patients before full rollout, additional staff training, or workflow protocols that identify low-confidence results or scenarios where population differences may affect performance. Such measures help ensure that AI supports, rather than inadvertently undermines, equitable and accurate care.
While this article focuses on diagnostic AI in eye care, the broader principle extends across all modalities and specialties: trustworthy AI requires continual attention, and responsibility is shared. Developers must build and validate systems rigorously, but clinicians, practices, and health systems must remain engaged stewards of those systems over time. By applying thoughtful evaluations, adopting reasonable safeguards and maintaining ongoing oversight, end users can mitigate risk, reduce bias and maximize the real-world benefits of AI for the patients and communities they serve.

