Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.
-
Updated
Apr 27, 2026 - Python
Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.
The course equips developers with techniques to enhance the reliability of LLMs, focusing on evaluation, prompt engineering, and fine-tuning. Learn to systematically improve model accuracy through hands-on projects, including building a text-to-SQL agent and applying advanced fine-tuning methods.
PromptGuard is a pragmatic, opinionated framework for establishing continuous integration for LLM behavior. It operates on a simple, verifiable principle: run the same prompts across multiple model configurations, compare outputs against defined expectations, and flag semantic regressions.
Hard Reasoning Benchmark filtered with disagreement scores
A reproducible, data-centric benchmarking framework evaluating the robustness of tabular machine learning models under systematic feature shift using OpenML-CC18 datasets and automated feature engineering.
Text-only playground for evaluating reasoning model outputs with mock accuracy, hallucination, and trust metrics — runs 100% locally.
Multi-LLM consensus engine for automated code review, diff analysis, and risk scoring.
Add a description, image, and links to the model-reliability topic page so that developers can more easily learn about it.
To associate your repository with the model-reliability topic, visit your repo's landing page and select "manage topics."