model-reliability

Here are 7 public repositories matching this topic...

Khanz9664 / TrustLens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

python data-science machine-learning opensource calibration fairness python-package model-evaluation ai-safety evaluation-framework explainable-ai bias-detection mlops fairness-ml model-monitoring trustworthy-ai model-reliability

Updated Apr 27, 2026
Python

ksm26 / Improving-Accuracy-of-LLM-Applications

Star

The course equips developers with techniques to enhance the reliability of LLMs, focusing on evaluation, prompt engineering, and fine-tuning. Learn to systematically improve model accuracy through hands-on projects, including building a text-to-SQL agent and applying advanced fine-tuning methods.

lora evaluation-framework performance-optimization text-to-sql self-reflection mome prompt-engineering llm-accuracy instruction-fine-tuning memory-tuning iterative-fine-tuning llama-models model-reliability

Updated Aug 29, 2024
Jupyter Notebook

Tarunjit45 / PromptGuard

Star

PromptGuard is a pragmatic, opinionated framework for establishing continuous integration for LLM behavior. It operates on a simple, verifiable principle: run the same prompts across multiple model configurations, compare outputs against defined expectations, and flag semantic regressions.

python nlp open-source developer-tools regression-testing ai-safety mlops llm prompt-engineering prompt-testing llm-evaluation ai-infrastructure model-reliability semantic-drift llm-systems

Updated Jan 17, 2026
Python

DURGESH716 / Creating-Hard-Reasoning-Benchmark

Star

Hard Reasoning Benchmark filtered with disagreement scores

benchmarks mathematical-modelling model-evaluation gsm8k arc-agi model-reliability

Updated Feb 14, 2026
Python

mdshoaibuddinchanda / AutoFE-ShiftBench

Star

A reproducible, data-centric benchmarking framework evaluating the robustness of tabular machine learning models under systematic feature shift using OpenML-CC18 datasets and automated feature engineering.

benchmarking automated-feature-engineering wilcoxon-test data-centric-ai distribution-shift model-reliability tabular-machine-learning feature-shift openml-cc18 wilcoxon-signed-rank-test