ai-evaluation

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

python nlp machine-learning natural-language-processing text-analysis ai-evaluation large-language-models llm genai evaluation-metircs text-comparision

Updated Jul 15, 2025
Python

sergeyklay / factly

Star

CLI tool to evaluate LLM factuality on MMLU benchmark.

cli benchmark openai factuality ai-evaluation llm prompt-engineering chatgpt mmlu llm-evaluation

Updated Jul 14, 2025
Python

Gaganv882 / llm-eval-analysis

Star

Automatic multi-metric evaluation of human-bot dialogues using LLMs (Claude, GPT-4o) across different datasets and settings. Built for the Artificial Intelligence course at the University of Salerno.

python nlp sentiment-analysis chatbot university-project openai dialogue-systems conversation-analysis robustness-assessment ai-evaluation large-language-models llm anthropic

Updated Jul 16, 2025
Python

StressTestor / PromptPressure-EvalSuite

Star

An open-source evaluation suite for testing LLMs on refusal handling, tone control, and reasoning. Built to explore model behavior across nuanced user cases.

python cli yaml automation opensource csv metrics test-suite evaluation dataset openai developer-tools mistral groq ai-benchmarks model-benchmarking ai-evaluation llm prompt-engineering

Updated Jun 15, 2025
Python

StressTestor / TokenPressureSandbox

Star

Interactive Python toolkit for exploring, testing, and benchmarking LLM tokenization, prompt behaviors, and sequence efficiency in a safe, modular sandbox environment.

python cli yaml automation opensource csv sandbox developer-tools tokenization model-benchmarking ai-evaluation llm prompt-engineering llm-testing

Updated May 26, 2025
Python

cyrilclovis / traitement-automatique-des-langages

Star

Entrainement et évaluation du moteur de traduction neuronale OpenNMT sur un corpus en formes fléchies puis en lemmes

automation opennmt ai-training preprocessing-data ai-evaluation

Updated Mar 22, 2025
Python

Q-Aware-Labs / Evaluating_AI_Web_Search

Star

This repository contains a study comparing the web search capabilities of four AI assistants: Gemini 2.0 Flash, ChatGPT-4 Turbo, DeepSeekR1, and Grok 3

artificial-intelligence gemini ai-evaluation llm ai-testing chatgpt- llm-evaluation grok-3 ai-assistans

Updated Jun 2, 2025
Python

Marco210210 / llm-eval-analysis

Star

Automatic multi-metric evaluation of human-bot dialogues using LLMs (Claude, GPT-4o) across different datasets and settings. Built for the Artificial Intelligence course at the University of Salerno.

python nlp natural-language-processing chatbot university-project openai dialogue-systems conversation-analysis ai-evaluation large-language-models llm anthropic

Updated May 6, 2025
Python

humaninloop / madguard-ai-explorer

Star

A diagnostic Gradio tool to simulate feedback loops in Retrieval-Augmented Generation (RAG) pipelines and detect Model Autophagy Disorder (MAD) risks.

nlp open-source data-validation cosine-similarity gradio open-source-project type-token-ratio huggingface ai-evaluation

Updated May 28, 2025
Python

rlin25 / FrizzlesRubric

Star

A modular system for automated, multi-metric AI prompt evaluation—featuring expert models, an orchestrator, and a modern web UI.

react nlp aws machine-learning cloud microservices transformers pytorch webapp orchestrator fastapi distilbert ai-evaluation llm prompt-engineering

Updated May 30, 2025
Python

Improve this page

Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation

Here are 15 public repositories matching this topic...

cvs-health / uqlm

kereva-dev / kereva-scanner

mhamzaerol / Cost-of-Pass

Alab-NII / llm-judge-extract-qa

aloth / JudgeGPT

ai4society / GenAIResultsComparator

sergeyklay / factly

Gaganv882 / llm-eval-analysis

StressTestor / PromptPressure-EvalSuite

StressTestor / TokenPressureSandbox

cyrilclovis / traitement-automatique-des-langages

Q-Aware-Labs / Evaluating_AI_Web_Search

Marco210210 / llm-eval-analysis

humaninloop / madguard-ai-explorer

rlin25 / FrizzlesRubric

Improve this page

Add this topic to your repo