evaluation

Star

Here are 621 public repositories matching this topic...

explodinggradients / ragas

Star

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Updated Jun 12, 2025
Python

oumi-ai / oumi

Star

Easily fine-tune, evaluate and deploy Qwen3, DeepSeek-R1, Llama 4 or any open source LLM / VLM!

evaluation inference llama fine-tuning sft dpo llms vlms

Updated Jun 13, 2025
Python

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Jun 13, 2025
Python

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated May 4, 2025
Python

MichaelGrupp / evo

Star

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated Jun 8, 2025
Python

Kiln-AI / Kiln

Star

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

python windows macos machine-learning ai evaluation prompt ml collaboration openai dataset-generation synthetic-data fine-tuning prompt-engineering chain-of-thought rlhf evals ollama

Updated Jun 14, 2025
Python

zzw922cn / Automatic_Speech_Recognition

Star

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio deep-learning tensorflow paper end-to-end evaluation cnn lstm speech-recognition rnn automatic-speech-recognition feature-vector data-preprocessing phonemes timit-dataset layer-normalization rnn-encoder-decoder chinese-speech-recognition

Updated Mar 24, 2023
Python

microsoft / promptbench

Star

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

Updated May 30, 2025
Python

EvolvingLMMs-Lab / lmms-eval

Star

Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

evaluation agi multimodal large-language-models

Updated Jun 13, 2025
Python

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated Jun 13, 2025
Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

huggingface / evaluate

Star

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

machine-learning evaluation

Updated Jan 10, 2025
Python

ContinualAI / avalanche

Star

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

training library framework deep-learning metrics evaluation pytorch benchmarks strategies lifelong-learning continual-learning continualai

Updated Mar 11, 2025
Python

Cloud-CV / EvalAI

Star

☁️ 🚀 📊 📈 Evaluating state of the art in AI

python angularjs docker challenge coveralls machine-learning django ai travis-ci reproducible-research leaderboard evaluation artificial-intelligence ai-challenges codecov reproducibility evalai

Updated Jun 11, 2025
Python

xinshuoweng / AB3DMOT

Star

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

tracking machine-learning real-time computer-vision robotics evaluation evaluation-metrics multi-object-tracking kitti 3d-tracking 3d-multi-object-tracking 2d-mot-evaluation 3d-mot 3d-multi kitti-3d

Updated Apr 3, 2024
Python

huggingface / lighteval

Star

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

Updated Jun 13, 2025
Python

sepandhaghighi / pycm

Star

Multi-class confusion matrix library in Python

Updated Jun 11, 2025
Python

SmartFlowAI / EmoLLM

Star

心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models

evaluation dataset post-training llm the-big-model-of-mental-health depoly

Updated May 18, 2025
Python

Maluuba / nlg-eval

Star

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

nlp natural-language-processing meteor machine-translation dialogue evaluation dialog rouge natural-language-generation nlg cider rouge-l skip-thoughts skip-thought-vectors bleu-score bleu task-oriented-dialogue

Updated Aug 20, 2024
Python

deepsense-ai / ragbits

Star

Building blocks for rapid development of GenAI applications

optimization evaluation agents prompts document-search rag guardrails llms vector-stores

Updated Jun 13, 2025
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 621 public repositories matching this topic...

explodinggradients / ragas

oumi-ai / oumi

open-compass / opencompass

Marker-Inc-Korea / AutoRAG

MichaelGrupp / evo

Kiln-AI / Kiln

zzw922cn / Automatic_Speech_Recognition

microsoft / promptbench

EvolvingLMMs-Lab / lmms-eval

open-compass / VLMEvalKit

uptrain-ai / uptrain

huggingface / evaluate

ContinualAI / avalanche

Cloud-CV / EvalAI

xinshuoweng / AB3DMOT

huggingface / lighteval

sepandhaghighi / pycm

SmartFlowAI / EmoLLM

Maluuba / nlg-eval

deepsense-ai / ragbits

Improve this page

Add this topic to your repo