diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md index 249f03928..b2f9a94d2 100644 --- a/docs/getstarted/evals.md +++ b/docs/getstarted/evals.md @@ -11,7 +11,7 @@ In this guide, you will evaluate a **text summarization pipeline**. The goal is ### Evaluating using a Non-LLM Metric -Here is a simple example that uses `BleuScore` score to score summary +Here is a simple example that uses `BleuScore` to score a summary: ```python from ragas import SingleTurnSample @@ -40,9 +40,9 @@ Here we used: As you may observe, this approach has two key limitations: -- **Time-Consuming Preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging. +- **Time-consuming preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging. -- **Inaccurate Scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`. +- **Inaccurate scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`. !!! info @@ -51,7 +51,7 @@ As you may observe, this approach has two key limitations: To address these issues, let's try an LLM-based metric. -### Evaluating using a LLM based Metric +### Evaluating using a LLM-based Metric **Choose your LLM** @@ -62,7 +62,7 @@ choose_evaluator_llm.md **Evaluation** -Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which an LLM based metric that outputs pass/fail given the evaluation criteria. +Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which is an LLM-based metric that outputs pass/fail given the evaluation criteria. ```python @@ -88,7 +88,7 @@ Output Success! Here 1 means pass and 0 means fail !!! info - There are many other types of metrics that are available in ragas (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). + There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). ### Evaluating on a Dataset @@ -148,7 +148,7 @@ Output {'summary_accuracy': 0.84} ``` -This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **It +This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **it s important to see why is this the case**. Export the sample level scores to pandas dataframe @@ -187,4 +187,4 @@ If you want help with improving and scaling up your AI application using evals. ## Up Next -- [Evaluate a simple RAG application](rag_eval.md) \ No newline at end of file +- [Evaluate a simple RAG application](rag_eval.md)