Skip to content

Fix typos and clean style #2042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/getstarted/evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In this guide, you will evaluate a **text summarization pipeline**. The goal is

### Evaluating using a Non-LLM Metric

Here is a simple example that uses `BleuScore` score to score summary
Here is a simple example that uses `BleuScore` to score a summary:

```python
from ragas import SingleTurnSample
Expand Down Expand Up @@ -40,9 +40,9 @@ Here we used:

As you may observe, this approach has two key limitations:

- **Time-Consuming Preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging.
- **Time-consuming preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging.

- **Inaccurate Scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`.
- **Inaccurate scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`.


!!! info
Expand All @@ -51,7 +51,7 @@ As you may observe, this approach has two key limitations:
To address these issues, let's try an LLM-based metric.


### Evaluating using a LLM based Metric
### Evaluating using a LLM-based Metric


**Choose your LLM**
Expand All @@ -62,7 +62,7 @@ choose_evaluator_llm.md
**Evaluation**


Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which an LLM based metric that outputs pass/fail given the evaluation criteria.
Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which is an LLM-based metric that outputs pass/fail given the evaluation criteria.


```python
Expand All @@ -88,7 +88,7 @@ Output
Success! Here 1 means pass and 0 means fail

!!! info
There are many other types of metrics that are available in ragas (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md).
There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md).

### Evaluating on a Dataset

Expand Down Expand Up @@ -148,7 +148,7 @@ Output
{'summary_accuracy': 0.84}
```

This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **It
This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **it
s important to see why is this the case**.

Export the sample level scores to pandas dataframe
Expand Down Expand Up @@ -187,4 +187,4 @@ If you want help with improving and scaling up your AI application using evals.

## Up Next

- [Evaluate a simple RAG application](rag_eval.md)
- [Evaluate a simple RAG application](rag_eval.md)