Topic Modeling-ResearchTool

Overview

The Topic Modelling Tool is a Streamlit-based application designed for extracting and refining topics from textual data. The tool allows users to upload CSV files, select the desired text column for analysis, and extract meaningful topics using NLP techniques. It also integrates OpenAI's API to refine extracted topics into clearer, more interpretable themes.

🎥 Demo Video (Click to watch ⬇)

Features

Upload CSV Files: Users can upload a CSV file containing textual data.
Text Cleaning: Removes punctuation, converts text to lowercase, and eliminates non-alphanumeric characters.
N-gram Selection: Users can choose between bi-grams, tri-grams, four-grams, and five-grams.
Frequency Filtering: Users can set a threshold to exclude highly frequent n-grams.
Parallel Processing: Uses multiprocessing for efficient computation.
Topic Extraction: Generates topics using CountVectorizer from scikit-learn.
Topic Refinement: Uses OpenAI API to convert raw topic terms into meaningful descriptions.
Search and View Results: Users can search for specific topics and view topic frequencies.

Installation

Prerequisites

Ensure you have Python installed (>=3.10.3).

1. Clone the Repository

git clone https://github.com/munas-git/TopicModeling-ResearchTool.git
cd TopicModeling-ResearchTool

2. Install Dependencies

pip install -r requirements.txt

3. Set Up API Key

Create a .env file in the root directory and add:

OPEN_AI_KEY=your_openai_api_key

4. Run the Application

streamlit run app.py

Preprocessing Steps

The following steps are applied to clean and prepare the text data before topic extraction:

Text Normalization
- Converts text to lowercase.
- Removes punctuation using string.punctuation.
- Eliminates non-alphanumeric characters (except spaces) using regex.
N-gram Vectorization
- Users select an n-gram range (e.g., bi-grams, tri-grams, etc.).
- CountVectorizer extracts term frequencies while removing common English stopwords.
Frequency Filtering
- Users define a threshold (max_df_threshold) to exclude extremely frequent n-grams.
- This prevents common phrases from dominating the results.
Topic Extraction
- The parallel_extract_topics function processes text in batches using multiprocessing.Pool.
- The extracted terms are stored as Base Topics with their respective frequencies.
Topic Refinement (LLM-based)
- Topics are processed in batches via OpenAI’s GPT-3.5/4.
- The model converts raw n-gram-based topics into AI Refined Topics.
- Ensures topics are specific, meaningful, and easily interpretable.

User Controls & Inputs

The tool provides the following controls:

File Upload
- Users upload a CSV file with textual content.
Column Selection
- If multiple columns exist, users select the column for topic modeling.
N-gram Selection
- Options: Bi-grams (2-grams), Tri-grams (3-grams), Four-grams (4-grams), Five-grams (5-grams).
Frequency Limit (%)
- Users specify a threshold (0-100%) to exclude common phrases.
Start Analysis
- Clicking the "Begin Analysis" button initiates the pipeline.
Search Topics
- Users can search extracted topics by entering keywords.

Final Output & Results

After analysis, the tool generates:

DataFrame Display: Shows extracted topics, their frequencies, and refined versions.
Searchable Topic List: Users can filter topics using keywords.
Download Option (Future Enhancement): Export results as a CSV file.

Example Workflow

Upload a CSV file containing documents or abstracts.
Select the text column to analyze.
Choose n-gram settings and frequency threshold.
Click “Begin Analysis” to extract topics.
View and search refined topics in the results table.

Future Enhancements

Downloadable Reports: Export topic analysis as CSV/JSON.
Interactive Visualizations: Word clouds and bar charts for topic distribution.
Model Customization: Allow users to choose different NLP models.
Cloud Storage Integration: Store analysis results for later retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
helpers		helpers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic Modeling-ResearchTool

Overview

🎥 Demo Video (Click to watch ⬇)

Features

Installation

Prerequisites

1. Clone the Repository

2. Install Dependencies

3. Set Up API Key

4. Run the Application

Preprocessing Steps

User Controls & Inputs

Final Output & Results

Example Workflow

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

License

munas-git/TopicModeling-ResearchTool

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling-ResearchTool

Overview

🎥 Demo Video (Click to watch ⬇)

Features

Installation

Prerequisites

1. Clone the Repository

2. Install Dependencies

3. Set Up API Key

4. Run the Application

Preprocessing Steps

User Controls & Inputs

Final Output & Results

Example Workflow

Future Enhancements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages