Skip to main content

Beginner Guide - Creating LLM Datasets with Python

1. Introduction to LLM Datasets

Large Language Models (LLMs) like GPT require high-quality datasets for training and fine-tuning. These datasets typically consist of text data organized in formats suitable for machine learning tasks.

Key Features of LLM Datasets:

  • Diversity: A mix of topics, languages, and styles.
  • Quality: Clean and well-structured data.
  • Size: Ranges from a few MBs for fine-tuning to TBs for large-scale training.

2. Prerequisites

Before you start, ensure you have the following tools installed:

  1. Python: Install Python 3.8 or newer.

    • Verify installation:
      python --version
  2. Libraries: Install essential libraries for data collection and processing:

    pip install pandas numpy requests beautifulsoup4 nltk datasets
  3. Hardware: Use a system with sufficient RAM and storage for larger datasets.


3. Step 1: Define Your Dataset Objective

Before creating your dataset, define its purpose:

  • Text Generation: Collect conversational data, creative writing, or news articles.
  • Classification: Gather labeled data for tasks like sentiment analysis.
  • Translation: Source parallel corpora for multiple languages.

4. Step 2: Collect Text Data

Option 1: Scraping Websites

Use libraries like requests and BeautifulSoup to scrape data:

import requests
from bs4 import BeautifulSoup

url = "https://example.com/articles"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract article text
articles = []
for tag in soup.find_all('p'):
articles.append(tag.get_text())

print("Collected Articles:", articles)

Option 2: Using Public APIs

Fetch data from APIs like Twitter or Reddit:

import requests

api_url = "https://api.example.com/data"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
response = requests.get(api_url, headers=headers)
data = response.json()

print("Fetched Data:", data)

Option 3: Using Open Datasets

Leverage open datasets from platforms like Hugging Face:

from datasets import load_dataset

dataset = load_dataset("ag_news")
print(dataset["train"][0])

5. Step 3: Clean and Preprocess Data

Common Preprocessing Steps:

  1. Remove Special Characters:

    import re

    text = "Hello! This is a sample text with special chars: #, $, %!"
    clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    print(clean_text)
  2. Tokenization:

    import nltk
    nltk.download('punkt')
    from nltk.tokenize import word_tokenize

    tokens = word_tokenize("This is a sample sentence.")
    print(tokens)
  3. Lowercasing:

    text = "This Is Mixed Case."
    print(text.lower())
  4. Remove Stop Words:

    from nltk.corpus import stopwords
    nltk.download('stopwords')

    words = [word for word in tokens if word not in stopwords.words('english')]
    print(words)

6. Step 4: Format Data for LLMs

JSONL Format

Store data in JSONL (JSON Lines) format for ease of loading:

import json

data = [{"prompt": "What is AI?", "response": "Artificial Intelligence is..."}]
with open("dataset.jsonl", "w") as f:
for entry in data:
f.write(json.dumps(entry) + "\n")

CSV Format

Use CSV for structured data:

import pandas as pd

data = {"prompt": ["What is AI?"], "response": ["Artificial Intelligence is..."]}
df = pd.DataFrame(data)
df.to_csv("dataset.csv", index=False)

7. Step 5: Validate Your Dataset

  1. Check for Missing Values:

    print(df.isnull().sum())
  2. Validate Data Distribution:

    print(df["response"].str.len().describe())
  3. Manual Review:

    • Spot-check a few entries to ensure quality.

8. Step 6: Augment Your Dataset (Optional)

Data Augmentation Techniques:

  1. Synonym Replacement:

    from nltk.corpus import wordnet

    def replace_synonyms(text):
    words = text.split()
    new_words = []
    for word in words:
    synonyms = wordnet.synsets(word)
    if synonyms:
    new_words.append(synonyms[0].lemmas()[0].name())
    else:
    new_words.append(word)
    return " ".join(new_words)

    print(replace_synonyms("This is a test sentence."))
  2. Back Translation: Translate text to another language and back for diversity.

  3. Sentence Shuffling: Randomize sentence order in paragraphs to create variations.


9. Saving and Sharing Your Dataset

  1. Save Locally:

    mv dataset.jsonl /path/to/your/project
  2. Host on Hugging Face: Use the Hugging Face Hub to share datasets.

  3. Open-Source Platforms: Publish on GitHub or Kaggle for broader reach.


10. Conclusion

Creating LLM datasets involves thoughtful data collection, cleaning, and formatting. With Python’s extensive libraries, you can efficiently build datasets tailored to your specific use case. Experiment with different sources and techniques to create diverse and high-quality datasets for your LLM projects.