Data Preparation for LLM: The Key To Better Model Performance
Using high-quality data, ethical scraping, and data pre-processing to build reliable LLMs
Big thanks to Louis-François Bouchard and the Towards AI team for this guest tutorial!
Since 2019, Louis-François and Towards AI have shared thousands of articles through their publication and hundreds of videos via the What’s AI channel. Building on that foundation, they launched an education platform built by practitioners for practitioners - now trusted by over 100,000 learners. With a mission to make AI accessible and a focus on applied learning, their hands-on LLM development courses bridge the gap between academic research and real-world industry needs.
The saying “garbage in, garbage out” holds for Large Language Models (LLMs). Their performance is directly tied to the quality of the data they’re trained on—noisy, inaccurate, or biased information can drastically reduce their effectiveness. On the other hand, models trained with well-curated, high-quality datasets perform better.
In this article, we’ll explore practical techniques for defining data standards, ethically scraping data, removing noise, and refining datasets to ensure optimal LLM performance—whether you’re starting from scratch with pre-training or fine-tuning an existing model. Before we dive into the specifics, let’s first discuss why data preparation is crucial for the success of LLMs.
Why Data Preparation Matters
Careful data preparation directly impacts an LLM's reliability. In many real-world scenarios, an unreliable model can result in misguided decisions or harmful outcomes, particularly in high-stakes fields like healthcare, finance, or law. Ensuring reliability means that the model's outputs can be trusted to reflect accurate, up-to-date information, reducing the risk of misinformation and unintended consequences. For example, if an LLM trained to provide medical information uses data containing outdated treatments, inaccurate dosages, or incorrect medical terminology, the model would generate misleading or potentially harmful recommendations.
In 2022, researcher (and YouTuber) Yannic Kilcher trained GPT-4chan using posts from 4chan’s /pol/ board, known for highly offensive content. Predictably, the model replicated racist and sexist biases from its training data. Meanwhile, the TinyGSM project used a synthetic dataset of 12.3 million elementary math problems with Python solutions. After training a smaller 1.3-billion-parameter model on this curated data, it achieved 81.5% accuracy on the GSM8K benchmark—matching GPT-3.5 and outperforming many larger models.
These examples highlight the vital role that data preparation plays in ensuring LLM quality. Not only can data curation enhance performance, but it also helps mitigate risks associated with biases and inaccuracies. This brings us to the importance of setting clear standards for data quality when working with LLMs.
Defining Data Quality Standards
Defining standards for data collection serves as a good foundation for the later process. Yu et al. (2024) identified 13 characteristics of high-quality training datasets for large language models in their paper “What Makes a High-Quality Training Dataset for Large Language Models: A Practitioners’ Perspective”:
“Reliability, Relevance, Accuracy, Compliance, Accessibility, Privacy Protection, Documentation, Large-Scale Data, Diversity, Knowledge Content, Wide Range of Sources, Absence of Low-Quality Documents, and Absence of Toxic Data.”
These characteristics can be broadly classified into four categories:
Data Reliability and Trustworthiness: Focus on credible sources and accurate representation of real-world facts in your dataset. Gathering text from well-curated corpora, peer-reviewed journals, or highly reputable websites decreases the risk of introducing misinformation. At the same time, cross-verifying critical facts mitigates the chance that your LLM will learn and reproduce inaccuracies. By prioritizing these aspects, you reduce the risk of your LLM learning harmful, biased, or inaccurate patterns, ultimately producing more trustworthy and factually correct outputs.
Data Scope and Coverage: Achieving the right balance of breadth and depth in your dataset is vital for boosting the adaptability and relevance of your LLM. Incorporating diverse topics and textual styles—ranging from academic articles and technical manuals to newspaper reports and creative writing—ensures that your model encounters a wide variety of linguistic structures, terminologies, and cultural viewpoints. While scaling up the quantity of data can be beneficial, it is equally important to filter and label content for domain-specific relevance. By doing so, you enable your model to capture both generalized language patterns and deeper domain insights that align with specific tasks or user queries.
Data Cleanliness focuses on screening out harmful or meaningless text that can degrade model performance or produce undesirable outputs. Removing toxic language, whether it involves hate speech or malicious bias, is critical to preventing your LLM from generating offensive or harmful responses. Eliminating low-quality documents, such as spam, cryptic fragments, or trivially short texts, further ensures that the model focuses on coherent, context-rich information. Keyword filtering, automated screening tools, and human review processes are all valuable strategies for detecting problematic text early on, allowing you to proactively preserve a dataset that fosters more accurate and respectful outputs.
Data Governance and Compliance ensure that the dataset is collected, stored, and used in ways that respect privacy, copyright, and legal regulations. Safeguarding privacy requires that personally identifiable information be removed or anonymized. Compliance with copyright and licensing obligations prevents legal entanglements and helps maintain professional integrity. Additionally, verifying that each data source allows the kind of usage you intend—be it research, commercial development, or open-source release—protects you from potential rights violations. When these governance and regulatory aspects are addressed thoroughly, you minimize liabilities and reinforce trust among stakeholders, collaborators, and end users.
In addition to these standards, documenting guidelines, metadata, and instructions helps ensure that the data can be easily reused and refined by collaborators or future users, streamlining verification and training pipelines.
However, well-organized data is only valuable if it is collected using sound methods. Ethical data sourcing is just as important as data organization in maintaining the integrity of the model. Recently, several high-profile AI providers have faced lawsuits over unauthorized data usage, showing how critical it is to follow proper compliance and ethical guidelines.
Ethical Web Scraping for Data Collection
Once you've identified the data you need, the next step is gathering it responsibly. Web scraping—extracting text from websites, forums, PDFs, and other online resources—is an efficient way to build datasets but comes with significant ethical and legal considerations. Generally, web scraping is legal if conducted responsibly, though it exists within a grey area shaped by evolving privacy laws, copyright regulations, and website policies.
Legal Landscape and Recent Cases
Web scraping legality primarily hinges on compliance with website Terms of Service (ToS), adherence to robots.txt
files, and respect for copyright and privacy laws. Ignoring these can lead to technical blocks or legal repercussions.
What Is robots.txt? A robots.txt file tells web crawlers which parts of a website they can access. Site owners can designate disallowed directories or paths, and responsible crawlers adhere to these rules. However, not all websites have (or maintain) a robots.txt file. If one is missing or unclear, follow the site’s Terms of Service and restrict your scraping to publicly available, non-sensitive content.
Recent cases illustrating these complexities include:
The New York Times sues OpenAI and Microsoft: Allegations of copyright infringement by AI models using copyrighted content.
Condé Nast sends cease-and-desist to Perplexity AI: Legal action against unauthorized use of publisher content.
Figma pulls AI tool over copyright issues: AI tool retracted after design copyright accusations.
Major music labels sue Suno and Udio: Lawsuits over unauthorized use of copyrighted songs.
Given these complexities, responsible scraping requires careful adherence to ethical practices:
Good Practices for Web Scraping
1. Respect Website Guidelines
Always adhere to a site's robots.txt
file and Terms of Service. Here's a typical example of robots.txt
:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Allow: /public/
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /search
When uncertain or robots.txt
is not provided, seek explicit permission or use provided APIs, as these are safer and clearer methods for structured data retrieval.
2. Protect Privacy and Avoid Sensitive Data
Limit scraping to publicly accessible, non-sensitive information. Avoid personally identifiable information (PII), like names, addresses, or private posts. Regulations like GDPR and CCPA impose strict penalties for privacy violations. Employ privacy filters or redaction methods to mitigate the risks of capturing sensitive data inadvertently.
3. Scrape Politely and Minimize Server Impact
Respect a website's resources. Excessive scraping can trigger legal action under laws like "trespass to chattels." Employ rate limiting (e.g., one request per second), auto-throttling, clearly identify your scraper (User-Agent), and follow crawl-delay
directives. A crawl-delay tells search engine crawlers how many seconds to wait between consecutive requests to the same server; for instance, a crawl-delay value of 10 instructs crawlers to wait at least 10 seconds between requests.
4. Handle Copyright with Caution
Public visibility does not imply free usage rights. Government websites or Wikipedia content are generally safe, while news articles, books, and creative works typically hold copyrights. Use such content internally or in research contexts without public redistribution, and stay updated on copyright regulations.
5. Use Scalable and Ethical Tools
Select well-known scraping tools like Python's Scrapy, which supports politeness settings, automatic throttling, and robots.txt
compliance. For dynamic sites, tools like Selenium or Puppeteer can be effective, though more resource-intensive. Browser-based APIs (ScrapingAnt, BrightData) simplify handling CAPTCHAs and proxies.
When available, utilize existing compliant datasets or open repositories like Common Crawl or Reddit data dumps to minimize legal risks. Web scraping should always prioritize respect for content creators, user privacy, and website policies, which ultimately helps create robust datasets and trustworthy AI models.
Practical Approaches to Transform Real-World Text into Clean, Structured Data
Once you’ve gathered raw text data from various sources, such as webpages, PDF documents, or user-generated content, the next step is cleaning it. Raw data often contains unnecessary clutter—HTML tags, scripts, special characters, inconsistent formatting, and other irrelevant elements—that need to be removed to ensure it can be used effectively in LLM training.
The first step in this process is cleaning and normalization. Cleaning involves removing non-essential elements from your data. For example, if your text comes from HTML sources, you might need to strip out elements like <script>
and <style>
tags, as well as navigation menus, headers, and footers. Python libraries such as BeautifulSoup simplify this step considerably. A practical example of HTML cleaning is as follows:
from bs4 import BeautifulSoup
def clean_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Remove unwanted tags like script and style
for script in soup(["script", "style"]):
script.decompose()
# Extract text and normalize spacing
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
clean_text = '\n'.join(chunk for chunk in chunks if chunk)
return clean_text
Frameworks like LangChain further streamline this task, offering dedicated tools such as AsyncHtmlLoader
to fetch pages and Html2TextTransformer
to extract plain text efficiently, saving developers from manual string manipulation and regex work.
Normalization complements cleaning by standardizing your text. It includes steps like, but not limited to:
Converting HTML entities (e.g.,
) into their equivalent characters.Turning text to lowercase if case sensitivity isn't important.
Converting fancy quotation marks or special punctuation into their standard equivalents.
Removing or replacing problematic non-UTF-8 characters that might cause issues later on.
Next, consider segmentation and structuring your cleaned and normalized data. Raw text may often appear as large, unwieldy blocks of continuous text. To improve usability, break it into manageable and meaningful segments, often referred to as "chunks." A practical strategy involves splitting large documents into paragraphs or limiting each chunk to a fixed number of tokens. This ensures each chunk contains sufficient context while still being manageable for training purposes:
def truncate_document(doc, max_tokens=1000):
tokens = doc.text.split()
if len(tokens) > max_tokens:
truncated_text = ' '.join(tokens[:max_tokens])
return Document(text=truncated_text, metadata=doc.metadata)
return doc
Sometimes, further filtering based on specific keywords relevant to your training objectives can be beneficial. For instance, if your LLM use case revolves around artificial intelligence, filtering your data for keywords such as "machine learning," "neural networks," or "deep learning" will significantly enhance the dataset’s relevance:
def keyword_filter(documents, keywords):
filtered_docs = []
for doc in documents:
if any(keyword.lower() in doc.text.lower() for keyword in keywords):
filtered_docs.append(doc)
return filtered_docs
ai_keywords = ["artificial intelligence", "machine learning", "neural networks", "deep learning"]
filtered_documents = keyword_filter(documents, ai_keywords)
Lastly, structuring data into clear formats like JSON or CSV can further enhance clarity, especially if your dataset consists of various content types. For example, if scraping forum content, you might structure each entry clearly as a JSON object:
{
"post": "main post text here",
"comment": "user comments here",
"user": "user details or identifiers"
}
Practical Example: Parsing PDFs with LlamaParse
To illustrate a practical approach, let’s use the example of PDF documents. PDFs are complex – they contain text, images, tables, and weird layouts that make extracting text tricky. LlamaParse is a tool designed for exactly this scenario. It’s an LLM-powered document parser by the LlamaIndex team that we love, and we used it in our AI tutor, which we teach to replicate in our LLM Developer course. The goal of LlamaParse is to take complex, semi-structured documents (PDFs, Word, etc.) and output nicely structured text that can be used for LLM training.
LlamaParse simplifies complex document parsing by providing precision, flexibility, and customization: it maintains intricate document layouts—including images and tables—accurately; handles a wide variety of formats such as PDF, Word, HTML, and ePub; and accepts natural‑language instructions to make parsing both tailored and intuitive.
Here's a practical implementation:
First, install the necessary libraries:
# shell
!pip install llama-index llama-parse
Set up your environment with the Llama Cloud API key. Free users receive 10,000 free credits per month, while Pro users receive 20,000 free credits per month. Additional usage incurs charges based on the selected parsing mode and region. For detailed pricing information, refer to the LlamaCloud documentation.
import os
os.environ["LLAMA_CLOUD_API_KEY"] = "your-api-key-here"
Now, parse your PDFs using LlamaParse into a structured markdown. Markdown preserves document structure in a readable format with headers, lists, tables, etc.
#Import required libraries
import glob
from llama_parse import LlamaParse
# Find all PDF files in a directory
pdf_files = glob.glob("./pdf_docs/*.pdf")
# Parser setup
parser = LlamaParse(verbose=True)
# Option 1: Parse PDFs into structured markdown
for pdf_file in pdf_files:
# Get markdown result
md_result = parser.get_markdown(pdf_file)
# Save markdown to a file
output_file = pdf_file.replace('.pdf', '.md')
with open(output_file, 'w', encoding='utf-8') as f:
f.write(md_result)
print(f"Processed {pdf_file} to {output_file}")
Alternatively, if you prefer structured JSON output, you can easily do so as follows. JSON output provides a more structured, machine-readable format ideal for programmatic analysis:
json_objs = []
for pdf_file in pdf_files:
json_objs.extend(parser.get_json_result(pdf_file))
# Accessing structured data, for instance, text from page 6:
print(json_objs[0]['pages'][5]['text'])
# Extracting structured table data (CSV formatted) from page 7:
print(json_objs[0]['pages'][6]['items'][1]['csv'])
Though it may have a higher initial cost than simpler parsers, LlamaParse’s structured clarity and ease of use can significantly boost productivity in your data preparation pipeline. Transforming raw text data into structured, clean datasets involves step-by-step cleaning of irrelevant information, normalizing text formatting, segmenting into usable chunks, and structuring or filtering data according to your modeling needs. Using specialized tools like LlamaParse significantly simplifies handling complex documents, ensuring high-quality, structured outputs ready for model training or application development. For those looking to compare solutions, we have a section later in this article that lists alternatives to LlamaParse.
Once you've established basic cleaning processes, structured output tools can further refine your dataset's consistency and usability.
Using OpenAI Structured Outputs for Consistent Data Generation
Once your data is structured via tools like LlamaParse, using OpenAI Structured Outputs further refines and enriches your dataset. OpenAI’s Structured Outputs can extract precise, predefined information, ensuring consistency and minimizing post-processing. Structured outputs guarantee consistent and predictable responses, simplifying their integration into automated processes, RAG systems, and downstream tasks. This approach significantly reduces the need for extensive validation or parsing of free-form responses, thereby improving overall efficiency.
Step 1: Define a JSON schema for the desired data:
schema = {
"type": "json_schema",
"json_schema": {
"name": "TopBooks",
"strict": True,
"schema": {
"type": "object",
"properties": {
"books": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"author": { "type": "string" },
"yearPublished": { "type": "integer" },
"summary": { "type": "string" }
},
"required": ["title", "author", "yearPublished", "summary"]
}
}
},
"required": ["books"]
}
}
}
Step 2: API call using schema:
from openai import OpenAI
import json
client = OpenAI()
prompt = "List the top 5 best-selling novels with authors, publication years, and summaries."
response = client.chat.completions.create(
model="gpt-4o",
response_format=schema,
temperature=0,
messages=[
{"role": "system", "content": "Provide structured JSON responses."},
{"role": "user", "content": prompt}
]
)
result = json.loads(response.choices[0].message.content)
print(result_book['Top10BestSellingBooks'][0])
print("-------------------------------")
print(result_book['Top10BestSellingBooks'][0]['title'])
Example output:
{'title': 'Don Quixote',
'author': 'Miguel de Cervantes',
'yearPublished': 1605,
'summary': 'A comedic tale of a noble who becomes a self-styled knight and embarks on adventures to revive chivalry.'}
-------------------------------
Don Quixote
Structured data preparation significantly influences the performance of LLMs. Employing specialized tools such as LlamaParse and OpenAI Structured Outputs directly addresses the challenges of extracting consistent, precise, and high-quality structured data. LlamaParse excels in turning complex, semi-structured documents into model-ready data, while OpenAI Structured Outputs provide reliable schema-aligned extractions, minimizing manual post-processing.
By combining these powerful tools and following rigorous data preparation standards, practitioners ensure their LLMs achieve higher accuracy, more reliable outputs, and robust downstream integration. This comprehensive approach is fundamental for successful AI and machine learning applications.
Data Preparation Tools
While the manual steps we've discussed—cleaning, normalizing, and structuring—are foundational, they can become overwhelming when dealing with large datasets. Thankfully, several specialized tools and libraries significantly simplify and automate these processes, making large-scale data preparation more manageable and efficient.
As previously detailed, OpenAI Structured Outputs ensures reliable and consistent extraction aligned exactly with predefined schemas. By reducing unpredictability, this tool minimizes post-processing efforts, streamlining integration into automated workflows.
Handling complex documents such as PDFs, Word files, or HTML documents often poses challenges due to intricate layouts. LlamaParse uses advanced LLM technology to convert these complex documents into structured markdown or JSON, maintaining accuracy in document elements like headings, tables, images, and lists. Its accuracy justifies the cost, providing an efficient alternative to manual parsing.
LangChain is versatile and powerful, ideal for constructing automated extraction pipelines rapidly. It features intuitive document loaders such as PDFLoader
and WebBaseLoader
to effortlessly fetch and convert data into plain text. Coupled with Html2Text
transformer tools, LangChain facilitates fast, streamlined workflows, including targeted extraction or summarization tasks using LLM-driven data agents.
Scrapy is a high-performance web scraping framework widely used for extracting data from websites. Its powerful, asynchronous architecture allows for rapid crawling, while a robust pipeline system helps clean and structure the scraped data. When integrated with tools like LlamaParse or OpenAI Structured Outputs, Scrapy can serve as the first step in a seamless, end-to-end data preparation workflow.
Unstructured.io excels at ingesting diverse document types, breaking them down into semantic elements—paragraphs, titles, lists, and tables—alongside detailed metadata. Its structured output significantly simplifies downstream processing, making it particularly effective for batch-processing tasks across varied document formats.
Initially developed for question-answering systems, Haystack offers advanced text preprocessing features through its PreProcessor
component. It efficiently removes unnecessary whitespace, headers, footers, and boilerplate text while segmenting documents into meaningful chunks. Haystack integrates smoothly with document converters and crawlers, simplifying complex data cleaning and ingestion workflows.
The Hugging Face Datasets library provides robust support for large-scale dataset management, enabling efficient loading, shuffling, and streaming of substantial datasets. With user-friendly transformations and integrated evaluation metrics, it's akin to a tailored version of Pandas optimized for machine-learning dataset handling—fast, memory-efficient, and seamless within ML workflows.
Typically, real-world pipelines often combine these tools: you might scrape data with Scrapy, parse and clean it with Unstructured or LlamaParse, apply OpenAI Structured Outputs for standardized data, and use Hugging Face Datasets for large-scale streaming and analysis.
Conclusion
In the rush to build applications with LLMs, data preparation can seem mundane. Yet it is a vital foundation. By prioritizing data quality, ethical sourcing, and meticulous cleaning, you equip your model to deliver reliable, context-aware responses.
A well-prepared dataset is one of the best investments you can make in the success of your LLM. When you feed your model the right kind of data, you give it the strongest possible foundation to learn from—and that pays dividends in every aspect of your AI application. For readers who found these data preparation techniques challenging, the Beginner-to-Advanced LLM Developer Course provides hands-on practice with these methods in real-world scenarios.
Thank you for trusting us with this article ! Super happy with the result!