After extraction, you must normalize the text to match the reference format. Write a script to:
| Phase | Tool |
|-------|------|
| PDF text extraction | pdfplumber, PyMuPDF, pdftotext (Poppler) |
| OCR for scanned PDFs | Tesseract + pytesseract, ocrmypdf |
| Text cleaning | Custom Python regex, textacy, nltk |
| Sentence splitting | spaCy, nltk.tokenize.punkt |
| BLEU calculation | sacrebleu (recommended), nltk.translate.bleu_score |
| Workflow automation | Apache Airflow, snakemake or simple bash+Python | bleu+pdf+work
PDFs are designed for visual fidelity, not text extractability. Common issues include: After extraction, you must normalize the text to
If you run BLEU directly on raw PDF extraction without preprocessing, your scores will be artificially low—not because translation is poor, but because the reference text is corrupted. PDFs are designed for visual fidelity, not text
If Option A produces jumbled text, use pdfplumber.
import pdfplumber
def extract_with_layout(pdf_path): text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: # filter_out_objs ensures tables/images don't mess up text flow page_text = page.extract_text() if page_text: text += page_text + "\n" return text