Annotation Guidelines
Use this page when you are preparing or correcting GROBID training data.
Core principle
Annotation quality matters more than annotation volume.
In practice, GROBID training works best when examples are:
- consistent
- manually reviewed
- chosen to cover real failure cases
- kept aligned with the actual model task
Do not annotate only the labels you personally care about
For a given model, all labels in that task need coherent annotation.
Selective annotation usually hurts the labels you thought you were optimizing, because the model loses the surrounding context that helps it separate neighboring structures.
Use pre-annotation as a starting point, not gold data
The practical workflow is usually:
- generate training TEI from PDFs
- open the generated files in an annotation or correction tool
- manually correct the result
- move trusted files into the model dataset directories
The older docs specifically point to pdf-tei-editor as a useful web-based correction tool for this workflow.
Typical generated files include model-specific TEI files such as:
*.training.segmentation.tei.xml*.training.header.tei.xml*.training.fulltext.tei.xml*.training.references.tei.xml
Layout-aware tasks also produce companion feature files without the .tei.xml suffix. Those feature files are necessary for training, but they are not the files you manually edit as annotation targets.
Keep the text stream untouched
When correcting pre-annotated files, the most important rule is:
- move tags if needed
- do not rewrite the extracted text itself
That includes keeping noisy PDF text, OCR artifacts, and unexpected Unicode as-is when they came from the extraction stream.
Practical exception:
- XML whitespace can be reformatted for readability
<lb/>should still remain aligned with the original text flow
Modify or delete generated files, but do not invent missing ones casually
If a pre-generated training fragment is clearly wrong, fix or remove it.
If a specialized generated file is missing because the upstream detection failed, do not casually invent a new parallel training file by hand. Correct the material through the intended workflow and keep the dataset organization coherent.
Keep training and evaluation discipline
When possible:
- keep held-out evaluation files separate
- add new examples because they expose real failures, not just because they are easy to annotate
- prefer iterative learning curves and error-driven corpus growth over blind accumulation
Naming and organization
Training data is organized by model under grobid-trainer/resources/dataset/<MODEL>/.
Be careful to place files under the correct model directory. A common mistake is to prepare examples for one task and then train a different model name.
Language and domain adaptation
GROBID can be adapted beyond mainstream English scholarly articles, but do it iteratively.
Typical order:
- segmentation
- header
- fulltext
- references and related subparsers
Detailed per-model annotation rules
Each model has specific label definitions and annotation rules with XML examples. Before annotating training data for a specific model, study the corresponding reference:
- Header model -- title, authors, affiliations, abstract, keywords, identifiers
- Fulltext model -- paragraphs, sections, figures, tables, equations, references
- Bibliographical references -- citation parsing,
<bibl>structure - Segmentation model -- document zones (
<front>,<body>,<listBibl>) - Affiliation-address model -- organization, address, markers
- Date model -- day/month/year normalization
These detailed guidelines are essential for producing consistent, high-quality training data.