Training Workflow

Use this page for the practical training flow after you already understand which model you want to retrain.

Dataset layout

Training data for a model lives under:

grobid-trainer/resources/dataset/<MODEL>/

Typical directories:

corpus/ for training material
evaluation/ for held-out evaluation material when you manage the split manually

Simple mode: train and evaluate with one Gradle task

For the common case, use the Gradle wrapper from the repo root:

./gradlew train_<model>

Examples:

./gradlew train_header
./gradlew train_name_header

This uses the data already arranged in the model's corpus/ and evaluation/ directories.

Full mode: trainer jar commands

When you need more control, use the trainer jar directly.

Train only:

java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <model> -gH grobid-home

Evaluate only:

java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 1 <model> -gH grobid-home

Automatically split, then train and evaluate:

java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 <model> -gH grobid-home -s 0.8

Incremental training

If you already have a trained model and want to continue from it instead of starting from scratch, use -i.

java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <model> -gH grobid-home -i

Use this carefully. A full retraining with all available data is usually the better final model, but incremental training is useful for faster iteration.

N-fold evaluation

For more robust model-level evaluation, run n-fold cross-validation:

java -Xmx1024m -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 3 <model> -gH grobid-home -n 10

Generating training data from PDFs

You can pre-generate model-specific TEI training files from PDFs using the batch-side training generation path (createTraining).

This creates model-specific TEI files and, for layout-aware models, companion feature files.

These outputs are not gold data yet. They still need manual review and correction before becoming trusted training material.

Practical advice

keep backups of existing trained models before overwriting them
validate that you are training the correct model directory
prefer targeted iterative improvement over large blind corpus growth
keep evaluation separate from the examples that inspired your latest fix when possible

Dataset layout​

Simple mode: train and evaluate with one Gradle task​

Full mode: trainer jar commands​

Incremental training​

N-fold evaluation​

Generating training data from PDFs​

Practical advice​

Related pages​