GROBID Tutorial
This site is an experimental, community-maintained tutorial — a hands-on, beginner-friendly on-ramp that gets you from zero to your first successful PDF extraction as quickly as possible. It is not the official GROBID documentation and may be incomplete, out of date, or differ from the canonical reference in places.
For authoritative information — every endpoint, every configuration parameter, every supported flag, every training option — always refer to the official GROBID documentation at grobid.readthedocs.io. When this tutorial and the official docs disagree, the official docs are correct.
Not sure which to read? If you've never run GROBID before, this tutorial is a good place to start — it's optimized for getting you unstuck quickly. If you're already running GROBID and looking up a specific flag, schema element, or behavior, go straight to grobid.readthedocs.io.
GROBID extracts structured data from scholarly PDFs: titles, authors, affiliations, references, citations, section structure, full text, and TEI XML.
Common capabilities include:
- header extraction for titles, abstracts, authors, affiliations, and keywords
- reference extraction and parsing from PDFs or raw citation strings
- fulltext structuring into sections, paragraphs, figures, tables, notes, and citations
- PDF coordinates for mapping extracted structures back onto the source document
- metadata enrichment through CrossRef or biblio-glutton when consolidation is enabled
- specialized processing flavors for non-standard document types
If you want to get productive quickly, start with Docker. The documentation builder generates a safer docker run command, explains the important flags, and helps you avoid the most common setup mistakes reported in GitHub issues.
For most users, the shortest successful path is:
- open the Docker Builder
- start the service with the CRF image
- verify
localhost:8070 - make your first REST API request
Start here
Recommended: Docker Builder
- Best for most users on Windows, macOS, and Linux
- Guides you through image choice, paths, ports, consolidation, and shell-specific command syntax
- Prevents common mistakes like invalid config mounting or unsafe
grobid-homeoverrides
Quick path
If you already know you want Docker, go directly here:
Choose your path
I want GROBID running as fast as possible
- Use Quick Start (Docker)
- Then continue with REST API Usage
- If startup fails, go directly to Docker Troubleshooting
I need help choosing Docker options
- Use the Docker Builder
- If startup or mounts fail, check Docker Troubleshooting
- If requests fail after startup, check Troubleshooting
I need to understand the API or outputs
- Start with REST API Usage
- For the full endpoint catalog, see the GROBID Service reference
What users most often get stuck on
The issue triage shows a clear pattern:
- Docker setup and shell-specific command syntax
- Configuration files and consolidation setup
- Error diagnosis and recovery
- API request details and failure modes
The docs are therefore optimized around three early moves:
- get the service running safely
- recover quickly when startup or requests fail
- make the first correct API request without reading a giant reference page
This documentation is organized to get you past those blockers early.
Documentation map
Tutorials
- Learn by doing with short, guided outcomes
- Start with Quick Start (Docker)
How-to guides
- Solve a task you already know you need
- Start with Docker Setup, Troubleshooting, or Configuration
Reference
For detailed reference documentation (API endpoints, configuration parameters, TEI encoding, training guidelines), see grobid.readthedocs.io.
This tutorial site provides practical, task-oriented guides designed to get newcomers productive fast. For the full reference documentation — API endpoints, configuration parameters, TEI encoding, and training annotation rules — see grobid.readthedocs.io.