Python Client

Use the Python client when your GROBID service is already running and you want a practical way to process many PDFs or integrate GROBID into a Python workflow.

This is usually better than writing your own naïve parallel request loop.

What the Python client is good at

The client is especially useful for:

processing folders of PDFs
managing concurrency more safely than ad hoc scripts
writing TEI outputs automatically
optionally generating JSON or Markdown derived from TEI
integrating GROBID into Python data pipelines

Before you start

Assume:

GROBID is already running
the default service URL is http://localhost:8070

If the service is not running yet, go back to Quick Start (Docker).

Choose your path

You want to...	Best path
process a folder of PDFs from the shell	CLI
integrate GROBID into Python code	Python API
turn existing TEI files into JSON or Markdown	formatting helpers

1. Install the client

Install from PyPI:

pip install grobid-client-python

2. Fastest useful CLI example

Process a folder of PDFs into TEI XML:

grobid_client --input ./pdfs --output ./out processFulltextDocument

This is the strongest first-use path for most users.

3. Emit JSON or Markdown too

If you also want derived outputs from the TEI result:

grobid_client --input ./pdfs --output ./out --json --markdown processFulltextDocument

Important:

JSON and Markdown are derived from TEI, not separate server-side formats
the client first writes TEI, then performs the conversion locally

4. Minimal Python example

from grobid_client.grobid_client import GrobidClient

client = GrobidClient()
client.process(
    "processFulltextDocument",
    input_path="./pdfs",
    output="./out",
)

This is the best place to start if you want Python-level integration.

5. Most important options first

`--server`

Point the client at a non-default server:

grobid_client --server http://my-host:8070 --input ./pdfs --output ./out processFulltextDocument

`--input` and `--output`

--input: directory of files to process
--output: where TEI and optional derived outputs are written

`--n`

Controls concurrency on the client side.

This is one of the most important knobs for throughput and stability.

If you start seeing many 503 responses, reduce --n first.

`--force`

Overwrite existing outputs if needed.

Useful when rerunning a workflow on the same output directory.

`--json` and `--markdown`

Generate additional local output formats from TEI:

--json
--markdown

Use these when TEI is not the only downstream format you need.

6. Common services you will actually use

Fulltext

grobid_client --input ./pdfs --output ./out processFulltextDocument

Header only

grobid_client --input ./pdfs --output ./out processHeaderDocument

References only

grobid_client --input ./pdfs --output ./out processReferences

Citation-list parsing from text files

grobid_client --input ./txt_refs --output ./out processCitationList

This is useful if you already have raw citation strings rather than PDFs.

7. Coordinates and sentence segmentation

If you need downstream annotation or span-aware processing, use coordinate and sentence options.

These are advanced but practical when needed.

The client exposes options for:

teiCoordinates
sentence segmentation

This is especially useful when you later want:

JSON with structured spans
browser/PDF overlays
passage-aware post-processing

8. Config file mode

The client can load settings from config.json.

This is useful when you want stable defaults for:

server URL
batch size
timeout
sleep time
output-related settings

A minimal example:

{
  "grobid_server": "http://localhost:8070",
  "batch_size": 1000,
  "timeout": 60,
  "sleep_time": 5
}

Practical advice:

treat this as a convenience config, not a full declarative workflow system
still keep your first examples simple before relying on config files heavily

9. Important pitfall: `n` is not `batch_size`

These are easy to confuse.

n controls client concurrency
batch_size controls how many discovered files are grouped for dispatch logic

If throughput or 503 behavior is weird, make sure you are changing the right one.

10. Important pitfall: the client checks the server early

The client checks server availability during initialization by default.

That means a run can fail before any files are processed if:

the server is not up
the URL is wrong
the service is not ready yet

So if the client appears to fail immediately, verify the server first.

11. CLI and library defaults are not identical

Do not assume the CLI and Python API behave exactly the same by default.

In particular, some defaults in the library path are more aggressive or more convenient for programmatic use than the CLI flags suggest.

If something feels inconsistent, be explicit rather than relying on implicit defaults.

12. Formatting helpers

The repo also provides TEI conversion helpers:

TEI -> JSON
TEI -> Markdown

These are useful when:

you already have TEI files
you want a lighter downstream representation
you want a readable Markdown form for inspection or LLM/RAG pipelines

13. Recommended first learning sequence

CLI: process a folder to TEI
CLI: add --json or --markdown
Python: use GrobidClient() in a short script
Then tune --n, timeout, and config file defaults

That sequence gives the fastest path from success to useful automation.

14. If something goes wrong

Go to:

If the service is healthy but the client still struggles, simplify first:

use processFulltextDocument
use one small PDF
lower --n
point to a clearly working http://localhost:8070

What the Python client is good at​

Before you start​

Choose your path​

1. Install the client​

2. Fastest useful CLI example​

3. Emit JSON or Markdown too​

4. Minimal Python example​

5. Most important options first​

--server​

--input and --output​

--n​

--force​

--json and --markdown​

6. Common services you will actually use​

Fulltext​

Header only​

References only​

Citation-list parsing from text files​

7. Coordinates and sentence segmentation​

8. Config file mode​

9. Important pitfall: n is not batch_size​

10. Important pitfall: the client checks the server early​

11. CLI and library defaults are not identical​

12. Formatting helpers​

13. Recommended first learning sequence​

14. If something goes wrong​