Skip to content

SegmOnto: A Controlled Vocabulary to Describe the Layout of Pages

SegmOnto offers a controlled vocabulary to describe the content of books or manuscripts pages, in order to homogenise the data required by layout analysers. This project follows a double objective:

  • Mutualise data to train stronger models on various layouts.
  • Design a standardised pipeline for text extraction, from page scans to structured documents

SegmOnto is thought as a generalist description scheme, covering written documents produced since the apparition of the codex, but it has been designed using mainly western and middle eastern documents.

How to cite

Simon Gabay, Jean-Baptiste Camps, Ariane Pinche, Nicola Carboni, SegmOnto, A Controlled Vocabulary to Describe the Layout of Pages, version 0.9, Paris/Genève, 2021, https://github.com/SegmOnto.

logo ENC logo UniGE