OLA: A Library of Annotated Latin Texts
  • GitHub

Tokenization

Tokenization is the first layer of annotation. It identifies the "atoms" which annotation units are attached to. There can be different tokenization schemes, depending on the definition given to "token" (e.g., morphosyntactic or prosodic word). The corpus currently contains a layer of morphosyntactic tokenization.

Fig. 1: The tokenization algorithm.

Menu

  • Homepage
  • Standoff Annotation
  • Texts
  • Tokenization
  • Sentence Split
  • Morphological Annotation
  • Syntactic Annotation
  • Next Layer of Annotation
  • Annis

Contacts

  • GitHub project
  • email prefix: celano
    email suffix: informatik.uni-leipzig.de
  • Leipzig University
    Department of Computer Science
    Augustplatz 10
    04109, Leipzig

Supporters



Creative Commons License
The content is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Design: adapted from HTML5 UP