What is CoNLL data format?

nlp text-parsing text-mining information-extraction

42,743

Solution 1

There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:

ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL

The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):

ID (index in sentence, starting at 1)
FORM (word form itself)
LEMMA (word's lemma or stem)
POS (part of speech)
FEAT (list of morphological features separated by |)
HEAD (index of syntactic parent, 0 for ROOT)
DEPREL (syntactic relationship between HEAD and this word)

There are variants of those columns (e.g., PPOS but not POS) that start with P indicate that the value was automatically predicted rather a gold standard value.

Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.

Solution 2

As update to @dmcc's answer:

CoNLL is the conventional name for TSV formats in NLP (TSV - tab-separated values, i.e., CSV with as separator)
It originates from a series of shared tasks organized at the Conferences of Natural Language Learning (hence the name)
Not all of these tasks use "CoNLL" formats, some tasks had JSON or XML formats
There are "CoNLL" formats that developed independently from CoNLL, most notably CoNLL-U
CoNLL formats differ in the choice and order of columns

In CoNLL formats,

every word (token) is represented in one line.
every sentence is separated from the next by an empty line
every column represents one annotation
every word in a sentence has the same number of columns (in some formats: every word in the corpus has the same number of columns)
an annotation is a string value about a particular word
annotations that span over multiple words sometimes use special notations, e.g., round brackets (indicating begin and end of a phrase) or the IOBES-annotation (e.g., B-NP: begin of NP, I-NP: in the middle of NP, E-NP: end of NP, S-NP: NP begins and ends at the current word, O: no NP annotation)
some CoNLL formats have one or multiple columns of numerical identifiers as the first column, the next column after these (or the first if there are no IDs) usually contains the WORD
the ID of the first word in the sentence is 1. If no ID column is provided, the ID is the number of preceding words within the sentence plus 1.
in dependency syntax, grammatical relations hold between words, the dependent is marked for the HEAD (= ID of the parent word) and the EDGE/DEP[endency] (= grammatical relation), both in separate columns
if a word in dependency syntax does not have a parent (i.e., it is the syntactic root), set its HEAD to 0

Be careful when working with tools or libraries that claim to support (some) "CoNLL format". Different CoNLL formats have different order of columns and the developer might not be aware of that. So, it is likely that they don't work as expected if they get data from another (or unspecified) CoNLL format.

For converting between different CoNLL formats, you can consider using CoNLL-RDF (https://github.com/acoli-repo/conll-rdf), resp., CoNLL-Transform (https://github.com/acoli-repo/conll-transform) (Disclaimer: Developed by my lab.)

42,743

Author by

swapna sourav rout

Updated on September 30, 2021

Comments

swapna sourav rout over 2 years

I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated.