Distributional models of lexical knowledge

This page contains resources created in the Project FK-125217.

Lexical resources

Hungarian verbal argument frame lexicon and other lexical resources described in: Attila Novák; László Laki; Borbála Novák; Andrea Dömötör; Noémi Ligeti-Nagy; Ágnes Kalivoda: Creation of a corpus with semantic role labels for Hungarian.

SZDT-UD2

A manually corrected version of the Szeged Dependency Treebank automatically converted to Universal Dependencies format. See Novák Attila; Novák Borbála: Egy nagyobb magyar UD korpusz felé.

MILQA Hungarian question-answer benchmark database

MILQA is a Hungarian machine reading comprehension, specifically, question answering (QA) benchmark database. It was built following the principles of SQuAD 2.0 with a number of innavations like Boolean and arithmetic questions, long and list answers added. See Attila Novák; Borbála Novák; Tamás Zombori; Gergő Szabó; Zsolt Szántó; Richárd Farkas: A Question Answering Benchmark Database for Hungarian.

Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora

See the GitHub repo https://github.com/ppke-nlpg/fastText_factored-cbow and Attila Novák, László Laki, Borbála Novák: CBOW-tag: a Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora.

NerKor+Cars-OntoNotes++

NerKor+Cars-OntoNotes++ is a richly annotated Hungarian named entity dataset. It was derived from NYTK-NerKor, a Hungarian gold standard named entity annotated corpus containing about 1 million tokens. It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of hvg.hu.
A model for Hungarian named entity recognition trained on the corpus is available at the HuggingFace hub. See Novák, Attila; Novák, Borbála. (2022) NerKor+Cars-OntoNotes++. In: Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). pp. 1907-1916.