The 1.7 billion word LEXMCI corpus of English was created by the Lexicography MasterClass in 2008 as a source of lexicographic information for the lexicographers compiling the Dante database.
Its components include:
- the 100 million word British National Corpus
- the 25 million word Hiberno-English corpus, created for Foras na Gaeilge by the Lexicography MasterClass as part of the New Corpus for Ireland (NCI)
- a 100 million word corpus of American English licensed from the Linguistic Data Consortium
- the 1.5 billion word ukWAC corpus, created at the University of Bologna, Italy.
Texts in the LEXMCI corpus are annotated with information about their genre, mode (written or spoken), medium (book, website etc), and language variety (to distinguish American, British and Hiberno English). This extensive annotation ensured full coverage of language variation by the lexicographers creating the database. All the full-sentence examples in the database are drawn from this corpus.