Dante for Language Technology
Dante is a rich resource capable of automatic data-mining. Recent trends in the acquisition of lexical information have been away from ‘processed’ resources and towards the use of raw corpus data. However, there are arguments that this approach has run its course and that it is time to pay more attention to linguistic/grammatical methods. For example, invited talks by Ken Church at Senseval-3 / EMNLP 2004, and Kevin Knight at EACL 2006.
For many years the Holy Grail of language technology was a machine-readable resource in which every word sense is linked to specific, identifying contextual features (both lexical and syntactic). DANTE goes a long way towards this goal. DANTE entries include information on subcategorization, collocation, grammatical category, for example count vs mass nouns, and multiword and phraseology combinations. In addition to this there is domain, register, style and evaluative information, all of which are necessary for disambiguation in text understanding and appropriate choice in text generation. Because all this is coupled with word sense the information will provide a depth of analysis not previously possible with existing resources. Not only will this be useful for enriching statistical systems with linguistic knowledge but it will also allow thorough evaluation on an unprecedented scale.
Commercial potential includes uses for
- word sense disambiguation
- information extraction
- question answering
- grammar checking
- machine translation
- translation memory
DANTE will also be valuable for University teaching and research for
- computational linguistics and language technology