This guide is an in-depth listing of resources on Linguistics available to students and faculty at the University of Minnesota.
Select List of Corpora
- English-Corpora.orgPreviously known as the "BYU Corpora,” this database includes 19 well-known corpora of American and British English, such as the Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), British National Corpus (BNC), and News on the Web (NOW). An overview on how to search the corpora can be found under the Overview tab. Some corpora can be downloaded by users. NOTE: users will need to register for an individual account while either being on or off-campus using EZProxy or full-tunnel VPN. Select "Register / profile" under "my account."
- Linguistic Data ConsortiumLDC is a repository for linguistic datasets, corpora, and other resources for research on human language. Use the link above to view descriptions and sign up for access to the corpora and datasets available to current UMN faculty, students, and staff. Quick overview of corpora available.
- BAS CLARIN RepositoryCorpora of spoken language archived in the Bavarian Archive for Speech Signals (BAS). Most resources marked with 'free for science' (ACA) or 'public' (PUB) can be downloaded for free by academic users.
- English-Corpora.orgFormerly the "BYU Corpora," a portal to a number of English-language corpora including: iWeb: The Intelligent Web-based Corpus, News on the Web (NOW), Global Web-Based English, Wikipedia Corpus,
Corpus of Contemporary American English (COCA),
Corpus of Historical American English (COHA), The TV Corpus, The Movie Corpus, and the Corpus of American Soap Operas - International Corpus of EnglishEach ICE corpus consists of one million words of spoken and written English produced after 1989.
- OLAC: Open Language Archives Community"OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources."
- UCLA Phonetics Lab ArchiveThe materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.
- PHOIBLEA repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample.
For assistance in finding corpora, contact Brian Vetruba (bvetruba@umn.edu; book an appointment).
Last Updated: Mar 10, 2025 10:40 AM
URL: https://libguides.umn.edu/linguisticsadvanced