Research Guides: Linguistics (Advanced Researchers): Corpora

Ask Brian!

Email Me

Contact:

Office: 1a Wilson Library
309 19th Ave. S. Minneapolis, MN 55455

Email: bvetruba@umn.edu

612-625-8161

This guide is an in-depth listing of resources on Linguistics available to students and faculty at the University of Minnesota.

English-Corpora.org
Previously known as the "BYU Corpora,” this database includes 19 well-known corpora of American and British English, such as the Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), British National Corpus (BNC), and News on the Web (NOW). An overview on how to search the corpora can be found under the Overview tab. Some corpora can be downloaded by users. NOTE: users will need to register for an individual account while either being on or off-campus using EZProxy or full-tunnel VPN. Select "Register / profile" under "my account."
Information regarding VPN access can be found on the OIT department website here: https://it.umn.edu/virtual-private-network-vpn
Linguistic Data Consortium
LDC is a repository for linguistic datasets, corpora, and other resources for research on human language. Use the link above to view descriptions and sign up for access to the corpora and datasets available to current UMN faculty, students, and staff. Quick overview of corpora available.

BAS CLARIN Repository
Corpora of spoken language archived in the Bavarian Archive for Speech Signals (BAS). Most resources marked with 'free for science' (ACA) or 'public' (PUB) can be downloaded for free by academic users.
English-Corpora.org
Formerly the "BYU Corpora," a portal to a number of English-language corpora including: iWeb: The Intelligent Web-based Corpus, News on the Web (NOW), Global Web-Based English, Wikipedia Corpus,
Corpus of Contemporary American English (COCA),
Corpus of Historical American English (COHA), The TV Corpus, The Movie Corpus, and the Corpus of American Soap Operas
International Corpus of English
Each ICE corpus consists of one million words of spoken and written English produced after 1989.
OLAC: Open Language Archives Community
"OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources."
Text + Corpora List from Linguist List
UCLA Phonetics Lab Archive
The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages with phonetic transcriptions, plus scans of original field notes where relevant.

PHOIBLE
A repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample.

For assistance in finding corpora, contact Brian Vetruba (bvetruba@umn.edu; book an appointment).

Last Updated: Jun 4, 2025 4:04 PM