Research Guides: Text Mining: Acquiring text

Text mining help

Workshops

LATIS workshops
A series of workshops created by LATIS (Liberal Arts Technology and Innovation Services) that are free and open to all faculty and graduate students.
Research tools & computing workshops
Find Python, R, and other workshops from a variety of UMN campus units.

Related Guides

This is a guide to UMN library and open access resources that are available in some form for text mining purposes.

New and notable:

Constellate (JSTOR)
UMN has trial access to a full slate of text mining workshops in Python via Constellate. Login on campus or use the VPN from off-campus.The trial ends on July 1, 2025, when JSTOR plans to sunset the Constellate platform.

Access and perform text analysis of scholarly books and articles from JSTOR and Portico. Also includes the large historical newspaper digital archive Chronicling America.

Digital Scholar Lab
Online tool for text mining collections from Gale databases that are available at UMN. Data sets can be built from British newspapers and other primary source collections in Gale and then analyzed and visualized in your browser.

The General Index (Internet Archive / Public Resource)
The General Index is an open-access collection of n-grams (words and phrases) and metadata from over 100 million journal articles to support text mining. Articles include both open access and paywalled content, available here as derived data (not human-readable text).
Linguistic Data Consortium (LDC)
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription, and access to LDC datasets, begins in 2022 though some older datasets are also available.
PubMed Central Article Datasets
Over four million articles from full-text biomedical and life sciences journal articles in PubMed Central are available in XML and plain text formats via Amazon Web Services.
Readex Text Explorer (RTE)
Many (but not all) of the digital archives that we subscribe to via Readex are available to explore using Voyant Tools. After searching a database, select the results of interest and choose the "Explore" button to launch an interface to allow you to analyze and visualize the texts. More info from Readex.

TDM Studio
ProQuest's text and data mining tool provides text-mining access to news and scholarly content including historical and contemporary newspapers, scholarly journals, and archival primary sources. There are two levels of access:
1. Full workbench access Requires basic Python or R skills.
2. Visualization access Web-based tool that doesn't require any coding skills.

Books

HathiTrust Research Center (HTRC)
Nearly 14 million books from the HathiTrust Library are currently available for analysis, offering various levels of immediate access. Check the HTRC tab on this guide for more information to help you get started.
Project Gutenberg (Mirror sites)
Project Gutenberg hosts over 50k ebooks, most of which are older books in the public domain. If you want to download more than about 100 books/day, use one of the mirror sites listed from the link above.
Scholarly ebooks dataset (JSTOR)
Thousands of titles from academic publishers, primarily from 2000 and 2017. The open access ebooks dataset includes full-text OCR and title-level metadata.
Encyclopaedia Britannica (1768-1860)
The full digital edition of the Encyclopaedia Britannica from 1768-1860, available for bulk download in XML, image files, and/or plain-text. See other digital text collections from the National Library of Scotland.

Newspapers and magazines

Constellate (JSTOR)
UMN has trial access to a full slate of text mining workshops in Python via Constellate. Login on campus or use the VPN from off-campus.The trial ends on July 1, 2025, when JSTOR plans to sunset the Constellate platform.

Access and perform text analysis of scholarly books and articles from JSTOR and Portico. Also includes the large historical newspaper digital archive Chronicling America.

Digital Scholar Lab
Online tool for text mining collections from Gale databases that are available at UMN. Data sets can be built from British newspapers and other primary source collections in Gale and then analyzed and visualized in your browser.
TDM Studio
ProQuest's text and data mining tool provides text-mining access to news and scholarly content including historical and contemporary newspapers, scholarly journals, and archival primary sources. There are two levels of access:
1. Full workbench access Requires basic Python or R skills.
2. Visualization access Web-based tool that doesn't require any coding skills.

Adam Matthew Archival Collections
Archival collections via Adam Matthew are available for text mining by request. OCR full text and metadata from collections such as American Indian Newspapers, Apartheid South Africa, and many more can be downloaded in XML or JSON formats to support digital scholarship projects.
Coalition Publica Corpora (Canada)
Textual datasets and bibliometric data from a variety of Canadian sources including hundreds of Érudit journals, archives from a variety of national and provincial collections from the 1800s to present, parliamentary debates, and more.
Chronicling America (Library of Congress)
The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages and their OCR data is available for bulk download. See the full list of digitized newspaper titles (1836-1922) for more information.
CNN and Daily Mail articles
The datasets used for a deep learning project includes 90,000 CNN articles and over 190,000 Daily Mail articles downloaded from the Wayback Machine and available for bulk download.
COVID-19 News Dataset
The White House recently issued a call to action for text and data mining researchers to work with the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group.
GDELT: Open Data Index
The GDELT Project is a realtime network diagram and database of global human society for open research. Coded data is available as raw downloads of via Google BigQuery, including collections on news, television, images, books, academic literature and the open web.
MediaCloud
Three online tools—Explorer, Topic Mapper, and Source Manager—to analyze stories and topics across print, broadcast and digital news collections. Developed by the MIT Center for Civic Media and the Berkman Klein Center for Internet & Society at Harvard. Data is also available via API and as open source software.
News API
API service that allows you to query online news sources from the past month including major publications such as the New York Times, ABC News, and Al Jazeera. Register for a free API key to get started.
NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
Readex Text Explorer (RTE)
Many (but not all) of the digital archives that we subscribe to via Readex are available to explore using Voyant Tools. After searching a database, select the results of interest and choose the "Explore" button to launch an interface to allow you to analyze and visualize the texts. More info from Readex.
Trove API (National Library of Australia)
Free non-commercial API key to access full-text and metadata from dozens of Australian newspapers and other sources, primarily historical.

Scholarly journals and publishers

arXiv Bulk Data
Full-text and metadata bulk downloads for open access scholarship in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics.
BMJ Journals TDM
BMJ allow text and data mining following their terms of service (see link) using the CrossRef APIs.
Cambridge University Press - TDM
Text mining is allowed, but no API is provided. To access a large amount of content for a text and data mining project, please do not scrape the site, but contact openresearch@cambridge.org for help.
Canadian Science Publishing
CSP grants subscribers the right to text and data mine (TDM) online content for non-commercial purposes. Follow terms and access content via CrossRef API.
Constellate (JSTOR)
UMN has trial access to a full slate of text mining workshops in Python via Constellate. Login on campus or use the VPN from off-campus.The trial ends on July 1, 2025, when JSTOR plans to sunset the Constellate platform.

Access and perform text analysis of scholarly books and articles from JSTOR and Portico. Also includes the large historical newspaper digital archive Chronicling America.
CORE: Open Access Research Papers
CORE provides a central API to access full content from tens of thousands of openly available scientific publications from thousands of OA repositories. They also provide full datasets by request.
Elsevier (ScienceDirect, Scopus) APIs
Researchers can text mine UMN-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes. Query the API from University of Minnesota, Twin Cities IP ranges to ensure full access. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see their Text Mining documentation.
The General Index (Internet Archive / Public Resource)
The General Index is an open-access collection of n-grams (words and phrases) and metadata from over 100 million journal articles to support text mining. Articles include both open access and paywalled content, available here as derived data (not human-readable text).
IEEE Developer APIs
Register for a free account to get an API key to query content from IEEE Xplore. Data responses include URLs for full-text HTML and PDFs and IEEE allows text mining for non-commercial research purposes.
JSTOR Data for Research
Data for Research allows you to download word frequencies, citations, key terms, and ngrams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also: JSTOR DfR in GitHub - A number of Python and R packages to work with JSTOR DfR data.JSTOR's Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.Public domain and OA datasets include full OCR text from early journals and current academic press open access titles.
Oxford University Press TDM
OUP accommodates TDM for non-commercial use. E-mail Data.Mining@oup.com to arrange for access.
PLOS (Public Library of Science) download tool
Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. See also: PLOS APIs to query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.

TDM Studio
ProQuest's text and data mining tool provides text-mining access to news and scholarly content including historical and contemporary newspapers, scholarly journals, and archival primary sources. There are two levels of access:
1. Full workbench access Requires basic Python or R skills.
2. Visualization access Web-based tool that doesn't require any coding skills.

PubMed Central Article Datasets
Over four million articles from full-text biomedical and life sciences journal articles in PubMed Central are available in XML and plain text formats via Amazon Web Services.
SAGE Journals TDM
Text and data mining for non-commercial purposes is allowed. You can systematically download articles available via UMN subscriptions as long as you follow their posted request limits and agree to their terms (from link above).
Scholarly API Cookbook (U Alabama)
Open online Jupyter book containing short scholarly API code examples for how to work with various scholarly web service APIs such as ScienceDirect, Crossref, Scopus, Chronicling America, and more.
Springer journals and ebooks
"Individual researchers are encouraged to download subscription (and open access) journal articles and books for TDM purposes directly from Springer Nature’s content platforms. The selection of desired articles can be conducted by using existing search methods and tools, such as PubMed, Web of Science, or Springer Nature’s Metadata API, among others. An API key is necessary only if researchers want to use Springer Nature’s TDM APIs." via Springer's Text and Data Mining Policy.
Unpaywall API
The REST API gives anyone free, programmatic access to the Unpaywall database, which includes Open Access content from over 50,000 publishers and repositories. The database harvests content from open indexes including DOAJ and Crossref, as well as specific OA publishers.
Wiley TDM
Academic subscribers can access subscribed content for text and data mining once they have accepted the Wiley click-through TDM license and received an API token.

Citations and metadata

Directory of Open Access Journals API
Search and retrieve metadata from over 13,000 interdisciplinary open access journals in DOAJ.
Elsevier (ScienceDirect, Scopus) APIs
Researchers can text mine UMN-subscribed journals and books published by Elsevier on the ScienceDirect full-text platform. Sign up for a developer account to use the Elsevier APIs for non-commercial purposes. Query the API from University of Minnesota, Twin Cities IP ranges to ensure full access. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see their Text Mining documentation.
JAMA
The American Medical Association network of journals provides individuals who have access via an institutional site-license (such as those at UMN) the ability to download aggregated metadata of subscribed content. Sign up for a free TDM account to access bulk downloads of JSON files.
Library of Congress: 25 million bibliographic metadata records
The LOC release of 25 million MARC records for free bulk download. MARC (Machine Readable Cataloging Records) is a international metadata standard for the representation and communication of bibliographic and related information.
NY Times APIs
The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851 to present.
Open Academic Graph
Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and AMiner.
OpenAlex
An open and free API and data snapshot of scholarly citations linking works, authors, venues, institutions, and concepts. Includes data from the Microsoft Academic Graph, Crossref, and more.
PubMed and NLM: Data Guide
A guide to using this API, called E-Utilities, to access citation data for medical journal literature in PubMed and other NCBI databases, including the National Library of Medicine Catalog, MeSH, Gene, and PMC (PubMed Central).

Government documents

CaseLaw Access Project
360 years of United States caselaw available via API and bulk downloads. (Downloads for some jurisdictions are only available to scholars who sign a research agreement.)
Coalition Publica Corpora (Canada)
Textual datasets and bibliometric data from a variety of Canadian sources including hundreds of Érudit journals, archives from a variety of national and provincial collections from the 1800s to present, parliamentary debates, and more.
CourtListener APIs and Bulk Legal Data
Opinions, docket files, and more from 420 courts.
FDSys: Bulk Data
Bulk data downloads of major US Government publications including Congressional Bills, Commerce Business Daily, Federal Register, Public Papers of the Presidents of the United States, Supreme Court Decisions 1937-1975 (FLITE) and more.
FRASER API (U.S. economy, banking...)
Use this REST API to access full-text and metadata from FRASER, a digital library of U.S. economic, financial, and banking history—particularly the history of the Federal Reserve System.
Library of Congress Datasets
A variety of data repositories from the Library of Congress related to business, demographics, news, images, science, government and more.
ProPublica Data Store
The ProPublica Congress API has up-to-date legislative data from the House of Representatives, the Senate and the Library of Congress, including details about members, votes, bills, nominations and more. ProPublica also offers free APIs and data on Campaign Finance, Trump Appointees, Political Ad Buys, and more.

Linguistic Corpora

BYU Corpora
Widely used free linguistic corpora created by Mark Davies, Professor of Linguistics at Brigham Young University. Now hosted at english-corpora.org. Sign up for a free academic account to work with many of these corpora (listed below).
Corpus of Contemporary American English (COCA)
The corpus contains more than 560 million words of text (20 million words each year 1990-2017) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
Corpus of Historical American English (COHA)
COHA contains more than 400 million words of text from the 1810s-2000s.
Corpus Resource Database (CoRD)
CoRD provides links to and descriptions of a large number of corpora, subcorpora and databases. (University of Helsinki)
Linguistic Data Consortium (LDC)
The LDC collects language data from both written texts and transcriptions of speech, in various languages, to support corpus linguistics. The Library subscription, and access to LDC datasets, begins in 2022 though some older datasets are also available.
Movie Corpus
200 million words of data in more than 25,000 movies from the 1930s to present. All movies are tied in to their IMDB entry.
News of the Web (NOW) Corpus
7.1 billion words of data (and growing) from web-based newspapers and magazines from 2010 to the present.
Open American National Corpus
15 million words of American English automatically annotated for logical structure, word and sentence boundaries, part of speech (multiple tag sets), shallow parse (noun and verb chunks), and named entities.
Scottish Corpus of Text & Speech (1945-present)
The Scottish Corpora project has created large electronic corpora (over 4.5 million words) of written and spoken texts for the languages of Scotland. See also the Helsinki Corpus of Older Scots (1450-1700) and the Corpus of Modern Scottish Writing (1700-1945).
TV Corpus
325 million words of data from 75,000 TV episodes from the 1950s to the present. All episodes are tied in to their IMDB entry.
Web Corpus (iWeb)
14 billion words from 22 million web pages and 95k websites.

Social media and web datasets

Awesome Public Datasets
A list of topic-centric high quality public data sources. Collected from blogs, answers, and user responses.
Blogger Corpus (2004)
The collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts.
Common Crawl
The Common Crawl corpus contains petabytes of data collected from the web since 2008. It contains raw web page data, extracted metadata and text extractions.
Facebook Ad Categories (ProPublica)
This dataset includes two tables: data on the interest categories Facebook shows to users and the ad groups its shows to advertisers (2016).
Internet Archive (data)
How to download files from archive.org in an automated way using wget.
Obama Administration Social Media Archives
A directory of sites archiving various social media posts from members of the Obama Administration.
Political Ads from Facebook (ProPublica)
This database, updated daily, contains ads that ran on Facebook and were submitted by thousands of ProPublica users from around the world (via browser extensions).
reddit APIs
Access data from posts, threads, comments, users and more from reddit and subreddits. Historical Reddit data has been collected at http://files.pushshift.io/reddit/ as monthly CSV downloads.
Social Computing Data Repository (Arizona)
As a service to the Machine Learning, Data Mining, and Social Sciences communities, the Social Computing data repository currently hosts datasets from a collection of many different social media sites.
Social Media Archive (SOMAR at ICPSR)
A centralized repository of researcher social media data via ICPSR. Open to submissions of social media research datasets.
Stanford Large Network Dataset Collection (SNAP)
The SNAP library collects data on large social and information networks since 2004.
Tweet Datasets (DocNow)
Directory of open-access tweet datasets on DocNow, available for research use. To convert Tweet IDs to JSON files of full-text tweets and metadata, use DocNow's Hydrator app. See also: UNLV's Twitter Data Tutorial Series.
Twitter: Moral Foundations Corpus
35,108 Tweets curated from seven different domains of Twitter corpus that have been hand-annotated for 10 categories of moral sentiment.
Web Corpus (iWeb)
14 billion words from 22 million web pages and 95k websites.
Wikipedia Data Dumps
Monthly database backups of all Wikimedia wikis in various formats.

Text Mining

Text mining help

Workshops

Related Guides

Contents

Books

Newspapers and magazines

Scholarly journals and publishers

Citations and metadata

Government documents

Linguistic Corpora

Social media and web datasets

See also: