5 Data I: Collecting, Organizing, Creating
5.1 Goals
Getting to know the basics of working with data: collecting, creating, organizing.
5.2 Software
- R
- OCR Engines (https://www.onlineocr.net/)
- OCR can also be done directly in R (requires Tesseract installed)
- Excel, Google Spreadsheets, or any other alternative
5.3 In Class I: Theoretical and Conceptual
5.3.1 Ways of obtaining data
- Reusing already produced data
- One may require to mold data into a more fitting structure.
- Creating one’s own dataset
- Digitizing data from printed and/or hand-written sources
5.3.2 Main format
- Relational databases or Tables/Spreadsheets (tabular data)?
- Tabular format: tables; spreadsheets; CSV/TSV files.
- Unique identifiers:
- tables with different data can be connected via unique identifiers
- Note: A relational database (rDB) is a collection of interconnected tables. Tables in an rDB are connected with each other via unique identifiers which are usually automatically created by the database itself when new data is added.
- One can maintain interconnected tables without creating a rDB: Open Linked Data
- Example: Table of the growth of cities. One table includes information on population over time; Another table includes coordinates of the cities from the dataset. It is more efficient and practical (reducing error rate from typos) to work on these tables separately, and connect them via unique identifiers of cities which are used in both tables.
5.3.2.1 Note on the CSV
/TSV
format
CSV
stands for comma-separated values; TSV
— for tab-separated values.
Below is an examples of a CSV format. Here, the first line is the header, which provides the names of columns; each line is a row, while columns are separated with ,
commas.
,growth_from_2000_to_2013,latitude,longitude,population,rank,state
city,4.8%,40.7127837,-74.0059413,8405837,1,New York
New York,4.8%,34.0522342,-118.2436849,3884307,2,California
Los Angeles,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Chicago,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Houston,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania Philadelphia
TSV
is a better option than a CSV
, since TAB characters (\t
) are very unlikely to appear in values.
Neither TSV
not CSV
are good for preserving new line characters (\n
)—or, in other words, text split into multiple lines/paragraphs. As a workaround, one can convert \n
into some unlikely-to-occur character combination (for example, ;;;
), which would be easy to restore into \n
later, if necessary.
5.3.3 Basic principles of organizing data: Tidy Data
5.3.3.1 Tidy Data
- Each variable is in its own column
- Each observation is in its own row
- Each value is in its own cell
5.3.3.2 Clean Data
- Column names and row names are easy to use and informative. In general, it is a good practice to avoid
spaces
and special characters.- Good example:
western_cities
- Alternative good example:
WesternCities
- Bad example:
Western Cities (only the largest)
- Good example:
- Obvious mistakes in the data have been removed
- Date format:
YYYY-MM-DD
is the most reliable format. Any thoughts why? - There should be no empty
cells
:- If you have them, it might be that your data is not organized properly.
- If your data is organized properly,
NA
must be used as an explicit indication that data point is not available.
- Each cell must contain only one piece of data.
- Date format:
- Variable values must be internally consistent
- Be consistent in coding your values:
M
andman
are different values computationally, but may be the same in the dataset; - Keep track of your categories: a document where all codes used in the data set are explained.
- Be consistent in coding your values:
- Preserve original values
- If you are working with a historical dataset, it might be inconsistent.
- For example, distances between cities are given in different formats: days of travel, miles, farsaḫs/parasangs, etc.).
- Instead of replacing original values, it is better to create an additional column, where this information will be homogenized according to some principle.
- Keeping original data will allow to honogenize data in multiple ways (example: day of travel).
- Clearly differentiate between the original and modified/modelled values.
- The use of suffixes can be convenient:
Distance_Orig
vsDistance_Modified
.
- The use of suffixes can be convenient:
- If you are working with a historical dataset, it might be inconsistent.
- Most of editing operations should be performed in software other than R; any spreadsheet program will work, unless it cannot export into CSV/TSV format.
- Keep in mind that if you prepare your data in an Excel-like program, rich formatting (like manual highlights, bolds, and italics) is not data and it will be lost, when you export your data into CSV/TSV format.
- Note: It might be useful, however, to use rule-based highlighting in order, for example, to identify bad values that need to be fixed.
- Back up your data!
- http://github.com is a great place for this, plus it allows to work collaboratively.
- Google spreadsheets is a decent alternative, but it lacks version control and detailed tracking of changes.
5.4 In Class II: Practical
5.4.1 Morris Dataset: the East Vs. the West
![]() |
---|
War-making capacity since 4000 BCE (in social development points). Data source: Morris, Ian. 2013. The Measure of Civilization: How Social Development Decides the Fate of Nations. Princeton: Princeton University Press. |
5.4.1.1 Difficulty: Easy
- Digitize “War-making capacity since 4000 BCE” from this file Morris_2013_Combined.pdf.
- Fix the dataset so that it conforms to the principles of tidy data. What should be corrected?
- Load the data set into R
- Graph chronological changes in war-making capacities for the East and the West.
- When the East was in the lead?
- When the West was in the lead?
- How can you determine that? (Hint: review logical operators and vector comparison).
5.4.1.2 Difficulty: More complicated
- Digitize “Maximum Settlement Sizes” from this file Morris_2013_Combined.pdf.
- Fix the dataset so that it conforms to the principles of tidy data. What should be corrected?
- The datasets for the East and the West are separate. What would be your strategies to combine them?
- Graph chronological changes in war-making capacities for the East and the West.
- When the East was in the lead? What were the most prominent settlements?
- When the West was in the lead? What were the most prominent settlements?
- How can you determine that? (Hint: review logical operators and vector comparison).
5.5 OCR in R
As was noted above, we can use R to OCR text in PDFs and images. The following libraries will be necessary.
library(pdftools)
library(tidyverse)
library(tesseract)
library(readr)
This code we can use to OCR individual PNG files.
<- tesseract::ocr(pathToPNGfile, engine = tesseract("eng"))
text ::write_lines(text, str_replace(pathToPNGfile, ".png", ".txt")) readr
This code can be used to process entire PDFs:
<- pdftools::pdf_convert(pathToPDFfile, dpi = 600)
imagesToProcess <- tesseract::ocr(imagesToProcess, engine = tesseract("eng"))
text ::write_lines(text, str_replace(pathToPDFfile, ".pdf", ".txt")) readr
NB: I had issues running pdftools
on Mac. Make sure that you install additional required tools for it. For more details, see: https://github.com/ropensci/pdftools.
More details on how to use Tesseract with R you can find here: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
5.6 Reference Materials:
- Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10. (The article in open access)
- Check these slides: A. Ginolhac, E. Koncina, R. Krause. Principles of Tidy Data: tidyr https://lsru.github.io/tv_course/lecture05_tidyr.html (Also check their other lectures/slides: )
- Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.
5.6.1 Additional
- Morris, Ian. 2013. The Measure of Civilization: How Social Development Decides the Fate of Nations. Princeton: Princeton University Press.
- Note: This book is a methodological companion to: Morris, Ian. 2010. Why the West Rules—for Now: The Patterns of History, and What They Reveal about the Future. New York: Farrar, Straus and Giroux.
5.6.2 Additional Readings
- Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10. (The article in open access)
5.7 Homework
- Finish your worksheet and submit your HW as described below.
- Additional: if you’d like more practice, you can use
swirl
library:- To install:
install.packages("swirl")
- To run:
library(swirl)
- Then:
swirl()
- it will offer you a set of interactive exercises similar to DataCamp.
- Then:
- To install:
5.8 Submitting homework
- Homework assignment must be submitted by the beginning of the next class;
- Email your homework to the instructor as attachments.
- In the subject of your email, please, add the following:
57528-LXX-HW-YourLastName-YourMatriculationNumber
, whereLXX
is the number of the lesson for which you submit homework;YourLastName
is your last name; andYourMatriculationNumber
is your matriculation number.
- In the subject of your email, please, add the following: