5 Data I: Collecting, Organizing, Creating

5.1 Goals

Getting to know the basics of working with data: collecting, creating, organizing.

5.2 Software

  • R
  • OCR Engines (https://www.onlineocr.net/)
    • OCR can also be done directly in R (requires Tesseract installed)
  • Excel, Google Spreadsheets, or any other alternative

5.3 In Class I: Theoretical and Conceptual

5.3.1 Ways of obtaining data

  1. Reusing already produced data
    • One may require to mold data into a more fitting structure.
  2. Creating one’s own dataset
  3. Digitizing data from printed and/or hand-written sources

5.3.2 Main format

  • Relational databases or Tables/Spreadsheets (tabular data)?
    • Tabular format: tables; spreadsheets; CSV/TSV files.
  • Unique identifiers:
    • tables with different data can be connected via unique identifiers
    • Note: A relational database (rDB) is a collection of interconnected tables. Tables in an rDB are connected with each other via unique identifiers which are usually automatically created by the database itself when new data is added.
    • One can maintain interconnected tables without creating a rDB: Open Linked Data
      • Example: Table of the growth of cities. One table includes information on population over time; Another table includes coordinates of the cities from the dataset. It is more efficient and practical (reducing error rate from typos) to work on these tables separately, and connect them via unique identifiers of cities which are used in both tables.

5.3.2.1 Note on the CSV/TSV format

CSV stands for comma-separated values; TSV — for tab-separated values.

Below is an examples of a CSV format. Here, the first line is the header, which provides the names of columns; each line is a row, while columns are separated with , commas.

city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Houston,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Philadelphia,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania

TSV is a better option than a CSV, since TAB characters (\t) are very unlikely to appear in values.

Neither TSV not CSV are good for preserving new line characters (\n)—or, in other words, text split into multiple lines/paragraphs. As a workaround, one can convert \n into some unlikely-to-occur character combination (for example, ;;;), which would be easy to restore into \n later, if necessary.

5.3.3 Basic principles of organizing data: Tidy Data

5.3.3.1 Tidy Data

  • Each variable is in its own column
  • Each observation is in its own row
  • Each value is in its own cell

5.3.3.2 Clean Data

  • Column names and row names are easy to use and informative. In general, it is a good practice to avoid spaces and special characters.
    • Good example: western_cities
    • Alternative good example: WesternCities
    • Bad example: Western Cities (only the largest)
  • Obvious mistakes in the data have been removed
    • Date format: YYYY-MM-DD is the most reliable format. Any thoughts why?
    • There should be no empty cells:
      • If you have them, it might be that your data is not organized properly.
      • If your data is organized properly, NA must be used as an explicit indication that data point is not available.
    • Each cell must contain only one piece of data.
  • Variable values must be internally consistent
    • Be consistent in coding your values: M and man are different values computationally, but may be the same in the dataset;
    • Keep track of your categories: a document where all codes used in the data set are explained.
  • Preserve original values
    • If you are working with a historical dataset, it might be inconsistent.
      • For example, distances between cities are given in different formats: days of travel, miles, farsaḫs/parasangs, etc.).
      • Instead of replacing original values, it is better to create an additional column, where this information will be homogenized according to some principle.
      • Keeping original data will allow to honogenize data in multiple ways (example: day of travel).
    • Clearly differentiate between the original and modified/modelled values.
      • The use of suffixes can be convenient: Distance_Orig vs Distance_Modified.
  • Most of editing operations should be performed in software other than R; any spreadsheet program will work, unless it cannot export into CSV/TSV format.
    • Keep in mind that if you prepare your data in an Excel-like program, rich formatting (like manual highlights, bolds, and italics) is not data and it will be lost, when you export your data into CSV/TSV format.
    • Note: It might be useful, however, to use rule-based highlighting in order, for example, to identify bad values that need to be fixed.
  • Back up your data!
    • http://github.com is a great place for this, plus it allows to work collaboratively.
    • Google spreadsheets is a decent alternative, but it lacks version control and detailed tracking of changes.

5.4 In Class II: Practical

5.4.1 Morris Dataset: the East Vs. the West

Sample_Page_With_Tabular_Data_Morris.png
War-making capacity since 4000 BCE (in social development points). Data source: Morris, Ian. 2013. The Measure of Civilization: How Social Development Decides the Fate of Nations. Princeton: Princeton University Press.

5.4.1.1 Difficulty: Easy

  • Digitize “War-making capacity since 4000 BCE” from this file Morris_2013_Combined.pdf.
  • Fix the dataset so that it conforms to the principles of tidy data. What should be corrected?
  • Load the data set into R
  • Graph chronological changes in war-making capacities for the East and the West.
  • When the East was in the lead?
  • When the West was in the lead?
  • How can you determine that? (Hint: review logical operators and vector comparison).

5.4.1.2 Difficulty: More complicated

  • Digitize “Maximum Settlement Sizes” from this file Morris_2013_Combined.pdf.
  • Fix the dataset so that it conforms to the principles of tidy data. What should be corrected?
  • The datasets for the East and the West are separate. What would be your strategies to combine them?
  • Graph chronological changes in war-making capacities for the East and the West.
  • When the East was in the lead? What were the most prominent settlements?
  • When the West was in the lead? What were the most prominent settlements?
  • How can you determine that? (Hint: review logical operators and vector comparison).

5.5 OCR in R

As was noted above, we can use R to OCR text in PDFs and images. The following libraries will be necessary.

library(pdftools)
library(tidyverse)
library(tesseract)
library(readr)

This code we can use to OCR individual PNG files.

text <- tesseract::ocr(pathToPNGfile, engine = tesseract("eng"))
readr::write_lines(text, str_replace(pathToPNGfile, ".png", ".txt"))

This code can be used to process entire PDFs:

imagesToProcess <- pdftools::pdf_convert(pathToPDFfile, dpi = 600)
text <- tesseract::ocr(imagesToProcess, engine = tesseract("eng"))
readr::write_lines(text, str_replace(pathToPDFfile, ".pdf", ".txt"))

NB: I had issues running pdftools on Mac. Make sure that you install additional required tools for it. For more details, see: https://github.com/ropensci/pdftools.

More details on how to use Tesseract with R you can find here: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

5.6 Reference Materials:

5.6.1 Additional

  • Morris, Ian. 2013. The Measure of Civilization: How Social Development Decides the Fate of Nations. Princeton: Princeton University Press.
    • Note: This book is a methodological companion to: Morris, Ian. 2010. Why the West Rules—for Now: The Patterns of History, and What They Reveal about the Future. New York: Farrar, Straus and Giroux.

5.6.2 Additional Readings

5.7 Homework

  • Finish your worksheet and submit your HW as described below.
  • Additional: if you’d like more practice, you can use swirl library:
    • To install: install.packages("swirl")
    • To run: library(swirl)
      • Then: swirl()
      • it will offer you a set of interactive exercises similar to DataCamp.

5.8 Submitting homework

  • Homework assignment must be submitted by the beginning of the next class;
  • Email your homework to the instructor as attachments.
    • In the subject of your email, please, add the following: 57528-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the number of the lesson for which you submit homework; YourLastName is your last name; and YourMatriculationNumber is your matriculation number.