Dec 25, 2016 - Data Not. Computer. Readable. Missing Values. Not Explicit. Undocumented,. Vague Data. Inconsistent Data
Common Threats to Tidy Data There are lots of ways to make data untidy. These examples are illustrative, not exhaustive. Inspired by the Tidyverse: vita.had.co.nz/papers/tidy-data.pdf
Merged Cells
No Unique ID
MESSY
year
cholera_cases 2,500 7,000 9,000 100
region
project
region
project1 project2 project3
Afar, Oromia Oromia, Somali, SNNP SNNP, Afar, Amhara, Gambela
project1 project1 project2 ...
Afar Oromia Oromia ...
project
region
activity_id project
project1 project1 project2 ...
Afar Oromia Oromia ...
region1 region2
2,500 7,000
project
combined cells
9,000 100
multiple values per cell
region
project1 project1 project2 ...
region Afar Oromia Oromia ...
variable2
region
cholera
TB
2,500 7,000
4,000 4,100
region1 region2
2,500 7,000
4,000 4,100
region
year
cholera_cases
region1 region2 region1 region2
2012 2012 2014 2014
2,500 7,000 9,000 100
2012cholera 2014cholera 2,500 7,000
region1 region2
100 101 102 ...
variable1
9,000 100
different formats names misspelled funding
project start_date region
14-04-13 Afar 25,000,000 project1 30000000 project2 2016-12-25 Affar 01/01 Benishangul-Gumuz $75M project3 €68M project3 01012014 Benishangul Gumuz
region region1 region2 region3 region4
special characters values in different units
TB_cases cholera_cases 4,200 1,250
2,500 7,000 9,000
project status project 1 project 2 project 3 project 4
funding_USD
project1 2014-04-13 Afar project2 2016-12-25 Afar project3 2014-01-01 Benishangul-Gumuz
duplicate values (?)
Undocumented, Vague Data
TIDY
2012 2012 2014 2014
project start_date region
Data Not Computer Readable
each set of observations form a table
region
cholera cases 2012 2014
Inconsistent Data
Missing Values Not Explicit
each observation forms a row
region1 region2 region1 region2
region
Variable Names region region1 Not Meaningful region2 Variable Names Contain Measurements
each variable forms a unique column
region
is this missing or zero?
legend
good kinda okay really not okay
project
region
funding
project1 project2 project3
5 1 7
25,000,000 10,000,000 15,000,000
code not defined units not specified
Tim Essam (
[email protected] @StataRGIS) & Laura Hughes (
[email protected] @flaneuseks)
region1 region2 region3 region4
25,000,000 30,000,000 75,000,000
TB_cases cholera_cases 4,200 1,250 NA 0
2,500 7,000 9,000 NA
project
status
project1 project2 project3 project4
kinda okay good kinda okay really not okay
project
region region_id funding_USD
project1 project2 project3
Afar Oromia Somali
CC BY 4.0
May 2017
5 1 7
25,000,000 10,000,000 15,000,000