Common Threats to Tidy Data - GitHub Pages

0 downloads 166 Views 179KB Size Report
Dec 25, 2016 - Variable Names. Not Meaningful. Data Not. Computer. Readable. Missing Values. Not Explicit. Undocumented,
Common Threats to Tidy Data There are lots of ways to make data untidy. These examples are illustrative, not exhaustive. Inspired by the Tidyverse: vita.had.co.nz/papers/tidy-data.pdf

Merged Cells

No Unique ID

MESSY

year

cholera_cases 2,500 7,000 9,000 100

region

project

region

project1 project2 project3

Afar, Oromia Oromia, Somali, SNNP SNNP, Afar, Amhara, Gambela

project1 project1 project2 ...

Afar Oromia Oromia ...

project

region

activity_id project

project1 project1 project2 ...

Afar Oromia Oromia ...

region1 region2

2,500 7,000

project

combined cells

9,000 100

multiple values per cell

region

project1 project1 project2 ...

region Afar Oromia Oromia ...

variable2

region

cholera

TB

2,500 7,000

4,000 4,100

region1 region2

2,500 7,000

4,000 4,100

region

year

cholera_cases

region1 region2 region1 region2

2012 2012 2014 2014

2,500 7,000 9,000 100

2012cholera 2014cholera 2,500 7,000

region1 region2

100 101 102 ...

variable1

9,000 100

different formats names misspelled funding

project start_date region

14-04-13 Afar 25,000,000 project1 30000000 project2 2016-12-25 Affar 01/01 Benishangul-Gumuz $75M project3 €68M project3 01012014 Benishangul Gumuz

region region1 region2 region3 region4

special characters values in different units

TB_cases cholera_cases 4,200 1,250

2,500 7,000 9,000

project status project 1 project 2 project 3 project 4

funding_USD

project1 2014-04-13 Afar project2 2016-12-25 Afar project3 2014-01-01 Benishangul-Gumuz

duplicate values (?)

Undocumented, Vague Data

TIDY

2012 2012 2014 2014

project start_date region

Data Not Computer Readable

each set of observations form a table

region

cholera cases 2012 2014

Inconsistent Data

Missing Values Not Explicit

each observation forms a row

region1 region2 region1 region2

region

Variable Names region region1 Not Meaningful region2 Variable Names Contain Measurements

each variable forms a unique column

region

is this missing or zero?

legend

good kinda okay really not okay

project

region

funding

project1 project2 project3

5 1 7

25,000,000 10,000,000 15,000,000

code not defined units not specified

Tim Essam ([email protected]  @StataRGIS) & Laura Hughes ([email protected]  @flaneuseks)

region1 region2 region3 region4

25,000,000 30,000,000 75,000,000

TB_cases cholera_cases 4,200 1,250 NA 0

2,500 7,000 9,000 NA

project

status

project1 project2 project3 project4

kinda okay good kinda okay really not okay

project

region region_id funding_USD

project1 project2 project3

Afar Oromia Somali

CC BY 4.0

May 2017

5 1 7

25,000,000 10,000,000 15,000,000