Friday 20 February 2015

Part 2: Data Preparation


In Part 1 I have introduced the weather data set we will be using in this series of tutorials. We are now going to have the data prepared for the subsequent EDA. We will recode and transform variables, change their types, and perform some basic data checks. Feel free to follow along with the analysis (click here to download the weather data), bearing in mind you can type ?function_name to get help about some specific R function, for instance, ?head

Importing the data


To start off, let's read in the data to an R data frame and run some basic commands:

# Make sure the file is in your current working directory
 
> weather <- read.csv("weather_2014.csv",sep=";",stringsAsFactors=FALSE)
 
> dim(weather)
[1] 365  14

> names(weather)
 [1] "day.count"      "day"            "month"          "season"        
 [5] "l.temp"         "h.temp"         "ave.temp"       "l.temp.time"   
 [9] "h.temp.time"    "rain"           "ave.wind"       "gust.wind"     
[13] "gust.wind.time" "dir.wind"
 
> head(weather)
  day.count day month season l.temp h.temp ave.temp l.temp.time h.temp.time rain
1         1   1     1 Winter   12.7   14.0     13.4       01:25       23:50 32.0
2         2   2     1 Winter   11.3   14.7     13.5       07:30       11:15 64.8
3         3   3     1 Winter   12.6   14.7     13.6       21:00       14:00 12.7
4         4   4     1 Winter    7.7   13.9     11.3       10:35       01:50 20.1
5         5   5     1 Winter    8.8   14.6     13.0       01:40       12:55  9.4
6         6   6     1 Winter   11.8   14.4     13.1       19:35       00:05 38.9
  ave.wind gust.wind gust.wind.time dir.wind
1     11.4      53.1          15:45        S
2      5.6      41.8          22:25        S
3      4.3      38.6          00:00      SSW
4     10.3      66.0          09:05       SW
5     11.6      51.5          13:50      SSE
6      9.9      57.9          08:10      SSE 
 
 
It seems we have correctly loaded the data into R. Notice that the variables in the data file are separated not by a comma, but by a semicolon, and hence the need to set the sep = ";" argument. We also told R not to import strings as factors (i.e., categorical variables). In many cases some character variables are indeed strings and some integer variables are actually factors. This means there is almost always manual work to be done after importing, and therefore I prefer to read in the variables without any initial processing.

Let's now have a look at the structure of the weather data frame, using one of the most useful R functions, str().

> str(weather)
'data.frame': 365 obs. of  14 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ season        : chr  "Winter" "Winter" "Winter" "Winter" ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : chr  "S" "S" "SSW" "SW" ...

Based on the output of this function, and having in the mind the goal of producing visualisations and potentially build models, how would you change the variables in the data set? Here are my thoughts:
  • Day and month are coded as integers but they should be factors (categorical variables); the same applies to the character variables representing the season and wind direction;
  • Looking at the first values for the wind direction - "S" "S" "SSW" "SW" - it seems that a 16-wind compass rose has been used. Since there are only 365 days in the year, do we have sufficient observations for each of the 16 groups, or could we try to group them into 8 principal winds or even only the 4 cardinal directions?
  • The day count (number of days since the beginning of the year) is useful, but we would like to have dates on the x axis when plotting instead of an index. This variable should therefore be transformed;
  • The three time variables show the exact minute where the corresponding event occurred. We would most likely benefit by doing some aggregation, and rounding to the nearest hour seems a good option. After that we would convert the hour variable to factor.

It is also quite common to check for missing values (coded as NA by default). As with almost everything, there are many ways to do it in R. Here are just two of them:

# One way (sum of NA over the entire data set)
> sum(is.na(weather))
[1] 0

# Another way (number of complete observations)
> nrow(weather)
[1] 365 
 
> sum(complete.cases(weather))
[1] 365 
 
> nrow(weather) == sum(complete.cases(weather))
[1] TRUE
 

 

Create factors

 

Variables can be easily coerced to factors using the as.factor() function, which is an abbreviated form of the main function, factor(). The first one, however, will order the levels by the alphabet when the original variable is a string (class character in R), which might be something we don't want for some variables, for example, the season of the year. The code below shows how factors are created and why they differ from the other types of variables.

> # Before (365 independent strings)
> class(weather$season)
[1] "character"

> summary(weather$season)
   Length     Class      Mode 
      365 character character 

> weather$season <- factor(weather$season,
                    levels = c("Spring","Summer","Autumn","Winter"))

> # After (4 categories, ordered by "levels")
> class(weather$season)
[1] "factor"

> summary(weather$season)
Spring Summer Autumn Winter 
    92     92     91     90
 
> # Using as.factor() when the order doesn't matter or original var. is integer

> weather$day <- as.factor(weather$day)
> weather$month <- as.factor(weather$month)
> weather$dir.wind <- as.factor(weather$dir.wind)
 

 

Dealing with the wind


Let's start by checking whether there are actually 16 directions in the dir.wind variable and, if so, determine whether we have sufficient number of observations in each group.

> # Number of unique values
> length(unique(weather$dir.wind))
[1] 16 
 
> # Absolute frequency (table function)
> table(weather$dir.wind)

  E ENE ESE   N  NE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
 11  15   2  18  25   8  37 108  26  24  31  17  11   5  24   3 
 
> # Making it relative (prop.table function)
> rel <- round(prop.table(table(weather$dir.wind))*100,1)
> rel

   E  ENE  ESE    N   NE  NNE  NNW   NW    S   SE  SSE  SSW   SW    W  WNW  WSW 
 3.0  4.1  0.5  4.9  6.8  2.2 10.1 29.6  7.1  6.6  8.5  4.7  3.0  1.4  6.6  0.8 
 
> # Bringing some order to the table
> sort(rel,decreasing = TRUE)

  NW  NNW  SSE    S   NE   SE  WNW    N  SSW  ENE    E   SW  NNE    W  WSW  ESE 
29.6 10.1  8.5  7.1  6.8  6.6  6.6  4.9  4.7  4.1  3.0  3.0  2.2  1.4  0.8  0.5
 
It can be seen that the relative frequency is less than 5% for more than half of the groups. Unfortunately, we don't have the actual wind direction in degrees. But, making use of some domain knowledge and the information in the table above, let's try to give a reasonable answer to the following question: Is it more likely for a value in the NNW group to be closer to NW to N?
(assuming the direction isn't exactly the midpoint between NW and N, in which case it would be a pure NNW). The numbers show that NW would be far more likely. Even though this logic doesn't necessarily apply to all of the directions we are trying to eliminate, let's assume this criterion of recode them as the closest ordinal direction (by definition, "NW","NE","SE", "SW" are called ordinal) in a new variable.

As long as the analyst knows to explain why and how some new variable was created, and bears in mind it may lack accuracy, it is perfectly fine to add it to the data set. It may be useful or useless, and that is something he will try to figure out during the stages of visualisation or modelling.

> # Transforming wind direction variable: from 16 to 8 principal winds 
 
> # Create a copy from the original variable...
> weather$dir.wind.8 <- weather$dir.wind 
 
> # ...and then simply recode some of the variables
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in%  c("NNE","ENE"),
                               "NE",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("NNW","WNW"),
                               "NW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("WSW","SSW"),
                               "SW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("ESE","SSE"),
                               "SE",as.character(weather$dir.wind.8)) 
 
> # create factors, ordered by "levels" 
> weather$dir.wind.8 <- factor(weather$dir.wind.8,
                        levels = c("N","NE","E","SE","S","SW","W","NW"))
 
 
> # Checking the length of the new variable 
> length(unique(weather$dir.wind.8))
[1] 8 
 
> # A 2-way table (direction vs season), with relative frequencies calculated
    over margin = 2 (i.e., the columns)  
 
> round(prop.table(table(weather$dir.wind.8,weather$season),margin = 2)*100,1)
    
     Spring Summer Autumn Winter
  N     1.1    3.3   12.1    3.3
  NE   14.1    5.4   20.9   12.2
  E     0.0    0.0    5.5    6.7
  SE   13.0   14.1   20.9   14.4
  S     5.4   12.0    4.4    6.7
  SW    6.5    8.7    2.2   16.7
  W     2.2    0.0    1.1    2.2
  NW   57.6   56.5   33.0   37.8 


The function ifelse() is one of the classical ways to populate a new variable based on the value, or calculation over the value, of any other(s). When the original variable is numeric, cut() is often simpler and used instead.

Just as a side note, it is uncommon and considered bad practice to use for loops and if statements in R when the goal is to loop through the rows of a column and apply some function when a certain condition is met. R supports vectorisation, which is a much more efficient way to accomplish the same thing. In fact, the ifelse() function is the vectorised way that makes the for-if-else construct unnecessary.

 

 

We need a date


To create a date in R, we just need to pass to the function as.Date() a string with an appropriate format. We can then add and subtract days using the usual math operators. Here is the code to calculate the date based on the day.count variable.

> first.day <- "2014-01-01"
> class(first.day)
[1] "character" 
 
> first.day <- as.Date(first.day)
> class(first.day)
[1] "Date" 
 
> Here is where we actually calculate the date 
> weather$date  <- first.day + weather$day.count - 1 
 
> head(weather$day.count)
[1] 1 2 3 4 5 6 
 
> head(weather$date)
[1] "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" "2014-01-05" "2014-01-06"

 

 

Not so hard times


The last thing we need to do is to round (to the nearest hour) the time at which a certain event occurred  (lower temperature, higher temperature, and wind gust). Working with times in R is a bit more complicated than working with dates, with the former having two alternative classes to represent it: POSIXct and POSIXlt. The first stores the date and time as a simple number, representing the seconds since the UNIX epoch (Jan 1, 1970); the second stores the date and time in a list, with elements for seconds, hours, years, among others. Since we are interested in extracting the hour information, after rounding, we will use the more complete POSIXlt class.

> # Store date and time as POSIXlt class
> l.temp.time.date <- as.POSIXlt(paste(weather$date,weather$l.temp.time))
> head(l.temp.time.date)
[1] "2014-01-01 01:25:00 GMT" "2014-01-02 07:30:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 10:35:00 GMT" "2014-01-05 01:40:00 GMT" "2014-01-06 19:35:00 GMT" 
 
> # Round to the nearest hour
> l.temp.time.date <- round(l.temp.time.date,"hours")
> head(l.temp.time.date)
[1] "2014-01-01 01:00:00 GMT" "2014-01-02 08:00:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 11:00:00 GMT" "2014-01-05 02:00:00 GMT" "2014-01-06 20:00:00 GMT" 
 
> # Which attributes are stored in the POSIXlt time variable?
> attributes(l.temp.time.date)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"   "isdst" 
[10] "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] ""    "GMT" "BST"

> # Extract the value of the hour attribute as a number and add it to the data set
> weather$l.temp.hour <- l.temp.time.date [["hour"]]
> head(weather$l.temp.hour)
[1]  1  8 21 11  2 20 
 
> # Lastly, the integer is converted to factor
> weather$l.temp.hour <- as.factor(weather$l.temp.hour)
> head(weather$l.temp.hour)
[1] 1  8  21 11 2  20
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 23

 

 

The prepared data set


After all the processing we have done, let's call str() again to see what our final data set looks like. We now have a date variable (to plot a time series) and several factor variables (commonly used to identify different groups of a numerical variable when creating a data visualisation).

> str(weather)
'data.frame': 365 obs. of  19 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ month         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season        : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : Factor w/ 16 levels "E","ENE","ESE",..: 9 9 12 13 11 11 10 10 4 7 ...
 $ dir.wind.8    : Factor w/ 8 levels "N","NE","E","SE",..: 5 5 6 6 4 4 4 4 1 8 ...
 $ date          : Date, format: "2014-01-01" "2014-01-02" ...
 $ l.temp.hour   : Factor w/ 20 levels "0","1","2","3",..: 2 9 18 12 3 17 8 1 8 9 ...
 $ h.temp.hour   : Factor w/ 19 levels "0","1","2","3",..: 1 8 11 3 10 1 12 10 11 9 ...
 $ gust.wind.hour: Factor w/ 24 levels "0","1","2","3",..: 17 23 1 10 15 9 13 15 15 15 ...

  

 

Final notes on data "wrangling" and R


As we have seen in Part 1, the process of cleaning and transforming the raw data is almost always required prior to starting the actual analysis. In fact, in many real life cases, the visualisation and modeling stages are easier and less time-consuming than having the data ready to be explored.

The R language is extremely powerful to create visualisations and build models. It is often considered, however, that the language is not the most user-friendly when it comes to prepare the data (still true, but not as much as a few years ago). Here are some of the alternatives we, the analysts, have at our disposal:
  • For reasonably simple and tidy data sets, like the one we have been using in this tutorial, Excel is usually sufficient; a combination of Pivot Tables, Vlookup() and/or Index()/Match() and a few basic formatting functions would have accomplished the same we have done here;
  • When some more advanced processing is required (for instance, the use of regex), and even though R supports regular expressions and provides functions in its base package, some prefer to use other languages (in this case, Perl or Python would be good options);
  • An ever increasing alternative option is to use other R packages that provide wrapper functions for the ones in its base. For instance, we would probably have dealt with our three time variables easily using the lubridate package than using the base functions; had we needed to format strings and the stringr package would provide us with several consistent wrappers, making it simpler to process the text.

Onward and upward to Part 3 of this series of tutorials, where we will be creating a few  visualisations to gain insight from our weather data.        

Saturday 14 February 2015

Part 1: Introduction


The ultimate goal of every data scientist is to extract as much valuable information as possible from a given data set. We want to be able to predict the future based on the past, to discover very deep and hidden patterns in the data, and to expand the current base of knowledge in some specific domain. With this in mind, several machine learning algorithms, from Neural Networks to Support Vector Machines, from Naïve Bayes to Random Forests, have been developed over the years. In many situations, when correctly applied, these can provide greater insight from the data than any human could do, how clever they might be, based on his analytical skills.
 
Daily experience shows, however, that the process of analysing data seldom goes past the Exploratory Data Analysis (EDA). In fact, more often than not, the classical EDA approach of summarising and plotting is sufficient for the analyst to build a solid intuition of what the data is trying to convey, and sometimes raising new questions that might be addressed afterwards, if a model is to be developed.

Unfortunately, in many cases we cannot perform EDA right away. Raw data is often messy, unstructured, badly coded, inconsistent, or just plain wrong. It is often said that the analyst spends 80% of the time preparing the data, and only 20% actually doing analysis and modelling. This initial data wrangling consists of handling missing values, remove duplicates, transforming variables, format variable types, recode values, detect outliers to ascertain data integrity, and others. Our goal is to have what is called tidy data by the end of the process.

Data analysis using R


In this series of intermediate-level tutorials, I will guide you through the process of analysing an actual data set. We will start by preparing the data and then use a few EDA techniques to get a grasp of what’s in it. To accomplish this, we will be using the lingua franca of data science, the R programming language. There are many advantages and disadvantages to using R, and I summarise below the ones I personally deem the most relevant.

Pros
  • There is a big community of R users – as of 2014, there were more than 5 000 user-contributed packages only in the main repository (CRAN), and around 150 000 R functions (software popularity). If you need a function for some specific purpose, it is highly likely that someone else has already created it before. If you have any doubt about the R syntax, you’ll probably find the answer online easily (stackoverflow);
  • The superb graphics capabilities offered by the ggplot2 package - this plotting system implements the grammar of graphics, a new way of thinking about the visual representation of the data. If you have to pick a single topic lo learn in R, ggplot2 is arguably the best option. It is a language by itself that will allow you to produce high quality plots in a short amount of time;
  •  It’s free! (This is not unique to R, though).

Cons
  • R has a steep learning curve – the paradigm is different from the mainstream languages and even from other statistical packages. It is highly interactive, where to complete an analysis you often call one function, take the result and use it to feed the next function and so on, in a cycle that can be quite extensive;
  • Some of the syntax is far from intuitive and even cumbersome – for example, using the intuitive sort() function to sort a data frame yields in nasty results; you need to use the order() function instead, not directly, but in a rather convoluted way. It must be said, however, that several packages have been developed to make the analysis much easier, namely plyr/dplyr to manipulate (filter, transform, summarise) data, lubridate and stringr to lessen the burden when dealing with dates and strings, respectively, sqldf to run SQL statements directly over R data frames, among a few other packages;
  • R can be a bit slow, but that’s more in the context of developing algorithms than about performing interactive EDA. In fact, the bottleneck of the process is the analyst himself. The time we spend thinking about what information to extract and how we want to visualise it is several orders of magnitude greater than the time it takes R to actually draw the graphics.

Without further ado, let’s have a look at the data set we will be using in this series of tutorials.

The weather data set


The data set consists of daily records of several meteorological parameters, measured in the city of Porto over the year of 2014. We have, then, 365 observations for each of the following 14 variables:

day.count – number of days passed since the beginning of the year
day – day of the month
month – month of the year
season – season of the year
l.temp, h.temp, ave.temp – lowest, highest and average temperature for the day (in ºC)
l.temp.time, h.temp.time – hour of the day when l.temp and h.temp occurred
rain – amount of precipitation (in mm)
ave.wind – average wind speed for the day (in km/h)
gust.wind – maximum wind speed for the day (in km/h)
gust.wind.time – hour of the day when gust.wind occurred
dir.wind – dominant wind direction for the day

Now, let’s move on to Part 2 of this tutorial, where we will start by inspecting the data and prepare it, so we can then proceed to perform EDA.