Tuesday, 17 March 2015

Part 3b: EDA with ggplot2

In Part 3a I have introduced the plotting system ggplot2. I talked about its concept and syntax with some detail, and then created a few general plots, using the weather data set we've been working with in this series of tutorials. My goal was to show that, in ggplot2, a nice looking and interesting graph can usually be created with just a few lines of code. Given the positive feedback that post has received, I believe I had some success with what I had in mind. Who knows if some of the readers will decide to give ggplot2 a try using their own data? It is often said that a picture is worth a thousand words, but I often see some highly experienced R users, after doing really complex data analyses, plotting the results in a way that falls short of the expectations, given the quality of their analytical work.

In Part 3b, we will continue to rely on visualisations to explore the data, but now with the goal of tackling the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? The numeric variable representing the amount of rain will be our dependent variable, also known as response or outcome variable, and all the remaining ones will be potential independent variables, also known as predictors or explanatory variables.

After framing the question, and before fitting any model, EDA techniques (such as visualisation) are used to gain insight from the data. Here are the steps I usually take at this stage:
  • Analyse the (continuous) dependent variable - at least, make an histogram of the distribution; when applicable, plot the variable against time to see the trend. Can the continuous variable be transformed to a binary one (i.e., did it rain or not on a particular day?), or to a multicategorical one (i.e., was the rain none, light, moderate, or heavy on a particular day?).
  • Search for correlations between the dependent variable and the continuous independent variables - are there any strong correlations? Are they linear or non-linear?
  • Do the same as above but try to control for other variables (faceting in ggplot2 is very useful  to do this), in order to assess for confounding and effect modification. Does the association between two continuous variables hold for different levels of a third variable, or is modified by them? (e.g., if there is a strong positive correlation between the rain amount and the wind gust maximum speed, does that hold regardless of the season of the year, or does it happen only in the winter?)
  • Search for associations between the dependent variable and the categorical independent variables (factors) - does the mean or median of the dependent variable change depending on the level of the factor? What about the outliers, are they evenly distributed across all levels, or seem to be present in only a few of them? 
So, let's now do some analyses, having the framework described above in mind.


Exploring the dependent variable - daily rain amount



# Time series of the daily rain amount, with smoother curve

ggplot(weather, aes(date,rain)) +
  geom_point(aes(colour = rain)) +
  geom_smooth(colour = "blue", size = 1) +
  scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = 20) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Date") +
  ylab("Rain (mm)") +
  ggtitle("Daily rain amount")


# Histogram of the daily rain amount

ggplot(weather,aes(rain)) + 
  geom_histogram(binwidth = 1,colour = "blue", fill = "darkgrey") +
  scale_x_continuous(breaks = seq(0,80,5)) +
  scale_y_continuous(breaks = seq(0,225,25)) +
  xlab("Rain (mm)") +
  ylab ("Frequency (days)") +
  ggtitle("Daily rain amount distribution")
The time series plot shows that the daily rain amount varies wildly throughout the year. There are many dry days interspersed with stretches of consecutive wet ones, often severe, especially in the autumn and winter seasons. The histogram not only confirms what was said above, but also shows that the distribution is extremely right-skewed. As shown below, both informally (comparing the mean to the median), and formally (calculating the actual value), the positive skewness remains even after removing all the days where it did not rain.

> # Heavily right-skewed distribution
> summary(weather$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   5.843   5.300  74.900 
> # Right-skewness is still there after removing all the dry days
> summary(subset(weather, rain > 0)$rain)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.30    1.15    5.10   11.17   16.10   74.90 
> # Formal calculation of skewness (e1071 package)
> library(e1071) 
> skewness(weather$rain)
[1] 2.99
> skewness(subset(weather, rain >0)$rain)
[1] 2.04
It should be clear at this point that one possible approach would be to dichotomise the dependent variable (rain vs. no rain).  Note that it is common to consider days with rain only those where the total amount was at least 1mm (to allow for measurement errors), and that's the criterion we will adopt here. Here is the code to do it and a few interesting summaries.

> # Create binary outcome (rained = {Yes, No})
> # Number of dry days
> nrow(subset(weather, rain == 0))
[1] 174 
> # Number of days that will also be considered dry days
> nrow(subset(weather, rain <1 & rain >0))
[1] 45
> # The new binary variable is called "rained"
> weather$rained <- ifelse(weather$rain >= 1, "Yes", "No")
> # Dry and wet days (absolute)
> table(rained = weather$rained) 
 No Yes 
219 146 
> # Dry and wet days (relative)
> prop.table(table(rained = weather$rained)) 
 No Yes 
0.6 0.4
Porto is one of the wettest cities in Europe for a reason. There was rain in exactly 40% of the days of the year, and this is considering those days with rain < 1mm as dry. Should we set the cutoff at 0 mm, and more than 50% of the days in 2014 would have been considered wet.


Looking at the association between rain and season of the year

The time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let's start by investigating this relation, with both the continuous rain variable and the binary one.

Rain amount (continuous) by season

# Jitter plot - Rain amount by season 
 ggplot(weather, aes(season,rain)) +
  geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) +
  scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Season") +
  ylab ("Rain (mm)") +
  ggtitle("Daily rain amount by season")

> # Rain amount by season
> # {tapply(x,y,z) applies function z to x, for each level of y} 
> tapply(weather$rain,weather$season,summary) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.934   2.550  28.200 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.936   1.350  68.300 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.300   7.164   6.450  74.900 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    3.95   10.45   14.65   64.80

We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.

Rain occurrence (binary) by season

# Bar plot - dry and wet days by season (relative)
ggplot(weather,aes(season)) +
  geom_bar(aes(fill = rained), position = "fill") +
  geom_hline(aes(yintercept = prop.table(table(weather$rained))["No"]),
             colour = "blue",linetype = "dashed", size = 1) +
  annotate("text", x = 1, y = 0.65, label = "yr. w/o = 0.60", colour = "blue") +
  xlab("Season") +
  ylab ("Proportion") +
  ggtitle("Proportion of days without and with rain, by season")

> round(prop.table(table(season = weather$season, rained= weather$rained),1),2
season     No  Yes
  Spring 0.74 0.26
  Summer 0.72 0.28
  Autumn 0.57 0.43
  Winter 0.37 0.63

It appears that, when it comes to calculate the likelihood of raining on a particular day, the season of the year may have some predictive power. This is especially true for the winter season, where the rainy days (63%) are well above the yearly average (40%).


Looking at the correlations between rain and all numeric variables

We are now going to calculate the linear (Pearson) correlations between the continuous outcome variable (daily rain amount) and all the numeric variables - both the actual numeric ones (such as the temperature and wind speed), and those that are actually factor variables, but still make sense if we want to see them as numeric (for instance, the hour of the day in which some event occurred).

Note that we are not modelling at this stage, just trying to gain some insight from the data, and therefore we are neither concerned about whether the correlations are significant (p-values and/or confidence intervals) nor if the the relation between any two variables is in fact linear (we will be able to check this later, anyway, when we create the scatter plots with the lowess curves on top).

Here is the code (the comments should make it easier to understand what we are trying to do).

> # Create a new data frame with only the variables than can be numeric
> weather.num <- weather[c("rain","l.temp","h.temp","ave.temp","ave.wind","gust.wind",
+ "l.temp.hour","h.temp.hour","gust.wind.hour")] 
> # Convert the following factor variables to numeric
> weather.num$l.temp.hour <- as.numeric(weather.num$l.temp.hour)
> weather.num$h.temp.hour <- as.numeric(weather.num$h.temp.hour)
> weather.num$gust.wind.hour <- as.numeric(weather.num$gust.wind.hour) 
> # Pass the entire data frame to the cor() function to create a big correlation matrix
> # Pick only the first row of the matrix [1,] ==> correlation between rain and all the other variables
> round(cor(weather.num),2)[1,]
          rain         l.temp         h.temp       ave.temp       ave.wind 
          1.00          -0.14          -0.29          -0.21           0.48 
     gust.wind    l.temp.hour    h.temp.hour gust.wind.hour 
          0.61           0.19          -0.25          -0.16

There seems to be a promising positive correlation between the rain amount and the wind variables, especially the wind gust maximum speed (moderate correlation value of 0.61) , i.e., higher wind speeds tend to be associated with higher amounts of precipitation. Always bear in mind that correlation does not imply causation, therefore while it is true that the wind correlates with rain, this does not necessarily mean that the wind itself is causing the rain. It could actually be the other way around or, what is very common, both of the variables are caused by a third one that we are not even considering.

There are also some negative correlations with the temperatures (especially the daily high) that, even though not as strong as the wind ones, are still worth looking at. It appears higher amounts of rain are correlated with lower high temperatures. But let's think about it for a minute: we saw that it is in the winter when it rains the most, and in Part 3a we also saw that the temperatures were lower in the winter. Here is an example of a potential (yet to be confirmed) interaction with a third variable: it may not be the lower temperature by itself that causes more rain, but the fact the both the precipitation and lower temperatures tend to occur during a specific period of the year.

Since the season seems to have an impact on these variables, I would like to explore it a bit further, calculating all these correlations by season and check whether the values hold. If the correlation between rain and some other variable is very dissimilar across all seasons, then there is the proof for an interaction.

> # Let's split the data frame in four parts, one for each season
> weather.num.season <- split(weather.num,weather$season) 
> # The result is a list...
> class(weather.num.season)
[1] "list" 
> # ...with 4 elements, where...
> length(weather.num.season)
[1] 4 
> # ...each element of the list is a data frame (the seasons), with nine variables
> summary(weather.num.season)
       Length Class      Mode
Spring 9      data.frame list
Summer 9      data.frame list
Autumn 9      data.frame list
Winter 9      data.frame list 
> # Here are the names of each of the data frames of the list
> attributes(weather.num.season)
[1] "Spring" "Summer" "Autumn" "Winter"

> # *apply family of functions are arguably the most powerful in base R, but also the most difficult to master
> # {sapply(x,z) applies function z to each element of x}
> # First go over the elements of the list and calculate the correlation matrix (all against all)
> # For each season, return only the correlation between "rain" and everything else
> sapply(weather.num.season, function (x) round(cor(x)["rain",],2))
               Spring Summer Autumn Winter
rain             1.00   1.00   1.00   1.00
l.temp          -0.33   0.06   0.13   0.05
h.temp          -0.45  -0.26  -0.07  -0.26
ave.temp        -0.43  -0.14   0.06  -0.09
ave.wind         0.34   0.50   0.38   0.66
gust.wind        0.37   0.45   0.66   0.71
l.temp.hour      0.24   0.16   0.01   0.33
h.temp.hour     -0.13  -0.22  -0.16  -0.30
gust.wind.hour  -0.26  -0.34  -0.18   0.06 
> # Not going to run it, but here is an alternative that might be easier (or not) to understand
> # It actually shows the correlations for each element of the list
> # lapply(weather.num.season, function (x) round(cor(x)["rain",],2)) 

What do you conclude from the table above? 

The correlation between the rain and wind varies, but keeps moderately strong, regardless of the season of the year, making this the most promising variable in out data set; the correlation between the rain and daily high temperature does not confirm what I had hypothesised above. In fact, the correlation is even stronger in the spring than in the winter, and we would have to go even deeper if we really needed to understand what is going on (keep in mind we are just analysing one of the possible interactions - the season - but in practice there can be multiple ones). For the purpose of what we are doing now, it is enough to be aware that this correlation is not stable throughout the year, and actually goes to zero during the autumn. Lastly, the correlations between rain and the hour of the events (low and high temperatures, and wind gust) are rather weak but show some stability (see the l.temp.hour and h.temp.hour). They might have some predictive power, at least in a linear model.
Now that we know that the wind has the strongest correlation with the rain, and that it holds true across all seasons, it's time to plot these variables, because we want to learn something we still don't know: what is the shape of this relation? Linear, piecewise linear, curvilinear? Let's find out using ggplot2 and the concept of faceting.


Looking deeper at the wind and rain correlation - faceting technique

The idea of faceting (also called Trellis plots) is a simple but important one. A graphic, of any type and mapping any number of variables, can be divided into several panels, depending on the value of a conditioning variable. This value is either each of the levels of a categorical variable, or a number of ranges of a numeric variable. This helps us check whether there are consistent patterns across all panels. In ggplot2 there are two functions to create facets - facet_wrap() and facet_grid() - that are used when we have one or two conditioning variables, respectively. As everything in ggplot2, this is just another layer in the plot, which means all geometries, mappings, and settings will be replicated across all panels.

Let's then assess the linearity of the correlation of the amount of rain and the maximum wind gust speed, conditioning on the season of the year.

# Amount of rain vs. wind, by season
ggplot(weather,aes(gust.wind,rain)) +
  geom_point(colour = "firebrick") +
  geom_smooth(size = 0.75, se = F) +
  facet_wrap(~season) +
  xlab("Maximum wind speed (km/h)") +
  ylab ("Rain (mm)") +
  ggtitle("Amount of rain vs. maximum wind speed, by season")

This plot confirms what we had already discovered: there is a positive correlation between rain and wind, and the association holds regardless of the season. But now we know more: this correlation is non-linear. In fact, if we were to generalise, we could say there is no correlation at all when the maximum wind speed is below 25 km/h. For values higher than that, there seems to be a linear association in the autumn and winter, not so linear in the spring, and definitely non-linear during the summer. If we wanted to model this relation, we would either fit a non-linear model (such as a regression tree) or we could try to force a piecewise linear model (linear spline), where the equation relating the outcome to the predictors would, itself, be different depending on the value of the wind.


Occurrence of rain - more complex faceting to visualise variable association


To finish this part of the series, let's check whether the variable that seemed to have some predictive power for the amount of rain (maximum wind speed), is also good in the case of a binary outcome (occurrence of rain), but now we will not only control for the season, but also for the daily high temperature (because, as we have seen before, this variable was interacting with both the rain and the season). We will do this simultaneously, using the faceting technique on two variables. But, since the daily high temperature variable is continuous, we need to transform it to categorical first. A common strategy is to split the continuous variables in four groups of (roughly) the same size, i.e., the quartiles. This is very simple to do in R, combining the cut() and quantile() functions.

> # Using the defaults of the quantiles function returns 4 intervals (quartiles)
> quantile(weather$h.temp)
  0%  25%  50%  75% 100% 
 9.8 14.4 19.1 23.3 31.5 
> # All we need to do is to define the quartiles as the breaks of the cut function
> # and label the intervals accordingly
> weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp),
                            labels = c("Cool","Mild","Warm","Hot"),include.lowest = T)
> # The result 
> table(weather$h.temp.quant)

Cool Mild Warm  Hot 
  92   91   94   88  


# Occurrence of rain, by season and daily high temperature 

ggplot(weather,aes(rained,gust.wind)) +
  geom_boxplot(aes(colour=rained)) +
  facet_grid(h.temp.quant~season) +
  xlab("Occurrence of rain") +
  ylab ("Maximum wind speed (km/h)") +
  ggtitle("Occurrence of rain, by season and daily high temperature")

The graph reveals a clear pattern: the median of the maximum wind speed is always higher when it rains, and this is not affected by the range the of daily high temperature, even after controlling for the temperature variation within each season of the year.

I think we now have a much better understanding of the data. We know which variables matter the most, and which ones seem to be useless, when it comes to predict the rain, either the actual amount or the probability of its occurrence. Please note that it would be impossible to write about the analysis of every single variable and show every plot; behind the scenes, much more work than what I've shown here has been done.

In Part 4, the last of this series, we will be using a few machine learning algorithms to find out how well the rain can be predicted, and the contribution of each variable to the accuracy and precision of each model.

In the meanwhile, if you have done any kind of analysis with this data set, feel free to share your own findings, either by commenting on the post or by sending me a private message.

No comments:

Post a Comment