setup-data.Rmd

---
title: "Setup of data for an onlineforecast model"
author: "Peder Bacher"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_debth: 3
vignette: >
  %\VignetteIndexEntry{Setup of data for an onlineforecast model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: literature.bib
---


```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
# Have to load the knitr to use hooks
library(knitr)
## This vignettes name
vignettename <- "setup-data"
## Read external code
knitr::read_chunk("shared-init.R")
```
```{r init, cache=FALSE, include=FALSE, purl=FALSE}
```
```{r links, cache=FALSE, echo=FALSE, purl=FALSE, results="asis"}
```


## Intro
This vignette explains how to setup data consisting of observations and
forecasts, such that it can be used for onlineforecast models. A generic
introduction and description is in available in [onlineforecasting]. The code is
available [here](setup-data.R). More information on [onlineforecasting.org].


## Data example

First load the package:
```{r}
## Load the package
library(onlineforecast)
```

In the package different data sets are included. The
`Dbuilding` holds the data used for the example of
heat load forecasting in the building-heat-load-forecasting vignette.

When the package is loaded the data is also loaded, so we can access it
directly. Let's start out by:
```{r}
## Keep it in D to simplify notation
D <- Dbuilding
```

The class is 'data.ĺist':
```{r}
## The class of D
class(D)
```

Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
order to have functions for the particular format of data - the format is explained in this document.

It consists of vectors of time, vectors of observations (model output) and
data.frames of forecasts (model input):
```{r}
## Print the names to see the variables in the data
names(D)
```

An overview of the content can be generated by:
```{r}
summary(D)
```
where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.

A function for carrying out a check of the format of the 'data.list' is:
```{r}
check(D)
```
Basically, if there is a `V` in the `ok` column, then the format of this
variable in `D` is correct. See the help with `?check.data.list` to learn what the printed output means.


### Time

First, lets have a look at `D$t`, which is the vector of time points:
```{r}
## The time
class(D$t)
head(D$t)
tail(D$t)
```
Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
can also simply be a numeric, but for plotting and many operations, its
very useful to use the 'POSIXct' class (see `?POSIXt`).

Rules for the time vector:

- It must be named `t`.

- There must be no gaps or NA values in `t`, since only equidistant time series
  can be used in the models (the other variables can have NAs).

- Its best to keep the time zone in `UTC` or `GMT` (not providing any time
  zone `tz` can give rise to problems).
  

Use the basic R functions for handling the time class. Most needed
operations can be done with:
```{r}
?as.POSIXct
?strftime
```


A helper function is provided with the `asp` function which can be called using `?`, or `?asp`. See example below:

```{r}
## Convert from a time stamp (tz="GMT" per default)
asct("2019-01-01 11:00")
## Convert from unix time
asct(3840928387)
```
Note that for all functions where a time value as a character is given, the time
zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
ignored). For some operations the package `lubridate` can
be very helpful.


### Observations

Note the rules for observations:

- In a `data.list` observations must be numeric vectors.

- The vectors must have the same length as the time `t` vector.

- Observation as vectors can be used directly as model output (if observations
  are to used as model inputs, they must be setup in a data.frame as explained
  below in Section [Forecasts]).


In the current data, a time series of hourly heat load observations is included:

```{r}
str(D$heatload)
```

It must have the same length as the time vector:
```{r}
## Same length as time
length(D$t)
length(D$heatload)
```

A simple plot can be generated by:
```{r}
plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
```

The convention used in all examples is that the time points are always
set to the time interval end point, e.g.:
```{r}
## The observation
D$heatload[2]
## Represents the average load between
D$t[1]
## and
D$t[2]
```
The main idea behind setting the time point at the end of the interval is:
Working with values averaged over the time interval, such values are available
at the end of the time interval, not before. Especially, in real-time
applications this is a useful convention.


### Forecasts

As described in [onlineforecast](onlineforecast.pdf) the setup of forecasts for
model inputs always follows the same format - as presented in the
following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.

The rules are:

- All values at row `i` are available at the `i`'th value in time `t`.

- All columns must be named with `k` followed by an integer indicating the
  horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).


Have a look at the forecasts of the global radiation:
```{r}
## Global radiation forecasts
head(D$I)
```

At the first time point:
```{r}
## First time point
D$t[1]
```
the available forecast ahead in time is at the first row:
```{r}
## The forecast available ahead in time is in the first row
D$I[1, ]
```

We can plot that by:
```{r}
i <- 1:ncol(D$I)
plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
```
So this is the forecast available ahead in time at `r D$t[1]`.

The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
the steps are hourly, is an equi-distant time series. Picking out the
entire series can be done by `D$I$k8` - hence a plot (together with the
observations) can be generated by:
```{r}
## Just pick some points by
i <- 200:296
plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
## Add the observations
lines(D$t[i], D$I.obs[i])
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
```

Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
them the forecasts must be lagged 8 steps by:
```{r}
plot(D$t[i], lag(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
lines(D$t[i], D$I.obs[i])
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
```


## Plotting

A few simple plotting functions are included in the package.


### Time series plots

The plot function provided with the package actually does this lagging with plotting forecasts:
```{r}
plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```

The argument `patterns` is vector of a regular expressions (see `?regex`),
which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.

An interactive plot can be generated using (first install the package `plotly`):
```{r, eval=FALSE}
plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide"}
L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
```
Note that the `patterns` argument is a vector of regular expressions, which
determines which variables from `D` to plot.


### Scatter plots

When modelling with the objective of forecasting, it's always a good start to
have a look at scatter plots between the model inputs and the model output. For
example the heatload vs. ambient temperature 8-step forecast:
```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lag(D$Ta$k8, 8), D$heatload)
```
So lagging (thus aligning in time) makes less slightly less scatter.

A wrapper for the `pairs` function is provided for a `data.list`, which can
generate very useful explorative plots:
```{r, fig.height=figwidth}
pairs(D, nms=c("heatload","Ta.obs","Ta","t"), kseq=c(1,8,24))
```
Note how the sequence of included horizons are specified in the `kseq` argument,
and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.

Just as a quick side note: This is the principle used for fitting onlineforecast
models, simply shift forecasts to align with the observations:
```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
## Lag the 8-step forecasts to be aligned with the observations
x <- lag(D$I$k8, 8)
## Take a smaller range
x <- x[i]
## Take the observations
y <- D$I.obs[i]
## Fit a linear regression model
fit <- lm(y ~ x)
## Plot the result
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)
```

Seen over time the 8-step forecasts are:
```{r}
plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
```

Of course that model was very simple, see how to make a better model in
[building-heat-load-forecasting] and more information on the [website].


## Subset

Taking a subset of a `data.list` is very useful and it can easily be done in
different ways using the `subset` function (i.e. it's really the
`subset.data.list` function called when:
```{r}
## Take the 1 to 4 values of each variable in D
Dsub <- subset(D, 1:4)
summary(Dsub)
```

Another useful function for taking data in a time range is:
```{r}
which(in_range("2010-12-20",D$t,"2010-12-21"))
```
always check the help of function for more details (i.e. `?in_range`)

Actually, it's easy to take subset from a period by:
```{r}
Dsub <- subset(D, c("2010-12-20","2010-12-21"))
summary(Dsub)
Dsub$t
```


## Data.list to data.frame (or data.table)

It can be really useful to bring the data.list on a format of a `data.frame` or
equivalently `data.table` for processing. 

Bringing to `data.frame` can easily be done by:
```{r}
Df <- as.data.frame(Dsub)
names(Df)
```
So the forecasts are just bind with the time and observations, and `.kxx` is
added to the column names.

It can be converted to a `data.table` by:
```{r}
library(data.table)
setDT(Df)
class(Df)
```

After processing it is easily converted back to the `data.list` again by:
```{r}
## Set back to data.frame
setDF(Df)
## Convert to a data.list
Dsub2 <- as.data.list(Df)
## Compare it with the original Dsub
summary(Dsub2)
summary(Dsub)
```


## Literature