setup-data.Rmd

---
title: "Setup of data for an onlineforecast model"
author: "Peder Bacher"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_debth: 3
vignette: >
  %\VignetteIndexEntry{Setup of data for an onlineforecast model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
# Have to load the knitr to use hooks
library(knitr)
## This vignettes name
vignettename <- "setup-data"
# REMEMBER: IF CHANGING IN THE shared-init (next block), then copy to the others!
```

<!--shared-init-start-->
```{r init, cache=FALSE, include=FALSE, purl=FALSE}
# Width will scale all
figwidth <- 12
# Scale the wide figures (100% out.width)
figheight <- 4
# Heights for stacked time series plots
figheight1 <- 5
figheight2 <- 6.5
figheight3 <- 8
figheight4 <- 9.5
figheight5 <- 11
# Set the size of squared figures (same height as full: figheight/figwidth)
owsval <- 0.35
ows <- paste0(owsval*100,"%")
ows2 <- paste0(2*owsval*100,"%")
# 
fhs <- figwidth * owsval

# Set for square fig: fig.width=fhs, fig.height=fhs, out.width=ows}
# If two squared the:  fig.width=2*fhs, fig.height=fhs, out.width=ows2

# Check this: https://bookdown.org/yihui/rmarkdown-cookbook/chunk-styling.html
# Set the knitr options
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##!!    ",
  prompt = FALSE,
  cache = TRUE,
  cache.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
  fig.align="center",
  fig.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
  fig.height = figheight,
  fig.width = figwidth,
  out.width = "100%"
)
options(digits=3)


# For cropping output and messages
cropfun <- function(x, options, func){
  lines <- options$output.lines
  ## if (is.null(lines)) {
  ##   return(func(x, options))  # pass to default hook
  ## }
  if(!is.null(lines)){
      x <- unlist(strsplit(x, "\n"))
      i <- grep("##!!",x)
      if(length(i) > lines){
          # truncate the output, but add ....
          #x <- c(x[-i[(lines+1):length(i)]], "```", "## ...output cropped", "```")
          x <- x[-i[(lines+1):length(i)]]
          x[i[lines]] <- pst(x[i[lines]], "\n\n## ...output cropped")
      }
      # paste these lines together
      x <- paste(c(x, ""), collapse = "\n")
  }
  x <- gsub("!!","",x)
  func(x, options)
}

hook_chunk <- knit_hooks$get("chunk")

knit_hooks$set(chunk = function(x, options) {
    cropfun(x, options, hook_chunk)
})

```

[onlineforecasting]: https://arxiv.org/abs/2109.12915
[building heat load forecasting]: https://onlineforecasting.org/examples/building-heat-load-forecasting.html
[onlineforecasting.org]: https://onlineforecasting.org
<!--shared-init-end-->


## Intro
This vignette explains how to setup data consisting of observations and
forecasts, such that it can be used for onlineforecast models. A generic
introduction and description is in available in [onlineforecasting]. The code is
available [here](setup-data.R). More information on [onlineforecasting.org].


## Data example

First load the package:
```{r}
# Load the package
library(onlineforecast)
```

In the package different data sets are included. The
`Dbuilding` holds the data used for the example of
heat load forecasting in the building-heat-load-forecasting vignette.

When the package is loaded the data is also loaded, so we can access it
directly. Let's start out by:
```{r}
# Keep it in D to simplify notation
D <- Dbuilding
```

The class is 'data.ĺist':
```{r}
# The class of D
class(D)
```

Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
order to have functions for the particular format of data - the format is explained in this document.

It consists of vectors of time, vectors of observations (model output) and
data.frames of forecasts (model input):
```{r}
# Print the names to see the variables in the data
names(D)
```

An overview of the content can be generated by:
```{r}
summary.default(D)
```
where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.

A function giving a summary, including checks of the format of the 'data.list' is:
```{r}
summary(D)
```
The 'NA' columns indicate the proportion of NAs. If there is a `ok` in a column,
then the check of the variables format is passed. See the help with
`?summary.data.list` to learn which checks are performed.


### Time

First, lets have a look at `D$t`, which is the vector of time points:
```{r}
# The time
class(D$t)
head(D$t)
tail(D$t)
```
Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
can also simply be a numeric, but for plotting and many operations, its
very useful to use the 'POSIXct' class (see `?POSIXt`).

Rules for the time vector:

- It must be named `t`.

- There must be no gaps or NA values in `t`, since only equidistant time series
  can be used in the models (the other variables can have NAs).

- Its best to keep the time zone in `UTC` or `GMT` (not providing any time
  zone `tz` can give rise to problems).
  

Use the basic R functions for handling the time class. Most needed
operations can be done with:
```{r, eval=FALSE}
?as.POSIXct
?strftime
```


A helper function is provided with the `ct` function which can be called using `?`, or `?ct`. See example below:

```{r}
# Convert from a time stamp (tz="GMT" per default)
ct("2019-01-01 11:00")
# Convert from unix time
ct(3840928387)
```
Note that for all functions where a time value as a character is given, the time
zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
ignored). For some operations the package `lubridate` can
be very helpful.


### Observations

Note the rules for observations:

- In a `data.list` observations must be vectors.

- The vectors must have the same length as the time `t` vector.

- Observation as numerical vectors can be used directly as model output (if
  observations are to used as model inputs, they must be setup in a data.frame
  as explained below in Section [Forecasts]).


In the current data, a time series of hourly heat load observations is included:

```{r}
str(D$heatload)
```

It must have the same length as the time vector:
```{r}
# Same length as time
length(D$t)
length(D$heatload)
```

A simple plot can be generated by:
```{r}
plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
```

The convention used in all examples is that the time points are always
set to the time interval end point, e.g.:
```{r}
# The observation
D$heatload[2]
# Represents the average load between
D$t[1]
# and
D$t[2]
```
The main idea behind setting the time point at the end of the interval is:
Working with values averaged over the time interval, such values are available
at the end of the time interval, not before. Especially, in real-time
applications this is a useful convention.


### Forecasts

As described in [onlineforecasting] the setup of forecasts for
model inputs always follows the same format - as presented in the
following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.

The rules are:

- All values at row `i` are available at the `i`'th value in time `t`.

- All columns must be named with `k` followed by an integer indicating the
  horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).


Have a look at the forecasts of the global radiation:
```{r}
# Global radiation forecasts
head(D$I)
```

At the first time point:
```{r}
# First time point
D$t[1]
```
the available forecast ahead in time is at the first row:
```{r}
# The forecast available ahead in time is in the first row
D$I[1, ]
```

We can plot that by:
```{r}
i <- 1:ncol(D$I)
plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
```
So this is the forecast available ahead in time at `r D$t[1]`.

The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
the steps are hourly, is an equi-distant time series. Picking out the
entire series can be done by `D$I$k8` - hence a plot (together with the
observations) can be generated by:
```{r}
# Just pick some points by
i <- 200:296
plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
# Add the observations
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
```

Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
them the forecasts must be lagged 8 steps by:
```{r}
plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
```


## Plotting

A few simple plotting functions are included in the package.


### Time series plots

The plot function provided with the package actually does this lagging with plotting forecasts:
```{r}
plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```

The argument `patterns` is vector of a regular expressions (see `?regex`),
which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.

An interactive plot can be generated using (first install the package `plotly`):
```{r, eval=FALSE}
plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide", eval=FALSE}
L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
plotly::subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
```
Note that the `patterns` argument is a vector of regular expressions, which
determines which variables from `D` to plot.


### Scatter plots

When modelling with the objective of forecasting, it's always a good start to
have a look at scatter plots between the model inputs and the model output. For
example the heatload vs. ambient temperature 8-step forecast:
```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lagvec(D$Ta$k8, 8), D$heatload)
```
So lagging (thus aligning in time) makes less slightly less scatter.

A wrapper for the `pairs` function is provided for a `data.list`, which can
generate very useful explorative plots:
```{r, fig.height=figwidth}
pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24))
```
Note how the sequence of included horizons are specified in the `kseq` argument,
and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.

Just as a quick side note: This is the principle used for fitting onlineforecast
models, simply shift forecasts to align with the observations:
```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
# Lag the 8-step forecasts to be aligned with the observations
x <- lagvec(D$I$k8, 8)
# Take a smaller range
x <- x[i]
# Take the observations
y <- D$Iobs[i]
# Fit a linear regression model
fit <- lm(y ~ x)
# Plot the result
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)
```

Seen over time the 8-step forecasts are:
```{r}
plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
```

Of course that model was very simple, see how to make a better model in
[building-heat-load-forecasting] and more information on the [website].


## Subset

Taking a subset of a `data.list` is very useful and it can easily be done in
different ways using the `subset` function (i.e. it's really the
`subset.data.list` function called when:
```{r}
# Take the 1 to 4 values of each variable in D
Dsub <- subset(D, 1:4)
summary(Dsub)
```

Another useful function for taking data in a time range is:
```{r}
which(in_range("2010-12-20",D$t,"2010-12-21"))
```
always check the help of function for more details (i.e. `?in_range`)

Actually, it's easy to take subset from a period by:
```{r}
Dsub <- subset(D, c("2010-12-20","2010-12-21"))
summary(Dsub)
Dsub$t
```


## Data.list to data.frame (or data.table)

It can be really useful to bring the data.list on a format of a `data.frame` or
equivalently `data.table` for processing. 

Bringing to `data.frame` can easily be done by:
```{r}
Df <- as.data.frame(Dsub)
names(Df)
```
So the forecasts are just bind with the time and observations, and `.kxx` is
added to the column names.

It can be converted to a `data.table` by:
```{r}
library(data.table)
setDT(Df)
class(Df)
```

After processing it is easily converted back to the `data.list` again by:
```{r}
# Set back to data.frame
setDF(Df)
# Convert to a data.list
Dsub2 <- as.data.list(Df)
# Compare it with the original Dsub
summary(Dsub2)
summary(Dsub)
```