Newer
Older
---
title: "Setup of data for an onlineforecast model"
author: "Peder Bacher"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_debth: 3
vignette: >
%\VignetteIndexEntry{Setup of data for an onlineforecast model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
# Have to load the knitr to use hooks
library(knitr)
## This vignettes name
vignettename <- "setup-data"
# REMEMBER: IF CHANGING IN THE shared-init (next block), then copy to the others!
<!--shared-init-start-->
# Width will scale all
figwidth <- 12
# Scale the wide figures (100% out.width)
figheight <- 4
# Heights for stacked time series plots
figheight1 <- 5
figheight2 <- 6.5
figheight3 <- 8
figheight4 <- 9.5
figheight5 <- 11
# Set the size of squared figures (same height as full: figheight/figwidth)
owsval <- 0.35
ows <- paste0(owsval*100,"%")
ows2 <- paste0(2*owsval*100,"%")
#
fhs <- figwidth * owsval
# Set for square fig: fig.width=fhs, fig.height=fhs, out.width=ows}
# If two squared the: fig.width=2*fhs, fig.height=fhs, out.width=ows2
# Check this: https://bookdown.org/yihui/rmarkdown-cookbook/chunk-styling.html
# Set the knitr options
knitr::opts_chunk$set(
collapse = TRUE,
prompt = FALSE,
cache = TRUE,
cache.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
fig.align="center",
fig.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
fig.height = figheight,
fig.width = figwidth,
out.width = "100%"
)
options(digits=3)
# For cropping output and messages
cropfun <- function(x, options, func){
lines <- options$output.lines
## if (is.null(lines)) {
## return(func(x, options)) # pass to default hook
## }
if(!is.null(lines)){
x <- unlist(strsplit(x, "\n"))
i <- grep("##!!",x)
if(length(i) > lines){
# truncate the output, but add ....
#x <- c(x[-i[(lines+1):length(i)]], "```", "## ...output cropped", "```")
x <- x[-i[(lines+1):length(i)]]
x[i[lines]] <- pst(x[i[lines]], "\n\n## ...output cropped")
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
x <- gsub("!!","",x)
func(x, options)
}
hook_chunk <- knit_hooks$get("chunk")
knit_hooks$set(chunk = function(x, options) {
cropfun(x, options, hook_chunk)
})
[building heat load forecasting]: https://onlineforecasting.org/examples/building-heat-load-forecasting.html
[onlineforecasting.org]: https://onlineforecasting.org
<!--shared-init-end-->
This vignette explains how to setup data consisting of observations and
forecasts, such that it can be used for onlineforecast models. A generic
introduction and description is in available in [onlineforecasting]. The code is
available [here](setup-data.R). More information on [onlineforecasting.org].
## Data example
First load the package:
```{r}
library(onlineforecast)
```
In the package different data sets are included. The
`Dbuilding` holds the data used for the example of
heat load forecasting in the building-heat-load-forecasting vignette.
When the package is loaded the data is also loaded, so we can access it
directly. Let's start out by:
```{r}
class(D)
```
Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
order to have functions for the particular format of data - the format is explained in this document.
It consists of vectors of time, vectors of observations (model output) and
data.frames of forecasts (model input):
```{r}
names(D)
```
An overview of the content can be generated by:
```{r}
```
where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.
A function giving a summary, including checks of the format of the 'data.list' is:
The 'NA' columns indicate the proportion of NAs. If there is a `ok` in a column,
then the check of the variables format is passed. See the help with
`?summary.data.list` to learn which checks are performed.
### Time
First, lets have a look at `D$t`, which is the vector of time points:
```{r}
class(D$t)
head(D$t)
tail(D$t)
```
Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
can also simply be a numeric, but for plotting and many operations, its
very useful to use the 'POSIXct' class (see `?POSIXt`).
Rules for the time vector:
- It must be named `t`.
- There must be no gaps or NA values in `t`, since only equidistant time series
can be used in the models (the other variables can have NAs).
- Its best to keep the time zone in `UTC` or `GMT` (not providing any time
zone `tz` can give rise to problems).
Use the basic R functions for handling the time class. Most needed
operations can be done with:
```{r, eval=FALSE}
A helper function is provided with the `ct` function which can be called using `?`, or `?ct`. See example below:
```
Note that for all functions where a time value as a character is given, the time
zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
ignored). For some operations the package `lubridate` can
be very helpful.
### Observations
Note the rules for observations:
- The vectors must have the same length as the time `t` vector.
- Observation as numerical vectors can be used directly as model output (if
observations are to used as model inputs, they must be setup in a data.frame
as explained below in Section [Forecasts]).
In the current data, a time series of hourly heat load observations is included:
```{r}
str(D$heatload)
```
It must have the same length as the time vector:
```{r}
length(D$t)
length(D$heatload)
```
A simple plot can be generated by:
```{r}
plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
```
The convention used in all examples is that the time points are always
set to the time interval end point, e.g.:
```{r}
D$t[2]
```
The main idea behind setting the time point at the end of the interval is:
Working with values averaged over the time interval, such values are available
at the end of the time interval, not before. Especially, in real-time
applications this is a useful convention.
### Forecasts
As described in [onlineforecasting] the setup of forecasts for
model inputs always follows the same format - as presented in the
following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.
The rules are:
- All values at row `i` are available at the `i`'th value in time `t`.
- All columns must be named with `k` followed by an integer indicating the
horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).
Have a look at the forecasts of the global radiation:
```{r}
head(D$I)
```
At the first time point:
```{r}
D$t[1]
```
the available forecast ahead in time is at the first row:
```{r}
# The forecast available ahead in time is in the first row
D$I[1, ]
```
We can plot that by:
```{r}
i <- 1:ncol(D$I)
plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
```
So this is the forecast available ahead in time at `r D$t[1]`.
The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
the steps are hourly, is an equi-distant time series. Picking out the
entire series can be done by `D$I$k8` - hence a plot (together with the
observations) can be generated by:
```{r}
i <- 200:296
plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
```
Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
them the forecasts must be lagged 8 steps by:
```{r}
plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
```
## Plotting
A few simple plotting functions are included in the package.
### Time series plots
The plot function provided with the package actually does this lagging with plotting forecasts:
```{r}
plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
The argument `patterns` is vector of a regular expressions (see `?regex`),
which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.
An interactive plot can be generated using (first install the package `plotly`):
```{r, eval=FALSE}
plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide", eval=FALSE}
L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
plotly::subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
```
Note that the `patterns` argument is a vector of regular expressions, which
determines which variables from `D` to plot.
### Scatter plots
When modelling with the objective of forecasting, it's always a good start to
have a look at scatter plots between the model inputs and the model output. For
example the heatload vs. ambient temperature 8-step forecast:
```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lagvec(D$Ta$k8, 8), D$heatload)
```
So lagging (thus aligning in time) makes less slightly less scatter.
A wrapper for the `pairs` function is provided for a `data.list`, which can
generate very useful explorative plots:
```{r, fig.height=figwidth}
pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24))
```
Note how the sequence of included horizons are specified in the `kseq` argument,
and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.
Just as a quick side note: This is the principle used for fitting onlineforecast
models, simply shift forecasts to align with the observations:
```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
# Lag the 8-step forecasts to be aligned with the observations
x <- lagvec(D$I$k8, 8)
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)
```
Seen over time the 8-step forecasts are:
```{r}
plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
```
Of course that model was very simple, see how to make a better model in
[building-heat-load-forecasting] and more information on the [website].
## Subset
Taking a subset of a `data.list` is very useful and it can easily be done in
different ways using the `subset` function (i.e. it's really the
`subset.data.list` function called when:
```{r}
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
Dsub <- subset(D, 1:4)
summary(Dsub)
```
Another useful function for taking data in a time range is:
```{r}
which(in_range("2010-12-20",D$t,"2010-12-21"))
```
always check the help of function for more details (i.e. `?in_range`)
Actually, it's easy to take subset from a period by:
```{r}
Dsub <- subset(D, c("2010-12-20","2010-12-21"))
summary(Dsub)
Dsub$t
```
## Data.list to data.frame (or data.table)
It can be really useful to bring the data.list on a format of a `data.frame` or
equivalently `data.table` for processing.
Bringing to `data.frame` can easily be done by:
```{r}
Df <- as.data.frame(Dsub)
names(Df)
```
So the forecasts are just bind with the time and observations, and `.kxx` is
added to the column names.
It can be converted to a `data.table` by:
```{r}
library(data.table)
setDT(Df)
class(Df)
```
After processing it is easily converted back to the `data.list` again by:
```{r}