Newer
Older
---
title: "Setup of data for an onlineforecast model"
author: "Peder Bacher"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_debth: 3
vignette: >
%\VignetteIndexEntry{Setup of data for an onlineforecast model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: literature.bib
---
```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
# Have to load the knitr to use hooks
library(knitr)
## This vignettes name
vignettename <- "setup-data"
## Read external code
```
```{r init, cache=FALSE, include=FALSE, purl=FALSE}
```
```{r links, cache=FALSE, echo=FALSE, purl=FALSE, results="asis"}
```
This vignette explains how to setup data consisting of observations and
forecasts, such that it can be used for onlineforecast models. A generic
introduction and description is in available in [onlineforecasting]. The code is
available [here](setup-data.R). More information on [onlineforecasting.org].
## Data example
First load the package:
```{r}
## Load the package
library(onlineforecast)
```
In the package different data sets are included. The
`Dbuilding` holds the data used for the example of
heat load forecasting in the building-heat-load-forecasting vignette.
When the package is loaded the data is also loaded, so we can access it
directly. Let's start out by:
```{r}
## Keep it in D to simplify notation
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
```
The class is 'data.ĺist':
```{r}
## The class of D
class(D)
```
Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
order to have functions for the particular format of data - the format is explained in this document.
It consists of vectors of time, vectors of observations (model output) and
data.frames of forecasts (model input):
```{r}
## Print the names to see the variables in the data
names(D)
```
An overview of the content can be generated by:
```{r}
summary(D)
```
where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.
A function for carrying out a check of the format of the 'data.list' is:
```{r}
check(D)
```
Basically, if there is a `V` in the `ok` column, then the format of this
variable in `D` is correct. See the help with `?check.data.list` to learn what the printed output means.
### Time
First, lets have a look at `D$t`, which is the vector of time points:
```{r}
## The time
class(D$t)
head(D$t)
tail(D$t)
```
Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
can also simply be a numeric, but for plotting and many operations, its
very useful to use the 'POSIXct' class (see `?POSIXt`).
Rules for the time vector:
- It must be named `t`.
- There must be no gaps or NA values in `t`, since only equidistant time series
can be used in the models (the other variables can have NAs).
- Its best to keep the time zone in `UTC` or `GMT` (not providing any time
zone `tz` can give rise to problems).
Use the basic R functions for handling the time class. Most needed
operations can be done with:
```{r}
?as.POSIXct
?strftime
```
A helper function is provided with the `asp` function which can be called using `?`, or `?asp`. See example below:
```{r}
## Convert from a time stamp (tz="GMT" per default)
asct("2019-01-01 11:00")
## Convert from unix time
asct(3840928387)
```
Note that for all functions where a time value as a character is given, the time
zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
ignored). For some operations the package `lubridate` can
be very helpful.
### Observations
Note the rules for observations:
- In a `data.list` observations must be numeric vectors.
- The vectors must have the same length as the time `t` vector.
- Observation as vectors can be used directly as model output (if observations
are to used as model inputs, they must be setup in a data.frame as explained
below in Section [Forecasts]).
In the current data, a time series of hourly heat load observations is included:
```{r}
str(D$heatload)
```
It must have the same length as the time vector:
```{r}
## Same length as time
length(D$t)
length(D$heatload)
```
A simple plot can be generated by:
```{r}
plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
```
The convention used in all examples is that the time points are always
set to the time interval end point, e.g.:
```{r}
## The observation
D$heatload[2]
## Represents the average load between
D$t[1]
## and
D$t[2]
```
The main idea behind setting the time point at the end of the interval is:
Working with values averaged over the time interval, such values are available
at the end of the time interval, not before. Especially, in real-time
applications this is a useful convention.
### Forecasts
As described in [onlineforecast](onlineforecast.pdf) the setup of forecasts for
model inputs always follows the same format - as presented in the
following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.
The rules are:
- All values at row `i` are available at the `i`'th value in time `t`.
- All columns must be named with `k` followed by an integer indicating the
horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).
Have a look at the forecasts of the global radiation:
```{r}
## Global radiation forecasts
head(D$I)
```
At the first time point:
```{r}
## First time point
D$t[1]
```
the available forecast ahead in time is at the first row:
```{r}
## The forecast available ahead in time is in the first row
D$I[1, ]
```
We can plot that by:
```{r}
i <- 1:ncol(D$I)
plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
```
So this is the forecast available ahead in time at `r D$t[1]`.
The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
the steps are hourly, is an equi-distant time series. Picking out the
entire series can be done by `D$I$k8` - hence a plot (together with the
observations) can be generated by:
```{r}
## Just pick some points by
i <- 200:296
plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
## Add the observations
lines(D$t[i], D$I.obs[i])
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
```
Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
them the forecasts must be lagged 8 steps by:
```{r}
plot(D$t[i], lag(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
lines(D$t[i], D$I.obs[i])
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
```
## Plotting
A few simple plotting functions are included in the package.
### Time series plots
The plot function provided with the package actually does this lagging with plotting forecasts:
```{r}
plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
The argument `patterns` is vector of a regular expressions (see `?regex`),
which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.
An interactive plot can be generated using (first install the package `plotly`):
```{r, eval=FALSE}
plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide"}
L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
```
Note that the `patterns` argument is a vector of regular expressions, which
determines which variables from `D` to plot.
### Scatter plots
When modelling with the objective of forecasting, it's always a good start to
have a look at scatter plots between the model inputs and the model output. For
example the heatload vs. ambient temperature 8-step forecast:
```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lag(D$Ta$k8, 8), D$heatload)
```
So lagging (thus aligning in time) makes less slightly less scatter.
A wrapper for the `pairs` function is provided for a `data.list`, which can
generate very useful explorative plots:
```{r, fig.height=figwidth}
pairs(D, nms=c("heatload","Ta.obs","Ta","t"), kseq=c(1,8,24))
```
Note how the sequence of included horizons are specified in the `kseq` argument,
and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.
Just as a quick side note: This is the principle used for fitting onlineforecast
models, simply shift forecasts to align with the observations:
```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
## Lag the 8-step forecasts to be aligned with the observations
x <- lag(D$I$k8, 8)
## Take a smaller range
x <- x[i]
## Take the observations
y <- D$I.obs[i]
## Fit a linear regression model
fit <- lm(y ~ x)
## Plot the result
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)
```
Seen over time the 8-step forecasts are:
```{r}
plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
```
Of course that model was very simple, see how to make a better model in
[building-heat-load-forecasting] and more information on the [website].
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
## Subset
Taking a subset of a `data.list` is very useful and it can easily be done in
different ways using the `subset` function (i.e. it's really the
`subset.data.list` function called when:
```{r}
## Take the 1 to 4 values of each variable in D
Dsub <- subset(D, 1:4)
summary(Dsub)
```
Another useful function for taking data in a time range is:
```{r}
which(in_range("2010-12-20",D$t,"2010-12-21"))
```
always check the help of function for more details (i.e. `?in_range`)
Actually, it's easy to take subset from a period by:
```{r}
Dsub <- subset(D, c("2010-12-20","2010-12-21"))
summary(Dsub)
Dsub$t
```
## Data.list to data.frame (or data.table)
It can be really useful to bring the data.list on a format of a `data.frame` or
equivalently `data.table` for processing.
Bringing to `data.frame` can easily be done by:
```{r}
Df <- as.data.frame(Dsub)
names(Df)
```
So the forecasts are just bind with the time and observations, and `.kxx` is
added to the column names.
It can be converted to a `data.table` by:
```{r}
library(data.table)
setDT(Df)
class(Df)
```
After processing it is easily converted back to the `data.list` again by:
```{r}
## Set back to data.frame
setDF(Df)
## Convert to a data.list
Dsub2 <- as.data.list(Df)
## Compare it with the original Dsub
summary(Dsub2)
summary(Dsub)
```
## Literature