Skip to content
Snippets Groups Projects
setup-data.Rmd 10.7 KiB
Newer Older
  • Learn to ignore specific revisions
  • pbac's avatar
    pbac committed
    ---
    title: "Setup of data for an onlineforecast model"
    author: "Peder Bacher"
    date: "`r Sys.Date()`"
    output:
      rmarkdown::html_vignette:
        toc: true
        toc_debth: 3
    vignette: >
      %\VignetteIndexEntry{Setup of data for an onlineforecast model}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
    bibliography: literature.bib
    ---
    
    
    ```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
    # Have to load the knitr to use hooks
    library(knitr)
    ## This vignettes name
    vignettename <- "setup-data"
    ## Read external code
    
    pbac's avatar
    pbac committed
    knitr::read_chunk("shared-init.R")
    
    pbac's avatar
    pbac committed
    ```
    ```{r init, cache=FALSE, include=FALSE, purl=FALSE}
    ```
    
    pbac's avatar
    pbac committed
    ```{r links, cache=FALSE, echo=FALSE, purl=FALSE, results="asis"}
    ```
    
    
    pbac's avatar
    pbac committed
    
    
    ## Intro
    
    pbac's avatar
    pbac committed
    This vignette explains how to setup data consisting of observations and
    forecasts, such that it can be used for onlineforecast models. A generic
    introduction and description is in available in [onlineforecasting]. The code is
    available [here](setup-data.R). More information on [onlineforecasting.org].
    
    pbac's avatar
    pbac committed
    
    
    ## Data example
    
    First load the package:
    ```{r}
    ## Load the package
    library(onlineforecast)
    ```
    
    In the package different data sets are included. The
    
    pbac's avatar
    pbac committed
    `Dbuilding` holds the data used for the example of
    
    pbac's avatar
    pbac committed
    heat load forecasting in the building-heat-load-forecasting vignette.
    
    When the package is loaded the data is also loaded, so we can access it
    directly. Let's start out by:
    ```{r}
    ## Keep it in D to simplify notation
    
    pbac's avatar
    pbac committed
    D <- Dbuilding
    
    pbac's avatar
    pbac committed
    ```
    
    The class is 'data.ĺist':
    ```{r}
    ## The class of D
    class(D)
    ```
    
    Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
    order to have functions for the particular format of data - the format is explained in this document.
    
    It consists of vectors of time, vectors of observations (model output) and
    data.frames of forecasts (model input):
    ```{r}
    ## Print the names to see the variables in the data
    names(D)
    ```
    
    An overview of the content can be generated by:
    ```{r}
    summary(D)
    ```
    where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.
    
    A function for carrying out a check of the format of the 'data.list' is:
    ```{r}
    check(D)
    ```
    Basically, if there is a `V` in the `ok` column, then the format of this
    variable in `D` is correct. See the help with `?check.data.list` to learn what the printed output means.
    
    
    ### Time
    
    First, lets have a look at `D$t`, which is the vector of time points:
    ```{r}
    ## The time
    class(D$t)
    head(D$t)
    tail(D$t)
    ```
    Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
    can also simply be a numeric, but for plotting and many operations, its
    very useful to use the 'POSIXct' class (see `?POSIXt`).
    
    Rules for the time vector:
    
    - It must be named `t`.
    
    - There must be no gaps or NA values in `t`, since only equidistant time series
      can be used in the models (the other variables can have NAs).
    
    - Its best to keep the time zone in `UTC` or `GMT` (not providing any time
      zone `tz` can give rise to problems).
      
    
    Use the basic R functions for handling the time class. Most needed
    operations can be done with:
    ```{r}
    ?as.POSIXct
    ?strftime
    ```
    
    
    A helper function is provided with the `asp` function which can be called using `?`, or `?asp`. See example below:
    
    ```{r}
    ## Convert from a time stamp (tz="GMT" per default)
    asct("2019-01-01 11:00")
    ## Convert from unix time
    asct(3840928387)
    ```
    Note that for all functions where a time value as a character is given, the time
    zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
    ignored). For some operations the package `lubridate` can
    be very helpful.
    
    
    ### Observations
    
    Note the rules for observations:
    
    - In a `data.list` observations must be numeric vectors.
    
    - The vectors must have the same length as the time `t` vector.
    
    - Observation as vectors can be used directly as model output (if observations
      are to used as model inputs, they must be setup in a data.frame as explained
      below in Section [Forecasts]).
    
    
    In the current data, a time series of hourly heat load observations is included:
    
    ```{r}
    str(D$heatload)
    ```
    
    It must have the same length as the time vector:
    ```{r}
    ## Same length as time
    length(D$t)
    length(D$heatload)
    ```
    
    A simple plot can be generated by:
    ```{r}
    plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
    ```
    
    The convention used in all examples is that the time points are always
    set to the time interval end point, e.g.:
    ```{r}
    ## The observation
    D$heatload[2]
    ## Represents the average load between
    D$t[1]
    ## and
    D$t[2]
    ```
    The main idea behind setting the time point at the end of the interval is:
    Working with values averaged over the time interval, such values are available
    at the end of the time interval, not before. Especially, in real-time
    applications this is a useful convention.
    
    
    ### Forecasts
    
    As described in [onlineforecast](onlineforecast.pdf) the setup of forecasts for
    model inputs always follows the same format - as presented in the
    following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.
    
    The rules are:
    
    - All values at row `i` are available at the `i`'th value in time `t`.
    
    - All columns must be named with `k` followed by an integer indicating the
      horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).
    
    
    Have a look at the forecasts of the global radiation:
    ```{r}
    ## Global radiation forecasts
    head(D$I)
    ```
    
    At the first time point:
    ```{r}
    ## First time point
    D$t[1]
    ```
    the available forecast ahead in time is at the first row:
    ```{r}
    ## The forecast available ahead in time is in the first row
    D$I[1, ]
    ```
    
    We can plot that by:
    ```{r}
    i <- 1:ncol(D$I)
    plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
    ```
    So this is the forecast available ahead in time at `r D$t[1]`.
    
    The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
    the steps are hourly, is an equi-distant time series. Picking out the
    entire series can be done by `D$I$k8` - hence a plot (together with the
    observations) can be generated by:
    ```{r}
    ## Just pick some points by
    i <- 200:296
    plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
    ## Add the observations
    lines(D$t[i], D$I.obs[i])
    legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
    ```
    
    Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
    them the forecasts must be lagged 8 steps by:
    ```{r}
    plot(D$t[i], lag(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
    lines(D$t[i], D$I.obs[i])
    legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
    ```
    
    
    
    ## Plotting
    
    A few simple plotting functions are included in the package.
    
    
    ### Time series plots
    
    The plot function provided with the package actually does this lagging with plotting forecasts:
    ```{r}
    plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
    ```
    
    The argument `patterns` is vector of a regular expressions (see `?regex`),
    which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.
    
    An interactive plot can be generated using (first install the package `plotly`):
    ```{r, eval=FALSE}
    plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
    ```
    ```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide"}
    L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
    subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
    ```
    Note that the `patterns` argument is a vector of regular expressions, which
    determines which variables from `D` to plot.
    
    
    
    ### Scatter plots
    
    When modelling with the objective of forecasting, it's always a good start to
    have a look at scatter plots between the model inputs and the model output. For
    example the heatload vs. ambient temperature 8-step forecast:
    ```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
    par(mfrow=c(1,2))
    plot(D$Ta$k8, D$heatload)
    plot(lag(D$Ta$k8, 8), D$heatload)
    ```
    So lagging (thus aligning in time) makes less slightly less scatter.
    
    A wrapper for the `pairs` function is provided for a `data.list`, which can
    generate very useful explorative plots:
    ```{r, fig.height=figwidth}
    pairs(D, nms=c("heatload","Ta.obs","Ta","t"), kseq=c(1,8,24))
    ```
    Note how the sequence of included horizons are specified in the `kseq` argument,
    and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.
    
    Just as a quick side note: This is the principle used for fitting onlineforecast
    models, simply shift forecasts to align with the observations:
    ```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
    ## Lag the 8-step forecasts to be aligned with the observations
    x <- lag(D$I$k8, 8)
    ## Take a smaller range
    x <- x[i]
    ## Take the observations
    y <- D$I.obs[i]
    ## Fit a linear regression model
    fit <- lm(y ~ x)
    ## Plot the result
    plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
    abline(fit)
    ```
    
    Seen over time the 8-step forecasts are:
    ```{r}
    plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
    lines(D$t[i], y)
    legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
    ```
    
    
    pbac's avatar
    pbac committed
    Of course that model was very simple, see how to make a better model in
    [building-heat-load-forecasting] and more information on the [website].
    
    pbac's avatar
    pbac committed
    
    
    
    ## Subset
    
    Taking a subset of a `data.list` is very useful and it can easily be done in
    different ways using the `subset` function (i.e. it's really the
    `subset.data.list` function called when:
    ```{r}
    ## Take the 1 to 4 values of each variable in D
    Dsub <- subset(D, 1:4)
    summary(Dsub)
    ```
    
    Another useful function for taking data in a time range is:
    ```{r}
    which(in_range("2010-12-20",D$t,"2010-12-21"))
    ```
    always check the help of function for more details (i.e. `?in_range`)
    
    Actually, it's easy to take subset from a period by:
    ```{r}
    Dsub <- subset(D, c("2010-12-20","2010-12-21"))
    summary(Dsub)
    Dsub$t
    ```
    
    
    ## Data.list to data.frame (or data.table)
    
    It can be really useful to bring the data.list on a format of a `data.frame` or
    equivalently `data.table` for processing. 
    
    Bringing to `data.frame` can easily be done by:
    ```{r}
    Df <- as.data.frame(Dsub)
    names(Df)
    ```
    So the forecasts are just bind with the time and observations, and `.kxx` is
    added to the column names.
    
    It can be converted to a `data.table` by:
    ```{r}
    library(data.table)
    setDT(Df)
    class(Df)
    ```
    
    After processing it is easily converted back to the `data.list` again by:
    ```{r}
    ## Set back to data.frame
    setDF(Df)
    ## Convert to a data.list
    Dsub2 <- as.data.list(Df)
    ## Compare it with the original Dsub
    summary(Dsub2)
    summary(Dsub)
    ```
    
    
    ## Literature