Skip to content
Snippets Groups Projects
setup-data.Rmd 12.6 KiB
Newer Older
  • Learn to ignore specific revisions
  • pbac's avatar
    pbac committed
    ---
    title: "Setup of data for an onlineforecast model"
    author: "Peder Bacher"
    date: "`r Sys.Date()`"
    output:
      rmarkdown::html_vignette:
        toc: true
        toc_debth: 3
    vignette: >
      %\VignetteIndexEntry{Setup of data for an onlineforecast model}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
    ---
    
    
    ```{r external-code, cache=FALSE, include=FALSE, purl = FALSE}
    # Have to load the knitr to use hooks
    library(knitr)
    ## This vignettes name
    vignettename <- "setup-data"
    
    pbac's avatar
    pbac committed
    # REMEMBER: IF CHANGING IN THE shared-init (next block), then copy to the others!
    
    pbac's avatar
    pbac committed
    ```
    
    pbac's avatar
    pbac committed
    ```{r init, cache=FALSE, include=FALSE, purl=FALSE}
    
    # Width will scale all
    figwidth <- 12
    # Scale the wide figures (100% out.width)
    figheight <- 4
    # Heights for stacked time series plots
    figheight1 <- 5
    figheight2 <- 6.5
    figheight3 <- 8
    figheight4 <- 9.5
    figheight5 <- 11
    # Set the size of squared figures (same height as full: figheight/figwidth)
    owsval <- 0.35
    ows <- paste0(owsval*100,"%")
    ows2 <- paste0(2*owsval*100,"%")
    # 
    fhs <- figwidth * owsval
    
    # Set for square fig: fig.width=fhs, fig.height=fhs, out.width=ows}
    # If two squared the:  fig.width=2*fhs, fig.height=fhs, out.width=ows2
    
    # Check this: https://bookdown.org/yihui/rmarkdown-cookbook/chunk-styling.html
    # Set the knitr options
    knitr::opts_chunk$set(
      collapse = TRUE,
    
    pbac's avatar
    pbac committed
      comment = "##!!    ",
    
    pbac's avatar
    pbac committed
      cache.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
    
    pbac's avatar
    pbac committed
      fig.path = paste0("../tmp/vignettes/tmp-",vignettename,"/"),
    
      fig.height = figheight,
      fig.width = figwidth,
      out.width = "100%"
    )
    options(digits=3)
    
    
    pbac's avatar
    pbac committed
    
    # For cropping output and messages
    cropfun <- function(x, options, func){
    
    pbac's avatar
    pbac committed
      ## if (is.null(lines)) {
      ##   return(func(x, options))  # pass to default hook
      ## }
      if(!is.null(lines)){
          x <- unlist(strsplit(x, "\n"))
          i <- grep("##!!",x)
          if(length(i) > lines){
              # truncate the output, but add ....
              #x <- c(x[-i[(lines+1):length(i)]], "```", "## ...output cropped", "```")
              x <- x[-i[(lines+1):length(i)]]
              x[i[lines]] <- pst(x[i[lines]], "\n\n## ...output cropped")
          }
          # paste these lines together
          x <- paste(c(x, ""), collapse = "\n")
    
    pbac's avatar
    pbac committed
      x <- gsub("!!","",x)
      func(x, options)
    }
    
    hook_chunk <- knit_hooks$get("chunk")
    
    knit_hooks$set(chunk = function(x, options) {
        cropfun(x, options, hook_chunk)
    
    pbac's avatar
    pbac committed
    [onlineforecasting]: https://arxiv.org/abs/2109.12915
    
    [building heat load forecasting]: https://onlineforecasting.org/examples/building-heat-load-forecasting.html
    [onlineforecasting.org]: https://onlineforecasting.org
    <!--shared-init-end-->
    
    pbac's avatar
    pbac committed
    
    
    ## Intro
    
    pbac's avatar
    pbac committed
    This vignette explains how to setup data consisting of observations and
    forecasts, such that it can be used for onlineforecast models. A generic
    introduction and description is in available in [onlineforecasting]. The code is
    available [here](setup-data.R). More information on [onlineforecasting.org].
    
    pbac's avatar
    pbac committed
    
    
    ## Data example
    
    First load the package:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Load the package
    
    pbac's avatar
    pbac committed
    library(onlineforecast)
    ```
    
    In the package different data sets are included. The
    
    pbac's avatar
    pbac committed
    `Dbuilding` holds the data used for the example of
    
    pbac's avatar
    pbac committed
    heat load forecasting in the building-heat-load-forecasting vignette.
    
    When the package is loaded the data is also loaded, so we can access it
    directly. Let's start out by:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Keep it in D to simplify notation
    
    pbac's avatar
    pbac committed
    D <- Dbuilding
    
    pbac's avatar
    pbac committed
    ```
    
    The class is 'data.ĺist':
    ```{r}
    
    pbac's avatar
    pbac committed
    # The class of D
    
    pbac's avatar
    pbac committed
    class(D)
    ```
    
    Actually, a 'data.list' is simply a 'list', but we made the class 'data.list' in
    order to have functions for the particular format of data - the format is explained in this document.
    
    It consists of vectors of time, vectors of observations (model output) and
    data.frames of forecasts (model input):
    ```{r}
    
    pbac's avatar
    pbac committed
    # Print the names to see the variables in the data
    
    pbac's avatar
    pbac committed
    names(D)
    ```
    
    An overview of the content can be generated by:
    ```{r}
    
    pbac's avatar
    pbac committed
    summary.default(D)
    
    pbac's avatar
    pbac committed
    ```
    where it can be seen that `t` is a time vector, `heatload` is a vector, and `Ta` and `I` are data.frames.
    
    
    pbac's avatar
    pbac committed
    A function giving a summary, including checks of the format of the 'data.list' is:
    
    pbac's avatar
    pbac committed
    ```{r}
    
    pbac's avatar
    pbac committed
    summary(D)
    
    pbac's avatar
    pbac committed
    ```
    
    pbac's avatar
    pbac committed
    The 'NA' columns indicate the proportion of NAs. If there is a `ok` in a column,
    then the check of the variables format is passed. See the help with
    `?summary.data.list` to learn which checks are performed.
    
    pbac's avatar
    pbac committed
    
    
    ### Time
    
    First, lets have a look at `D$t`, which is the vector of time points:
    ```{r}
    
    pbac's avatar
    pbac committed
    # The time
    
    pbac's avatar
    pbac committed
    class(D$t)
    head(D$t)
    tail(D$t)
    ```
    Hence, the vector is of the class `POSIXct`. It is not a necessity, `t`
    can also simply be a numeric, but for plotting and many operations, its
    very useful to use the 'POSIXct' class (see `?POSIXt`).
    
    Rules for the time vector:
    
    - It must be named `t`.
    
    - There must be no gaps or NA values in `t`, since only equidistant time series
      can be used in the models (the other variables can have NAs).
    
    - Its best to keep the time zone in `UTC` or `GMT` (not providing any time
      zone `tz` can give rise to problems).
      
    
    Use the basic R functions for handling the time class. Most needed
    operations can be done with:
    
    pbac's avatar
    pbac committed
    ?as.POSIXct
    ?strftime
    ```
    
    
    
    pbac's avatar
    pbac committed
    A helper function is provided with the `ct` function which can be called using `?`, or `?ct`. See example below:
    
    pbac's avatar
    pbac committed
    
    ```{r}
    
    pbac's avatar
    pbac committed
    # Convert from a time stamp (tz="GMT" per default)
    
    ct("2019-01-01 11:00")
    
    pbac's avatar
    pbac committed
    # Convert from unix time
    
    ct(3840928387)
    
    pbac's avatar
    pbac committed
    ```
    Note that for all functions where a time value as a character is given, the time
    zone is always "GMT" (or "UTC", but this can result in warnings, but they can be
    ignored). For some operations the package `lubridate` can
    be very helpful.
    
    
    ### Observations
    
    Note the rules for observations:
    
    
    pbac's avatar
    pbac committed
    - In a `data.list` observations must be vectors.
    
    pbac's avatar
    pbac committed
    
    - The vectors must have the same length as the time `t` vector.
    
    
    pbac's avatar
    pbac committed
    - Observation as numerical vectors can be used directly as model output (if
      observations are to used as model inputs, they must be setup in a data.frame
      as explained below in Section [Forecasts]).
    
    pbac's avatar
    pbac committed
    
    
    In the current data, a time series of hourly heat load observations is included:
    
    ```{r}
    str(D$heatload)
    ```
    
    It must have the same length as the time vector:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Same length as time
    
    pbac's avatar
    pbac committed
    length(D$t)
    length(D$heatload)
    ```
    
    A simple plot can be generated by:
    ```{r}
    plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")
    ```
    
    The convention used in all examples is that the time points are always
    set to the time interval end point, e.g.:
    ```{r}
    
    pbac's avatar
    pbac committed
    # The observation
    
    pbac's avatar
    pbac committed
    D$heatload[2]
    
    pbac's avatar
    pbac committed
    # Represents the average load between
    
    pbac's avatar
    pbac committed
    D$t[1]
    
    pbac's avatar
    pbac committed
    # and
    
    pbac's avatar
    pbac committed
    D$t[2]
    ```
    The main idea behind setting the time point at the end of the interval is:
    Working with values averaged over the time interval, such values are available
    at the end of the time interval, not before. Especially, in real-time
    applications this is a useful convention.
    
    
    ### Forecasts
    
    
    As described in [onlineforecasting] the setup of forecasts for
    
    pbac's avatar
    pbac committed
    model inputs always follows the same format - as presented in the
    following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.
    
    The rules are:
    
    - All values at row `i` are available at the `i`'th value in time `t`.
    
    - All columns must be named with `k` followed by an integer indicating the
      horizon in steps (e.g. the column named `k8` hold the 8-step forecasts).
    
    
    Have a look at the forecasts of the global radiation:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Global radiation forecasts
    
    pbac's avatar
    pbac committed
    head(D$I)
    ```
    
    At the first time point:
    ```{r}
    
    pbac's avatar
    pbac committed
    # First time point
    
    pbac's avatar
    pbac committed
    D$t[1]
    ```
    the available forecast ahead in time is at the first row:
    ```{r}
    
    pbac's avatar
    pbac committed
    # The forecast available ahead in time is in the first row
    
    pbac's avatar
    pbac committed
    D$I[1, ]
    ```
    
    We can plot that by:
    ```{r}
    i <- 1:ncol(D$I)
    plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
    ```
    So this is the forecast available ahead in time at `r D$t[1]`.
    
    The column in `I` named `k8` holds the 8-step horizon forecasts, which, since
    the steps are hourly, is an equi-distant time series. Picking out the
    entire series can be done by `D$I$k8` - hence a plot (together with the
    observations) can be generated by:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Just pick some points by
    
    pbac's avatar
    pbac committed
    i <- 200:296
    plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
    
    pbac's avatar
    pbac committed
    # Add the observations
    
    lines(D$t[i], D$Iobs[i])
    
    pbac's avatar
    pbac committed
    legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
    ```
    
    Notice how the are not aligned, since the forecasts are 8 hours ahead. To align
    them the forecasts must be lagged 8 steps by:
    ```{r}
    
    plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
    
    lines(D$t[i], D$Iobs[i])
    
    pbac's avatar
    pbac committed
    legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
    ```
    
    
    
    ## Plotting
    
    A few simple plotting functions are included in the package.
    
    
    ### Time series plots
    
    The plot function provided with the package actually does this lagging with plotting forecasts:
    ```{r}
    plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
    ```
    
    The argument `patterns` is vector of a regular expressions (see `?regex`),
    which is used to match the variables to include in the plot. See the help with `?plot_ts` for more details.
    
    An interactive plot can be generated using (first install the package `plotly`):
    ```{r, eval=FALSE}
    plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))
    ```
    
    ```{r, warning=FALSE, message=FALSE, echo=FALSE, purl=FALSE, output="hide", eval=FALSE}
    
    pbac's avatar
    pbac committed
    L <- plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36), plotit=FALSE)
    
    pbac's avatar
    pbac committed
    plotly::subplot(L, shareX=TRUE, nrows=length(L), titleY = TRUE)
    
    pbac's avatar
    pbac committed
    ```
    Note that the `patterns` argument is a vector of regular expressions, which
    determines which variables from `D` to plot.
    
    
    
    ### Scatter plots
    
    When modelling with the objective of forecasting, it's always a good start to
    have a look at scatter plots between the model inputs and the model output. For
    example the heatload vs. ambient temperature 8-step forecast:
    ```{r, fig.width=2*fhs, fig.height=fhs, out.width=ows2}
    par(mfrow=c(1,2))
    plot(D$Ta$k8, D$heatload)
    
    plot(lagvec(D$Ta$k8, 8), D$heatload)
    
    pbac's avatar
    pbac committed
    ```
    So lagging (thus aligning in time) makes less slightly less scatter.
    
    A wrapper for the `pairs` function is provided for a `data.list`, which can
    generate very useful explorative plots:
    ```{r, fig.height=figwidth}
    
    pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24))
    
    pbac's avatar
    pbac committed
    ```
    Note how the sequence of included horizons are specified in the `kseq` argument,
    and note that the forecasts are lagged to be aligned in time. See `?pairs.data.list` for more details.
    
    Just as a quick side note: This is the principle used for fitting onlineforecast
    models, simply shift forecasts to align with the observations:
    ```{r, fig.width=fhs, fig.height=fhs, out.width=ows}
    
    pbac's avatar
    pbac committed
    # Lag the 8-step forecasts to be aligned with the observations
    
    pbac's avatar
    pbac committed
    # Take a smaller range
    
    pbac's avatar
    pbac committed
    x <- x[i]
    
    pbac's avatar
    pbac committed
    # Take the observations
    
    y <- D$Iobs[i]
    
    pbac's avatar
    pbac committed
    # Fit a linear regression model
    
    pbac's avatar
    pbac committed
    fit <- lm(y ~ x)
    
    pbac's avatar
    pbac committed
    # Plot the result
    
    pbac's avatar
    pbac committed
    plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
    abline(fit)
    ```
    
    Seen over time the 8-step forecasts are:
    ```{r}
    plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
    lines(D$t[i], y)
    legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
    ```
    
    
    pbac's avatar
    pbac committed
    Of course that model was very simple, see how to make a better model in
    [building-heat-load-forecasting] and more information on the [website].
    
    pbac's avatar
    pbac committed
    
    
    
    ## Subset
    
    Taking a subset of a `data.list` is very useful and it can easily be done in
    different ways using the `subset` function (i.e. it's really the
    `subset.data.list` function called when:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Take the 1 to 4 values of each variable in D
    
    pbac's avatar
    pbac committed
    Dsub <- subset(D, 1:4)
    summary(Dsub)
    ```
    
    Another useful function for taking data in a time range is:
    ```{r}
    which(in_range("2010-12-20",D$t,"2010-12-21"))
    ```
    always check the help of function for more details (i.e. `?in_range`)
    
    Actually, it's easy to take subset from a period by:
    ```{r}
    Dsub <- subset(D, c("2010-12-20","2010-12-21"))
    summary(Dsub)
    Dsub$t
    ```
    
    
    ## Data.list to data.frame (or data.table)
    
    It can be really useful to bring the data.list on a format of a `data.frame` or
    equivalently `data.table` for processing. 
    
    Bringing to `data.frame` can easily be done by:
    ```{r}
    Df <- as.data.frame(Dsub)
    names(Df)
    ```
    So the forecasts are just bind with the time and observations, and `.kxx` is
    added to the column names.
    
    It can be converted to a `data.table` by:
    ```{r}
    library(data.table)
    setDT(Df)
    class(Df)
    ```
    
    After processing it is easily converted back to the `data.list` again by:
    ```{r}
    
    pbac's avatar
    pbac committed
    # Set back to data.frame
    
    pbac's avatar
    pbac committed
    setDF(Df)
    
    pbac's avatar
    pbac committed
    # Convert to a data.list
    
    pbac's avatar
    pbac committed
    Dsub2 <- as.data.list(Df)
    
    pbac's avatar
    pbac committed
    # Compare it with the original Dsub
    
    pbac's avatar
    pbac committed
    summary(Dsub2)
    summary(Dsub)
    ```