Author

Affiliation

Beniamino Sartini

University of Bologna

Published

June 21, 2025

Modified

June 27, 2025

1 Hourly energy demand generation and weather

R packages

library(tidyverse)
library(backports)
library(solarr)
library(latex2exp)

2 Project description

The dataset contains 4 years of electrical consumption, generation, pricing, and weather data for Spain. Consumption and generation data was retrieved from ENTSOE a public portal for Transmission Service Operator (TSO) data. Settlement prices were obtained from the Spanish TSO Red Electric España. Weather data was purchased as part of a personal project from the Open Weather API for the 5 largest cities in Spain. The original datasets can be downloaded from Kaggle. However, the dataset to be used for the project has been already pre-processed. The code used for preprocessing is in this github repository, while the dataset to be used can be downloaded here.

Tip 1: Description of the variables

date: datetime index localized to CET.
date_day: date without hour.
city: reference city. (Valencia. Madrid, Bilbao,Barcelona, Seville)
Year: number for the year.
Month: number for the month.
Day: number for the day in a month.
Hour: reference hour of the day.

Energy variables

fossil: energy produced with fossil sources: i.e. coal/lignite, natural gas, oil (MW).
nuclear: energy produced with nuclear sources (MW).
energy_other: other generation (MW).
solar: solar generation (MW).
biomass: biomass generation (MW).
river: hydroeletric generation (MW).
wind: wind generation (MW).
energy_other_renewable: other renewable generation (MW).
energy_waste: wasted generation (MW).
pred_solar: forecasted solar generation (MW).
pred_demand: forecasted electrical demand by TSO (MW).
pred_price: forecasted day ahead electrical price by TSO (Eur/MWh).
price: actual electrical price (Eur/MWh).
demand: actual electrical demand (Eur/MWh).
prod_tot: total electrical production (MW).
prod_renewable: total renewable electrical production (MW).

Weather variables

temp: mean temperature.
temp_min: minimum temperature (degrees).
temp_max: maximum temperature (degrees).
pressure: pressure in (hPas).
humidity: humidity in percentage.
wind_speed: wind speed (m/s).
wind_deg: wind direction.
rain_1h: rain in last hour (mm).
rain_3h: rain in last 3 hours (mm).
snow_3h: show last 3 hours (mm).
clouds_all: cloud cover in percentage.

Import data

dir_data <- "../grenfin-summer-school/data"
filepath <- file.path(dir_data, "data.csv")
data <- readr::read_csv(filepath, show_col_types = FALSE, progress = FALSE)
head(data) %>%
  knitr::kable(booktabs = TRUE ,escape = FALSE, align = 'c')%>%
  kableExtra::row_spec(0, color = "white", background = "green")

date	date_day	Year	Month	Day	Hour	Season	fossil	nuclear	energy_other	solar	biomass	river	energy_other_renewable	energy_waste	wind	pred_solar	pred_demand	pred_price	demand	price	prod	prod_renewable	temp	temp_min	temp_max	pressure	humidity	wind_speed	wind_deg
2014-12-31 23:00:00	2014-12-31	2014	12	31	23	Winter	10156	7096	43	49	447	3813	73	196	6378	17	26118	50.10	25385	65.41	27939	10687	-0.6585375	-5.825	8.475	1016.4	82.4	2.0	135.2
2015-01-01 00:00:00	2015-01-01	2015	1	1	0	Winter	10437	7096	43	50	449	3587	71	195	5890	16	24934	48.10	24382	64.92	27509	9976	-0.6373000	-5.825	8.475	1016.2	82.4	2.0	135.8
2015-01-01 01:00:00	2015-01-01	2015	1	1	1	Winter	9918	7099	43	50	448	3508	73	196	5461	8	23515	47.33	22734	64.48	26484	9467	-1.0508625	-6.964	8.136	1016.8	82.0	2.4	119.0
2015-01-01 02:00:00	2015-01-01	2015	1	1	2	Winter	8859	7098	43	50	438	3231	75	191	5238	2	22642	42.27	21286	59.32	24914	8957	-1.0605312	-6.964	8.136	1016.6	82.0	2.4	119.2
2015-01-01 03:00:00	2015-01-01	2015	1	1	3	Winter	8313	7097	43	42	428	3499	74	189	4935	9	21785	38.41	20264	56.04	24314	8904	-1.0041000	-6.964	8.136	1016.6	82.0	2.4	118.4
2015-01-01 04:00:00	2015-01-01	2015	1	1	4	Winter	7962	7098	43	34	410	3804	74	188	4618	4	21441	35.72	19905	53.63	23926	8866	-1.1260000	-7.708	7.317	1017.4	82.6	2.4	174.8

3 Part A: Descriptive analysis

Consider the data of the energy market in Spain and let’s examine the composition of the energy mix. At your disposal you have 6 sources that generate energy, i.e. 4 renewable (columns biomass, solar, river, wind) and 2 non-renewable (columns nuclear, fossil). The column demand represents the amount of demanded electricity in MW for a given day, denoted as $D_{t}$ . Let’s denote the energy of the $*$ -th source as $\begin{matrix} (1) & E_{t}^{*} = {\begin{cases} E_{t}^{b} * \to b = biomass \\ E_{t}^{s} * \to s = solar \\ E_{t}^{h} * \to h = hydroelettric \\ E_{t}^{w} * \to w = wind \\ E_{t}^{n} * \to n = nuclear \\ E_{t}^{f} * \to f = fossil \\ E_{t}^{o} * \to o = others \end{cases} \end{matrix}$ where others is the sum of the column energy_other and energy_other_renewable.

3.1 Task A.1

Let’s decompose the total demand across all the sources by defining share of demand met by source $*$ , i.e. $\begin{matrix} (2) & s_{t}^{*} = \frac{E_{t}^{*}}{D_{t}} \end{matrix}$ where $E_{t}^{*}$ is defined as in Equation 1. Then, let’s compute the average share of demand ${\bar{s}}^{*}$ between the electricity produced from each source and the total electricity demanded from the market, i.e. $\begin{matrix} (3) & {\bar{s}}^{*} = \frac{1}{n} \sum_{t = 1}^{n} s_{t}^{*} \end{matrix}$ Then, rank the first 5 sources by the percentage of demand satisfied with respect to the total demand in decreasing order. Plot the results with an histogram. Which is the source of energy that, on average, satisfy most of the electricity demand?

Solution A.1

Download

Figure average ratio between production and demand.

var_levels <- c("other", "biomass", "solar", "hydroelettric", "wind", "nuclear", "fossil")
var_labels <- c("other", "Biomass", "Solar", "Hydroelettric", "Wind", "Nuclear", "Fossil")
# Compute the ratio of the demand satisfied with respect to total demand 
df_A1 <- data %>%
  dplyr::summarise(other = mean((energy_other+energy_other_renewable)/demand), 
                   biomass = mean(biomass/demand),
                   solar = mean(solar/demand),
                   hydroelettric = mean(river/demand),
                   wind = mean(wind/demand),
                   nuclear = mean(nuclear/demand),
                   fossil = mean(fossil/demand))%>%
  tidyr::gather(key = "source", value = "perc")
# Categorical variable (plots purposes)
df_A1$id <- factor(df_A1$source, levels = var_levels, labels = var_labels)

# Figure 
df_A1 %>%
  arrange(desc(perc)) %>%
  head(n = 7) %>%
  ggplot()+
  geom_bar(aes(id, perc, fill = source), color = "black", stat = "identity")+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")+
  scale_fill_manual(values = c(fossil = "#65880f", nuclear = "#0477BF", biomass = "brown",
                               solar = "darkorange", hydroelettric = "blue", wind = "gray", other = "black"))+
  scale_y_continuous(breaks = seq(0, 1, 0.05), 
                     labels = paste0(seq(0, 1, 0.05)*100, " %"))+
  labs(fill = NULL, x = NULL, y = NULL)

# Rank the sources in decreasing order 
vec_max <- df_A1$perc
names(vec_max) <- var_labels
vec_max_sort <- vec_max[order(vec_max, decreasing = TRUE)]

Figure 1: Mean percentage demand of energy satisfied by source in Spain. Years 2015-2018.

Solution: ranked sources

# Rank the sources in decreasing order 
vec_max <- df_A1$perc
names(vec_max) <- var_labels
vec_max_sort <- vec_max[order(vec_max, decreasing = TRUE)]
# Generate the response based on previous computations 
response <- ""
for(i in 1:length(vec_max_sort)){
  response <- paste0(response, " The ", i, "°",  " source with respect to production/demand is ", names(vec_max_sort[i]), " (", round(vec_max_sort[i]*100, 2), "%).")
}

The 1° source with respect to production/demand is Fossil (36.82%). The 2° source with respect to production/demand is Nuclear (22.37%). The 3° source with respect to production/demand is Wind (19.48%). The 4° source with respect to production/demand is Hydroelettric (14.24%). The 5° source with respect to production/demand is Solar (4.73%). The 6° source with respect to production/demand is Biomass (1.37%). The 7° source with respect to production/demand is other (0.52%).

3.2 Task A.2

Aggregate the sources of energy in two groups:

Renewable: computed as the sum of solar, river, wind and biomass.
Non-renewable: computed as the sum of nuclear and fossil.

Then, compute the percentage of energy demand satisfied by each group. What is the percentage of energy satisfied by renewable sources? And by the fossil ones?

Repeat the computations, but filtering the data firstly only for year 2015 and then only for 2018. Are the percentage obtained different? Describe the result and explain if the transition to renewable energy is supported by data (to be supported the percentage of renewable energy should increase while percentage of fossil fuels should decrease.)

Solution A.2

The percentage of demand satisfied by renewable sources is 39.82 %. Instead, the percentage of demand satisfied by fossil and nuclear sources is 59.19 %. The ratio of the fossil plus nuclear is 46.73 % bigger with respect to renewable sources.

Solution: transition 2015-2019

# Percentage of demand by source in 2015
df_A2_2015 <- data %>%
  filter(Year == 2015) %>%
  summarise(other = mean((energy_other+energy_other_renewable)/demand), 
            biomass = mean(biomass/demand, na.rm = TRUE),
            solar = mean(solar/demand, na.rm = TRUE),
            hydroelettric = mean(river/demand, na.rm = TRUE),
            wind = mean(wind/demand, na.rm = TRUE),
            nuclear = mean(nuclear/demand, na.rm = TRUE),
            fossil = mean(fossil/demand, na.rm = TRUE))%>%
  gather(key = "source", value = "perc") 
df_A2_2015 <-  filter(df_A2_2015, source != "other")
# Normalize 
df_A2_2015$perc <- df_A2_2015$perc/sum(df_A2_2015$perc)
# Categorical variable (plots purposes)
df_A2_2015$id <- factor(df_A2_2015$source, levels = var_levels[-1], labels = var_labels[-1])

# Percentage of demand by source in 2018
df_A2_2018 <- data %>%
  filter(Year == 2018) %>%
  summarise(other = mean((energy_other+energy_other_renewable)/demand), 
            biomass = mean(biomass/demand, na.rm = TRUE),
            solar = mean(solar/demand, na.rm = TRUE),
            hydroelettric = mean(river/demand, na.rm = TRUE),
            wind = mean(wind/demand, na.rm = TRUE),
            nuclear = mean(nuclear/demand, na.rm = TRUE),
            fossil = mean(fossil/demand, na.rm = TRUE))%>%
  gather(key = "source", value = "perc")
df_A2_2018 <-  filter(df_A2_2018, source != "other")
# Normalize 
df_A2_2018$perc <- df_A2_2018$perc/sum(df_A2_2018$perc)
# Categorical variable (plots purposes)
df_A2_2018$id <- factor(df_A2_2018$source,  levels = var_levels[-1], labels = var_labels[-1])
# Response
response <- paste0(
  "In 2015 the percentage of demand satisfied by renewable sources was ",
  round(sum(df_A2_2015$perc[1:4])*100, 3), " %, while in 2018 was ", 
  round(sum(df_A2_2018$perc[1:4])*100, 3), " %. ",
  "Hence, the dominance of renewable sources in this period increased by ",
  round((sum(df_A2_2018$perc[1:4])/sum(df_A2_2015$perc[1:4])-1)*100, 3), " %.",
  " Instead, in 2015 the percentage of demand satisfied by fossil and nuclear sources was ",
  round(sum(df_A2_2015$perc[5:6])*100, 3), " %, while in 2018 was ",
  round(sum(df_A2_2018$perc[5:6])*100, 3), " %. ",
  "Hence, the dominance of fossil sources in this period decreased by ",
  round((sum(df_A2_2018$perc[5:6])/sum(df_A2_2015$perc[5:6])-1)*100, 3), 
  " %. and the transition seems to be supported by data."
)

In 2015 the percentage of demand satisfied by renewable sources was 39.335 %, while in 2018 was 42.132 %. Hence, the dominance of renewable sources in this period increased by 7.111 %. Instead, in 2015 the percentage of demand satisfied by fossil and nuclear sources was 60.665 %, while in 2018 was 57.868 %. Hence, the dominance of fossil sources in this period decreased by -4.611 %. and the transition seems to be supported by data.

3.3 Task A.3

For each month, compute the mean percentage of demand of energy satisfied by the fossil sources. In order to compute $D_{m}^{*}$ for the $m$ -month, let’s denote with $n_{m}$ the number of day in that month, then: $\begin{matrix} (4) & D_{m}^{*} = \frac{1}{n_{m}} \sum_{t = 1}^{n_{m}} \frac{E_{t}^{*}}{D_{t}} = \frac{1}{n_{m}} \sum_{t = 1}^{n_{m}} D_{t}^{*} \end{matrix}$

Compute also the maximum, minimum, median and standard deviation of $D_{t}^{*}$ for each month. Repeat the computation for the others 4 sources of energy, i.e. nuclear, solar, river and wind. Which are the source with the lowest and the highest standard deviation in December? And in June? Comment the results obtained (max 200 words).

Solution A.3

create_monthly_kable()

create_monthly_kable <- function(data, target = "fossil"){
  data[["target"]] <- data[[target]]
  data <- data %>%
    mutate(Month = lubridate::month(date_day, label = TRUE)) %>%
    group_by(Month) %>%
    mutate(target = target/demand) %>%
    summarise(min = min(target, na.rm = TRUE),
              mean = mean(target, na.rm = TRUE),
              meadian = median(target, na.rm = TRUE),
              max = max(target, na.rm = TRUE),
              sd = sd(target))
  
  kab <- data %>% 
    mutate_if(is.numeric, ~.x*100) %>%
    mutate(sd = paste0(format(sd, digits = 4), " %"))%>%
    mutate_if(is.numeric,  ~paste0(format(.x, digits = 4), " %")) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green") 
  
  structure(
    list(
      data = data,
      kab = kab
      
    )
  )
}

Code

df_fossil <- create_monthly_kable(data, target = "fossil")
df_fossil$kab

Month	min	mean	meadian	max	sd
Jan	10.607 %	36.87 %	38.61 %	66.58 %	10.082 %
Feb	9.125 %	32.48 %	32.21 %	67.95 %	11.214 %
Mar	13.351 %	29.39 %	26.56 %	69.18 %	10.438 %
Apr	12.396 %	31.03 %	28.80 %	70.62 %	11.246 %
May	9.976 %	33.35 %	33.01 %	76.60 %	10.764 %
Jun	12.730 %	36.97 %	37.35 %	75.26 %	11.359 %
Jul	14.677 %	39.85 %	40.64 %	63.78 %	9.407 %
Aug	13.075 %	38.00 %	38.82 %	60.91 %	8.791 %
Sep	12.927 %	39.40 %	40.53 %	68.04 %	9.156 %
Oct	12.531 %	41.37 %	41.69 %	72.10 %	10.998 %
Nov	12.924 %	44.10 %	45.71 %	78.15 %	12.626 %
Dec	13.891 %	38.77 %	39.69 %	67.38 %	11.030 %

Table 1: Percentage of energy demand satisfied by fossil.

Code

df_nuclear <- create_monthly_kable(data, target = "nuclear")
df_nuclear$kab

Month	min	mean	meadian	max	sd
Jan	12.271 %	22.74 %	21.86 %	39.22 %	4.498 %
Feb	11.660 %	22.56 %	21.77 %	38.42 %	4.655 %
Mar	13.040 %	23.46 %	22.74 %	36.57 %	4.504 %
Apr	12.624 %	23.13 %	22.65 %	37.64 %	4.906 %
May	4.999 %	20.40 %	19.98 %	37.59 %	3.990 %
Jun	12.569 %	21.17 %	20.67 %	37.76 %	4.381 %
Jul	12.390 %	21.65 %	21.09 %	34.93 %	3.956 %
Aug	12.780 %	23.61 %	23.08 %	34.00 %	4.113 %
Sep	13.929 %	23.65 %	22.98 %	35.29 %	4.348 %
Oct	14.127 %	23.09 %	22.54 %	35.85 %	4.503 %
Nov	0.000 %	19.82 %	19.25 %	33.38 %	4.343 %
Dec	13.425 %	23.08 %	22.31 %	39.37 %	5.219 %

Table 2: Percentage of energy demand satisfied by nuclear.

Code

df_wind <- create_monthly_kable(data, target = "wind")
df_wind$kab

Month	min	mean	meadian	max	sd
Jan	1.1883 %	21.80 %	20.39 %	62.26 %	11.609 %
Feb	1.5627 %	22.85 %	21.24 %	61.05 %	13.511 %
Mar	1.5243 %	24.03 %	23.67 %	64.02 %	12.726 %
Apr	1.8202 %	20.14 %	18.09 %	64.49 %	11.326 %
May	1.1022 %	19.88 %	17.45 %	57.96 %	11.555 %
Jun	1.0185 %	18.12 %	15.68 %	65.36 %	10.978 %
Jul	0.6918 %	15.96 %	14.27 %	59.08 %	9.259 %
Aug	0.7066 %	16.75 %	15.52 %	57.37 %	9.134 %
Sep	0.9261 %	16.05 %	13.60 %	56.38 %	10.383 %
Oct	0.7795 %	19.22 %	16.83 %	62.57 %	12.014 %
Nov	0.0000 %	18.93 %	16.16 %	69.16 %	12.136 %
Dec	1.6081 %	20.17 %	18.12 %	67.28 %	12.108 %

Table 3: Percentage of energy demand satisfied by wind.

Code

df_hydro <- create_monthly_kable(data, target = "river")
df_hydro$kab

Month	min	mean	meadian	max	sd
Jan	2.983 %	15.06 %	13.798 %	41.58 %	6.731 %
Feb	3.003 %	17.33 %	16.062 %	42.13 %	7.674 %
Mar	2.540 %	18.69 %	18.987 %	45.22 %	7.715 %
Apr	2.928 %	18.84 %	16.437 %	49.09 %	9.524 %
May	2.858 %	16.92 %	15.327 %	50.66 %	8.092 %
Jun	2.796 %	13.81 %	13.050 %	38.07 %	5.839 %
Jul	2.261 %	11.50 %	10.293 %	40.04 %	5.887 %
Aug	3.040 %	10.94 %	9.488 %	43.30 %	6.013 %
Sep	2.635 %	11.05 %	9.611 %	36.30 %	5.534 %
Oct	2.271 %	11.61 %	10.175 %	44.41 %	6.266 %
Nov	0.000 %	12.28 %	11.021 %	43.45 %	6.441 %
Dec	3.519 %	13.11 %	11.911 %	46.06 %	6.121 %

Table 4: Percentage of energy demand satisfied by hydroelettric.

Code

df_solar <- create_monthly_kable(data, target = "solar")
df_solar$kab

Month	min	mean	meadian	max	sd
Jan	0.024068 %	3.523 %	0.7840 %	25.38 %	4.830 %
Feb	0.006935 %	3.966 %	1.3321 %	22.70 %	5.080 %
Mar	0.018776 %	4.517 %	2.0257 %	21.22 %	5.345 %
Apr	0.009580 %	4.986 %	2.3468 %	21.88 %	5.572 %
May	0.008510 %	5.942 %	3.0158 %	22.48 %	5.939 %
Jun	0.011501 %	6.248 %	3.2706 %	22.03 %	5.761 %
Jul	0.008478 %	6.221 %	2.9958 %	23.40 %	5.726 %
Aug	0.009806 %	5.774 %	2.6587 %	21.32 %	5.825 %
Sep	0.017305 %	5.333 %	2.2467 %	23.02 %	5.875 %
Oct	0.007038 %	3.767 %	1.0721 %	21.75 %	4.974 %
Nov	0.000000 %	3.355 %	0.6537 %	18.63 %	4.619 %
Dec	0.008983 %	3.063 %	0.4059 %	19.26 %	4.386 %

Table 5: Percentage of energy demand satisfied by solar.

Code

df_biomass <- create_monthly_kable(data, target = "biomass")
df_biomass$kab

Month	min	mean	meadian	max	sd
Jan	0.7321 %	1.404 %	1.383 %	2.630 %	0.3501 %
Feb	0.6226 %	1.394 %	1.380 %	2.741 %	0.3070 %
Mar	0.4760 %	1.334 %	1.306 %	2.680 %	0.4059 %
Apr	0.6078 %	1.191 %	1.147 %	2.819 %	0.3495 %
May	0.0000 %	1.394 %	1.308 %	2.718 %	0.3703 %
Jun	0.6752 %	1.344 %	1.277 %	2.614 %	0.3477 %
Jul	0.6938 %	1.365 %	1.305 %	2.719 %	0.3265 %
Aug	0.5274 %	1.447 %	1.373 %	2.859 %	0.3776 %
Sep	0.5987 %	1.424 %	1.348 %	2.809 %	0.3834 %
Oct	0.5433 %	1.404 %	1.314 %	2.890 %	0.4034 %
Nov	0.0000 %	1.372 %	1.298 %	2.830 %	0.3720 %
Dec	0.5796 %	1.339 %	1.284 %	2.801 %	0.3663 %

Table 6: Percentage of energy demand satisfied by biomass.

Solution: June vs December

sd_june <- c(fossil = filter(df_fossil$data, Month == "Jun")$sd,
             nuclear = filter(df_nuclear$data, Month == "Jun")$sd,
             wind = filter(df_wind$data, Month == "Jun")$sd,
             hydro = filter(df_hydro$data, Month == "Jun")$sd,
             solar = filter(df_solar$data, Month == "Jun")$sd)

sd_dec <- c(fossil = filter(df_fossil$data, Month == "Dec")$sd,
            nuclear = filter(df_nuclear$data, Month == "Dec")$sd,
            wind = filter(df_wind$data, Month == "Dec")$sd,
            hydro = filter(df_hydro$data, Month == "Dec")$sd,
            solar = filter(df_solar$data, Month == "Dec")$sd)

min_sd_jun <- sd_june[order(sd_june, decreasing = FALSE)][1]
min_sd_dec <- sd_dec[order(sd_dec, decreasing = FALSE)][1]
max_sd_jun <- sd_june[order(sd_june, decreasing = FALSE)][length(sd_june)]
max_sd_dec <- sd_dec[order(sd_dec, decreasing = FALSE)][length(sd_dec)]

The nuclear energy has the lowest standard deviation of percentage demand satisfied in June with a standard deviation of 4.4 %. The maximum standard deviation is from fossil with 11 %.

Instead the solar energy has the lowest standard deviation of percentage demand satisfied in December with a standard deviation of 4.4 %. The maximum standard deviation is from wind with 12 %.

Download

Figure percentage demand of energy by source

df_A3 <- data %>%
  mutate(Month = lubridate::month(date_day, label = TRUE)) %>%
  group_by(Month, Hour) %>%
  summarise(biomass = mean(biomass/demand, na.rm = TRUE), 
            solar = mean(solar/demand, na.rm = TRUE), 
            hydroelettric = mean(river/demand, na.rm = TRUE),
            wind = mean(wind/demand, na.rm = TRUE),
            nuclear = mean(nuclear/demand, na.rm = TRUE),
            fossil = mean(fossil/demand, na.rm = TRUE),
            demand = mean(demand, na.rm = TRUE))
 
ggplot(df_A3)+
    geom_line(aes(Hour, biomass, color = "biomass"))+
    geom_line(aes(Hour, solar, color = "solar"))+
    geom_line(aes(Hour, hydroelettric, color = "hydroelettric"))+
    geom_line(aes(Hour, wind, color = "wind"))+
    geom_line(aes(Hour, nuclear, color = "nuclear"))+
    geom_line(aes(Hour, fossil, color = "fossil"))+
    facet_wrap(~Month)+
    theme_bw()+
    custom_theme+
    scale_color_manual(values = c(fossil = "#65880f", nuclear = "#0477BF", biomass = "brown",
                                  solar = "darkorange", hydroelettric = "blue", wind = "gray"))+
    scale_y_continuous(breaks = seq(0, 1, 0.1), 
                       labels = paste0(seq(0, 1, 0.1)*100, " %"))+
    labs(color = NULL, y = "Production/Demand")

Figure 2: Mean percentage demand of energy satisfied by each source in Spain by hour and month. Years 2015-2018.

3.4 Task A.4

In mean, the demand satisfied by solar energy with respect to total demand is greater or lower with respect to the one satisfied by hydroelectric on 12:00 in March? In July, August and September at the same hour the situation changes? Answer, providing a possible explanation of the results obtained (max 200 words).

Solution A.4

Code

df_A4 <- bind_rows(
   filter(df_A3, Month == "Mar" & Hour == 12),
   filter(df_A3, Month == "Jul" & Hour == 12),
   filter(df_A3, Month == "Aug" & Hour == 12),
   filter(df_A3, Month == "Sep" & Hour == 12)
)

df_A4 %>%
  mutate_if(is.numeric, ~.x*100) %>%
  mutate_if(is.numeric,  ~paste0(format(.x, digits = 3), " %"))  %>%
  mutate(demand = paste0(str_remove_all(demand, "\\%"), " MW"),
         Hour = paste0(as.numeric(str_remove_all(Hour, "\\%"))/100, ":00")) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Month	Hour	biomass	solar	hydroelettric	wind	nuclear	fossil	demand
Mar	12:00	1.19 %	11.5 %	16.7 %	21.7 %	20.9 %	26.9 %	3179217 MW
Jul	12:00	1.19 %	13.8 %	11.8 %	12.4 %	19.1 %	37.8 %	3354479 MW
Aug	12:00	1.28 %	13.8 %	10.3 %	13.6 %	20.9 %	35.8 %	3191764 MW
Sep	12:00	1.28 %	13.7 %	9.6 %	13.4 %	21.3 %	36.9 %	3107519 MW

Code


# More hydroelectric on March at 12:00
# More solar on Jul, Aug, Sep at 12:00
# Interpret: more sun less water in Summer.

3.5 Task A.5

Grouping the data by season and hour ( $4 \times 24$ groups) and by month and hour ( $12 \times 24$ groups) and compute:

Average electricity price.
Average ratio between total production and total demand, i.e. prod/demand.

For both groups, answer to the following questions.

What is the meaning of a ratio between total production and total demand greater than 1? And greater than 1? Following the demand and supply law, what is the expected behavior of the price in such situations? Comment the results (max 200 words).

When the price reaches it’s minimum value? In such situation, does the ratio reach its maximum value? What happen when the prices reaches it’s maximum? Comment the results (max 200 words).

Solution A.5

The minimum price for electricity by season and hour is reached, in mean, at 3:00 in all the seasons. Moreover, for all the seasons the ratio between production and demand is above 100% between 00:00 and 3:00 denoting a general excess of production with respect to the demand. Interestingly, when this ratio goes again above 100% again between 14:00 to 16:00 (except is summer) the price, in mean, reaches a relative minimum for the day.

On contrary the maximum price for the electricity, by season and hour is reached, in mean, at 19:00 except for Autumn (18:00). As expected, in the same period the production ratio reaches a minimum that for all the seasons is also a global minimum for the day.

In the monthly plots is highlighted the fact that is the drop in electricity price in 14:00 to 16:00 and the subsequent increase during the afternoon is smoothed going from May up to September.

Let’s compute the average by Season, Month, Weekday and Hour of the total production divided by the total demand.

Download

Code

data %>%
  mutate(Weekday = weekdays(date)) %>%
  mutate(prod = prod/demand) %>%
  group_by(Weekday, Hour) %>%
  summarise(prod = mean(prod, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, prod), color = "black")+
  geom_line(aes(Hour, 1), color = "red", linetype = "dashed")+
  facet_wrap(~Weekday)+
  theme_bw()+
  custom_theme+
  scale_x_continuous(breaks = seq(0, 24, 4))+
  scale_y_continuous(breaks = seq(0, 2, 0.05), 
                     labels = paste0(seq(0, 2, 0.05)*100, " %"))+
  labs(color = NULL, y = "Production/Demand")

Figure 3: Average Production/Demand by Weekday and Hour.

Download

Code

data %>%
  mutate(Month = lubridate::month(date_day, label = TRUE)) %>%
  group_by(Month, Hour) %>%
  summarise(ratio = mean(prod/demand, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, ratio), color = "black")+
  geom_line(aes(Hour, 1), color = "red", linetype = "dashed")+
  facet_wrap(~Month)+
  scale_x_continuous(breaks = seq(0, 24, 4))+
  scale_y_continuous(breaks = seq(0, 2, 0.04), 
                     labels = paste0(seq(0, 2, 0.04)*100, " %"))+
  theme_bw()+
  custom_theme+
  labs(color = NULL, y = "Production/Demand")

Figure 4: Average Production/Demand by Month and Hour.

Download

Code

data %>%
  mutate(prod = prod/demand) %>%
  group_by(Season, Hour) %>%
  summarise(prod = mean(prod, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, prod), color = "black")+
  geom_line(aes(Hour, 1), color = "red", linetype = "dashed")+
  facet_wrap(~Season)+
  theme_bw()+
  custom_theme+
  scale_x_continuous(breaks = seq(0, 24, 2))+
  scale_y_continuous(breaks = seq(0, 2, 0.05), 
                     labels = paste0(seq(0, 2, 0.05)*100, " %"))+
  labs(color = NULL, y = "Production/Demand")

Figure 5: Average Production/Demand by Season and Hour.

Let’s now compute the average by Season, Month, Weekday and Hour of the electricity price.

Download

Code

data %>%
  mutate(Month = lubridate::month(date_day, label = TRUE)) %>%
  mutate(Weekday = weekdays(date)) %>%
  group_by(Weekday, Hour) %>%
  summarise(price = mean(price, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, price), color = "black")+
  facet_wrap(~Weekday)+
  theme_bw()+
  custom_theme+
  scale_x_continuous(breaks = seq(0, 24, 4))+
  scale_y_continuous(breaks = seq(0, 100, 5), 
                     labels = seq(0, 100, 5))+
  labs(color = NULL, y = "Price (Eur/MWh)")

Figure 6: Average electricity prices by Weekday and Hour.

Download

Code

data %>%
  mutate(Month = lubridate::month(date_day, label = TRUE)) %>%
  group_by(Month, Hour) %>%
  summarise(price = mean(price, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, price), color = "black")+
  facet_wrap(~Month)+
  theme_bw()+
  custom_theme+
  scale_x_continuous(breaks = seq(0, 24, 4))+
  scale_y_continuous(breaks = seq(0, 100, 5), 
                     labels = seq(0, 100, 5))+
  labs(color = NULL, y = "Price (Eur/MWh)")

Figure 7: Average electricity prices by Month and Hour.

Download

Code

data %>%
  group_by(Season, Hour) %>%
  summarise(price = mean(price, na.rm = TRUE)) %>%
  ggplot()+
  geom_line(aes(Hour, price))+
  geom_point(aes(Hour, price))+
  facet_wrap(~Season)+
  theme_bw()+
  custom_theme+
  scale_x_continuous(breaks = seq(0, 24, 2))+
  scale_y_continuous(breaks = seq(0, 100, 10))+
  labs(color = NULL, y = "Price (Eur/MWh)")

Figure 8: Mean electricity prices by Season and Hour.

4 Part B: Model daily electricity price

4.1 Task B.1

Let’s aggregate the data on a daily basis by computing the daily average electricity prices.

Daily Aggregation

# Daily prices 
data_prices <- data %>%
  mutate(date = date_day) %>%
  group_by(date, Year, Month, Day) %>%
  summarise(price = mean(price), Season = unique(Season)) %>%
  ungroup() %>%
  mutate(log_price = log(price), 
         Weekday = weekdays(date))

# Prepare the seasonal variables 
# Day of the week 
levels_weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
data_prices$Weekday_ <- factor(data_prices$Weekday, levels = levels_weekday, labels = levels_weekday)
# Months
levels_month <- c("January", "February", "March", "April", 
                  "May", "June", "July", "August", 
                  "September", "October", "November", "December")
data_prices$Month_ <- factor(data_prices$Month, levels = 1:12, labels = levels_month)
# Seasons
levels_season <- c("Winter", "Spring", "Summer", "Autumn")
data_prices$Season_ <- factor(data_prices$Season, levels = levels_season, labels = levels_season)

4.1.1 Task B.1 (a)

Let’s consider only positive prices $P_{t}$ for electricity, hence we transform them into the corresponding log-prices $S_{t}$ , i.e. $\begin{matrix} (5) & S_{t} = \log (P_{t}) ⟺ P_{t} = \exp (S_{t}) . \end{matrix}$ Then, we can proceed by removing the seasonality from the time series. Let’s define a seasonal function to remove deterministic components from the month, day of the week, i.e. ${\bar{S}}_{t} = s_{0} + D (t) + M (t) + S (t),$ where $D (t)$ , $M (t)$ and $S (t)$ stands for the indicator functions $\begin{matrix} (6) & \begin{array}{r} D (t) = {\begin{cases} D_{2} Tuesday \\ ⋮ \\ D_{7} Sunday \end{cases} M (t) = {\begin{cases} M_{2} February \\ ⋮ \\ M_{12} December \end{cases} S (t) = {\begin{cases} S_{2} Spring \\ S_{3} Summer \\ S_{4} Autumn \end{cases} \end{array} \end{matrix}$ It follows that, by construction, the intercept $s_{0}$ is interpreted as the average log price when the day is Monday, the month is January and the season is Spring. Estimate the parameters of the model with Ordinary Least Square, then compute the seasonal price ${\bar{S}}_{t}$ and the deseasonalized time series, i.e. ${\tilde{S}}_{t} = S_{t} - {\bar{S}}_{t} .$

Solution B.1 (a)

Fit the seasonal model

# Seasonal model 
seasonal_model <- lm(log_price ~ Month_ + Weekday_ + Season_, data = data_prices) 
# Predicted seasonal mean 
data_prices$log_price_bar <- predict(seasonal_model, newdata = data_prices)
# Fitted residuals 
data_prices$log_price_tilde <- data_prices$log_price - data_prices$log_price_bar

Table statistics.

seasonal_model %>%
  broom::glance() %>% 
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.2953	0.2855	0.2061	30.1916	0	20	245.3127	-446.6255	-330.2991	61.197	1441	1462

Table 7: Statistics for the seasonal model for

{\bar{S}}_{t}

Table estimates.

seasonal_model %>%
  broom::tidy() %>% 
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error	statistic	p.value
(Intercept)	4.1173	0.0227	181.2863	0.0000
Month_February	-0.1517	0.0268	-5.6618	0.0000
Month_March	-0.2391	0.0281	-8.4973	0.0000
Month_April	-0.1988	0.0443	-4.4906	0.0000
Month_May	-0.1290	0.0441	-2.9238	0.0035
Month_June	-0.0103	0.0427	-0.2404	0.8101
Month_July	-0.0203	0.0491	-0.4139	0.6790
Month_August	-0.0463	0.0491	-0.9415	0.3466
Month_September	-0.0303	0.0451	-0.6710	0.5023
Month_October	-0.0204	0.0439	-0.4646	0.6423
Month_November	-0.0165	0.0441	-0.3733	0.7089
Month_December	0.0303	0.0361	0.8390	0.4016
Weekday_Tuesday	0.0141	0.0202	0.6997	0.4842
Weekday_Wednesday	0.0098	0.0202	0.4836	0.6288
Weekday_Thursday	0.0135	0.0202	0.6678	0.5044
Weekday_Friday	0.0038	0.0202	0.1882	0.8507
Weekday_Saturday	-0.0936	0.0202	-4.6430	0.0000
Weekday_Sunday	-0.1654	0.0202	-8.2057	0.0000
Season_Spring	-0.0435	0.0355	-1.2226	0.2217
Season_Summer	0.0230	0.0416	0.5523	0.5808
Season_Autumn	0.0805	0.0353	2.2808	0.0227

Table 8: Estimated parameters and standard errors for the seasonal log-electricity price

{\bar{S}}_{t}

4.1.2 Task B.1 (b)

Plot the log-prices $S_{t}$ , the deseasonalized log-prices ${\tilde{S}}_{t}$ and the estimated seasonal function ${\bar{S}}_{t}$ for 2017.

Solution B.1 (b)

Download

Figure deseasonalized series.

# Plot year
plot_year <- 2017
# Plot realized, seasonal function and deseasonalized prices for `plot_year`
plot_seasonal <- data_prices %>%
  filter(Year == plot_year)%>%
  ggplot()+
  geom_point(aes(date, log_price_tilde, color = "realized"), alpha = 0)+
  geom_line(aes(date, log_price, color = "sample"))+
  geom_line(aes(date, log_price_tilde, color = "deseasonal"))+
  geom_line(aes(date, log_price_bar, color = "seasonal"))+
  geom_line(aes(date, 0, color = "realized"), alpha = 0)+
  labs(x = NULL, y = latex2exp::TeX("$P_t$"), color = NULL)+
  scale_color_manual(values = c(sample = "orange", realized = "black", 
                                seasonal = "blue", deseasonal = "black"),
                     labels = c(sample = "Sample", realized = "Realized residuals", 
                                seasonal = "Seasonal", deseasonal = "Deseasonalized"))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")

# Plot realized deseasonalized prices for all the data 
plot_eps <- data_prices %>%
  mutate(type = ifelse(Year == plot_year, "col", "nocol")) %>%
  ggplot()+
  geom_point(aes(date, log_price_tilde, color = type), size = 0.05)+
  labs(x = NULL, y = latex2exp::TeX("$\\tilde{P}_t$"), color = NULL)+
  scale_color_manual(values = c(col = "orange", nocol = "black"))+
  custom_theme+
  theme_bw()+
  theme(legend.position = "none")
gridExtra::grid.arrange(plot_seasonal, plot_eps, ncol = 1, heights = c(0.6, 0.4))

Figure 9: Realized, seasonal and deseasonalized time series of electricity price in Spain. Years 2015-2018.

4.1.3 Task B.1 (c)

Plot the autocorrelation function for the deseasonalized log-prices ${\tilde{S}}_{t}$ and for the squared deseasonalized log-prices ${\tilde{S}}_{t}^{2}$ and their confidence intervals with $α = 0.05$ . Is the time series of ${\tilde{S}}_{t}$ uncorrelated?

Solution B.1 (c)

Download

Figure autocorrelation.

# Maximum number of lags
lag.max <- 365
# Confidence level
ci <- 0.05
# ACF log price tilde 
acf_x <- acf(data_prices$log_price_tilde, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# ACF log price tilde squared 
acf_x2 <- acf(data_prices$log_price_tilde^2, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# Limits for y-axis
limits_y <- range(c(acf_x, acf_x2))
# Confidence levels 
acf_ci_bound <- qnorm(1 - ci / 2) / sqrt(nrow(data_prices))
plot_acf_x <- ggplot()+
  geom_bar(aes(1:lag.max, acf_x), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = NULL, y = latex2exp::TeX("ACF($\\tilde{P}_t$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

plot_acf_x2 <-ggplot()+
  geom_bar(aes(1:lag.max, acf_x2), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = "Lags", y = latex2exp::TeX("ACF($\\tilde{P}_t^2$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_acf_x, plot_acf_x2, ncol = 1)

Figure 10: Autocorrelation function of the deseasonalized time series (top) and the deseasonalized time series squared (bottom).

4.2 Task B.2

The deseasonalized log-prices presents a strong autocorrelation in mean and variance. Let’s proceed by removing the first and then we will deal with the second. A popular model for electricity prices is the Autoregressive model of order 1 (AR(1)), the correspondent discrete version of an OU-process.

4.2.1 Task B.2 (a)

Test the hypothesis that the deseasonalized log-prices follows a random walk with an appropriate test. For example, you can consider an Augmented Dickey-Fuller Test. Fit with OLS an AR(1) model of the form ${\tilde{S}}_{t} = ϕ {\tilde{S}}_{t - 1} + ε_{t} .$ where we refer to this section for more details. Then, compute the fitted residuals $\begin{matrix} (7) & ε_{t} = {\tilde{S}}_{t} - ϕ {\tilde{S}}_{t - 1}, \end{matrix}$ their standard deviation $\begin{matrix} (8) & σ_{ε} = \sqrt{\frac{1}{n - 1} \sum_{t = 1}^{n} ε_{t}^{2}}, \end{matrix}$ and the standardized residuals $\begin{matrix} (9) & {\tilde{ε}}_{t} = \frac{ε_{t}}{σ_{ε}} . \end{matrix}$ Is the estimated model stationary?

Solution B.2 (a)

Table ADF Test.

tseries::adf.test(data_prices$log_price_tilde, k = 7, alternative = "stationary") %>%
  broom::tidy() %>%
  dplyr::select(-alternative) %>%
  dplyr::mutate(lags = 20, 
         H0 = ifelse(p.value <= 0.01, "Rejected (stationary)", "Non-rejected (explosive)")) %>% 
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Table 9: ADF-test on

{\tilde{S}}_{t}

statistic	p.value	parameter	method	lags	H0
-5.151	0.01	7	Augmented Dickey-Fuller Test	20	Rejected (stationary)

AR model on the deseasonalized log-prices.

# AR model 
AR_model <- lm(log_price_tilde ~ I(dplyr::lag(log_price_tilde, 1)) - 1, data = data_prices)
# Fitted values 
data_prices$log_price_tilde_hat <- predict(AR_model, newdata = data_prices)
# Fitted residuals 
data_prices$eps <- data_prices$log_price_tilde - data_prices$log_price_tilde_hat 
# Estimated constant standard deviation 
data_prices$sigma_bar <- sd(data_prices$eps, na.rm = TRUE)*sqrt(nrow(data_prices)/(nrow(data_prices)-1))
# Standardized residuals 
data_prices$std_eps <- data_prices$eps/data_prices$sigma_bar
data_prices <- data_prices[-c(1),]

Table statistics.

glance_AR_model <- broom::glance(AR_model)
glance_AR_model %>%
  dplyr::select(-df, -logLik, -AIC, -BIC) %>% 
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

r.squared	adj.r.squared	sigma	statistic	p.value	deviance	df.residual	nobs
0.7105	0.7103	0.1102	3583	0	17.72	1460	1461

Table 10: AR statistics for

{\tilde{S}}_{t}

Table estimates.

tidy_AR_model <- broom::tidy(AR_model)
tidy_AR_model %>%
  dplyr::mutate(term = "$$\\phi$$") %>% 
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE) %>%
  knitr::kable(booktabs = TRUE, escape = TRUE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error	statistic	p.value
$ϕ$	0.8429	0.01408	59.86	0

Table 11: AR estimates for

{\tilde{P}}_{t}

Note that, since $ϕ =$ 0.843 is less than 1 in absolute value, then the AR(1) model is stationary.

4.2.2 Task B.2 (b)

Plot the autocorrelation function of standardized residuals ${\tilde{ε}}_{t}$ , the squared standardized residuals ${\tilde{ε}}_{t}^{2}$ and their confidence intervals with $α = 0.05$ . Then, let’s verify if the basic assumptions of the model hold.

Assumption 1.: Identically distributed residuals. We can verify this hypothesis with a two-sample Kolmogorov test. (see here)

Hint: For the two-sample Kolmogorov test you can consider the following R function.

Two-sample Kolmogorov Smirnov Test

#' Kolmogorov Smirnov test for a distribution
#'
#' Test against a specific distribution
#'
#' @param x a vector.
#' @param ci p.value for rejection.
#' @param idx_split Index used for splitting the time series. If `missing` will be random sampled.
#' @param min_quantile minimum quantile for the grid of values.
#' @param max_quantile maximum quantile for the grid of values.
#' @param seed random seed for two sample test.
#' @param plot when `TRUE` a plot is returned, otherwise a `tibble`.
#' @return when `plot = TRUE` a plot is returned, otherwise a `tibble`.
ks_test_ts <- function(x, ci = 0.05, idx_split, min_quantile = 0.015, max_quantile = 0.985, seed = 1, plot = FALSE){
  # Random seed
  set.seed(seed)
  # number of observations
  n <- length(x)
  # Random split of the time series
  idx_split <- ifelse(missing(idx_split), sample(n, 1), idx_split)
  x1 <- x[1:idx_split]
  x2 <- x[(idx_split+1):n]
  # Number of elements for each sub sample
  n1 <- length(x1)
  n2 <- length(x2)
  # Grid of values for KS-statistic
  grid <- seq(quantile(x, min_quantile), quantile(x, max_quantile), 0.01)
  # Empiric cdfs
  cdf_n1 <- ecdf(x1)
  cdf_n2 <- ecdf(x2)
  # KS-statistic
  ks_stat <- max(abs(cdf_n1(grid) - cdf_n2(grid)))
  # Rejection level
  # Equivalent: sqrt(-log(ci / 2) * (1 + n2/n1) / (2*n2))
  rejection_lev <- sqrt(-0.5 * log(ci / 2)) * sqrt((n1+n2)/(n1*n2))
  # P-value
  p.value <- 2 * exp(- 2 * n2 / (1 + n1/n2) * ks_stat^2)

  # ========================== Plot ==========================
  if (plot) {
    y_breaks <- seq(0, 1, 0.2)
    y_labels <- paste0(format(y_breaks*100, digits = 2), "%")
    grid_max <- grid[which.max(abs(cdf_n1(grid) - cdf_n2(grid)))]
    plt <- ggplot()+
      geom_ribbon(aes(grid, ymax = cdf_n1(grid), ymin = cdf_n2(grid)),
                  alpha = 0.5, fill = "green") +
      geom_line(aes(grid, cdf_n1(grid)))+
      geom_line(aes(grid, cdf_n2(grid)), color = "red")+
      geom_segment(aes(x = grid_max, xend = grid_max,
                       y = cdf_n1(grid_max), yend = cdf_n2(grid_max)),
                   linetype = "solid", color = "magenta")+
      geom_point(aes(grid_max, cdf_n1(grid_max)), color = "magenta")+
      geom_point(aes(grid_max, cdf_n2(grid_max)), color = "magenta")+
      scale_y_continuous(breaks = y_breaks, labels = y_labels)+
      labs(x = "x", y = "cdf")+
      theme_bw()
    return(plt)
  } else {
    kab <- dplyr::tibble(
      idx_split = idx_split,
      ci = paste0(ci*100, "%"),
      n1 = n1,
      n2 = n2,
      KS = ks_stat,
      p.value = p.value,
      rejection_lev = rejection_lev,
      H0 = ifelse(KS > rejection_lev, "Rejected", "Non-Rejected")
    )
    class(kab) <- c("ks_test_ts", class(kab))
    return(kab)
  }
}

Assumption 2.: Uncorrelated residuals. If we do not reject Assumption 1. we can use Box-cox test, however if we reject Assumption 1. the Box-Cox tests are not valid and it is better to rely on a more general approach. In general, we can fit an AR(p) mode on the standardized residuals, i.e.
${\tilde{ε}}_{t} \sim a_{0} + a_{1} {\tilde{ε}}_{t - 1} + \dots + a_{p} {\tilde{ε}}_{t - p} + z_{t},$ and test for the significance of the entire regression with the F-test. More precisely, the F-statistic and its p-value are evaluated under the null hypothesis $H_{0} = a_{1} = \dots = a_{p} = 0 ⟹ {\tilde{ε}}_{t} = a_{0} + z_{t} .$ In other words, under $H_{0}$ the standardized residuals are uncorrelated and with expectation equal to $a_{0}$ . Is the AR model able to remove the autocorrelation in mean? Are the standardized residuals stationary in variance?

Solution B.2 (b)

Download

Figure autocorrelation.

# Maximum number of lags
lag.max <- 365
# Confidence level
ci <- 0.05
# ACF log price tilde 
acf_x <- acf(data_prices$std_eps, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# ACF log price tilde squared 
acf_x2 <- acf(data_prices$std_eps^2, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# Limits for y-axis
limits_y <- range(c(acf_x, acf_x2))
# Confidence levels 
acf_ci_bound <- qnorm(1 - ci / 2) / sqrt(nrow(data_prices))
plot_acf_x <- ggplot()+
  geom_bar(aes(1:lag.max, acf_x), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = NULL, y = latex2exp::TeX("ACF($\\tilde{P}_t$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")
plot_acf_x2 <-ggplot()+
  geom_bar(aes(1:lag.max, acf_x2), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = "Lags", y = latex2exp::TeX("ACF($\\tilde{P}_t^2$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_acf_x, plot_acf_x2, ncol = 1)

Figure 11: Autocorrelation function of the AR(1) residuals (top) and AR(1) residuals squared (bottom).

As shown in Figure 11 the residuals present a significant autocorrelation in mean and variance that will make them not identically distributed and correlated.

Download

Figure residuals.

plot_eps <- ggplot(data_prices)+
  geom_point(aes(date, eps), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3*sigma_bar, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2*sigma_bar, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1*sigma_bar, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1*sigma_bar, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2*sigma_bar, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3*sigma_bar, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$\\epsilon_t$"), color = NULL)+
  theme_bw()+
  theme(legend.position = "top")+
  theme_bw()+
  custom_theme+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red", std_eps = "darkorange"),
                     labels = c(sigma1 = TeX("$\\pm \\bar{\\sigma}$"), 
                                sigma2 = TeX("$\\pm 2\\bar{\\sigma}$"), 
                                sigma3 = TeX("$\\pm 3\\bar{\\sigma}$"), 
                                std_eps = TeX("$\\bar{\\epsilon}_t = \\frac{\\epsilon_t}{\\bar{\\sigma}}$")))
plot_std_eps <- ggplot(data_prices)+
  geom_point(aes(date, std_eps), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$\\bar{\\epsilon}_t$"), color = NULL)+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red"))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_eps, plot_std_eps, ncol = 1, heights = c(0.6, 0.4))

Figure 12: AR(1) residuals and standardized residuals with $\pm$ one, two and three standard deviations. Years 2015-2018.

The Figure 12 shows clearly that the standardized residuals presents volatility clusters, hence the time series of ${\bar{ϵ}}_{t}$ cannot be identically distributed. A formal way to test it, is a two sample Kolmogorov-Smirnov test.

Download

KS-test on

{\tilde{ε}}_{t}

ks_test_ts(na.omit(data_prices$std_eps), ci = 0.01, seed = 1, plot = TRUE)

Figure 13: Two samples cdfs and KS-statistic (magenta) for $t i l d e {v a r e p s i l o n}_{t}$ .

Two samples cdfs and KS-statistics for

{\tilde{ε}}_{t}

kab <- ks_test_ts(na.omit(data_prices$std_eps), ci = 0.01, seed = 2) %>%
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE)
colnames(kab) <- c("Index split", "$$\\alpha$$", "$$n_1$$", "$$n_2$$", 
                   "$$KS_{n_1, n_2}$$", "p.value", "Critical Level", "$$H_0$$")
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Index split	$α$	$n_{1}$	$n_{2}$	$K S_{n_{1}, n_{2}}$	p.value	Critical Level	$H_{0}$
975	1%	975	486	0.1179	0.02239	0.09038	Rejected

Table 12: KS test on

{\tilde{ε}}_{t}

The null hypothesis of independence and identically distribution is rejected. Hence, we fit an AR(7) model on the standardized residuals.

Autocorrelation Test on

{\tilde{ε}}_{t}

data_test <- data_prices
# Number of lags to test 
lag.max <- 7
# Add lagged variables 
for(i in 1:lag.max){
  data_test[[paste0("L", i, "_std_eps")]] <- dplyr::lag(data_test$std_eps, i)
}
# Select only relevant variables 
data_test <- na.omit(dplyr::select(data_test, contains("std_eps")))
# Fit an AR(p) model 
AR_model_test <- lm(std_eps ~ ., data = data_test)

Table statistics.

glance_AR_model_test <-   broom::glance(AR_model_test)
glance_AR_model_test %>%
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  dplyr::select(-AIC, -BIC, -logLik) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

r.squared	adj.r.squared	sigma	statistic	p.value	df	deviance	df.residual	nobs
0.0929	0.0885	0.9552	21.1498	0	7	1319.286	1446	1454

Table 13: Regression’s statistics for the AR model on

{\tilde{ε}}_{t}

Table estimates.

tidy_AR_model_test <- broom::tidy(AR_model_test)
tidy_AR_model_test %>%
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  dplyr::mutate(term = paste0("$$a_", 0:lag.max, "$$")) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error	statistic	p.value
$a_{0}$	-0.0020	0.0250	-0.0780	0.9378
$a_{1}$	-0.0640	0.0258	-2.4789	0.0133
$a_{2}$	-0.2211	0.0257	-8.5932	0.0000
$a_{3}$	-0.0117	0.0263	-0.4456	0.6560
$a_{4}$	0.0579	0.0262	2.2065	0.0275
$a_{5}$	0.0748	0.0263	2.8465	0.0045
$a_{6}$	0.1074	0.0257	4.1791	0.0000
$a_{7}$	0.1864	0.0258	7.2225	0.0000

Table 14: Regression’s estimated coefficients for the AR model on

{\tilde{ε}}_{t}

As we can see the coefficients -0.00195 has a p-value of 0.938 and so it is not statistically different from zero. However, the F-test equal to 21.15 has a p-value equal to 0 so we rejects the null hypothesis that all the AR coefficients are jointly equal to zero. Therefore, it is still present autocorrelation in mean that the AR(1) is not able to capture.

4.2.3 Task B.2 (c)

Let’s consider a more general model by fitting an ARMA(2,2). Such process will have the form: ${\tilde{S}}_{t} = ϕ_{1} {\tilde{S}}_{t - 1} + ϕ_{2} {\tilde{S}}_{t - 2} + θ_{1} ε_{t - 1} + θ_{2} ε_{t - 2} + ε_{t},$ and can be fitted with maximum likelihood and by minimizing the sum the squared residuals. We refer to this section for more details. In this case, we prefer to be agnostic about the true distribution of $ε_{t}$ and use the second method. In R you can consider the function arima(). Compute the fitted residuals as $\begin{matrix} (10) & ε_{t} = {\tilde{S}}_{t} - ϕ_{1} {\tilde{S}}_{t - 1} - ϕ_{2} {\tilde{S}}_{t - 2} - θ_{1} ε_{t - 1} - θ_{2} ε_{t - 2} . \end{matrix}$ Re-compute the std. deviation as in Equation 8 and the standardized residuals as in Equation 9. Then, repeat the tests of Section 4.2.2 on the new standardized residuals. Is the ARMA model able to remove the autocorrelation in mean? Are the new standardized residuals stationary in variance?

Solution B.2 (c)

Fit ARMA model.

# 3) ARMA model 
ARMA_model <- arima(data_prices$log_price_tilde, order = c(2, 0, 2), include.mean = FALSE)
# Fitted residuals 
data_prices$eps <- ARMA_model$residuals
# Fitted time series 
data_prices$log_price_tilde_hat <- data_prices$log_price_tilde - data_prices$eps
# Estimated constant standard deviation 
data_prices$sigma_bar <- sd(data_prices$eps, na.rm = TRUE) * sqrt(nrow(data_prices) / (nrow(data_prices) - 1))
# Standardized residuals 
data_prices$std_eps <- data_prices$eps / data_prices$sigma_bar
data_prices <- data_prices[-c(1:2),]

ARMA statistics.

glance_ARMA_model <- broom::glance(ARMA_model)
glance_ARMA_model %>%
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = TRUE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

sigma	logLik	AIC	BIC	nobs
0.1037	1237.005	-2464.01	-2437.576	1461

Table 15: ARMA statistics for

{\tilde{S}}_{t}

ARMA estimates.

tidy_ARMA_model <- broom::tidy(ARMA_model)
tidy_ARMA_model %>%
  dplyr::mutate(term = c("$$\\phi_1$$", "$$\\phi_2$$", 
                         "$$\\theta_1$$", "$$\\theta_2$$")) %>%
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = TRUE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error
$ϕ_{1}$	1.1792	0.0801
$ϕ_{2}$	-0.1916	0.0765
$θ_{1}$	-0.4497	0.0770
$θ_{2}$	-0.3180	0.0378

Table 16: ARMA estimates for

{\tilde{S}}_{t}

Download

Code

# Maximum number of lags
lag.max <- 365
# Confidence level
ci <- 0.05
# ACF log price tilde 
acf_x <- acf(data_prices$std_eps, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# ACF log price tilde squared 
acf_x2 <- acf(data_prices$std_eps^2, type = "correlation", plot = FALSE, lag.max = lag.max)$acf[,,1][-1]
# Limits for y-axis
limits_y <- range(c(acf_x, acf_x2))
# Confidence levels 
acf_ci_bound <- qnorm(1 - ci / 2) / sqrt(nrow(data_prices))
plot_acf_x <- ggplot()+
  geom_bar(aes(1:lag.max, acf_x), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = NULL, y = latex2exp::TeX("ACF($\\tilde{\\varepsilon}_t$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")
plot_acf_x2 <-ggplot()+
  geom_bar(aes(1:lag.max, acf_x2), stat = "identity", width = 1)+
  geom_line(aes(1:lag.max, acf_ci_bound), color = "blue", linetype = "dashed")+
  geom_line(aes(1:lag.max, -acf_ci_bound), color = "blue", linetype = "dashed")+
  scale_y_continuous(limits = limits_y)+
  labs(x = "Lags", y = latex2exp::TeX("ACF($\\tilde{\\varepsilon}_t^2$)"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_acf_x, plot_acf_x2, ncol = 1)

Figure 14: Autocorrelation function of the standardized ARMA residuals (top) and standardized squared ARMA residuals (bottom).

As shown in Figure 14 the squared residuals still present a significant autocorrelation in variance that will make them not identically distributed.

Download

Code

library(latex2exp)
plot_eps <- ggplot(data_prices)+
  geom_point(aes(date, eps), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3*sigma_bar, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2*sigma_bar, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1*sigma_bar, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1*sigma_bar, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2*sigma_bar, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3*sigma_bar, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$\\epsilon_t$"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red", std_eps = "darkorange"),
                     labels = c(sigma1 = TeX("$\\pm \\bar{\\sigma}$"), 
                                sigma2 = TeX("$\\pm 2\\bar{\\sigma}$"), 
                                sigma3 = TeX("$\\pm 3\\bar{\\sigma}$"), 
                                std_eps = TeX("$\\bar{\\epsilon}_t = \\frac{\\epsilon_t}{\\bar{\\sigma}}$")))
plot_std_eps <- ggplot(data_prices)+
  geom_point(aes(date, std_eps), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$\\bar{\\epsilon}_t$"), color = NULL)+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red"))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_eps, plot_std_eps, ncol = 1, heights = c(0.6, 0.4))

Figure 15: ARMA residuals and standardized residuals with $\pm$ one, two and three standard deviations. Years 2015-2018.

The graph above shows clearly that the standardized residuals presents volatility clusters, hence the time series of ${\bar{ϵ}}_{t}$ will be not identically distributed. A formal way to test it, is a two sample Kolmogorov-Smirnov.

Download

KS-test on

{\tilde{ε}}_{t}

ks_test_ts(na.omit(data_prices$std_eps), ci = 0.01, seed = 1, plot = TRUE)

Figure 16: Two samples cdfs and KS-statistic (magenta) for ${\tilde{\varsepsilon}}_{t}$ .

KS-test

{\tilde{ε}}_{t}

kab <- ks_test_ts(na.omit(data_prices$std_eps), ci = 0.01, seed = 2) %>%
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE)
colnames(kab) <- c("Index split", "$$\\alpha$$", "$$n_1$$", "$$n_2$$", 
                   "$$KS_{n_1, n_2}$$", "p.value", "Critical Level", "$$H_0$$")
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c')%>%
kableExtra::row_spec(0, color = "white", background = "green")

Index split	$α$	$n_{1}$	$n_{2}$	$K S_{n_{1}, n_{2}}$	p.value	Critical Level	$H_{0}$
975	1%	975	484	0.06512	0.5124	0.0905	Non-Rejected

Table 17: KS test for

{\tilde{ε}}_{t}

The null hypothesis of independence and identically distribution is rejected, hence we have to define a model for the conditional variance of the process.

Test Autocorrelation on the ARMA standardized residuals.

data_test <- data_prices
# Number of lags to test 
lag.max <- 7
# Add lagged variables 
for(i in 1:lag.max){
  data_test[[paste0("L", i, "_std_eps")]] <- dplyr::lag(data_test$std_eps, i)
}
# Select only relevant variables 
data_test <- na.omit(dplyr::select(data_test, contains("std_eps")))
# Fit an AR(p) model 
ARMA_model_test <- lm(std_eps ~ ., data = data_test)

Table statistics.

glance_ARMA_model_test <-   broom::glance(ARMA_model_test)
glance_ARMA_model_test %>%
  select(-AIC, -BIC, -logLik) %>% 
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

r.squared	adj.r.squared	sigma	statistic	p.value	df	deviance	df.residual	nobs
0.01026	0.005463	0.998	2.139	0.03705	7	1438	1444	1452

Table 18: Regression’s statistics for the AR model on

{\tilde{ε}}_{t}

Table estimates.

tidy_ARMA_model_test <- broom::tidy(ARMA_model_test)
tidy_ARMA_model_test %>%
  dplyr::mutate(term = paste0("$$a_", 0:lag.max, "$$")) %>% 
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE)  %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error	statistic	p.value
$a_{0}$	-0.0021308	0.02619	-0.081354	0.9351715
$a_{1}$	-0.0001462	0.02621	-0.005578	0.9955504
$a_{2}$	-0.0053908	0.02622	-0.205615	0.8371207
$a_{3}$	0.0007113	0.02619	0.027153	0.9783418
$a_{4}$	0.0429757	0.02617	1.642383	0.1007284
$a_{5}$	-0.0028748	0.02619	-0.109781	0.9125983
$a_{6}$	0.0142256	0.02619	0.543246	0.5870442
$a_{7}$	0.0901152	0.02619	3.441149	0.0005958

Table 19: Regression’s estimated coefficients for the AR model on

{\tilde{ε}}_{t}

As we can see the coefficients -0.00213 has a p-value of 0.935 and so it is not statistically different from zero. However, the F-test equal to 2.139 has a p-value equal to 0.037 so we do not reject the null hypothesis that all the AR coefficients are jointly equal to zero with 3.7 % confidence level. Therefore, with the ARMA model we do not have anymore autocorrelation in mean that the AR(1) was not able to completely remove.

4.3 Task B.3

4.3.1 Task B.3 (a)

A popular approach to model the conditional variance of a time series is a GARCH(p,q) model. For simplicity, let’s consider a GARCH(1,1) model, i.e. $\begin{aligned} ε_{t} = σ_{t} u_{t} \\ σ_{t}^{2} = ω + α ε_{t - 1}^{2} + β σ_{t - 1}^{2} \\ u_{t} \sim IID (0, 1) \end{aligned}$ In general, the parameters of such models are estimated with a Quasi-Maximum Likelihood approach where, as long as the residuals are identically distributed, independent and with mean zero and unitary variance, one can estimate the parameters by assuming a normal distribution for $u_{t}$ . Moreover, to ensure that the long-term GARCH variance is coherent with the variance estimated on the residuals $ε_{t}$ , is it possible to constraint $ω$ as follows, i.e. $σ_{t}^{2} = σ_{ε}^{2} + α (1 - σ_{ε}^{2}) ε_{t - 1}^{2} + β (1 - σ_{ε}^{2}) σ_{t - 1}^{2} .$ Compute the long-term GARCH(1,1) variance, i.e. $E {σ_{t}^{2}}$ $σ_{\infty}^{2} = E {σ_{t}^{2}} = \frac{ω}{1 - α - β},$ and compare it with the variance estimated $σ_{ε}^{2}$ estimated in Section 4.2.2. Are the two variances equal? Is the estimated model stationary?

Hint: For the parameter estimation you can consider the R package rugarch in which is implemented the QMLE estimator, the possibility to automatically contraint the variance and many others different GARCH models. For the stationarity condition check the estimated parameters $α$ and $β$ .

Solution B.3 (a)

Fit the GARCH model.

# Variance specification
GARCH_spec <- rugarch::ugarchspec(
  variance.model = list(model = "sGARCH", garchOrder = c(1,1), variance.targeting = data_prices$sigma_bar[1]^2),
  mean.model = list(armaOrder = c(0,0), include.mean = FALSE), distribution.model = "norm")
# GARCH fitted model
GARCH_model <- rugarch::ugarchfit(data = data_prices$eps, spec = GARCH_spec, out.sample = 0)
# Fitted GARCH standard deviation
data_prices$sigma <- GARCH_model@fit$sigma
# Fitted standardized residuals
data_prices$ut <- data_prices$eps / data_prices$sigma

Table GARCH estimates.

# GARCH parameters 
GARCH_params <- GARCH_model@fit$coef
# Std. deviation of the parameters 
GARCH_errors <- GARCH_model@fit$se.coef
names(GARCH_errors) <- names(GARCH_params)
# Long-term GARCH standard deviation
sigma2_inf <- GARCH_params["omega"] / (1-GARCH_params["alpha1"]-GARCH_params["beta1"])

kab <- dplyr::tibble(
  params = c("$$\\omega$$", "$$\\alpha$$", "$$\\beta$$", "$$\\sigma^2_{\\infty}$$"),
  estimates = c(GARCH_params["omega"], GARCH_params["alpha1"], GARCH_params["beta1"],  sigma2_inf),
  se.coef = c(GARCH_errors["omega"], GARCH_errors["alpha1"], GARCH_errors["beta1"],  NA_integer_)) %>%
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE)  
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c')%>%
  kableExtra::row_spec(0, color = "white", background = "green")

params	estimates	se.coef
$ω$	0.0003374	NA
$α$	0.1793855	0.01242
$β$	0.7892810	0.01486
$σ_{\infty}^{2}$	0.0107691	NA

Table 20: GARCH estimates for

u_{t}

Since the sum $α + β =$ 0.79 is less than 1 in absolute value the model is stationary.

Download

Figure GARCH residuals.

plot_eps <- ggplot(data_prices)+
  geom_point(aes(date, eps), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3*sigma, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2*sigma, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1*sigma, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1*sigma, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2*sigma, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3*sigma, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$\\epsilon_t$"), color = NULL)+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red", std_eps = "darkorange"),
                     labels = c(sigma1 = TeX("$\\pm \\bar{\\sigma}$"), 
                                sigma2 = TeX("$\\pm 2\\bar{\\sigma}$"), 
                                sigma3 = TeX("$\\pm 3\\bar{\\sigma}$"), 
                                std_eps = TeX("$\\bar{\\epsilon}_t = \\frac{\\epsilon_t}{\\bar{\\sigma}}$")))
plot_std_eps <- ggplot(data_prices)+
  geom_point(aes(date, ut), color = "black", size = 0.1, alpha = 0.4)+
  geom_line(aes(date, 3, color = "sigma3"), size = 0.4)+
  geom_line(aes(date, 2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, 1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -1, color = "sigma1"), size = 0.4)+
  geom_line(aes(date, -2, color = "sigma2"), size = 0.4)+
  geom_line(aes(date, -3, color = "sigma3"), size = 0.4)+
  labs(x = NULL, y = TeX("$u_t$"), color = NULL)+
  scale_color_manual(values = c(sigma1 = "green", sigma2 = "orange", sigma3 = "red"))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "none")

gridExtra::grid.arrange(plot_eps, plot_std_eps, ncol = 1, heights = c(0.6, 0.4))

Figure 17: ARMA-GARCH residuals and standardized residuals with $\pm$ one, two and three standard deviations. Years 2015-2018.

Figure GARCH standard deviation.

ggplot(data_prices)+
  geom_line(aes(date, sigma, color = "sigma"), size = 0.3)+
  geom_line(aes(date, sigma_bar, color = "sigma_bar"))+
  geom_line(aes(date, sqrt(sigma2_inf), color = "sigma0"), linetype = "dashed")+
  geom_line(aes(date, 0, color = "std_eps"), alpha = 0)+
  labs(x = NULL, y = TeX("$\\sigma_t$"), color = NULL)+
  #scale_y_continuous(limits = c(min(data_prices$sigma), max(data_prices$sigma)))+
  scale_color_manual(values = c(sigma = "black", sigma0 = "red", sigma_bar = "blue"),
                     labels = c(sigma = TeX("$\\sigma_t$"), 
                                sigma0 = TeX("$\\sigma_0$"), 
                                sigma_bar = TeX("$\\bar{\\sigma}$")))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")

Figure 18: GARCH standard deviation $σ_{t}$ with long term mean $σ_{0}$ and standard deviation of the residuals $\bar{σ}$ .

4.3.2 Task B.3 (b)

Once the parameters $ω$ , $α$ and $β$ are estimated, the series of GARCH variance $σ_{t}^{2}$ become available and one can compute the standardized GARCH residuals $u_{t} = \frac{ε_{t}}{σ_{t}} .$ Let’s verify if the assumption of the GARCH model holds. The assumption are three:

Assumption 1.: Identically distributed residuals. We can verify this hypothesis with a two-sample Kolmogorov test.

Hint: You can consider the same implementation used in Section 4.2.2.

Assumption 2.: Uncorrelated residuals. If we do not reject 1. we can use Box-cox test.
Assumption 3.: The assumption of the QMLE estimatior is that the residuals are normally distributed, i.e. $u_{t} \sim N (0, 1)$ . We can verify this hypothesis with a Kolmogorov test by comparing the empirical distribution of $u_{t}$ with the theoric distribution of a standard normal.

Hint: You can consider the following implementation for the Kolmogorov Smirnov Test for a distribution. (See here) for more details)

Kolmogorov Smirnov Test for a Distribution

#' Kolmogorov distribution functions
#'
#' @param x vector of quantiles.
#' @param p vector of probabilities.
#' @param k finite value for approximation of infinite sum.
#' @return A probability, a numeric vector in 0, 1.
#' @examples
#' pks(2, 1000)
#' pks(10, 1000)
pks <- function(x, k = 1000){
  # Infinite sum function
  # k: approximation for infinity
  inf_sum <- function(x, k = 100){
    isum <- 0
    for(i in 1:k){
      isum <- isum + (-1)^(i-1)*exp(-2*(x^2)*(i^2))
    }
    isum <- 2 * isum
    isum
  }
  # Compute probability
  p <- c()
  for(i in 1:length(x)){
    p[i] <- 1 - ifelse(x[i] <= 0, 1, inf_sum(x[i], k = k))
  }
  p[p<0] <- 0
  return(p)
}

#' Kolmogorov quantile functions
#' @param p vector of probabilities.
qks <- function(p, k = 100, upper = 10){
  loss <- function(x, p){
    (pks(x, k = k) - p)^2
  }
  x <- c()
  for(i in 1:length(p)){
    x[i] <- optim(par = 2, method = "Brent", lower = 0, upper = upper, fn = loss, p = p[i])$par
  }
  return(x)
}

#' Kolmogorov Smirnov test for a distribution
#'
#' Test against a specific distribution
#'
#' @param x a vector.
#' @param cdf a function. The theoric distribution to use for comparison.
#' @param ci p.value for rejection.
#' @param min_quantile minimum quantile for the grid of values.
#' @param max_quantile maximum quantile for the grid of values.
#' @param k finite value for approximation of infinite sum.
#' @param plot when `TRUE` a plot is returned, otherwise a `tibble`.
#' @return when `plot = TRUE` a plot is returned, otherwise a `tibble`.
ks_test <- function(x, cdf, ci = 0.05, min_quantile = 0.015, max_quantile = 0.985, k = 1000, plot = FALSE){
  # number of observations
  n <- length(x)
  # Set the interval values for upper and lower band
  grid <- seq(quantile(x, min_quantile), quantile(x, max_quantile), 0.01)
  # Empirical cdf
  cdf_x <- ecdf(x)
  # Compute the KS-statistic
  ks_stat <- sqrt(n)*max(abs(cdf_x(grid) - cdf(grid)))
  # Compute the rejection level
  rejection_lev <- qks(1-ci, k = k)
  # Compute the p-value
  p.value <- 1-pks(ks_stat, k = k)

  # ========================== Plot ==========================
  if (plot) {
    y_breaks <- seq(0, 1, 0.2)
    y_labels <- paste0(format(y_breaks*100, digits = 2), "%")
    grid_max <- grid[which.max(abs(cdf_x(grid) - cdf(grid)))]
    plt <- ggplot()+
      geom_ribbon(aes(grid, ymax = cdf_x(grid), ymin = cdf(grid)),
                  alpha = 0.5, fill = "green") +
      geom_line(aes(grid, cdf_x(grid)))+
      geom_line(aes(grid, cdf(grid)), color = "red")+
      geom_segment(aes(x = grid_max, xend = grid_max,
                       y = cdf_x(grid_max), yend = cdf(grid_max)),
                   linetype = "solid", color = "magenta")+
      geom_point(aes(grid_max, cdf_x(grid_max)), color = "magenta")+
      geom_point(aes(grid_max, cdf(grid_max)), color = "magenta")+
      scale_y_continuous(breaks = y_breaks, labels = y_labels)+
      labs(x = "x", y = "cdf")+
      theme_bw()
    return(plt)
  } else {
    kab <- dplyr::tibble(
      n = n,
      alpha = paste0(format(ci*100, digits = 3), "%"),
      KS = ks_stat,
      rejection_lev = rejection_lev,
      p.value = p.value,
      H0 = ifelse(KS > rejection_lev, "Rejected", "Non-Rejected"))
    class(kab) <- c("ks_test", class(kab))
    return(kab)
  }
}

Solution B.3-b

Download

KS-test on

u_{t}

ks_test_ts(na.omit(data_prices$ut), ci = 0.01, seed = 5, plot = TRUE) +
  labs(x = TeX("$u_t$"))+  
  theme_bw()+
  custom_theme

Figure 19: Two samples cdfs and KS-statistic (magenta) for a $u_{t}$ .

KS-test

u_{t}

kab <- ks_test_ts(na.omit(data_prices$ut), ci = 0.01, seed = 2) %>%
  mutate_if(is.numeric, format, digits = 4, scientific = FALSE)
colnames(kab) <- c("Index split", "$$\\alpha$$", "$$n_1$$", "$$n_2$$", 
                   "$$KS_{n_1, n_2}$$", "p.value", "Critical Level", "$$H_0$$")
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Index split	$α$	$n_{1}$	$n_{2}$	$K S_{n_{1}, n_{2}}$	p.value	Critical Level	$H_{0}$
975	1%	975	484	0.05389	0.7872	0.0905	Non-Rejected

Table 21: KS test for

u_{t}

Since we do not reject the hypothesis of identical distribution we can apply a Box.cox test for autocorrelation

Box-Cox test

u_{t}

and u_t^2$.

dplyr::bind_cols(
  Term = c("$u_t$", "$u_t^2$"),
  dplyr::bind_rows(
   broom::tidy(Box.test(data_prices$ut, lag = 5)), 
   broom::tidy(Box.test(data_prices$ut^2, lag = 5)))
) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Term	statistic	p.value	parameter	method
$u_t$	5.603438	0.3467367	5	Box-Pierce test
$u_t^2$	10.382843	0.0650865	5	Box-Pierce test

Table 22: Box-Cox test for

u_{t}

and u_t^2$.

Download

KS-test on

u_{t}

ks_test(na.omit(data_prices$ut), cdf = function(x) pnorm(x), ci = 0.01, plot = TRUE) +
  labs(x = TeX("$u_t$"))+
  theme_bw()+
  custom_theme

Figure 20: Kolmogorov-Smirnov test for Normality on $u_{t}$ .

Normality KS-test

u_{t}

kab <- ks_test(na.omit(data_prices$ut), cdf = function(x) pnorm(x), ci = 0.01, plot = FALSE) %>%
  mutate_if(is.numeric, round, digits = 4)
colnames(kab) <- c("$n$", "$$\\alpha$$", "$$KS_{n}$$", "Critical Level", "p.value", "$$H_0$$")
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

$n$	$α$	$K S_{n}$	Critical Level	p.value	$H_{0}$
1459	1%	2.732	1.6276	0	Rejected

Table 23: Normality KS test for

u_{t}

4.4 Task B.4

Let’s consider a Gaussian mixture distribution for the GARCH residuals, i.e. $u_{t} \sim B \cdot (μ_{1} + σ_{1} Z_{1}) + (1 - B) \cdot (μ_{2} + σ_{2} Z_{2})$

where $B \sim Bernoulli (p)$ , while $Z_{1}$ and $Z_{2}$ are standard normal. All are assumed to be independent. Estimate the parameters $μ_{1}$ , $μ_{2}$ , $σ_{1}$ , $σ_{2}$ and $p$ with maximum likelihood (see here for an example). Consider the estimated parameters and give a possible interpretation of the two states, i.e. B = 1, B = 0.

Solution B.4

Yearly Gaussian Mixture parameters.

# Fit a mixture model for all the year 
sm <- mclust::Mclust(data_prices$ut, G = 2)
# Means 
mu <- sm$parameters$mean
sigma <- sqrt(sm$parameters$variance$scale)
probs <- sm$parameters$pro
# Mixture density 
dnorm_mix <- function(x, mu, sigma, probs){
  dnorm(x, mu[1], sigma[1])*probs[1] + dnorm(x, mu[2], sigma[2])*probs[2]
}
# Mixture distribution 
pnorm_mix <- function(x, mu, sigma, probs){
  pnorm(x, mu[1], sigma[1])*probs[1] + pnorm(x, mu[2], sigma[2])*probs[2]
}

# Log-likelihood 
loglik <- sum(log(dnorm_mix(data_prices$ut, mu, sigma, probs)))
kab <- dplyr::tibble(mu1 = mu[1], mu2 = mu[2], sd1 = sigma[1], sd2 = sigma[2], p1 = probs[1], p2 = probs[2]) %>%
  bind_cols(log_lik = loglik, mu_u = mean(data_prices$ut), sd_u = sd(data_prices$ut)) %>% 
  mutate_if(is.numeric, format, digits = 4, scientific = FALSE) 


sum(mu*probs)
## [1] -0.009603254
sqrt(sum((mu^2 + sigma^2)*probs) - sum(mu*probs)^2)
## [1] 1.038544

knitr::kable(kab,
    escape = FALSE,
    col.names = c("$$\\mu_1$$", "$$\\mu_2$$", 
                  "$$\\sigma_1$$", "$$\\sigma_2$$", "$$p_1$$", "$$1-p_1$$", 
                  "log-lik", "$$\\mathbb{E}\\{u_t\\}$$", "$$\\mathbb{S}d\\{u_t\\}$$"))

$μ_{1}$	$μ_{2}$	$σ_{1}$	$σ_{2}$	$p_{1}$	$1 - p_{1}$	log-lik	$E {u_{t}}$	$S d {u_{t}}$
-1.475	0.1389	1.79	0.7897	0.09202	0.908	-1992	-0.009603	1.039

Table 24: Yearly Gaussian Mixture parameters.

Download

Yearly Gaussian Mixture vs Empirical densities

grid <- seq(min(data_prices$ut), max(data_prices$ut), 0.01)
ggplot()+
  geom_density(data = data_prices, aes(ut, ..density..), color = "black")+
  geom_line(aes(grid, 0, color = "ut"), alpha = 0)+
  geom_line(aes(grid, dnorm_mix(grid, mu, sigma, probs), color = "nm"))+
  scale_color_manual(values = c(ut = "black", nm = "red"), 
                     labels = c(ut = TeX("$u_t$"), nm = TeX("$GM(\\mu_1, \\mu_2, \\sigma_1, \\sigma_2, p)$")))+
  labs(x =  TeX("$u_t$"), y = "Density", color = NULL)+
  scale_x_continuous(limits = c(-5, 5))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")

Download

Gaussian Mixture vs monthly empirical densities.

pdf <- dnorm_mix(grid, mu, sigma, probs)
df_grid <- tibble(Month = lubridate::month(1, label = TRUE), grid = grid, pdf = pdf)
for(i in 2:12){
  df_grid <- bind_rows(df_grid, mutate(df_grid, Month = lubridate::month(i, label = TRUE)))
}
data_prices %>%
  mutate(Month = lubridate::month(date, label = TRUE)) %>%
  ggplot()+
  geom_density(aes(ut, ..density.., color = "ut"))+
  geom_line(data = df_grid, aes(grid, pdf, color = "nm"))+
    scale_color_manual(
      values = c(ut = "black", nm = "red"), 
      labels = c(ut = TeX("$u_t$"), nm = TeX("$GM(\\mu_1, \\mu_2, \\sigma_1, \\sigma_2, p)$")))+
  facet_wrap(~Month)+
  labs(x =  TeX("$u_t$"), y = "Density", color = NULL)+
  scale_x_continuous(limits = c(-5, 5))+
  theme_bw()+
  custom_theme+
  theme(legend.position = "top")

Download

KS-test on

u_{t}

ks_test(na.omit(data_prices$ut), cdf = function(x) pnorm_mix(x,  mu, sigma, probs), ci = 0.01, plot = TRUE) +
  labs(x = TeX("$u_t$"))+
  theme_bw()+
  custom_theme

Figure 23: Kolmogorov-Smirnov test for Gussian Mixture on $u_{t}$ .

KS-test

u_{t}

kab <- ks_test(na.omit(data_prices$ut), cdf = function(x) pnorm_mix(x,  mu, sigma, probs), ci = 0.01, plot = FALSE) %>%
  mutate_if(is.numeric, round, digits = 4)
colnames(kab) <- c("$n$", "$$\\alpha$$", "$$KS_{n}$$", "Critical Level", "p.value", "$$H_0$$")
knitr::kable(kab, booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

$n$	$α$	$K S_{n}$	Critical Level	p.value	$H_{0}$
1459	1%	0.7591	1.6276	0.6119	Non-Rejected

Table 25: KS test for

u_{t}

5 Task C

Let’s say that we want to understand the impact on price when demand increases and is satisfied by a specific energy source (e.g., fossil, hydro, wind, etc.). In practice we ask ourselves the following question:

“What happens to the price if demand increases by 1%, and that increase is matched by a 1% increase in, say, fossil output?”

This implies a price sensitivity to marginal supply source. Electricity prices are set by the marginal unit that supplies the last unit of demand. So:

If the marginal increase in demand is met by cheap supply (e.g., wind), prices rise little.
If it’s met by expensive generation (e.g., fossil), prices increase significantly.

To capture this in a model, we need a model that is able to distinguish which source is meeting the marginal demand.

5.1 Task C.1

Let’s decompose the demand across sources by defining a model of the form $S_{t} = β_{0} + β_{1} d_{t} + γ_{1} d_{t} T_{t} + \sum_{*} γ_{*} \cdot [d_{t} \cdot s_{*, t}] + controls + ε_{t},$ where

$d_{t} = \log (D_{t} + 1)$ : log electricity demand.
$β_{1}$ : baseline elasticity of price to demand.
$T_{t}$ : is the temperature.
$γ_{*}$ : differential impact when demand is met by source $*$ .
$S_{t} = \log (P_{t})$ : log electricity price at time t from Equation 5.
$s_{*, t}$ : share of demand met by source $*$ from Equation 2.
$controls$ variables include the hour of day, weekday and month as in Equation 6.

Note that the use of $\log (D_{t} + 1)$ instead of $\log (D_{t})$ is a practical solution when some values of demand might be zero or near-zero, preventing undefined or highly negative values. Adding 1 ensures the transformation is always valid. Taking the derivative of the price with respect to the demand one obtain a decomposition of the total price elasticity to demand at time t, depending on which source is satisfying this demand, i.e. $\begin{matrix} (11) & \partial_{d_{t}} S_{t} = β_{1} + γ_{1} T_{t} + \sum_{*} γ_{*} s_{*, t}, \end{matrix}$ where each $γ_{*}$ is interpreted as the sensitivity to electricity price is to a 1% change in demand on that source of energy.

Solution C.1

Fit the sensitivity model.

# Create the dataset 
data_fit <- data %>%
  dplyr::mutate(log_demand = log(demand + 1),
                other = energy_other+energy_other_renewable,
                s_other = (energy_other+energy_other_renewable)/demand*log_demand, 
                s_biomass = biomass/demand*log_demand,
                s_solar = solar/demand*log_demand,
                hydroelettric = river,
                s_hydroelettric = hydroelettric/demand*log_demand,
                s_wind = wind/demand*log_demand,
                s_nuclear = nuclear/demand*log_demand,
                s_fossil = fossil/demand*log_demand,
                log_price = log(price),
                temp = temp*log_demand,
                Year_ = as.factor(Year),
                Weekday_ = as.factor(weekdays(date)),
                Month_ = as.factor(Month),
                Hour_ = as.factor(Hour)) %>%
  group_by(Year, Month, Day) %>%
  select(date, Year, Month, Day, Hour, Year_, Hour_, Weekday_, Month_, 
         price, log_price, demand, log_demand, 
         other, biomass, solar, hydroelettric, wind, nuclear, fossil, 
         temp, 
         s_other, s_biomass, s_solar, s_hydroelettric, s_wind, s_nuclear, s_fossil)
# Fit the impact model
impact_model <- lm(
  log_price ~ Hour_ + Weekday_ + Month_ + log_demand + temp  + 
    s_other + s_biomass + s_solar + s_hydroelettric + s_wind + s_nuclear + s_fossil,
  data = data_fit
)

Table statistics.

glance_impact_model <- broom::glance(impact_model)
glance_impact_model %>%
  dplyr::select(-df, -logLik, -AIC, -BIC) %>% 
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = FALSE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

r.squared	adj.r.squared	sigma	statistic	p.value	deviance	df.residual	nobs
0.4471	0.4463	0.2163	577.7977	0	1638.377	35014	35064

Table 26: Model’s statistics for

S_{t}

Table estimates.

tidy_impact_model <- broom::tidy(impact_model)
tidy_impact_model %>%
  dplyr::mutate_if(is.numeric, round, digits = 4) %>%
  knitr::kable(booktabs = TRUE, escape = TRUE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

term	estimate	std.error	statistic	p.value
(Intercept)	0.1494	0.1751	0.8534	0.3934
Hour_1	-0.0330	0.0080	-4.1088	0.0000
Hour_2	-0.0550	0.0081	-6.8345	0.0000
Hour_3	-0.0600	0.0081	-7.4432	0.0000
Hour_4	-0.0229	0.0080	-2.8512	0.0044
Hour_5	0.0232	0.0081	2.8660	0.0042
Hour_6	0.0860	0.0083	10.3798	0.0000
Hour_7	0.1229	0.0086	14.2660	0.0000
Hour_8	0.1431	0.0091	15.6580	0.0000
Hour_9	0.1663	0.0097	17.1834	0.0000
Hour_10	0.1707	0.0101	16.9787	0.0000
Hour_11	0.1742	0.0103	16.9527	0.0000
Hour_12	0.1631	0.0103	15.8048	0.0000
Hour_13	0.1357	0.0102	13.2631	0.0000
Hour_14	0.1208	0.0101	12.0104	0.0000
Hour_15	0.1268	0.0098	12.9928	0.0000
Hour_16	0.1581	0.0094	16.8235	0.0000
Hour_17	0.2087	0.0091	22.8409	0.0000
Hour_18	0.2437	0.0090	26.9597	0.0000
Hour_19	0.2517	0.0090	27.9594	0.0000
Hour_20	0.2100	0.0088	23.9533	0.0000
Hour_21	0.1429	0.0084	16.9516	0.0000
Hour_22	0.1007	0.0082	12.2896	0.0000
Hour_23	0.0527	0.0080	6.5518	0.0000
Weekday_Monday	0.0045	0.0043	1.0428	0.2970
Weekday_Saturday	-0.0629	0.0044	-14.3093	0.0000
Weekday_Sunday	-0.1194	0.0045	-26.4603	0.0000
Weekday_Thursday	0.0064	0.0043	1.4919	0.1357
Weekday_Tuesday	0.0101	0.0043	2.3315	0.0197
Weekday_Wednesday	0.0024	0.0043	0.5591	0.5761
Month_2	-0.1304	0.0058	-22.5472	0.0000
Month_3	-0.1705	0.0058	-29.2185	0.0000
Month_4	-0.1098	0.0064	-17.1061	0.0000
Month_5	-0.0547	0.0072	-7.6049	0.0000
Month_6	0.0894	0.0084	10.6677	0.0000
Month_7	0.1439	0.0093	15.4391	0.0000
Month_8	0.1258	0.0093	13.5458	0.0000
Month_9	0.1115	0.0081	13.7816	0.0000
Month_10	0.1213	0.0069	17.5233	0.0000
Month_11	0.0594	0.0061	9.7015	0.0000
Month_12	0.0971	0.0057	17.0270	0.0000
log_demand	0.3679	0.0162	22.7345	0.0000
temp	-0.0009	0.0000	-20.4696	0.0000
s_other	5.6336	0.1854	30.3879	0.0000
s_biomass	0.0512	0.0403	1.2709	0.2038
s_solar	-0.0119	0.0042	-2.8307	0.0046
s_hydroelettric	-0.0370	0.0021	-17.4828	0.0000
s_wind	-0.0331	0.0015	-22.1170	0.0000
s_nuclear	-0.0546	0.0042	-12.9825	0.0000
s_fossil	0.0215	0.0019	11.3573	0.0000

Table 27: Model’s estimates for

S_{t}

5.2 Task C.2

Let’s provide an interpretation of the main coefficients of the estimated model. Are the demands shocks equally expensive? Which is the source of energy that impact mostly on the price? Compute the impact of an increase in 1% in demand distinguished by the different sources. What are the policies that we can derive?

Solution C.2

Function to predict

Δ S_{t}

predict_delta_St <- function(r = 0.05, temp = 25, 
                             fossil = 0.2, nuclear = 0.2, wind = 0.2,
                             hydroelettric = 0.2, solar = 0.2){
  # Model coefficients 
  coefs <- round(coefficients(impact_model), digits = 4)
  # Current mix
  weights <- c(fossil, nuclear, wind, hydroelettric, solar)
  # Ensure normalization
  weights <- weights / sum(weights)
  # Compute price sensitivity
  pred_Dt <- coefs["log_demand"] +  coefs["temp"]*temp+
    coefs["s_fossil"]*weights[1] + 
    coefs["s_nuclear"]*weights[2] + 
    coefs["s_wind"]*weights[3] +
    coefs["s_hydroelettric"]*weights[4] + 
    coefs["s_solar"]*weights[5] 
  
  (exp(pred_Dt * log(1 + r)) - 1)*100
}

Interpretation of the main coefficients:

log_demand (0.3679): A 1% increase in electricity demand leads to a 0.3679 % increase in price, holding the generation mix constant. If all sources had equal effect, then a 1% increase in demand would imply a 0.3679 % increase in price. This is the baseline elasticity.
temp: (-9^{-4}): As temperature increases, the marginal effect of demand on price slightly decreases. Possibly reflects that price spikes due to demand are less steep in warm (but not hot) weather, perhaps due to solar generation. Higher temperatures (likely in summer) reduce price, possibly due to less heating-related demand or more solar production during warm hours.
s_fossil (0.0215): A 1% increase in fossil share (e.g., from 40% to 41%) increases log-price by approx 0.0215% in price. Fossil fuels are marginal cost setters, increasing price. If extra demand is met by fossil, the price impact is higher. Fossil is a high-marginal-cost technology.
s_wind (-0.0331): A 1% increase in wind share lowers price by -0.0331%. Wind has zero marginal cost, and pushes more expensive plants out of the merit order.
s_hydroelettric (-0.037): Hydroelectric share reduces prices even more. This may reflect both low marginal cost and dispatchability.
s_nuclear (-0.0546): A 1% increase in nuclear share lowers price by -0.0546%. Nuclear is base-load and price-stabilizing. If demand is met by nuclear, price reacts much less. Nuclear is stable base-load with low price-setting influence.
s_solar (-0.0119): Solar reduces prices modestly, especially during midday hours when it’s abundant.
s_biomass (0.0512): Not statistically significant, though directionally it appears slightly costly.
s_other (5.6336): Extremely high effect, may reflect marginal or rare expensive sources, such as imports.

Hence, demand shocks are not equally expensive the price depends on which generator ramps up. For example, if demand increases by 1%, the temperature is 20° and such demand is fully met by

fossil then the price increases by 0.37%.
nuclear then the price increases by 0.294%.
wind then the price increases by 0.316%.
hydroelettric then the price increases by 0.312%.
solar then the price increases by 0.337%.

5.3 Task C.3

Simulate the price impact of a 5% demand increase under 3 hypothetical supply scenarios. The objective is to quantify the price response to a 5% increase in demand under three custom-defined supply mixes.

Current mix: 37% fossil, 23% nuclear, 20% wind, 15% hydroelettric, 5% solar.
Green mix 1: 20% fossil, 25% nuclear, 25% wind, 15% hydroelettric, 15% solar.
Green mix 2: 10% fossil, 30% nuclear, 20% wind, 20% hydroelettric, 20% solar.
Fully decarbonized current supply mix 0% fossil, 25% nuclear, 25% wind, 25% hydroelettric, 25% solar.

For each mix compute the total elasticity and the expected log-price increase in percentage, i.e. $Δ S_{t} = \partial_{d_{t}} S_{t} \cdot \log (1 + r) ⟹ % Δ P_{t} = (e^{Δ S_{t}} - 1) \cdot 100,$ where $r$ is the demand shock, in this case 5% and the marginal effect $\partial_{d_{t}} S_{t}$ at a given time t reads as in Equation 11. For each energy mix, let’s assume different levels for temperature, i.e. 0°, 10°, 15°, 20°, 25°, 30°, 35°.

Solution C.3

Sensitivity scenarios.

# Percentage change to predict 
r <- 0.05
# Temperature levels 
Temp_levels <- c(0, 10, 15, 20, 25, 30, 35)
# Set the energy mix 
energy_mix <- list()
# Current mix 
energy_mix[[1]] <- c(fossil = 0.37, nuclear = 0.23, wind = 0.20, hydroelettric = 0.15, solar = 0.05)
# Green mix 1
energy_mix[[2]] <- c(fossil = 0.20, nuclear = 0.25, wind = 0.25, hydroelettric = 0.15, solar = 0.15)
# Green mix 2
energy_mix[[3]] <- c(fossil = 0.10, nuclear = 0.30, wind = 0.20, hydroelettric = 0.20, solar = 0.20)
# Fully decarbonized
energy_mix[[4]]<- c(fossil = 0, nuclear = 0.25, wind = 0.25, hydroelettric = 0.25, solar = 0.25)
names(energy_mix) <- c("Current Mix", "Green Mix 1", "Green Mix 2", "Fully Decarbonized")
# Initialize 
kable <- tibble()
# Predict for different temperature levels 
for(i in 1:4){
  pred_list <- list()
  for(temp in Temp_levels){
    mix <- energy_mix[[i]]
    new_pred <- predict_delta_St(r = r, temp = temp, fossil = mix[1], 
                                 nuclear = mix[2], wind = mix[3], 
                                 hydroelettric = mix[4], solar = mix[5])
    pred_list <- c(pred_list, new_pred)
  }
  pred <- paste0(round(unlist(pred_list), 2), " %")
  names(pred) <- paste0(Temp_levels, "°")
  kable <- bind_rows(kable, bind_cols(Mix =  names(energy_mix[i]), bind_rows(pred)))
}

Table sensitivity scenarios.

kable  %>%
  dplyr::mutate_if(is.numeric, format, digits = 4, scientific = FALSE) %>%
  knitr::kable(booktabs = TRUE, escape = TRUE, align = 'c') %>%
  kableExtra::row_spec(0, color = "white", background = "green")

Mix	0°	10°	15°	20°	25°	30°	35°
Current Mix	1.72 %	1.68 %	1.66 %	1.64 %	1.61 %	1.59 %	1.57 %
Green Mix 1	1.69 %	1.64 %	1.62 %	1.6 %	1.58 %	1.55 %	1.53 %
Green Mix 2	1.66 %	1.61 %	1.59 %	1.57 %	1.55 %	1.53 %	1.5 %
Fully Decarbonized	1.64 %	1.6 %	1.57 %	1.55 %	1.53 %	1.51 %	1.49 %

Table 28: Sensitivity scenarios for each energy mix.

So a more renewable-intensive system reduces price pressure from demand shocks. Some policy takeaways:

Fossil supply tightens the price-demand link: if demand is met by fossil, prices become much more sensitive.
Renewables reduce price volatility: wind, solar, hydro, and nuclear dampen the price response.
Decarbonization reduces price spikes: shifting the marginal supply away from fossil sources leads to more stable prices.

Citation

BibTeX citation:

@online{sartini2025,
  author = {Sartini, Beniamino},
  date = {2025-06-21},
  url = {https://greenfin.it/projects/project2025.html},
  langid = {en}
}

For attribution, please cite this work as:

Sartini, Beniamino. 2025. June 21, 2025. https://greenfin.it/projects/project2025.html.