The dataset contains renewable power generation and weather conditions. The original dataset can be downloaded from Kaggle. The variable Energy delta[Wh] is not a power variable: it is an energy increment measured in Watt-hours between two consecutive timestamps. Since the raw frequency is 15 minutes, each raw observation describes the energy accumulated over the previous 15-minute interval. We denote this raw increment by \(R_q^{15m}\).
The objective of the project is to measure how renewable weather conditions, especially solar irradiation, affect the hourly energy delta. For this reason the analysis is performed on hourly data and only during daylight. Keeping night observations would mostly add structural zeros: when the sun is not in the sky, solar irradiation is zero and the energy delta is not informative about the marginal impact of renewable conditions.
The main unit issue is that GHI is measured in \(W/m^2\), i.e. power per square meter, while Energy delta[Wh] is measured in Wh, i.e. energy. To make them compatible we aggregate the data by hour. The hourly energy delta is obtained by summing Wh over the four 15-minute intervals: \[
R_h = \sum_{q \in h} R_q^{15m}.
\] Instead, solar irradiation is converted into hourly solar exposure by multiplying each GHI observation by the interval length \(\Delta t = 0.25\) hours and summing inside the hour: \[
X_h = \sum_{q \in h} GHI_q \cdot 0.25.
\] Hence, \(X_h\) is measured in \(Wh/m^2\) and can be compared with an hourly energy outcome. The comparison is still not one-to-one because the panel area and conversion efficiency are not observed, but the units are coherent: both variables are energy over an hourly interval.
Tip 1: Description of the variables
Time: datetime of the observation.
Energy delta[Wh]: energy increment in Watt-hours (Wh) from the previous timestamp to the current timestamp, denoted by \(R_q^{15m}\) in the raw 15-minute data.
GHI: Global Horizontal Irradiance in Watts per square meter (\(W/m^2\)) measured by a pyranometer.
temp: The temperature in degrees Celsius (\(^{\circ}\text{C}\)) measured at the same height as the pyranometer.
pressure: The atmospheric pressure in hectopascals (hPa) measured at the same height as the pyranometer.
humidity: The relative humidity in percentage (%) measured at the same height as the pyranometer.
wind_speed: The wind speed in meters per second (m/s) measured at the same height as the pyranometer.
rain_1h: The amount of precipitation in millimeters (mm) measured over the past hour.
snow_1h: The amount of snowfall in millimeters (mm) measured over the past hour.
clouds_all: cloud cover in percentage.
isSun: Indicator equal to 1 when the observation is during sunlight time.
sunlightTime: Elapsed sunlight time.
dayLength: Length of the day.
SunlightTime/daylength: Ratio between elapsed sunlight time and the full day length.
df_units <- tibble::tribble(~Variable, ~Raw_unit, ~Hourly_transformation, ~Hourly_unit, ~Role,"Energy delta", "Wh per 15 minutes", "Sum inside each daylight hour", "Wh per hour", "Dependent variable","GHI", "W/m^2", "Sum GHI x 0.25 hours inside each daylight hour", "Wh/m^2 per hour", "Main renewable regressor","Temperature", "C", "Average inside the hour", "C", "Weather control","Pressure", "hPa", "Average inside the hour", "hPa", "Weather control","Humidity", "%", "Average inside the hour", "%", "Weather control","Wind speed", "m/s", "Average inside the hour", "m/s", "Weather control","Rain and snow", "mm over the previous hour", "Maximum value inside the hour", "mm", "Weather control","Cloud cover", "%", "Average inside the hour", "%", "Weather control")df_units %>% knitr::kable(booktabs =TRUE, escape =FALSE, align ='c') %>% kableExtra::row_spec(0, color ="white", background ="green")
Variable
Raw_unit
Hourly_transformation
Hourly_unit
Role
Energy delta
Wh per 15 minutes
Sum inside each daylight hour
Wh per hour
Dependent variable
GHI
W/m^2
Sum GHI x 0.25 hours inside each daylight hour
Wh/m^2 per hour
Main renewable regressor
Temperature
C
Average inside the hour
C
Weather control
Pressure
hPa
Average inside the hour
hPa
Weather control
Humidity
%
Average inside the hour
%
Weather control
Wind speed
m/s
Average inside the hour
m/s
Weather control
Rain and snow
mm over the previous hour
Maximum value inside the hour
mm
Weather control
Cloud cover
%
Average inside the hour
%
Weather control
Table 2: Unit compatibility used in the project.
3 Part A: Descriptive analysis
Consider the hourly daylight energy delta \(R_h\) and the weather variables. The first objective is to understand when the energy delta is positive, how it changes during daylight hours and how much it depends on solar exposure and cloud conditions.
3.1 Task A.1
Compute the main descriptive statistics of the hourly daylight energy delta \(R_h\), i.e. total energy in kWh, mean, median, maximum, standard deviation, the percentage of daylight hours with zero energy and the percentage of daylight hours with positive energy. Then plot the empirical distribution of positive hourly energy delta. Is the distribution symmetric? Comment the result (max 150 words).
Insert the password to see the solution of Task A1
data %>% dplyr::filter(energy_delta_Wh >0) %>%ggplot()+geom_histogram(aes(energy_delta_Wh), bins =60, fill ="darkgreen", color ="black")+theme_bw()+ custom_theme+labs(x ="Hourly energy delta (Wh)", y ="Frequency")
Figure 1: Empirical distribution of positive hourly daylight renewable energy delta.
Solution: distribution
response <-paste0("After removing night observations and aggregating by hour, the data contain ", scales::comma(df_A1$observations), " daylight hours from ", df_A1$start, " to ", df_A1$end, ". ","The total energy delta is ", round(df_A1$total_kWh, 2)," kWh. The average hourly value is ", round(df_A1$mean_Wh, 2), " Wh, while the median is ", round(df_A1$median_Wh, 2), " Wh. ","The average hourly solar exposure is ", round(df_A1$mean_solar_exposure, 2)," Wh/m^2. The share of zero daylight hours is ",round(df_A1$zero_share*100, 2), " %, and the share of positive observations is ",round(df_A1$positive_share*100, 2), " %. The distribution is not symmetric: ","it has a mass close to zero and a long right tail. Since night hours were removed, the remaining zeros are mainly sunrise/sunset or poor-radiation hours, not mechanical night zeros.")
After removing night observations and aggregating by hour, the data contain 27,132 daylight hours from 2017-01-01 to 2022-08-31. The total energy delta is 112754.27 kWh. The average hourly value is 4155.77 Wh, while the median is 1924.5 Wh. The average hourly solar exposure is 59.1 Wh/m^2. The share of zero daylight hours is 5.44 %, and the share of positive observations is 94.56 %. The distribution is not symmetric: it has a mass close to zero and a long right tail. Since night hours were removed, the remaining zeros are mainly sunrise/sunset or poor-radiation hours, not mechanical night zeros.
3.2 Task A.2
Aggregate \(R_h\) by month and by hour. Compute the total monthly energy delta, the monthly average hourly energy delta, the monthly average solar exposure and the percentage of positive daylight hours. Then compute the average energy delta for each pair month-hour. In which month is the total energy delta maximum? At which hour is the average energy delta maximum? Plot the monthly hourly profiles.
Insert the password to see the solution of Task A2
df_A2_hour %>%ggplot()+geom_line(aes(Hour, mean_delta), color ="darkgreen")+geom_point(aes(Hour, mean_delta), color ="black", size =0.6)+facet_wrap(~Month_)+theme_bw()+ custom_theme+scale_x_continuous(breaks =seq(0, 24, 4))+labs(x ="Hour", y ="Mean hourly energy delta (Wh)")
Figure 2: Average hourly daylight renewable energy delta by month and hour.
The month with the largest total energy delta is Jun, with 1.59192^{4} kWh. The hour with the largest average energy delta is 10:00, with 7838.97 Wh. Since the data are daylight-only, the profile should be read as a production profile conditional on the sun being in the sky: the low values at the edges of the day are sunrise and sunset effects, not night observations.
3.3 Task A.3
Group the data by cloud cover and daylight position. Use four cloud-cover groups: 0-25%, 25-50%, 50-75% and 75-100%. Use three daylight-position groups: morning, central day and afternoon, defined from the ratio between elapsed sunlight time and day length. Compute the mean energy delta, the median energy delta and the percentage of positive observations for each group. Which condition produces the highest average energy delta? Which condition produces the lowest one? Comment the result (max 150 words).
Insert the password to see the solution of Task A3
df_A3 %>%ggplot()+geom_bar(aes(cloud_band, mean_Wh, fill = daylight_band),stat ="identity", position ="dodge", color ="black")+theme_bw()+ custom_theme+scale_fill_manual(values =c("Morning"="darkorange", "Central day"="darkgreen","Afternoon"="gray50"))+labs(x ="Cloud cover", y ="Mean hourly energy delta (Wh)", fill =NULL)
Figure 3: Mean hourly renewable energy delta by cloud cover and daylight position.
Solution: cloud effect
best_cloud <- df_A3 %>% dplyr::arrange(desc(mean_Wh)) %>% dplyr::slice(1)worst_cloud <- df_A3 %>% dplyr::arrange(mean_Wh) %>% dplyr::slice(1)response <-paste0("The largest average energy delta is obtained with cloud cover in the ", best_cloud$cloud_band, " group and condition ", best_cloud$daylight_band, ", with ", round(best_cloud$mean_Wh, 2), " Wh. ","The lowest average energy delta is obtained with cloud cover in the ", worst_cloud$cloud_band, " group and condition ", worst_cloud$daylight_band, ", with ", round(worst_cloud$mean_Wh, 2), " Wh. ","The result supports the expected interpretation: even after removing the night, the position of the sun within the day and cloud cover are crucial. Central daylight hours have more solar exposure, while high cloud cover reduces the energy increment.")
The largest average energy delta is obtained with cloud cover in the 0-25% group and condition Central day, with 13496.28 Wh. The lowest average energy delta is obtained with cloud cover in the 75-100% group and condition Morning, with 1577 Wh. The result supports the expected interpretation: even after removing the night, the position of the sun within the day and cloud cover are crucial. Central daylight hours have more solar exposure, while high cloud cover reduces the energy increment.
4 Part B: Renewable generation and weather
We now focus on the statistical link between renewable conditions and hourly energy delta. The variable solar_exposure_Wh_m2 is used as the main renewable intensity variable because it converts GHI from power density into hourly solar energy density. The objective is to understand how much of the variation in \(R_h\) is explained by solar exposure, weather and daylight conditions.
4.1 Task B.1
Compute the correlation between \(R_h\) and each weather variable: solar_exposure_Wh_m2, GHI_W_m2, temp, pressure, humidity, wind_speed, rain_1h, snow_1h, clouds_all, sunlight_ratio and n_15min. Rank the variables by absolute correlation. Which variable is most associated with energy delta? Plot \(R_h\) against hourly solar exposure.
Insert the password to see the solution of Task B1
set.seed(2026)df_B1_plot <- data %>% dplyr::slice_sample(n =min(20000, nrow(data)))df_B1_plot %>%ggplot()+geom_point(aes(solar_exposure_Wh_m2, energy_delta_Wh), alpha =0.08, color ="darkgreen")+geom_smooth(aes(solar_exposure_Wh_m2, energy_delta_Wh), color ="black", se =FALSE)+theme_bw()+ custom_theme+labs(x ="Solar exposure (Wh/m^2)", y ="Hourly energy delta (Wh)")
Figure 4: Hourly renewable energy delta and hourly solar exposure.
The variable with the largest absolute correlation with the energy delta is GHI_W_m2, with correlation 0.9068. The positive relation with solar exposure is expected because higher hourly irradiation means more available renewable input. Notice that this is a statistical relation, not a physical efficiency estimate: panel area and conversion efficiency are not observed.
Fit a linear model for \(\log(R_h + 1)\) using the weather variables and seasonal controls for hour and month: \[
\log(R_h + 1) = \beta_0 + \beta_1 x_h + \beta_2 x_h^2 + \boldsymbol{\gamma}'W_h + \text{Hour}_h + \text{Month}_h + \varepsilon_h
\text{,}
\] where \(x_h = \log(X_h + 1)\) and \(X_h\) is hourly solar exposure in \(Wh/m^2\). The vector \(W_h\) contains temperature, pressure, humidity, wind speed, rain, snow, cloud cover, daylight ratio and the number of 15-minute daylight observations inside the hour. Estimate the model on 80% of the data and compute the root mean squared error on the remaining 20%. Which variables have the expected sign?
Insert the password to see the solution of Task B2
df_B2_plot <- test_B2 %>% dplyr::mutate(pred_delta =exp(pred_B2) -1) %>% dplyr::slice_sample(n =min(10000, nrow(test_B2)))df_B2_plot %>%ggplot()+geom_point(aes(energy_delta_Wh, pred_delta), alpha =0.08, color ="darkgreen")+geom_abline(slope =1, intercept =0, color ="red", linetype ="dashed")+theme_bw()+ custom_theme+labs(x ="Observed hourly energy delta (Wh)", y ="Predicted hourly energy delta (Wh)")
Figure 6: Observed and predicted renewable energy delta in the test sample.
The coefficient of log_exposure is expected to be positive because larger hourly solar exposure should increase the energy increment. The quadratic term allows the effect to flatten at high exposure levels, which is realistic because the system cannot convert irradiation into electricity without capacity limits. Cloud cover, rain and snow should be interpreted as controls: their coefficients are conditional on measured exposure and seasonal controls, so their sign is not a direct physical efficiency parameter. The model captures a large part of the variation in \(\log(R_h+1)\), with an out-of-sample RMSE of 1.131.
5 Task C
Let’s say that we want to understand the impact of renewable conditions on hourly energy delta. In practice we ask ourselves the following question:
“What happens to the hourly energy delta if solar exposure increases by 1%, holding weather and seasonal controls fixed?”
To capture this impact, we estimate a model with interactions between log solar exposure and the main weather variables: \[
\log(R_h + 1) = \beta_0 + \beta_1 x_h + \beta_2 x_h^2 + \gamma_1 x_h L_h + \gamma_2 x_h C_h + \gamma_3 x_h T_h + \boldsymbol{\delta}'Z_h + \varepsilon_h
\text{,}
\] where \(x_h=\log(X_h+1)\), \(X_h\) is hourly solar exposure in \(Wh/m^2\), \(L_h\) is the daylight ratio, \(C_h\) is cloud cover, \(T_h\) is temperature and \(Z_h\) contains the remaining controls. Taking the derivative with respect to \(x_h\) we obtain: \[
\partial_{x_h}\log(R_h + 1) = \beta_1 + 2\beta_2x_h + \gamma_1 L_h + \gamma_2 C_h + \gamma_3 T_h
\text{.}
\tag{1}\]
5.1 Task C.1
Estimate the interaction model above. Report the model statistics and the coefficients directly linked with the renewable impact, i.e. log_exposure, log_exposure^2, log_exposure:sunlight_ratio, log_exposure:clouds_all and log_exposure:temp. Are the interactions economically reasonable?
Insert the password to see the solution of Task C1
The model explicitly allows the marginal impact of solar exposure to depend on the time of day, the month, the year and the weather. A positive log_exposure effect means that higher hourly solar exposure increases the expected energy delta. The negative quadratic term, when present, means decreasing marginal returns: increasing exposure matters, but the incremental percentage gain is smaller when exposure is already high. The interaction terms must be interpreted conditionally on measured exposure and seasonality, not as standalone physical laws.
5.2 Task C.2
Using Equation 1, define a function that computes the marginal effect of log solar exposure on \(\log(R_h+1)\). Then compute the expected percentage change in energy delta after a 10% increase in hourly solar exposure under three situations:
Clear day: high daylight ratio and low cloud cover.
Mixed day: intermediate daylight ratio and intermediate cloud cover.
Cloudy day: low daylight ratio and high cloud cover.
Are renewable shocks equally productive in percentage terms?
Insert the password to see the solution of Task C2
Table 11: Impact of a 10% increase in hourly solar exposure under different weather situations.
Solution: renewable shock
best_C2 <- df_C2 %>% dplyr::arrange(desc(impact_10pct_exposure)) %>% dplyr::slice(1)worst_C2 <- df_C2 %>% dplyr::arrange(impact_10pct_exposure) %>% dplyr::slice(1)response <-paste0("Renewable shocks are not equally productive in percentage terms. The largest relative impact is obtained in the ", best_C2$Scenario, " scenario, where a 10% increase in hourly solar exposure implies an expected ",round(best_C2$impact_10pct_exposure, 2), "% increase in energy delta. ","The smallest relative impact is obtained in the ", worst_C2$Scenario," scenario, where the same exposure shock implies an expected ",round(worst_C2$impact_10pct_exposure, 2), "% change. A larger percentage effect does not necessarily mean larger production in levels: low-exposure hours can have high relative sensitivity but still low expected Wh.")
Renewable shocks are not equally productive in percentage terms. The largest relative impact is obtained in the Cloudy day scenario, where a 10% increase in hourly solar exposure implies an expected 10.08% increase in energy delta. The smallest relative impact is obtained in the Clear day scenario, where the same exposure shock implies an expected 4.48% change. A larger percentage effect does not necessarily mean larger production in levels: low-exposure hours can have high relative sensitivity but still low expected Wh.
5.3 Task C.3
Simulate the expected hourly energy delta under 4 renewable-weather scenarios and 4 temperature levels. Consider a representative observation at 12:00 in June of the last available year. Use the estimated model to compute the expected energy delta in Wh.
Current average: average daylight observations in the data.
Moderate renewable gain: 10% higher hourly solar exposure, slightly higher daylight ratio and lower cloud cover.
Strong renewable gain: 25% higher hourly solar exposure, higher daylight ratio and much lower cloud cover.
Cloud constrained: 10% lower hourly solar exposure and higher cloud cover.
What is the expected effect of better renewable conditions on energy delta?
Insert the password to see the solution of Task C3
Table 12: Sensitivity scenarios for renewable energy delta.
Better renewable conditions increase expected hourly energy delta in all temperature scenarios. The strongest gain is obtained when solar exposure increases and cloud cover falls at the same time, because the model transmits renewable conditions through both the direct exposure coefficient and the interaction terms. The cloud constrained case shows the opposite mechanism: weaker hourly exposure and more cloud cover reduce the expected energy delta. Temperature is kept as a scenario variable because it also affects observed generation conditions, but the policy interpretation should focus on solar exposure and cloudiness.