From Isolated Data Mining to a Complete Knowledge Discovery Process
A key starting point is the distinction between data mining and the broader knowledge discovery process. The latter does not coincide with the mere selection of an algorithm: it includes objective definition, data selection, cleaning and preprocessing, dimensionality reduction, interpretation of results, and consolidation into formats suitable for further analysis and reporting. Within this perspective, interactivity is not a secondary feature: human supervision plays a role in validating patterns, discarding spurious correlations, and realigning models and objectives with the operational context. Within this framework, the main classes of methods used to extract information from massive datasets are discussed, including classification and regression techniques, clustering, probabilistic and graphical models, and change and anomaly detection strategies. The focus remains anchored to the specific conditions of smart grids, where data sources are intrinsically heterogeneous and informational value often depends on how streams are integrated, compressed, and made queryable.
Managing Complexity: Heterogeneity, Volume, and Cardinality Reduction
A central aspect concerns complexity management prior to learning. Measurements from SCADA, PMU, and AMI systems exhibit different sampling rates, granularities, and semantics. Moreover, increasing volume may generate bottlenecks in transmission and storage, especially where communication infrastructures are unevenly reinforced across the grid.
To make the data tractable, cardinality reduction is addressed in two complementary forms: feature reduction and sample reduction. On the feature side, methods such as Principal Component Analysis and feature selection techniques based on mutual information are examined, with emphasis on criteria balancing relevance to the target and redundancy among variables, such as mRMR. On the sample side, strategies including sampling, clustering, and binning are considered, particularly when selecting or summarising data proves preferable to transformations in the variable space.
A KDD-Driven Forecasting Pipeline for Multi-Step Load Prediction
The second component makes the discussion concrete through a forecasting methodology designed as a complete workflow, from data acquisition to feedback generation and visualisation. The pipeline includes the transformation of temporal information into numerical predictors, a feature engineering phase to stabilise and enrich signals, the conversion of the time series problem into a supervised learning task through embedding with lags, delays, and forecast horizons, and a dimensionality reduction phase to control the explosive growth of variables induced by temporal shifts.
The predictive core is evaluated using two model families: Random Forest regression and Lazy Learning, in the style of k-nearest neighbours, alongside naive baselines that produce forecasts using moving or seasonal averages over temporal windows. The setup adopts rolling window validation to assess robustness across different training and testing segments, and a statistical reading of results through non-parametric and post-hoc tests, supported by visualisations designed to make model comparisons immediately interpretable.
A Realistic Use Case: High-Resolution Smart Metering and Two Experimental Regimes
The experimentation is conducted on a dataset acquired from a pervasive network of smart meters installed in a large commercial facility in Southern Italy, with 5-minute resolution over one month and multiple electrical variables, including three-phase active power, currents, voltages, and power quality indicators such as harmonic distortion. The target variable is active power over multiple forecasting horizons, aligned with operational and market needs, where accuracy typically degrades at longer horizons unless the model captures calendar regularities and recurring patterns.
Two experimental settings are distinguished. In the first, high temporal resolution emphasises signal volatility and highlights how baseline methods may remain competitive at short horizons, while more complex models improve performance as the horizon increases and temporal structure becomes dominant. In the second, the series is resampled at 30-minute intervals to reduce volatility and computational cost, enabling finer-grained feature selection across forecast steps and allowing a clearer comparison between PCA and mRMR in combination with different learners.
Interpreting the Results: Feature Selection and Decision-Oriented Visualisations
The results show that performance does not depend solely on the predictive model, but on the pairing between model and dimensionality reduction strategy. In particular, mRMR-based feature selection tends to sustain forecasting performance more effectively, whereas PCA appears more sensitive to the learner type and, in some combinations, reduces the ability to track the real load trajectory. The analysis also incorporates a statistical interpretation of performance differences through post-hoc testing, with heatmaps transforming model comparison into a decision-oriented artefact for practitioners.
From Forecasting to Decision Support: A Reusable Workflow Beyond the Case Study
The most systemic contribution lies in the generalizable nature of the workflow: load forecasting becomes a testbed to demonstrate how a complete knowledge discovery process can feed a decision support system, not only through numerical outputs, but through structured data transformations, evidence selection procedures, and critical performance interpretation tools. The concluding discussion emphasises that quality remains tied to the presence of repeatable patterns and coherence between training and validation; when volatility dominates, adaptive ensemble strategies prove more appropriate than an indiscriminate increase in model complexity.