On a deeper understanding of data-driven approaches in the current framework of wastewater treatment: looking inside the black-box
Machine learning (ML) is one of the most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence (AI) and data science. The effect of ML is broadly felt across a range of industries concerned with data intensive issue...
Summary: | Machine learning (ML) is one of the most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence (AI) and data science. The effect of ML is broadly felt across a range of industries concerned with data intensive issues, such as consumer services, banking, astronomy and empirical sciences, among others. In the field of wastewater treatment, the origin of vast data generation came along with automation of wastewater treatment plants (WWTP). Additionally, an increase of the computing and storage capacity, allowed large amounts of information to be generated in the water sector coming from different sources to be stored. The information from WWTP, that is generated and recorded involves complex and heterogeneous data sources; on-line from sensors, on/off control data from pumps and equipment and off-line measurements from laboratories. Sensors are able to record measurements every few seconds, thus, generating thousands of data points daily. The data generated in laboratories in wastewater treatment is crucial to evaluate the quality of the water in any biological wastewater treatment process (bWWTP) and often to validate the sensors information. However, due to the costs and time involved, the frequency of sampling for laboratory measurements is often dramatically reduced compared to sensors. Thus, the resulting database (from sensors and laboratories), involve varying frequencies of sampling and thus a highly heterogeneous dataset.
Current research on data-driven methods in wastewater treatment has focused mainly on predictive tasks, to forecast the effluent composition and performance of different bWWTP, the latter also widely studied by activated sludge models (ASM). Although the outcome could be similar with both approaches, the application and the input information to the models is very different. Data-driven approaches require enough data to perform an analysis task, they are data driven. However, the nature of ASM models is phenomenological, which aims to describe the biochemical interaction between the microbial community in the wastewater system and main pollutants in the wastewater; organic matter, nitrogen, phosphorus and other dissolved nutrients. Both approaches provide useful and important information from the process performance, however it is utmost important to distinguish and clarify the differences and goals of ASM-type models and ML-based tasks in the current framework of wastewater treatment. The main reasons that moved the wastewater treatment community to apply these methods in predictive tasks are two-fold; i) is the availability of data gathered from monitoring different bWWTP and ii) the already mentioned complexity of biological processes. The high adaptability of ML methods to dynamic systems has conducted the research community to a wide application of these methods. However, a key issue emerges from the literature. The current studies related to data-driven methods in wastewater treatment do not explicitly describe the pre-processing techniques applied, the amount of the data used for analysis, the frequency considered for the data selection and the rationale behind the selection of the dataset size. The majority of the studies use similar input parameters to those used in ASM-type models, ignoring the potential use of other parameters which are monitored in any bWWTP and not necessarily implemented in the mechanistic models; oxidation reduction potential (ORP), conductivity, turbidity, etc. Thus, yet, potentialities of data-driven methods are being ignored and on the other side, relevant information is omitted in most of the studies published.
As previously stated, the diversity of data sources in wastewater treatment is clear. However, the combination of these data sources for extraction of knowledge is not yet studied in bWWTP. Hence, the main goal of this doctoral dissertation is to increase the general understanding of the state of the art ML methods in wastewater treatment focusing on; i) heterogeneous datasets analysis, ii) the suitability of data-driven methods for these datasets and iii) novel approaches to extract novel knowledge from these datasets. This work demonstrates the importance of data selection in heterogeneous datasets to extract reliable information. The outcome of different data-driven methods change dramatically with different amount of data considered in analysis. This was evidenced when a municipal WWTP was studied. To solve this problem, a methodology to extract a significant subset out of a total raw heterogeneous dataset was developed; optimizing the size of the dataset. The definition of a score-function, allowed the optimization of a subset which was comprised by a set of representative parameters or features (and observations) and then applied to build highly accurate models. Although, feature engineering is a well-developed field in data-science, not yet explored in wastewater treatment. New engineered features allowed to build highly accurate models for the prediction of complex bWWTP where data limitation was an issue. As well, an alternative methodology is proposed in this work to combine even more heterogeneous data sources to efficiently extract novel knowledge from complex bWWTP and that can be applied to similar complex bWWTP.
Although the contributions of this doctoral dissertation are important, yet the main limitation of this work is the extension of the analysis to similar processes i.e. to evaluate if the knowledge gained from the processes studied are particular to these systems or similar patterns eco in comparable processes, for example, do the patterns in all municipal WWTP are similar?
After showing the impact of the amount of data in different data-driven tasks. Existing data quality metrics for specific data sources in wastewater treatment (except for sensor data) need to be addressed, since are currently disconnected from the specific contextual characteristics. The need to revise data quality metrics for different sources of data in wastewater treatment is necessary, mainly when dealing with heterogeneous datasets. These issues however, are out of the focus of this work. |
---|