The Significance of Data Quality in Making a Successful Machine Learning Model Good quality data becomes imperative and a basic building block of an ML pipeline. The ML model can only be as good as its training data. By Vidhi Chugh , Data Scientist on March 10, 2022 in Machine Learning Source: Business photo created by frimufilms - www.freepik.com Introduction AI has been a buzzword for quite some time now and is highly ubiquitous. The AI-enabled applications have extensively increased in the market. We have also been ‘blessed’ with powerful infrastructure and advanced algorithms. However, that does not make the journey of taking your ML project to production any easy. Source: Chat bot vector created by roserodionova - www.freepik.com The issue in data quality is not new, it has gained attention since the onset of machine learning (ML) applications. The machine learns the statistical associations from the historical data and is as good as the data it is trained on. Hence, good quality data becomes imperative and a basic building block of an ML pipeline. The ML model can only be as good as its training data. Data-centric vs algorithm-centric Let me share two scenarios with you - Let's assume you have done initial exploratory data analysis and are very excited to see the model performance. But to your disappointment (which happens every now and then in a data scientist’s life :)), the model’s results are not good enough to be acceptable by the business. In this case, considering the repetitive nature of the data science world, what would be your next steps: Analyze the wrong predictions and associate them with their input data to investigate possible anomalies and previously ignored data patterns. Or, you would take a forward-looking approach and simply advance to more complex algorithms. Simply put, the typical practice of resorting to more advanced ML algorithms to gain more accuracy will not yield much good if the data does not provide a good signal to the machine. This is very well articulated in Andrew Ng’s lecture on “MLOps: From Model-centric to Data-centric AI”. Data Quality Assessment The machine learning algorithms need training data in a single view i.e. a flat structure. As most organizations maintain multiple sources of data, the data preparation by combining multiple data sources to bring all necessary attributes in a single flat file is a time and resource (domain expertise) expensive process. The data gets exposed to multiple sources of error at this step and requires strict peer review to ensure that the domain-established logic has been communicated, understood, programmed, and implemented well. Since data warehouses integrate data from multiple sources, quality issues related to data acquisition, cleaning, transformations, linking, and integration become critical. A very popular notion among most the data scientists is that the data preparation, cleaning, and transformation take up the majority of the model building time – and it is an absolute truth. Hence, it is advised not to rush through the data to feed into the model and perform extensive data quality checks. Though the number and type of checks one can perform on the data can be very subjective, we will discuss some of the key factors to be checked in the data while preparing data quality score and assessing the goodness of data: Techniques to maintain data quality: missing data imputation Let’s check how we can improve the data quality: All labelers are not the same: Data is gathered from multiple sources. Multiple vendors have different approaches to collecting and labeling data with a different understanding of the end-use of the data. Within the same vendor for data labeling, there are myriad ways data inconsistency can crop up as the supervisor gets requirements and shares the guidelines to different team members, all of whom can label based on their understanding.