본문 바로가기

데이터사이언스/DataScience_eng

Why samples and statistics are important in machine learning and AI

by 미스터탁 2022. 12. 12.

In the past two to three years, a lot of education courses related to data science (DS) or artificial intelligence (AI) has been increasing. These curriculums appeal the advantages of each course to the people who want to study DS and AI. It has various advantages such as being linked to employment, conducting hands-on lectures, or helping to make portfolios. Obviously, these courses can definitely help you study DS or AI. Of course, coding skills and machine learning algorithms are important, but I would like to briefly point out the parts that many people make mistakes and do not think deeply about DS/AI.

Let's think about why we build a model when we need to analyze data. The purpose of building a model with the data we have is to predict new data or unknown (unseen) data. In fact, if we have the whole data (population), we don't need to make a good model. This is because you can make a model that is overfitting. However, we only have some samples, and we need to make the most of them to create a highly accurate model.

So, what are the conditions for a good sample?

First, the sample should be fully reflected the characteristics of the population. To do this work, the larger the sample size, the better. This is because the larger the sample size, the more likely it is to reflect the characteristics of the population. As most people who study machine learning know, a model that predicts collected samples well cannot be guaranteed to predict unseen data well. We call overfitting a phenomenon in which the model predicts the data it uses for learning, but can not predict the data it has not used for the learning. The overfitting problem is an inseparable problem in the DS/AI field, and many studies have been conducted to solve this problem.

Proper experimental design is important to verify the performance of the model and obtain reliability using the data we obtain. In other words, experimental design is required to argue that the accuracy of the model we built is 70% for predicting unseen data or new data.

Given the data we have as a whole, we divide and experiment with training data used for learning, validation data for optimizing the model, and finally test data for measuring the performance of the model. However, the number of data we have is likely to be a minority. Even if we have more than tens of thousands of data, the number of data is still insufficient. Therefore, sometimes, we divided the total dataset into the training data and test data.

When dividing the entire data here, it should be randomly divided. Suppose that the model is trained by randomly dividing the training data and the test data, and 70% performance is achieved by predicting the test data. If you divide the data randomly again and evaluate it, will the performance be 70%? That's not the case at all. If you have a lot of data, it may be between 68 and 72%, and if you have a small data, it may be between 55 and 80% (if you have a huge data, it is likely that you will get less than 1% error).

That is, even if we randomly design experiments according to the number and characteristics of the data, we find it difficult to properly measure the performance of the model. Therefore, experimental design is more important to secure the reliability of our model. In general, when designing an experiment, the data is randomly divided, evaluated with test data, and averaged by repeating it several times or dozens of times. Then, we can tell the customer that on average, some performance has been achieved through these experimental designs

댓글

티스토리툴바