Estimated reading time: 3 minutes, 1 second

Data Errors Affect Outcomes   Featured

Data Errors Affect Outcomes    Sigmund

The biggest challenge to AI is the errors, most of which are associated with labelling. According to a new study by MIT, the ten most cited datasets in AI are full of label errors, which distorts the understanding of the field. Data sets are a vital element in AI research. However, some data sets are more crucial than others. These sets are those used by the researchers to develop and evaluate machine-learning models, which is a way of tracking the capabilities of artificial intelligence and how it has advanced over time. For example, the canonical image recognition data set, ImageNet, is credited for kicking off the modern AI revolution. On the other hand, MNIST, which compiles images of handwritten numbers between 0-9, is another crucial dataset, together with those that test models trained to recognize audio, text and hand drawings.

Over the recent years, studies have found out that most datasets have some serious flaws. ImageNet, for instance, has been found to harbor racist and sexist labels. The dataset also has photos of people taken without their consent. According to the latest study, many labels that exist today are just completely wrong. For example, a frog is labelled a cat, while a mushroom is labelled as a spoon. The ImageNet test set has an estimated label error rate of 5.8 percent, while QuickDraw, another dataset that tests a compilation of hand drawings, has an estimated error rate of 10 percent.

The errors in datasets originate from data management and tend to be more idiosyncratic as opposed to systematic. In some cases, the errors originate from receiving incomplete datasets because the entire categories of data were missed. The errors in the datasets tested by MIT were measured by evaluating models with the corresponding data set used for training. For instance, the training data sets were used to develop a machine learning model and predict the labels in the testing data. If the model disagreed with the original label, the data point was then flagged for manual review.  Five reviewers on Amazon Mechanical Turk voted on which label they thought was correct between the model and the original. If a majority of the Mturk reviewers agreed with the model, the original one is then labelled as an error and corrections are made.

Data errors lead to bias, such as the latent bias that is currently a big challenge in AI. If machine learning data error is intentional and is used in an AI project, the outcome can lead to discrimination and bias. An example of such biases is when a labelling system labels a spatula as a cooking tool and refers to a person holding that tool as a woman even when it is a man.  According to researchers, failure to properly quantify and reduce this type of error will magnify stereotypes. Therefore, researchers in the AI field should strive to create clean datasets for evaluating models and tracking progress. From the over 34 existing models whose performance had already been measured against the ImageNet test data set, the data labels were found to be wrong. These data sets did not perform well against the incorrect labels performed well after they were corrected. The problem seems to largely affect the complex models compared to the simpler ones, considering the simpler models seemed to fare better than the complex models.

Without a comprehensive cleaning, you will deploy noisy data sets into the real world and could lead to the selection of the wrong model. The point that needs to be understood here is that good data has a positive impact on outcomes. However, it takes time and money.

Read 363 times
Rate this item
(0 votes)
Scott Koegler

Scott Koegler is Executive Editor for Big Data & Analytics Tech Brief

scottkoegler.me/

Visit other PMG Sites:

click me
PMG360 is committed to protecting the privacy of the personal data we collect from our subscribers/agents/customers/exhibitors and sponsors. On May 25th, the European's GDPR policy will be enforced. Nothing is changing about your current settings or how your information is processed, however, we have made a few changes. We have updated our Privacy Policy and Cookie Policy to make it easier for you to understand what information we collect, how and why we collect it.