As we consume more and more data, we are also changing the definition of the data itself. Not so long ago, data was only singular i.e. language and number based.
Then we added pixels (and so images) as well to the language and number, to make it more enriched data. Still, singular though.
And then we added time dimension to the images, languages and numbers with videos, making the data plural.
When used in the right way i.e. with good intention mathematical and statistical modeling, it provides actionable insights.
Thanks to the efforts of researchers and enthusiasts over the years, we have some very large open databases for singular or plural data.
WordNet as we know was the first one that came. With its multiple dimensional linguistic structure, it has helped researchers in many ways for building insights on word based models. With the current span of over 200 languages, WordNet is really large in its ability to provide multi-level insights. All singular in nature though. This was in the 80s.
Then came the ImageNet. Over 14 million annotated collection of images that is maintained another equally committed set of researchers, just as the makers of WordNet. ImageNet came in 1990s. Over the years, with advent of multiple sources of the image capturing, ImageNet has grown very rapidly. Very rapidly.
Over the years both WordNet and ImageNet have served the purpose of feeding training data sets for thousands of projects. They are seen as the Gold standard for multiple events and also for organisations to set their bar of competence.
But with growth comes complexity and at times unintended. These unintended deviations in data coupled with mathematical and statistical modeling leads to building of bias.
The insight thus built is appropriately the #Bias #Intelligence.
Biases are not known to be positive in anyway. In general, they reflect a negative outcome of a well-intended action.
WordNet is subject to biases as meaning of words can change overtime. Words referred to in common parlance today can become derogatory tomorrow.
Take the example of words like Senile and Egregious. Both these words had positive meanings in the past. Now they have meanings with a negative bias.
If WordNet is not refreshed periodically these biases in the words meanings will start skewing the outcome.
ImageNet relies on the tagging done by members of the database for the images. Its is a crowdsourced activity. In general, it runs certain basic checks on these tags but a holistic assessment of the embedded tags was never adopted.
Consequently, the Tiny ImageNet was found to have offensive tags for the 80 million images that are there in the database.
Multiple racial, offensive, abusive and inappropriate tags were found deeply embedded in the database, as per the research here.
Here is a snapshot of the offensive words found by the researchers.
Clearly, the larger the dataset becomes, it becomes all the more important and difficult to maintain the quality in it.
Extreme caution and very strong measures for data quality checks are required to ensure that such bad data is not leading to vitiating of the your model.
Tiny ImageNet has been used in numerous Convolutional Neural Networks (CNNs) as training dataset for building the model. Models trained by such models are bound to have been infected by the biased content in these images. This in turn would have created a significant biased intelligence in the world today.