
If you have been on the internet, you have probably consumed information from Wikipedia. While citing Wikipedia as a source is often frowned upon in academic circles, it is the place where we all go to when we want to know about the history of bucket hats (yes, I did look this up once and it was a fun read!), where your favourite actor did her schooling from or about some war that happened in 18th century France that you really need to know about for an assignment due in 4 hours. More often than not, what we get when we google something is a Wikipedia summary of it. It is free, improves access to information and helps increase knowledge in the world, one day at a time. But despite all this, Wikipedia does have its shortcomings. Funnily enough, the first place I went to read about the Criticisms of Wikipedia was Wikipedia.
While Wikipedia has many criticisms against it ranging from institutional to ideological bias, we are going to stick to the widespread gender bias in Wikipedia for the purpose of this article. To edit Wikipedia, one must use a “neutral point of view” to make the content free from biases. But in the past, Wikipedia has been used by organizations to promote their points of view. In 2008, CAMERA, a pro-Israel group sent an email calling for volunteers to edit “anti-Israeli” content by pro-Palestinian groups on Wikipedia. Sticking to a neutral point of view is hard even for individual editors, with unconscious (or otherwise) biases slipping through in how the edits are framed, what articles are edited and what edits are accepted. In 2009, Wikipedia introduced a new feature where users could select their preferred gender while editing. A research paper from 2011 found that only 1 in 5 of editors happened to be women. But the bias runs deeper. Women, even if they are registered, edit less than men and their average editing lifespan is also shorter than men. Now, why is this the case? Research has shown that women feel less confident editing someone else’s work and about their expertise, leading them to spend less time editing Wikipedia. The fact that women have more care work responsibilities in their non-working time could also be a reason for why women edit less. No matter what the reason for the gender editing gap is, it matters a lot. Women and men bring different things to the table. With predominantly male editors, a lot of issues might have skewed takes and we also miss out on important topics that women have more of an interest in.
Even though 39% of chemistry PhDs are awarded to women every year just in the US, women chemist’s biographies only make up 7-11% of Wikipedia’s total biographies on chemists. When women scientists, novelists and other prominent women aren’t equally represented in the most easily accessible source for information, we are depriving young girls of role models they can look up to and it also has a profound impact on societal attitudes in general. Having more women editing wikipedia and increasing representation of eminent women will help maintain the “neutral point of view” by reducing bias.
In 2017, scientists fed AI with all of wikipedia to help it understand what was appropriate and not appropriate to do. The impact and reach of Artificial intelligence is ever growing and it is important that we don’t feed our existing biases into these systems and further solidify systemic bias in decision making. That is another very important reason why we need to ensure equal representation and reduce bias in Wikipedia.
The models used in machine learning these days need huge amounts of data to train them. Most Natural Language Processing (NLP) models, which need textual data, scrape the data which is available publicly to train their model. Wikipedia is one of the most used data sources for training. Transformer-based models like BERT and GPT-3, which are the latest used models in NLP are trained on Wikipedia as well. It is also one of the few sites where one can get information about various things all in one place and in a similar format. Wikipedia, though it is not meant to be an interpretable corpus for any particular NLP task, appears as an impassable opportunity in this context. It is large, up-to-date, and encompasses a wide variety of subjects. This paper shows how Wikipedia is used for various NLP tasks.
Whatever data is present on Wikipedia determines the style in which the NLP models will learn and generate data. Therefore it is important that we have a balanced and fair Wikipedia so that the models on which this data is trained are more robust and fair. The bias present in the data can seep into the models, which can affect the decisions made by the model. In order to reduce this gender data gap in Wikipedia, Project Hidden Voices aims to develop information theoretical approaches, ML-assisted auto identification and validation of external sources and textual analysis methods to auto-generate a first draft of Wikipedia-style biography for notable women in STEMM. This will encourage more people to write articles about women in the future.
Written by: Neeraja Kirtane and Gayathri Arvind
Leave a Reply