IBM Research is treating gender bias like a data problem.
The company’s India research division announced a dataset cataloguing the genders and relationships of characters in 4000 Bollywood movies, in addition to details of the movie’s plots, and representations on movie posters and trailers. The dataset will allow researchers to use machine learning algorithms, which specialize in finding complex patterns within data, to pick out different ways bias is quantitatively shown throughout the movies.
Researchers said that this will be the first publicly-available, large-scale dataset that’s been made to specifically analyze gender bias in film, created so data scientists could study how men and women are represented in Bollywood. Once this bias is understood, the software could analyze or even generate other pieces of work.
“This dataset can be used to generate unbiased plausible stories from biased stories,” the team writes. “The main area where this can be extended is to train [algorithms] to identify which is a biased statement and which is not.”
Machine learning algorithms could use the data to understand not only how often men and women appear in films, but the extent to which women may be cast in subordinate roles or underrepresented on movie posters.
IBM Research India isn’t the first to study gender representation in movies. In 2016, independent researchers analyzed 2000 movie scripts and found that men have more than 60% of the dialogue in Disney movies. In the movie Mulan, they found that Mulan’s dragon, voiced by Eddie Murphy, had 50% more dialogue than Mulan herself, the star character.
Researchers write the dataset will be public after “acceptance” of the work.