Search
  • Kerry Mackereth

David Adelani on Datasets for African Languages


In this episode, we chat to David Adelani, a computer scientist, PhD candidate at Saarland University in Germany, and active member of Masakhane. Masakhane is a grassroots organisation whose mission is to strengthen and support natural language processing research in African languages. There are over 2000 African languages, so David and the Masakhane team have their work cut out for them. We also discuss how to build technology with few resources and the challenges and joys of participatory research.


David Adelani is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. Originally from Nigeria, he is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages.


Reading List:


1) David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, Salomey Osei. 2021. Masakhaner: Named entity recognition for african languages. arXiv preprint arXiv:2103.11811.


2) Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and Dietrich Klakow. 2020. “Transfer learning and distant supervision for multilingual transformer models: A study on African languages”. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online. Association for Computational Linguistics.


3) Jesujoba Alabi, Kwabena Amponsah-Kaakyire, David Adelani, and Cristina España-Bonet. 2020. Massive vs. curated embeddings for low-resourced languages: the case of yorùbá and twi. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2754–2762, Marseille, France. European Language Resources Association.


4) David Ifeoluwa Adelani, Michael A Hedderich, Dawei Zhu, Esther van den Berg, and Dietrich Klakow. 2020. Distant supervision and noisy label learning for low resource named entity recognition: A study on hausa and yoruba. Workshop on Practical Machine Learning for Developing Countries at ICLR’20.


Other relevant papers:


5) ∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkabir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Espoir Murhabazi, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Emezue, Bonaventure Dossou, Blessing Sibanda, Blessing Itoro Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, Abdallah Bashir. 2020. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, Online. Association for Computational Linguistics.


6) Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strotgen, and Dietrich Klakow. A survey on recent approaches for natural language processing in low-resource scenarios. In Proc. of NAACL 2021, 2021a. URL https://arxiv.org/abs/2010.12309.


7) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. International Conference on Machine Learning, pages 4411–4421.



Transcript


KERRY MACKERETH: Hi! We're Eleanor and Kerry. We're the hosts of The Good Robot podcast, and join us as we ask the experts: what is good technology? Is it even possible? And what does feminism have to bring to this conversation? If you wanna learn more about today's topic, head over to our website, where we've got a full transcript of the episode and a specially curated reading list with work by, or picked by, our experts. But until then, sit back, relax, and enjoy the episode


ELEANOR DRAGE: Today, we’re talking to David Adelani, a computer scientist, PhD candidate at Saarland University in Germany, and active member of Masakhane. Masakhane is a grassroots organisation whose mission is to strengthen and support natural language processing research in African languages. There are over 2000 African languages, so David and the Masakhane team have their work cut out for them. We also discuss how to build technology with few resources and the challenges and joys of participatory research. We hope you enjoy the show. ELEANOR DRAGE: Thank you so much for joining us today David. Can you tell us a bit about what you do and what brought you to your work?


DAVID ADELANI: Thank you very much for inviting me. I'm David Adelani , originally from Nigeria, I'm a PhD student of computer science at Saarland University in the Department of Language, Science and Technology. I work on topics around privacy and security of voice interaction systems and NLP [natural language processing] models. And I'm also very, very interested in the development of datasets and tools for African languages. I guess why, that's why you invited me!

KERRY MACKERETH Fantastic, and so we know that as part of that interest you work with an organisation called Masakhane, could you tell us a bit more about Masakhane and what brought you to that organisation?


DAVID ADELANI: Masakhane is a grassroots organisation whose mission is to strengthen and support NLP research in African languages, for Africans, and by Africans, so there’s a real emphasis on [it] being done by African researchers. And of course, this is with the support of like-minded people around the world. So yeah, Masakhane is an isiZulu word, which means that ‘we build together’. And that's what has been happening in this participatory form of research.

I joined Masakhane during the first African NLP workshop in 2020. I think I was so impressed by their work on machine translation, and the approach of the research, they prioritise inclusive community building with an open participatory research, I really love this approach, because it allows people from different parts of the world to work together, and also people from different disciplines, like linguists, technologies, policymakers, computer researchers, and scientists, so it's very good initiative. And also, everything is open sourced. The way they work everything is open source, and everyone can see what's going on when you start a project.


ELEANOR DRAGE: We’re called The Good Robot, so as part of that we ask what ‘good technology’ is and what it looks like, and occasionally we come across technologies that we think are brilliant, or technological processes, so with that in mind, can you talk to us a little bit about either Masakhane or other work that you do, and tell us what you think ‘good technology’ is?


DAVID ADELANI: Yeah, good technology, I would say is a technology that helps improve the quality of life of people. And at the same time, without harming any group of people. You know, you can have some technology that works very well on certain groups, and doesn't work on another group, like, with a lot of discussion about facial recognition technologies, that doesn't work for, for example, Black people. And so, then the question is, is this a good technology or not if it cause harm to certain groups of people, then we cannot generally accept that this is a good technology. [When you talk about Masakhane, so we want..] I think the main idea is to make a different group of people to be part of the process of building this dataset, or building these models. So if you're part of it, then you are very careful of what would be the impact of this technology. So if this was, okay, if I build a technology that really works, for example, for Bantu languages in Africa, and doesn't work for maybe some other West African languages, then you can already point it out because you're part of the process. And, there are some other issues of will it work in low resource scenarios? Can I even run this model? Yeah, and things like that. So maybe can I run this model on a CPU [central processing unit], for example. Because maybe most labs in Africa cannot afford to buy a GPU [graphics processing unit] for example.


KERRY MACKERETH: Absolutely, and this leads really nicely into the next question we wanted to ask you, which is, thinking very broadly about the AI sector as a whole, what kinds of problems or exclusions is Masakhane trying to address?


DAVID ADELANI: Oh, yeah, so the main problem we’re trying to address is the underrepresentation of AI scientists from Africa. So how can we include, how can we have more people from Africa being part of AI, it’s a very big sector. And we want them to be part of the process. And also, we want to increase the research on African languages. So I think Africa, Africa has over 2000 languages. And these languages are not spoken by just a small portion of people, on average, they are spoken by a large number of people. So it's really important. So until recently, most evaluation of African languages are only based on maybe Swahili. And, you know, and this is just one out of many languages, so we wanted to, we want our models to be evaluated on many African languages. And one thing is, Africa is big, and is very diverse, and also the languages are very diverse, even in the same country. For example, Nigeria has over 500 languages, a very diverse country. So if you just focus on one or two Nigerian languages, you'll really understand the diversity of these languages and some of the interesting properties of the languages that have been left out. And the last thing is that, that I'll talk about is, we also have the non-availability of these technologies for Africa. So you want this technology to support African languages. So how many machine translation tools support African languages? For European languages, we have many tools that you can use.


And also because it's a very multilingual society in Europe, and sometimes industry entrepreneurs are very sceptical about investing in African languages because they don't know what will be the market value, is an economic benefit for this. But it's important to the to the speakers of the language


KERRY MACKERETH: Fantastic, it sounds like such important and amazing work. But what kinds of barriers do you face in trying to do this good work, or why is it so hard to fix these kinds of problems or these kinds of exclusions?


DAVID ADELANI: Yeah, I think it's difficult for two reasons, I think the first thing is about the process in addressing this problem. So most people have tried to work on African languages in the past. So what they do is that sometimes they don't even involve native speakers of the language, you can crawl the web and just get conversation of people, social media, or Wikipedia, and you don't involve the speakers of this language. And sometimes, even if they are involved, you see they're not properly compensated, you're not involved in technology.

They're not involved in - and they are not acknowledged in publications things like that. So the second point I wanted to talk about is this knowledge gap in AI, because you cannot solve the problem if you don't have the skills to solve it. So there is a need to invest in education, mentorship, and collaborative work, for example, internships with top labs and AI companies to address this problem. And that's why this approach really works well because it's open and it's participatory, so we involve everybody. It's not only Africans that are working on this problem, we also have researchers from around the world, also assisting to help, people who are interested in this problem are also assisting. So when we try to include all these you know, involving native speakers, and also the knowledge gap has been addressed, I think we'll be able to solve some of these problems.

So the problems are surmountable.


ELEANOR DRAGE: Lots of what you’re saying to me resonates with the feminist perspective that Kerry and I take in our work, for example, we’re both trying to respond to the question of how to make technologies more inclusive, we’re both putting emphasis on methods and processes, and we’re both trying to break down false distinctions between the social world on the one hand, and technology on the other to show that social mechanisms and discrimination are embedded in the technology we use. So how do you see Masakhane’s work in relation to feminism?

DAVID ADELANI: Oh, yeah, good question. I think we were trying to make Masakhane very inclusive, right. I mean, there are many aspects that Masakhane is looking at, for example, in terms of how many indigenous languages are we working on? So are we just working on those few that are very popular or were really considering languages with fewer numbers of speakers, endangered languages. So that's one aspect that we're looking at. Another thing is that currently, most people that that are in Masakhane are English speakers who want to see how we can encourage, for example, more Francophone Africa, more Portuguese speakers, more Arabic speakers from North Africa, we can encourage different people from different parts of Africa to join whether in West Africa, East, Central, South or North, to join. And this is one thing we're really thinking about and working towards. For example, one thing we are currently working on is translating some notebooks, or resources. So it's easier for the people that do not speak English, and are well educated, can also join this community. And also, I'm very glad that the founders of the group are women, which has really encouraged many women to feel welcome in the group. And they are leading amazing projects in Masakhane. So most of these tall projects that you see are actually being led by women in Masakhane.


ELEANOR DRAGE: Masakhane is a community that is made up of more than 400 members from 30 African countries, so can you tell us more about the joys and the challenges of working with such a big and dispersed grassroots organisation?


DAVID ADELANI: Well, it's really amazing working with them. I mean, when you think about it, to see the enthusiasm, the energy, the passion that people have for African languages, even the non-speakers of the languages, I think it's really amazing. And one thing that really makes me very happy, is it's very easy to scale projects in Masakhane to many languages. For example, I have an idea to build a named entity-recognition dataset for Yoruba, which is my native language. I shared the idea with Masakhane in the weekly meeting, and a lot of people are interested, say, 20 people are interested. And they say, yeah, I want to do this for my language, then you can easily scale a project of one, to a project that involves 20 languages. How amazing is that, and you can even make more impact by doing this. So scaling things, projects, very fast is something that is really one beautiful thing with how Masakhane works. So, but when you talk about a challenge, you know, when you have like a lot of people, there are also challenges working with many people. One major challenge is how do you keep motivating people to work? Especially if they’re working on a voluntary basis, they’re volunteers, right? You're not paying them - how do you keep motivating them? For example, when we are building the named-entity recognition for African languages, the MasakhaNER dataset - it involves more than 40 annotators. How do you manage this, how do you keep encouraging them? Sometimes I have to be passing out a message, you know, to the people that speak this language and encourage them and say, let's do it. And sometimes you have to give time so that you have weekly meetings and assist them in the annotation, even though you're not a speaker of the language. So, and sometimes it's time consuming, but it’s worth it, because why would I not give my time when the people that will do the main work, who spend more than 10 hours, 20 hours doing this work are interested in doing it. Even if I would spare just 30 minutes, I would gladly do that. So, one thing I've seen that really challenge people to do more is well okay, there's a deadline, there’s an international conference or workshop coming. And they know they are going to be part of the paper. I think this really encourages them that okay, at last, all their efforts are being acknowledged somehow. So this acknowledging people's efforts, I think it's really important to the community. And this really challenges them to really work hard even when they’re not paid for it.


KERRY MACKERETH: Absolutely, and we’ve talked a little bit about what Masakhane can do for technology, but what do you think AI can do for African languages?


DAVID ADELANI: Yeah, I mean, there are a lot of areas that AI can help African languages. So most AI technologies, for example, they work with having a large amount of data. My research is in a really addressing direction about how do you make this happen with fewer amounts of data.

So it's a very interesting direction which can really help African languages, because we're in this low-resource scenario. And last year, we did some work around this that we submitted to the African NLP workshop and the EMNLP [Empirical Methods in Natural Language Processing] conference, for example, how can you use like few examples, each less than - 10 to 100 examples, to improve NER [Named-entity recognition] for Hausa NER and Yoruba NER. And also, we include things like distance supervision for automatic labelling from Wikipedia, and also having native speakers write some simple rules. You know, these are small, small, simple, simple ideas that you can integrate without having this large amount of data and you can see a very impressive performance. So AI can also help in that without acquiring lots of data. And I will refer people to our recent work from, I think, last year, it's called “Transfer learning and distant supervision for multilingual transformer models: A study on African languages” [in reading list]. Secondly, I think, also building models that work on realistic low-resource scenarios. So can we build, for example, machine translation models that work on the CPU, that work on the device, and without having the capacity of big companies like Google, Microsoft, and can we make it available to many African languages. And also, the last thing that will be interesting to see is how we can build solutions that can help address some of the Sustainable Development Goals, maybe around health, food security education in Africa - I think this will be amazing. And it will be more interesting if Africans that understand this problem more can also partake in this.


KERRY MACKERETH: Fantastic, and so with that in mind, what’s in line for Masakhane in the future?


DAVID ADELANI: Oh, I will say lots, I mean it's difficult to predict. But I think it’s going to keep expanding and keep growing. And I think we'll be a force in the continent that can really change things. So I'm really looking forward to how [what] this Masakhane initiative will spur NLP research in African universities. Also, in general, we do less research in Africa than the rest of the world. So in this we encourage people to do more research, this will be very nice. Another thing is that we, it will, we want to also encourage entrepreneurs to build solutions that work on indigenous African languages. And hopefully this will lead to maybe economic prosperity on the continent. I think if Masakhane can achieve this, you know, we do, you know the full end-to-end system that you know from the research side to the industry having real impact in the life of people. Then this will be amazing. Thank you.


ELEANOR DRAGE: Well it was a real pleasure to speak with you thank you so much for joining us and all of the text that you’ve participated in and some of the things that you’ve mentioned today will be in our reading list on our website for everyone to access.


DAVID ADELANI: Thank you very much for inviting me. It was a pleasure talking with you.


6 views0 comments

Recent Posts

See All