Three years after OpenAI’s ChatGPT went public in November 2022, it’s fair to say that commercial use cases for AI have failed to inspire– see Oral-B’s AI powered toothbrush or pay-to-grieve interactive avatars of deceased relatives. Sifting through hundreds of new AI tools and applications, it’s difficult to justify all the components that go into the training of these models and quell concerns they raise over environmental impact, data privacy, and copyright protection. We know that AI has more potential than this, including the ability to transform humanitarian response, but it needs the right data to do so.
Large Language Models (LLMs) are trained off large data sets scraped from publicly available internet sources, stoking fears that data gaps among underrepresented groups will result in biased and unreliable results. But these “data deserts” are especially significant when expanding access to emergency services to those already most vulnerable, with limited or no access to the internet or who primarily speak one of the thousands of languages not recognized on the national level.
What does this mean in practice? When a rural community experiences an extreme weather event or an outbreak of violence, existing emergency systems which only operate in officially recognized languages and/or require internet connectivity effectively isolate them and can delay or completely preclude humanitarian aid.
The Malawi Voice Data Commons (MVDC), developed by Portulans Institute Senior Fellow and NYU Peace Research and Education Program AI Lead Marine Collins Ragnet together with her NYU colleague Katerina Siira, introduces peace-focused digital infrastructure designed to address this challenge and to ultimately save countless lives.
According to the Open Data Policy Lab, data commons, or shared pools of data resources available for public benefit, can serve as critical infrastructure for peace technology aimed at providing emergency humanitarian aid and preventing conflict. By providing a governance structure that gives communities agency over their data, national emergency systems can access the localized and contextual information needed to serve the public good.
How it works
MVDC brings theory into practice in Malawi, where over 3 million rural Malawians are excluded by language and literacy barriers from accessing critical emergency services. Based on 2018 census data, 35% of Malawians face adult literacy limitations and 73% lack regular internet access. Built on partnerships with Mozilla Common Voice, UNDP, Ushahidi, and Malawian universities, the MVDC provides communities the ability to report emergencies in their native languages including Chichewa, Chitumbuka, and Chiyao using basic mobile phones and toll-free numbers– reducing emergency response times from 72 to 6 hours.
In addition to responding to community concerns about emergency communication, the data commons project aims to preserve the linguistic heritage of these groups. The system fine-tune’s OpenAI’s Whisper model using newly collected Chichewa language data. Locally, servers perform real-time keyword detection and basic transcription to identify urgent reports, which are immediately flagged for response. During off-peak hours, more advanced speech recognition analysis is conducted through NYU Greene’s high-performance computing system, allowing for deeper linguistic processing and improved transcription accuracy. Through this combination of machine learning, local infrastructure, and language-specific model optimization, the MVDC enables real-time emergency alerts, efficient data routing to response teams, and the creation of scalable, AI-ready datasets.
Key to this project’s success is its emphasis on community governance and consent. If technology aims to protect underserved communities, ethical and equitable data governance is of paramount importance. Traditional authorities are partners with clearly defined roles, covering cultural content and community participation decisions. They lead monthly community committees including women’s leaders, youth representatives, religious leaders, and educators to review how the system is being used.
The project operates within a clear policy framework grounded in community data sovereignty, privacy protection, open data for public benefit, and the preservation of cultural and linguistic heritage. All personal information is anonymized automatically, and communities retain ownership and control over their data, including the right to request deletion at any time. During emergencies, rapid access protocols allow authorized responders to view essential data without compromising individual privacy or trust.
Recognizing that crisis reporting can involve highly sensitive or personal information, the project employs multi-modal consent procedures that include verbal and visual options to accommodate all literacy levels such as visual storytelling cards explaining data use.
By 2027, the MVDC will operate entirely through Malawian institutions with NYU and Mozilla providing support rather than direction. Universities own and operate the technical infrastructure, the government provides operational funding, but communities maintain governance control.
Implications for data management
The MVDC project not only presents an AI tool with real community impact, it poses some greater questions regarding data governance. AI training data is composed of a mixture of personal and publicly available sources, including administrative records, anonymized data, and copyrighted works, which help teach AI models to recognize patterns and categorize information. Outside of some exceptions for research and small-scale data commons initiatives, much of our data is currently inaccessible and privately controlled.
Data harvesting for AI has prompted many in the field to revisit Hess and Ostrom’s 2007 work Understanding Knowledge as a Commons: From Theory to Practice, which highlights how technological advancements can transform non-exclusionary public knowledge into private goods and the need to protect the digital ecosystem as a common-pool resource (CPR). CPR frameworks are designed with safeguards to ensure the sustainability of the digital commons and to guarantee that the value it generates is returned to its creators. Central to this governance approach is pulling control away from commercial entities like online platforms that leverage consumer behavior analytics and impose barriers to entry for small competitors.
Peace technology captures some of our highest hopes for machine learning: that this significant pressure on copyright and privacy could result in innovation in critical infrastructure and improve living conditions around the world. Importantly, the Malawi Voice Data Commons does not position copyright and privacy in opposition to innovation. What makes it both intriguing and successful is its ability to create a feedback loop of accountability and protective safeguards, demonstrating that the societal value of technology and the knowledge it can produce may ultimately rest with its keepers. Building on the MVDC, we can learn how to balance protecting the most sensitive data, empowering data creators, complying with local copyright regulations, and ensuring transparency with the ultimate goal of sharing the most data for the public good.
Marine Ragnet is an international affairs expert specializing in the intersection of emerging technologies, international affairs, and governance. She is a Senior Fellow at Portulans Institute and AI Lead for the NYU Peace Research and Education Program.
The post Introducing the Malawi Voice Data Commons appeared first on Portulans Institute.
