Mod help us all
ENC-AI-CLOPEDIA
AI is mining the sum of human knowledge from Wikipedia. What does that mean for its future?
We spoke with Wikipedia executives who told us AI could jeopardize the encyclopedia’s connection with the volunteers who create it.
The meteoric rise of AI in the last few years has caused a huge demand for data on which to train its large language models. That load has fallen heavily on Wikipedia, the world’s largest repository of human knowledge, which is created by people and free to use for machines.
But the proliferation of AI tools like ChatGPT and Google AI Overview also means that there’s a disconnect now between where people are getting their information and where it comes from. In turn, Wikipedia's executives tell us that potential separation could jeopardize an entire generation of volunteer editors' engagement with Wikipedia. If volunteers aren't editing as much, the site's quality could suffer.
Data from Similarweb shows that traffic to Wikipedia has declined in recent years, though spokespeople for Wikimedia Foundation, the nonprofit that operates it, contest those numbers. (Similarweb measurements don’t always match up with an organization’s internal metrics, but since the firm is applying the same methodology over time the data should be directionally correct.)
Of course, Wikipedia isn’t ad-funded, so traffic to the site doesn’t directly impact its revenue, but it does impact awareness of the site, which depends on volunteers and donations to survive.
We spoke with two executives at Wikimedia Foundation — Lane Becker, Senior Director of Earned Revenue for Wikimedia Enterprise and Chris Albon, Director of Machine Learning — via email to discuss what the age of genAI means for the free encyclopedia’s future.
The following has been edited for brevity and clarity.
Rani Molla: Data from Similarweb shows that traffic to Wikipedia has been in decline. What are you seeing?
Lane Becker, Senior Director, Earned Revenue for Wikimedia Enterprise at the Wikimedia Foundation:
Our own pageview analytics show that Wikipedia’s page views are approximately 15 billion a month. These have consistently been in the 15B to 18B range since Oct 2020.
Chris Albon, Director of Machine Learning at the Wikimedia Foundation: For reference, the Wikimedia Foundation currently follows the below metrics to track visits to Wikimedia websites:
1. Page views: the amount of users that are visiting all Wikimedia projects and reading articles online.
2. Unique devices: the amount of distinct devices visiting all Wikipedia in a given time period. This statistic can be another proxy for readership and has been relatively stable at 1.5-2B unique devices monthly, over the last 12 months.
3. Clicks and impressions through Google search, another proxy for visits, have remained at a consistent level through the past 16 months and have shown no major declines.
It is important to note that people continue to visit Wikipedia as a trusted source of information on wide ranging topics. It is normal for traffic on Wikipedia to see both spikes and falls over time, which can be due to various reasons, including some current event of global significance, or changes in our internal analytics systems, for example.
Rani: Has traffic been affected by generative AI tools, which although they are typically trained on Wikipedia data, don’t necessarily send people to the site?
Lane: We have not seen a significant drop in traffic on Wikimedia websites that can directly be attributed to the current surge in AI tools.
Rani: How does AI affect Wikipedia?
Lane: We believe the new surge in AI can represent a tremendous opportunity to help scale the work of the global volunteer community that creates and curates consensus-based, reliable content on Wikipedia. It can help make knowledge more accessible to more people in more languages faster than humans alone can do.
Rani: Any negative impacts?
Lane: Our main concern is about the potential impact that these AI tools could have on the human motivation to continue creating and sharing knowledge. When people visit Wikipedia directly, they are more likely to become volunteer contributors themselves. If there is a disconnect between where knowledge is generated (e.g. Wikipedia) and where it is consumed (e.g. ChatGPT or Google AI Overview), we run the risk of losing a generation of volunteers and all the knowledge they help contribute. Also, without clear attribution (and source links) to where information is collected from, AI applications risk introducing an unprecedented amount of misinformation into the world, where users aren’t able to easily and clearly distinguish accurate information from hallucinations.
We believe that AI should credit the human contributions that they rely on. Attributions that credit human participation are crucial because humans are crucial to contributing knowledge to the internet. Wikipedia content is created by volunteers. We must give credit to the work of these volunteers and keep the cycle of participation on Wikipedia going for generations.
Rani: Has AI affected that human motivation yet?
Chris: We have not seen any major signs of decline in human motivation so far owing to the recent AI developments.
Rani: How do you feel about practically every LLM being trained on Wikipedia content?
Lane: As a non-profit with the mission to create a world in which every single human can share freely in the sum of all knowledge, we welcome people and organizations to extend the reach of Wikipedia’s knowledge. Wikipedia is freely licensed and its APIs are available for free to everyone, so that people all over the world can use, share, add to, and remix Wikipedia content. The reliable and verifiable information on Wikipedia is immensely valuable to anyone looking to train machine learning models on knowledge expressed in natural language.
We urge AI companies to use Wikimedia’s free APIs responsibly and include recognition and reciprocity for the human contributions that they are built on, through clear and consistent attribution. They should also provide pathways for continued growth and maintenance of the human-created knowledge that is used to train them.
Rani: What do those pathways look like?
Lane: Clearly attributing knowledge back to Wikipedia is essential so that people are informed on where that knowledge has come from, and so they can click on the relevant links to learn more. In doing so, they can also spend time on the site understanding how to edit, or donate, for example. In this way they are enabled to more easily contribute back to growth of human-created knowledge on the site and also the wider Wikimedia mission.
We also encourage high-volume commercial reusers of Wikipedia content to use our opt-in paid for product, Wikimedia Enterprise, so they can better meet their business needs and also contribute to the financial sustainability of Wikipedia and other Wikimedia projects.
We are also exploring opportunities to see how we can reach future audiences of knowledge seekers and contributors (including on generative AI platforms), through time-bound experiments to help understand what the pathways for continued maintenance and growth of human created knowledge may look like, and make recommendations for innovating.
Rani: How do the paid Wikimedia Enterprise partnerships, like the one with Google, work?
Lane: Two years ago, the Foundation introduced Wikimedia Enterprise, an opt-in, paid for product for commercial enterprises that want to more easily reuse content from Wikipedia and other Wikimedia projects at a high volume. They have product, service, and system requirements that go beyond what we freely provide through publicly available APIs and data dumps. It allows them to reuse content in a way that more effectively aligns with their business models and provides more people everywhere access to free knowledge. Wikimedia Enterprise serves as a model for paid APIs, that charges high end commercial reusers of Wikipedia content so they can directly sustain Wikipedia and other Wikimedia projects, ensuring the continued success of content they rely on.
Rani: How much does Wikimedia Enterprise contribute to revenue?
Lane: As per the Wikimedia Enterprise Financial report 2022-23 published earlier this year, total revenue for Wikimedia Enterprise for FY 2022-23 was $3.2 million – representing 1.8% of the Wikimedia Foundation’s total revenue for the period.
Rani: Presumably Google represents the majority of that. How much does Google pay to use Wikimedia Enterprise?
Lane: We do not share details of individual contracts.
Rani: What other big paying customers does Wikimedia Enterprise have now?
Lane: Google was the first for-profit customer for Wikimedia Enterprise, which we announced in 2022. Since then, we have had other for-profit customers, such as yep.com.
Rani: How do Wikipedia and its volunteers leverage AI?
Chris: We believe that artificial intelligence (AI) should support the work of humans, not replace them. The approach to AI on Wikipedia has always been that humans edit, improve, and audit the work done by AI. While all content on Wikipedia is created and curated by humans, since 2002, some volunteers have used AI and machine learning (ML) tools to support their work, especially on time-consuming and redundant tasks. For example, many editors use specialized tools for patrolling Wikipedia, including bots that help volunteers quickly identify and revert wrongful edits. Volunteers create and enforce guidelines for the responsible use of these AI/ML tools across Wikipedia, ensuring that they are being used to best support human contributors. As with all content on Wikipedia, edits carried out by or with support of AI/ML tools are publicly and transparently documented. Public logs of all changes to content on Wikipedia are available to everyone to review.
The Wikimedia Foundation has a team creating a new generation of AI models to increase the capacity of volunteers. Our technology team regularly works with volunteers on features designed to make the work of editors more efficient, enabling them to focus time on edits that require complex human judgment. This work also includes the development of algorithms to measure the quality of Wikipedia articles, and machine learning models that help catch incidents of vandalism on Wikimedia projects.
The open, consensus-based Wikipedia model continues to work effectively even in the changing information landscape; volunteers continue to review all the content on Wikipedia to ensure that it adheres to the online encyclopedia’s editorial policies and guidelines.
Rani: Overall has the advent of generative AI trained on Wikipedia been good or bad for Wikimedia?
Lane: It is our mission to do the most that we can to ensure the spread of free knowledge for people throughout the world. Wikipedia is available for free to everyone through Wikimedia’s free APIs and data dumps.
While there is no fee for anyone to use this content, we ask AI companies to use Wikimedia’s free APIs responsibly and include recognition and reciprocity for the human contributions that they are built on, through clear and consistent attribution. They should also provide pathways for continued growth and maintenance of the human-created knowledge that is used to train them.