Funnelling locusts – further reflections on the OAPEN Library and DOAB’s response time 

Ronald Snijder

Tue 27 May 2025

Read this article at hypothèses.org

In my previous post on our less than optimal performance of the OAPEN Library and DOAB, I wrote about the effect of AI bots on our systems, and that it has become a common problem for many sites providing open access (OA) or open source software content. All of them – including the OAPEN Library and DOAB – see a new threat: not censorship or budget cuts, but bots. A new kind of digital swarm has arrived, and it’s putting serious pressure on our infrastructure, especially the OAPEN Library. In this blog, I will discuss in some detail our ongoing work to prevent us from being overrun by AI bots. 

Interacting with OAPEN and DOAB – a dialogue 

At its core, every interaction with the OAPEN Library is a dialogue between a user and our server. A request is made, we respond. This simple back-and-forth happens countless times a day. A user is either a person, or an automated system: a bot.  

Most of the time, these interactions are respectful. Many bots—like those from Google or CLOCKSS—identify themselves and behave in a predictable way. We welcome not just human users, but also automated systems – for instance for text and data mining – into the OAPEN Library; we aim to share a quality-controlled collection of open access books as widely as possible.  

But with the rise of AI tools—especially large language models (LLMs)—the nature of these conversations is changing. Many bots no longer introduce themselves. They mask their identities and arrive in droves, flooding the system with requests. 

OAPEN under attack: a swarm of locusts 

Under normal circumstances, we see less than 50 dialogues per second happening between a user and the OAPEN Library. However, when a swarm of AI bots starts hammering the OAPEN server, the number of interactions rises to over 10,000. As you can imagine, this is similar to undergoing a swarm of locusts. 

These swarms of AI bots are scraping content to feed LLMs. The impact? Our response time balloons from milliseconds to minutes, effectively making the OAPEN Library unusable. And it doesn’t stop. These surges can last for hours and sometimes occur multiple times a day. 

Figure 1: Line drawing of locusts, taken from https://commons.wikimedia.org/wiki/File:Diagrams_of_Locusts_which_swarmed_over_England_in_1748.jpg  

Our solutions to the problem 

Stopping this digital onslaught is far from simple. The attacks don’t come from a single source or follow an obvious pattern. As an open infrastructure serving a global audience, we cannot block entire countries, as some may have done as a solution. Our goal is to make sure that all genuine users of the OAPEN Library continue to have access, and to funnel the flood of AI bots. Together with our colleagues of the CERN Data Center we have been working on several countermeasures. 

What measures have we implemented? 

  1. Blocking known offenders 

We keep a list of bots that have repeatedly overwhelmed the system and block them. While many bots disguise themselves, some of them are still recognisable. 

  1. Rate limiting 

When a user is sending more than a certain number of requests per seconds, we temporary block them. Our OAPEN Library server does not respond with an answer or by sending a file but instead gives the message “429 Too Many Requests”. 

We are applying the following limitations to handle these instances: 

  • More than 40 requests in 10 seconds for a URL starting with /discover or /handle will lead to a time-out. 
  • More than 20 requests in 10 seconds for a URL starting with /bitstream will lead to a time-out.  

If this results in genuine users being blocked from using the OAPEN Library, such as large institutions, we will recosider changing the limits. 

  1. Separating file downloads from information requests 

The most important function of the OAPEN Library is providing access to OA book and chapter files. To ensure that they remain accessible, requests for files are now redirected to four separate servers to manage file downloads. This helps us deal with the large number of file requests, as there are now more servers to carry the load. 

  1. Continuous monitoring 

Additional to all the above measures, in collaboration with our colleagues from the CERN Data Center, we will continue to closely watch traffic to the OAPEN Library and adjust our responses as needed. 

Figure 2 An overview of response times of several aspects of the OAPEN Library 

Looking Ahead 

There is no silver bullet. Combating this problem requires significant time, technical resources, and coordination between our team and the CERN Data Center. On the horizon, a newer version of DSpace promises greater flexibility and tools to help manage these issues more effectively. Our current version of DSpace uses a relatively old version of the SOLR search engine and it is not possible to add an additional copy of the search engine to double the capacity. The latest versions of DSpace are using an updated version of SOLR, making it possible to add more capacity.  

Until then, we continue to do all we can to keep the OAPEN Library open and accessible – for humans and well-behaved bots alike – to avoid sincere users being harmed by these solutions. However, if you run into issues, please contact us