Meta’s New Crawler Gathers Data Silently 

In July, Meta launched a web crawler called Meta External Agent to scrape public data check for AI training. 

In July, Meta launched a web crawler called Meta External Agent to scrape public data check for AI training. 

According to three companies specialized in monitoring web scraping, this tool collects information publicly displayed on websites, such as news articles and discussions, to train Meta’s AI models. 

From AI Tools to Scraping Tools 

Meta External Agent crawls websites for data to be used in training AI systems, just like OpenAI’s GPTBot.  

According to Dark Visitors, a firm that helps block scraper bots, Meta’s new tool explicitly targets the gathering of data for AI training.  

This has been confirmed by two other entities that monitor web scrapers. 

Late in July, Facebook’s parent company updated its developer website to disclose Meta’s External Agent, without making any public data check announcement. 

A spokesperson from Meta highlighted that this is not the first time the company has used such a tool, adding that Meta External Hit is another tool used for different purposes, such as generating link reviews. 

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesperson added. 

“We recently updated our guidance on how publishers can exclude their domains from being crawled by Meta’s AI-related bots.” 

Always a Price to Pay 

Scraping web data for AI training has long been controversial, pushing artists, writers, and other content creators to file lawsuits against the AI companies for using their work without their prior consent. 

Recently, in an effort to avoid such actions, some AI companies, such as Microsoft-backed OpenAI and Bezo and Nvidia-backed Perplexity, have made noteworthy deals with other companies to pay public data check for content.  

Back in April, OpenAI made a deal with the Financial Times to allow its data trainers to use the archives

As for Perplexity, in July, it announced a revenue sharing agreement with major media companies, including Time and Fortune magazine. 

Web scrapers like the Meta External Agent help collect massive volumes of the data required for large language models (LLMs), with one of its most advanced models being Llama. This AI model operates the Meta AI chatbot integrated across Meta’s platforms. While the data sources used for the training of the latest Llama version is unknown, earlier versions used a scraping tool called Common Crawl. 


Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Tech sections to stay informed and up-to-date with our daily articles.