Meta, the company responsible for platforms such as Facebook and Instagram, is using two new bots to crawl the internet in search of data for the development and improvement of its artificial intelligence (AI) models.
These new tools were quietly implemented at the end of July, as reported by Business Insider last Wednesday (21).
The introduction of these bots marks a significant step in Meta’s strategy to optimize its AI-powered products while also circumventing data access blocks imposed by websites that do not wish to share their information.
Tracking Tools for Collecting Goal Data
The new bots, called “Meta-ExternalAgent” and “Meta-ExternalFetcher,” are designed to collect a vast array of data from across the web that is needed to train the AI models that Meta uses across its various products and services.
The “Meta-ExternalAgent” has the ability to directly index the content it finds, playing a crucial role in gathering information to improve the company’s AI capabilities.
In contrast, the “Meta-ExternalFetcher” is targeted at fetching specific information, with the goal of improving Meta’s AI assistant and other features tied to its products.
Bypassing Blockages with Advanced Technology
What makes these bots especially notable is the advanced technology they employ to evade blocks set up by website owners looking to prevent their data from being scraped.
Traditionally, many websites use a file called “robots.txt” to restrict or prohibit access by automated crawlers, such as those used by Meta.
However, the company's new bots are able to bypass these restrictions with great effectiveness, which has raised concerns among website administrators and digital privacy experts.
You need to know this today:
- WhatsApp reveals list of who blocked you if you do this
- Video Calls on Instagram: A Complete Guide
- URGENT! “BIA scam” targets victims on banking app
Effectiveness of Meta's New Bots
According to a report from AI-generated content detection startup Originality.ai, only 1.5% of the top websites are managing to block the “Meta-ExternalAgent” bot.
Meta-ExternalFetcher, on the other hand, is even more efficient, being blocked by less than 1% of these pages. This performance represents a significant improvement compared to FacebookBot, an older Meta crawler, which is blocked by approximately 10% of the sites.
The effectiveness of these new bots demonstrates Meta’s ability to adapt its technologies to continue accessing the data needed to train its AI models, even when faced with barriers imposed by website administrators.
The company, led by Mark Zuckerberg, appears to be committed to ensuring that its AI systems can evolve and become increasingly sophisticated, powered by vast amounts of data collected from across the web.
Policy Update and Market Reactions to the Target
In response to concerns raised by publishers and website administrators, Meta recently updated its guidelines on how to exclude a domain from data scraping by the company's AI-powered bots.
According to a Meta spokesperson, the company is committed to honoring requests from publishers who do not want their content used to train Meta's AI models.
This update to the company's policies reflects an attempt to balance its data needs with respect for website owners' preferences.
However, this change was not enough to calm everyone's nerves. The new bots' ability to bypass the robots.txt file raises questions about the effectiveness of data protection measures currently in place on the web.
Additionally, Meta’s ability to track and collect data so extensively could intensify the debate over privacy and the control that large technology companies have over information available on the internet.
Implications for the Future of Data Collection
Meta’s introduction of these new bots represents a significant evolution in the way the company collects and uses data to train its AI.
As AI technologies become more integrated into digital products and services, the demand for large volumes of data to power these systems also grows.
As a result, companies like Meta are looking for increasingly sophisticated ways to access the information they need, even in an environment where lockdowns and restrictions are increasingly common.
On the other hand, this trend could lead to greater resistance from website owners, who may look for new ways to protect their content from unauthorized scraping.
Additionally, regulatory pressure on Big Tech's data collection practices may increase as governments and privacy organizations seek to protect users' rights in the digital age.