Internet Engineering Task Force F. Badii Internet-Draft Digital Medusa Intended status: Informational Jo Levy ARDC Expires: September 24, 2026 S. McKenna Sequentum March 24, 2026 Best Practices for Responsible Web Data Collection draft-farzdusa-webbot-datacollection-00 Abstract The IETF develops standards and protocols to make the internet work better, adhering to principles of openness and decentralization. Industry best practices and protocols for automated web data collection have long existed, but have not been documented at the IETF. For decades, researchers, universities, journalists, public interest groups, and commercial entities have used automated tools to access and collect public web data (sometimes referred to as data scraping, web crawling, or text and data mining) for a wide range of uses [I-D.farzdusa-aipref-enduser]. Examples of these uses include extraction of pricing information for market intelligence or to create a consumer price index, comparative real estate analysis to support underwriting of loans and mortgages, webpage archiving to preserve human knowledge, preserving government websites to hold political powers accountable, journalist research and reporting, and university and scientific research. Recently, innovations in artificial intelligence have significantly increased the automated collection of public web data, creating tensions between the use of AI to equalize and increase access to knowledge and the disruption of existing Internet models, including non-profit repositories that face increased demands for access and businesses that profit from free web access to human viewers. This document lists a set of technical best practices that are prevalent across industries for the automated collection of public web data. It provides protocols for how automated tools access and collect publicly available web data, including volume control, transparency, documentation, and access, that can be implemented by any automated data collector. It applies principles of net neutrality to the collection of data, providing uniform guidance regardless of the identity of the data collector or website operator, the location of the collection, or the applicable legal jurisdiction. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 24, 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction and Terminology . . . . . . . . . . . . . . . . 4 2. Collection Limited to Public Web Data . . . . . . . . . . . 5 3. Identification and Transparency . . . . . . . . . . . . . . 5 4. Rate Limits . . . . . . . . . . . . . . . . . . . . . . . . 5 5. Preference Signals . . . . . . . . . . . . . . . . . . . . . 7 6. Log Retention . . . . . . . . . . . . . . . . . . . . . . . 8 7. Out of Scope . . . . . . . . . . . . . . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . 9 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . 10 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 10.1. Normative References . . . . . . . . . . . . . . . . . 10 10.2. Informative References . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction and Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. "Data Collectors": Persons or entities that access websites through automated devices to collect, scrape, harvest, or crawl information. "Public Web Data": Data published on the world wide web that is not behind a restricted access log-in. "Publicly available information": Information that a person, organization, or data collector has a reasonable basis to believe is lawfully made available to the general public. 2. Collection Limited to Public Web Data Data collectors MUST only access web data that is not behind a paywall or restricted access log-in, or non-public data that the Data Collector has received explicit approval to collect. Data Collectors MUST NOT circumvent a restricted access log-in or paywall. 3. Identification and Transparency Data Collectors SHOULD provide sufficient information to allow website owners to contact them to report abuses, such as a dedicated email address or link on a published webpage, or an identifiable User Agent name or email address. Data Collectors MUST NOT affirmatively misrepresent their identities by using the name, email, or URL of another Data Collector without authorization. 4. Rate Limits Multiple methods of rate limitation are available to protect the health of internet locations and help prevent Distributed Denial of Service (DDoS) and Denial of Service (DoS) attacks. Data Collectors SHOULD select one or more rate limitation methodologies, taking into account factors such as whether monitoring indicates a degradation of service and the scope, breadth, and other parameters of the data collection. Rate limitation methods MAY include one or more of the following: a. Use of a static or dynamic/random download delay: It is possible for a script to set a static (e.g., once every five seconds) or random (e.g., a random period between 2 and 7 seconds) delay between page requests. b. Use of "auto throttle" and similar technologies: Many open-source libraries for web data collection offer functionality that will automatically adjust the frequency of page requests based on the current webserver load, for instance by inferring such load based on the latency between requests and responses. c. Calculating average daily loads: A number of analytics firms offer data about website traffic that would allow a data collector to calculate the number of average page loads per day. A Data Collector MAY consider applying this data when determining the frequency at which the script accesses the website(s). For example, by ensuring that the percentage of page requests it makes remains below some threshold percentage of average daily page loads. d. Collecting data during low-traffic timeframes: Consider limiting data collection to less busy times based on website traffic data or inferred from the geographic location of the site and the site's content. Applying randomness to script start times can minimize concurrent requests. e. Limiting the number of concurrent requests: It is possible to keep a script from issuing additional requests until the web server responds to outstanding requests. Consider keeping the number of outstanding concurrent requests below a predefined limit. f. Crawling at "human speed": If a human collecting the data by copying and pasting would load a new page once every five seconds, consider limiting scripts to a similar rate. g. Incorporating "speed bumps": Similar to applying "human speed," consider including speed bumps that pause the script at certain intervals (e.g., the script pauses for 5 seconds after every 10 page loads). h. Following the robots.txt "Crawl-delay" directive: Website owners can specify in their robots.txt file the number of seconds that a script should wait between successive page loads. 5. Preference Signals a. Robots.txt: Data collectors MAY exercise discretion in deciding whether to search for the presence of a robots.txt file and, if located, whether to follow the robots.txt instructions. Data collectors MAY decide to follow certain types of robots.txt instructions (e.g., Crawl-delay), or to follow robots.txt for certain websites but not others. To promote transparency, Data Collectors SHOULD retain documentation of whether robots.txt was read and followed and, if so, under what circumstances. b. Access Requirements: When Public Web Data is made available to the general public only after completion of a specific action, and no pre-authorization is required, Data Collectors MAY complete the action and access the Public Web Data. 6. Log Retention a. Query Log Content: Data Collectors SHOULD maintain query logs for each data collection that include, at a minimum: i. A precise date and time of the query. ii. The target URL (the domain/URL to which the query is directed) in standard format. iii. The source IP address (the IP used to send the request) in standard format. iv. A unique query identifier. b. Query Log Retention: Retention periods for query logs SHOULD be based upon the type and frequency of the query, and SHOULD include at a minimum: i. Live log retention sufficient to support timely responses to ongoing data collections. ii. Archived log retention for a minimum of 3 years. 7. Out of Scope The following topics are explicitly out of scope for this document: 1. Restating, embedding, or enforcing legal or quasi-legal restraints on the use of collected data. Laws applicable to any particular data can vary based on the specific data collected, the identity of the party collecting the data, the timing of collection and data use, the location of the data collection, and the existence of contractual agreements beyond the parameters of machine-readable computer code. 2. Operational guidance on how persons or entities that collect data manage their business, organizations, or data handling practices after Public Web Data is accessed and collected from the internet. This document does not cover best practices for Acceptable Use Policies for the automated collection of data within organizations, governance practices for handling personal information, complaints, or reports, internal operations for determining whether particular data includes copyrighted or copyrightable information and if so, the legality of copying that data, or data and SOC2 compliance practices for handling certain data after collection. For guidance on these matters, please see ARDC's Technical Standards and Governance Guidelines, Section 2. 8. Security Considerations TODO 9. IANA Considerations This document has no IANA actions. 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 10.2. Informative References [I-D.farzdusa-aipref-enduser] Badii, F., Bailey, J., and J. Levy, "AI Preferences Signaling: End User Impact", Internet-Draft draft-farzdusa-aipref-enduser-00, 2024, . Authors' Addresses Farzaneh Badii Digital Medusa Email: farzaneh@digitalmedusa.org Jo Levy Alliance for Responsible Data Collection Email: josaxe@yahoo.com Sarah McKenna Sequentum Email: sarah.mckenna@sequentum.com