Web Scraping Best Practices
A scraper that works once is easy. A scraper that works reliably over time, handles errors gracefully, and collects data responsibly takes more thought. These are practical best practices drawn from building and maintaining scrapers at scale.
Writing reliable parsers
Use stable selectors
Prefer IDs, data attributes (data-testid, data-product-card), and semantic HTML elements over generated CSS classes. Class names like css-1a2b3c are generated by build tools and change every deployment. Data attributes and IDs are intentional and tend to stay stable across updates.
Handle missing elements gracefully
Never assume an element exists. Check for null before accessing text or attributes. A product page might not have a sale price, a review might not have a date, a listing might not have an image. Your parser should return null for missing fields, not crash.
Validate extracted data
After parsing, check that the data makes sense. A product price of $0.00 or $999,999 is probably a parsing error. An empty title or a date from 1970 means your selector matched the wrong element. Basic validation catches broken parsers before bad data enters your pipeline.
Keep selectors separate from logic
Define your CSS selectors or XPath expressions as constants at the top of your script, not buried in parsing logic. When a site updates their HTML structure, you want to update selectors in one place rather than hunting through code.
Rate limiting and politeness
Add delays between requests
Even when using a scraping API with built-in proxy rotation, spacing out requests is good practice. A 1-3 second delay between requests is reasonable for most sites. For small sites with limited infrastructure, consider longer delays.
Scrape during off-peak hours
If your scraping can run at any time, schedule it during the target site's off-peak hours. A US e-commerce site has lower traffic at 3 AM Eastern than at 3 PM. Your scraping puts less strain on the site when fewer real users are browsing.
Only scrape what you need
If you need product prices, do not scrape every product page, every review page, every image page, and every related product page. Fewer requests means less cost, less chance of triggering rate limits, and less load on the target site. Define your data requirements before you start scraping.
Check robots.txt
Review the site's robots.txt file to understand what they ask crawlers to avoid. While it is not legally binding in most jurisdictions, respecting it demonstrates good faith and helps you avoid scraping pages that are likely to trigger aggressive anti-bot measures.
Error handling
Retry with backoff
Not every request will succeed on the first try. Implement exponential backoff for retries - wait 1 second, then 2, then 4, then 8. This handles transient failures without hammering the site. Set a maximum retry count (3-5 is usually enough) so your scraper does not get stuck on permanently failing URLs.
Detect blocked or empty responses
A 200 status code does not always mean success. Some sites return a CAPTCHA page, a "please verify you are human" message, or a login form instead of the actual content - all with a 200 status. Check the response content for expected elements before parsing.
Log failures for debugging
When a request fails or parsing returns unexpected results, log the URL, status code, and enough of the response to diagnose the issue later. Saving the full HTML of failed responses to a file can help you understand what the site returned and why your parser did not find the expected data.
Separate fetching from parsing
If possible, save the raw HTML first, then parse it in a separate step. This lets you re-parse saved HTML when you fix a selector bug, without having to re-scrape the pages. It also makes debugging easier because you can inspect the exact HTML your parser received.
Scaling your scraping
Start small, then expand
Test your scraper on 10 pages before running it on 10,000. Verify that your selectors work, your error handling catches edge cases, and the data looks correct. Fixing a bug after scraping 10 pages is easy. Fixing it after scraping 10,000 pages and discovering the data is wrong is expensive.
Use queues for large jobs
For large scraping jobs, use a task queue (Redis, RabbitMQ, or even a simple database table) to manage URLs. This lets you pause and resume, track progress, retry failures, and distribute work across multiple workers if needed.
Monitor your pipeline
Track success rates, response times, and data quality metrics. A sudden drop in success rate might mean the site changed their HTML or tightened their anti-bot measures. A gradual increase in null fields might mean a selector is becoming unreliable.
Responsible data collection
Be careful with personal data
If your scraping involves personal information (names, email addresses, phone numbers), understand your obligations under GDPR, CCPA, or other applicable privacy laws. Just because data is publicly visible does not mean there are no restrictions on collecting and using it.
Do not republish copyrighted content
Extracting factual data (prices, specifications, ratings) is different from copying and republishing articles, images, or creative content. If you need to use copyrighted material, understand fair use limitations in your jurisdiction.
Review terms of service
Many websites address automated access in their terms of service. Understanding what the site's terms say helps you make informed decisions about what to scrape and how. See our page on web scraping legality for more context.
How Browser7 helps you follow best practices
Browser7 handles many of the infrastructure-level best practices automatically:
- Residential proxies reduce the chance of IP blocks, so you do not need to manage proxy rotation yourself
- Real browser rendering means you get the same HTML a real user sees, including JavaScript-loaded content
- CAPTCHA solving is available when needed, so your pipeline does not stall on challenge pages
- Geo-targeting lets you see location-specific content without maintaining proxy pools in each country
You still need to handle the data-level best practices - choosing good selectors, validating extracted data, rate limiting your requests, and using data responsibly. Browser7 solves the infrastructure problems so you can focus on these higher-level concerns.
Learn more
- Scraping Guides - these best practices applied to real websites
- Web Scraping Solutions - solving common technical challenges
- Is Web Scraping Legal? - the legal landscape for scraping