You’ve built your first scraper, tested it on simple sites, and felt that rush of success. Then you tried it on a “real” website… and it failed. You hit a CAPTCHA, got your IP banned, or received empty HTML from a JavaScript-heavy site.
Welcome to the real world of web scraping. In this technical guide, I’ll share the advanced techniques we use at Scraperscoop to handle anti-scraping measures.
Why Websites Block Scrapers (And Why You Should Care)
Before we fight the system, understand why it exists:
- Server load: Too many rapid requests can crash a site
- Competitive advantage: Companies don’t want competitors stealing their data
- Content protection: Some data is expensive to create
- User experience: Bots can distort analytics and affect real users
Ethical note: Always respect websites. If they’re aggressively blocking you, ask yourself if you should be scraping them at all.
Common Anti-Scraping Techniques
Websites use various methods to detect and block bots:
- IP-based blocking: Too many requests from one IP = ban
- CAPTCHAs: “Prove you’re human” challenges
- JavaScript challenges: Content loads only after JS execution
- Header analysis: Checking for suspicious User-Agents
- Behavior analysis: Detecting non-human patterns
- Honeypot traps: Links invisible to humans but visible to bots
Solution 1: IP Rotation & Proxies
The Problem: You get blocked after ~100 requests from the same IP.
The Solution: Rotate through multiple IP addresses.
Types of Proxies:
Datacenter Proxies:
- Cheap and fast
- Easy to detect as proxies
- Best for: Large-scale scraping of less protected sites
Residential Proxies:
- IPs from real ISPs
- Harder to detect
- More expensive
- Best for: Scraping protected sites
Mobile Proxies:
- IPs from mobile carriers
- Most expensive
- Least likely to be blocked
- Best for: Extremely sensitive targets
Implementation Example (Python with rotating proxies):
import requests
from itertools import cycle
# List of proxies (format: http://user:pass@ip:port)
proxies = [
'http://user1:pass1@proxy1.com:8000',
'http://user2:pass2@proxy2.com:8000',
'http://user3:pass3@proxy3.com:8000'
]
proxy_pool = cycle(proxies)
url = 'https://target-site.com'
for i in range(10):
# Get a proxy from the pool
proxy = next(proxy_pool)
print(f"Request #{i+1} using {proxy}")
try:
response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=10)
print(f"Success: {response.status_code}")
except:
print("Failed with this proxy, trying next...")
Solution 2: Handling CAPTCHAs
The Problem: You encounter “I’m not a robot” checkboxes or image recognition challenges.
Approach 1: Avoidance (Best)
- Slow down your requests
- Mimic human behavior patterns
- Use headless browsers (they’re less likely to trigger CAPTCHAs)
- Stick to residential proxies
Approach 2: Solving Services (When unavoidable)
Services like 2Captcha or Anti-Captcha solve CAPTCHAs for you (for a fee).
import requests
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('YOUR_API_KEY')
# Get the CAPTCHA image
captcha_image_url = 'https://site.com/captcha.jpg'
response = requests.get(captcha_image_url)
# Solve it
result = solver.normal(response.content)
captcha_code = result['code']
# Use the solved code in your request
Approach 3: Manual Solving (For small scale)
Sometimes it’s easiest to just solve the occasional CAPTCHA manually.
Solution 3: Headless Browsers for JavaScript Sites
The Problem: Your scraper gets empty HTML because the content loads via JavaScript.
The Solution: Use a headless browser that executes JavaScript.
Selenium Example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Run without GUI
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
# Navigate to page
driver.get('https://javascript-heavy-site.com')
# Wait for content to load
wait = WebDriverWait(driver, 10)
content = wait.until(EC.presence_of_element_located((By.ID, "dynamic-content")))
# Extract data
data = driver.find_element(By.CSS_SELECTOR, '.product-list').text
print(data)
driver.quit()
Puppeteer Example (Node.js):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Avoid detection
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
await page.goto('https://javascript-heavy-site.com', { waitUntil: 'networkidle2' });
// Wait for specific element
await page.waitForSelector('.loaded-content');
const data = await page.evaluate(() => {
return document.querySelector('.product-price').innerText;
});
console.log(data);
await browser.close();
})();
Solution 4: Mimicking Human Behavior
The Problem: Your scraper gets detected by behavior analysis.
The Solution: Make your bot act more human.
Techniques:
Random delays:
import random
import time
# Instead of fixed delays
time.sleep(2)
# Use random delays
time.sleep(random.uniform(1, 3))
Random scrolling:
# In Selenium
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(random.uniform(0.5, 2))
driver.execute_script("window.scrollTo(0, 500);")
Mouse movements:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
element = driver.find_element(By.TAG_NAME, 'body')
actions.move_to_element(element).perform()
Realistic browsing patterns:
- Visit multiple pages (not just the data you need)
- Sometimes go back, sometimes go forward
- Vary time spent on pages
Solution 5: Request Headers & Fingerprinting
The Problem: Your requests have bot-like headers.
The Solution: Use realistic headers and avoid detection.
Good headers setup:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
Advanced: The “Nuclear” Option – Full Browser Automation with Undetected Chrome
For extremely protected sites, we sometimes use specialized tools:
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://heavily-protected-site.com')
# This driver is much harder to detect
Monitoring & Adaptive Strategies
The best defense is a good monitoring system:
- Success rate monitoring: Track what percentage of requests succeed
- Response analysis: Check for CAPTCHAs or blocks in responses
- Automatic switching: If one method fails, try another
- Alerting: Get notified when success rates drop
When to Give Up
Despite all these techniques, some websites are just too well-protected. If you’re facing:
- Constant blocks even with residential proxies
- Legal threats
- Advanced fingerprinting you can’t bypass
- Declining returns on time invested
…it might be time to reconsider. Can you:
- Use an official API instead?
- Purchase the data legally?
- Find an alternative data source?
- Partner with the website owner?
Our Complete Anti-Detection Stack
At Scraperscoop, we use a multi-layered approach:
- Intelligent proxy rotation (mix of residential and datacenter)
- Request fingerprint randomization
- Headless browsers with human-like behavior
- Automatic CAPTCHA solving when needed
- Continuous monitoring and adaptation
This stack handles 99% of websites, but we’re always updating as detection methods evolve.
Final Thoughts
Anti-scraping measures are an arms race. Today’s solution might not work tomorrow. The key is to:
- Stay updated on new techniques
- Have multiple strategies ready
- Always respect websites and their resources
- Know when to walk away
Need help with a particularly tough website? We specialize in handling complex anti-scraping measures. Contact us for a consultation.
Start Scraping Now!
Ready to unlock the power of data?