Effective Methods to Scrape Yandex Search Results
Yandex might not be the first search engine that comes to mind when you think of scraping, but it can offer valuable data—if you know how to handle it. In this guide, we'll show you how to build a custom Yandex scraper using proxies, and how to scale it easily with the API. Let’s break down the obstacles, the best practices, and how you can access Yandex’s search results like a pro.
What You Need to Know About Yandex SERPs
Yandex SERPs resemble what you'd see on Google, with a mix of organic results and paid ads. When searching for something like "iPhone," the page is divided into two sections: advertisements at the top and organic results below. Ads are clearly marked with tags like "Sponsored" or "Advertisement." Below those, you’ll find organic search results—pages ranked by Yandex based on relevance, authority, and several other factors.
The catch? Scraping Yandex results isn’t as simple as it looks.
The Challenges of Scraping Yandex
Yandex's anti-scraping measures can make your life difficult. The most notorious? CAPTCHA. If you hit the search engine too frequently, it will prompt a CAPTCHA, and once that happens, your IP might be blocked. Worse, Yandex continuously updates its anti-bot system, making manual scraping techniques a maintenance nightmare.
But don't worry. You can still scrape Yandex like a pro—if you use proxies. Proxies help by rotating your IP, making you less likely to get blocked. Combine that with a robust API, and you’ve got yourself a scalable solution. Let’s dive into the step-by-step process.
Setting Up Your Environment
Start by ensuring Python is installed on your machine. If it's not, head over to the official Python website to grab the latest version. Once Python is up and running, you’ll need three essential libraries: requests
, BeautifulSoup
, and pandas
. Install them with the following command:
pip install requests pandas beautifulsoup4
requests
handles making network requests.BeautifulSoup
is your HTML parser and data extractor.pandas
will help you store the scraped data in an easy-to-manage CSV file.
Scraping Yandex with Proxies
Alright, let’s get started on the scraping code.
Step 1: Set Up Proxies and Headers
import requests
from bs4 import BeautifulSoup
import pandas as pd
USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'
proxies = {
'http': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777',
'https': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) Gecko/20100101 Firefox/137.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8',
'Connection': 'keep-alive'
}
Step 2: Send the GET Request
response = requests.get(
'https://yandex.com/search/?text=what%20is%20web%20scraping',
proxies=proxies,
headers=headers
)
response.raise_for_status()
Step 3: Parse the Results
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for listing in soup.select('li.serp-item_card'):
title_el = listing.select_one('h2 > span')
title = title_el.text if title_el else None
link_el = listing.select_one('.organic__url')
link = link_el.get('href') if link_el else None
data.append({'Title': title, 'Link': link})
Step 4: Save the Results
df = pd.DataFrame(data)
df.to_csv('yandex_results.csv', index=False)
Run this code, and voilà—you’ll have a CSV file with the search results you need.
Scaling with API
As you scale up, your custom scraper can become unwieldy. That’s where the API comes in. It handles IP rotation, CAPTCHA bypass, and parsing, so you don’t need to worry about any of the tedious stuff.
Step 1: Set Up Your API Request
import requests
import pandas as pd
payload = {
'source': 'universal',
'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
}
Step 2: Add Custom Parsing Logic
payload['parsing_instructions'] = {
'listings': {
'_fns': [{'_fn': 'css', '_args': ['li.serp-item_card']}],
'_items': {
'title': {
'_fns': [
{'_fn': 'css_one', '_args': ['h2 > span']},
{'_fn': 'element_text'}
]
},
'link': {
'_fns': [
{'_fn': 'xpath_one', '_args': ['.//a[contains(@class, "organic__url")]/@href']}
]
}
}
}
}
Step 3: Send the API Request
response = requests.post(
'https://realtime.swiftproxy.net/v1/queries',
auth=('API_USERNAME', 'API_PASSWORD'),
json=payload
)
response.raise_for_status()
Step 4: Export the Data to CSV
data = response.json()['results'][0]['content']['listings']
df = pd.DataFrame(data)
df.to_csv('yandex_results_API.csv', index=False)
Comparing Scraping Methods
Scraping without proxies is simple to implement, but it leads to frequent IP blocks, CAPTCHA issues, and poor performance at scale.
Using proxies helps avoid IP blocks and is better suited for high-volume scraping tasks. However, it incurs the cost of proxies and adds maintenance overhead.
The API is scalable, offers IP rotation, bypasses CAPTCHA, and has a fast setup. However, it comes with recurring fees and limited flexibility for more complex scraping needs.
Custom solutions provide full control and avoid subscription fees, but they require a complex setup, often result in slower performance, and demand technical expertise.
Conclusion
Scraping Yandex doesn’t have to be a headache. With the right tools, like proxies or the API, you can bypass CAPTCHAs, scale your operations, and extract valuable search data at scale. Whether you’re looking for quick results or an enterprise-level solution, you now have the knowledge to scrape Yandex with confidence.