Effective Methods to Scrape Yandex Search Results

in #webscraping21 days ago

Yandex might not be the first search engine that comes to mind when you think of scraping, but it can offer valuable data—if you know how to handle it. In this guide, we'll show you how to build a custom Yandex scraper using proxies, and how to scale it easily with the API. Let’s break down the obstacles, the best practices, and how you can access Yandex’s search results like a pro.

What You Need to Know About Yandex SERPs

Yandex SERPs resemble what you'd see on Google, with a mix of organic results and paid ads. When searching for something like "iPhone," the page is divided into two sections: advertisements at the top and organic results below. Ads are clearly marked with tags like "Sponsored" or "Advertisement." Below those, you’ll find organic search results—pages ranked by Yandex based on relevance, authority, and several other factors.
The catch? Scraping Yandex results isn’t as simple as it looks.

The Challenges of Scraping Yandex

Yandex's anti-scraping measures can make your life difficult. The most notorious? CAPTCHA. If you hit the search engine too frequently, it will prompt a CAPTCHA, and once that happens, your IP might be blocked. Worse, Yandex continuously updates its anti-bot system, making manual scraping techniques a maintenance nightmare.
But don't worry. You can still scrape Yandex like a pro—if you use proxies. Proxies help by rotating your IP, making you less likely to get blocked. Combine that with a robust API, and you’ve got yourself a scalable solution. Let’s dive into the step-by-step process.

Setting Up Your Environment

Start by ensuring Python is installed on your machine. If it's not, head over to the official Python website to grab the latest version. Once Python is up and running, you’ll need three essential libraries: requests, BeautifulSoup, and pandas. Install them with the following command:

pip install requests pandas beautifulsoup4
  • requests handles making network requests.
  • BeautifulSoup is your HTML parser and data extractor.
  • pandas will help you store the scraped data in an easy-to-manage CSV file.

Scraping Yandex with Proxies

Alright, let’s get started on the scraping code.

Step 1: Set Up Proxies and Headers

import requests
from bs4 import BeautifulSoup
import pandas as pd

USERNAME = 'PROXY_USERNAME'
PASSWORD = 'PROXY_PASSWORD'

proxies = {
    'http': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777',
    'https': f'https://{USERNAME}:{PASSWORD}@pr.swiftproxy.net:7777'
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:137.0) Gecko/20100101 Firefox/137.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8',
    'Connection': 'keep-alive'
}

Step 2: Send the GET Request

response = requests.get(
    'https://yandex.com/search/?text=what%20is%20web%20scraping',
    proxies=proxies,
    headers=headers
)
response.raise_for_status()

Step 3: Parse the Results

soup = BeautifulSoup(response.text, 'html.parser')

data = []
for listing in soup.select('li.serp-item_card'):
    title_el = listing.select_one('h2 > span')
    title = title_el.text if title_el else None
    link_el = listing.select_one('.organic__url')
    link = link_el.get('href') if link_el else None

    data.append({'Title': title, 'Link': link})

Step 4: Save the Results

df = pd.DataFrame(data)
df.to_csv('yandex_results.csv', index=False)

Run this code, and voilà—you’ll have a CSV file with the search results you need.

Scaling with API

As you scale up, your custom scraper can become unwieldy. That’s where the API comes in. It handles IP rotation, CAPTCHA bypass, and parsing, so you don’t need to worry about any of the tedious stuff.

Step 1: Set Up Your API Request

import requests
import pandas as pd

payload = {
    'source': 'universal',
    'url': 'https://yandex.com/search/?text=what%20is%20web%20scraping',
}

Step 2: Add Custom Parsing Logic

payload['parsing_instructions'] = {
    'listings': {
        '_fns': [{'_fn': 'css', '_args': ['li.serp-item_card']}],
        '_items': {
            'title': {
                '_fns': [
                    {'_fn': 'css_one', '_args': ['h2 > span']},
                    {'_fn': 'element_text'}
                ]
            },
            'link': {
                '_fns': [
                    {'_fn': 'xpath_one', '_args': ['.//a[contains(@class, "organic__url")]/@href']}
                ]
            }
        }
    }
}

Step 3: Send the API Request

response = requests.post(
    'https://realtime.swiftproxy.net/v1/queries',
    auth=('API_USERNAME', 'API_PASSWORD'),
    json=payload
)
response.raise_for_status()

Step 4: Export the Data to CSV

data = response.json()['results'][0]['content']['listings']

df = pd.DataFrame(data)
df.to_csv('yandex_results_API.csv', index=False)

Comparing Scraping Methods

Scraping without proxies is simple to implement, but it leads to frequent IP blocks, CAPTCHA issues, and poor performance at scale.
Using proxies helps avoid IP blocks and is better suited for high-volume scraping tasks. However, it incurs the cost of proxies and adds maintenance overhead.
The API is scalable, offers IP rotation, bypasses CAPTCHA, and has a fast setup. However, it comes with recurring fees and limited flexibility for more complex scraping needs.
Custom solutions provide full control and avoid subscription fees, but they require a complex setup, often result in slower performance, and demand technical expertise.

Conclusion

Scraping Yandex doesn’t have to be a headache. With the right tools, like proxies or the API, you can bypass CAPTCHAs, scale your operations, and extract valuable search data at scale. Whether you’re looking for quick results or an enterprise-level solution, you now have the knowledge to scrape Yandex with confidence.