Data Collection from Web-scraping and API for Social Scientific Research

2025/2026

Type: Mago-Lego

Delivered by: International Laboratory for Applied Network Research

Where: International Laboratory for Applied Network Research

When: 4 module

Open to: students of one campus

Instructors: Lika Kapustina

Language: English

ECTS credits: 3

Contact hours: 40

Full Syllabus Ask Question

Abstract

The course introduces students to web-scraping technology and involves working with the main libraries for collecting data from html pages (BeautifulSoup, Requests, Selenium), includes working with the API, working with dynamic pages (including working with individual page elements) through Selenium, working with API the API and libraries to get data from social networks and instant messengers. Each block of classes will be devoted to working with a specific way to get data from the pages of websites, social networks and instant messengers. At the beginning , students will learn how to process json data, work with html markup of pages, and remember the basics of working with the pandas library. Then, in each of the classes, students will practice their data collection skills using the example of the open VKontakte API, the Pyrogram library for collecting data from Telegram, collect open court data from the Moscow City Court website using Selenium, and practice their skills on other sources. As a result of the course, students will gain an understanding of different tools for collecting data from web pages and through APIs, and learn how to independently collect and process data from web sources for social science research and consulting solutions.

Learning Objectives

As a result of the course, students will gain an understanding of different tools for collecting data from web pages and through APIs, and learn how to independently collect and process data from web sources for social science research and consulting solutions.

Expected Learning Outcomes

Understand the syntax of the HTML markup language
Able to find necessary elements in the structure of pages written with HTML
Parse HTML using the BeautifulSoup library in Python
Processes HTML markup data for further saving in tabular form in Python
Send requests to servers using the requests library in Python
Understand APIs, specifics of sending requests and processing API data in Python;
Work with third-party Python libraries for automated data collection
Skilled in browser management using the Selenium library in Python
To be familiar with ethical and legal standards in the field of automated data collection

Course Contents

Introduction to web-scraping. Basics of html
Parsing html with BeautifulSoup and sending requests using the requests library
Working with API
Working with third-party libraries for data collection with Python
Browser management with Selenium
Ethical and regulatory challenges associated with automated data collection

Assessment Elements

Final Project
he project is a full-fledged project, including both writing Python code for data collection, processing and analysis, and textual elements - presentation of the research idea, description of the resulting data set, presentation of the results of solving the research problem using relevant data analysis methods. The project is completed individually or in teams of up to two people, uploaded by students to SmartLMS.
Homeworks
Two homework assignments dedicated to working with requests and beautifulsoup, as well as Selenium.
Tests
Two tests with solutions to practical tasks in Python in the SmartLMS interface.

Interim Assessment

2025/2026 4th module
0.4 * Final Project + 0.3 * Homeworks + 0.3 * Tests

Bibliography

Recommended Core Bibliography

Data Mining : извлечение информации из Facebook, Twitter, Linkedin, Instagram, Github, Рассел, М., 2020
Data Mining. Извлечение информации из Facebook, Twitter, LinkedIn, Instagram, GitHub. 3-е изд. - 978-5-4461-1246-3 - Рассел Мэтью, Классен Михаил - 2020 - Санкт-Петербург: Питер - https://ibooks.ru/bookshelf/365266 - 365266 - iBOOKS
HTML5 : для профессионалов, Гоше, Х. Д., 2015
Python. Разработка на основе тестирования : повинуйся Билли - тестировщику, используя Django, Selenium и JavaScript, Персиваль, Г., 2018
Seppe vanden Broucke, & Bart Baesens. (2018). Practical Web Scraping for Data Science : Best Practices and Examples with Python. Apress.

Recommended Additional Bibliography

HTML5 Полный курс : учебно-методическое пособие / М. Р. Богданов, Л. В. Вахидова, И. Н. Думчикова, Л. В. Миниярова. — Уфа : БГПУ имени М. Акмуллы, 2015. — 168 с. — Текст : электронный // Лань : электронно-библиотечная система. — URL: https://e.lanbook.com/book/72507 (дата обращения: 00.00.0000). — Режим доступа: для авториз. пользователей.

Authors

Pavlova Irina Anatolevna

Course Syllabus