• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
2025/2026

Data Collection from Web-scraping and API for Social Scientific Research

Type: Mago-Lego
When: 4 module
Open to: students of one campus
Instructors: Lika Kapustina
Language: English
Contact hours: 40

Course Syllabus

Abstract

The course introduces students to web-scraping technology and involves working with the main libraries for collecting data from html pages (BeautifulSoup, Requests, Selenium), includes working with the API, working with dynamic pages (including working with individual page elements) through Selenium, working with API the API and libraries to get data from social networks and instant messengers. Each block of classes will be devoted to working with a specific way to get data from the pages of websites, social networks and instant messengers. At the beginning , students will learn how to process json data, work with html markup of pages, and remember the basics of working with the pandas library. Then, in each of the classes, students will practice their data collection skills using the example of the open VKontakte API, the Pyrogram library for collecting data from Telegram, collect open court data from the Moscow City Court website using Selenium, and practice their skills on other sources. As a result of the course, students will gain an understanding of different tools for collecting data from web pages and through APIs, and learn how to independently collect and process data from web sources for social science research and consulting solutions.
Learning Objectives

Learning Objectives

  • As a result of the course, students will gain an understanding of different tools for collecting data from web pages and through APIs, and learn how to independently collect and process data from web sources for social science research and consulting solutions.
Expected Learning Outcomes

Expected Learning Outcomes

  • Understand the syntax of the HTML markup language
  • Able to find necessary elements in the structure of pages written with HTML
  • Parse HTML using the BeautifulSoup library in Python
  • Processes HTML markup data for further saving in tabular form in Python
  • Send requests to servers using the requests library in Python
  • Understand APIs, specifics of sending requests and processing API data in Python;
  • Work with third-party Python libraries for automated data collection
  • Skilled in browser management using the Selenium library in Python
  • To be familiar with ethical and legal standards in the field of automated data collection
Course Contents

Course Contents

  • Introduction to web-scraping. Basics of html
  • Parsing html with BeautifulSoup and sending requests using the requests library
  • Working with API
  • Working with third-party libraries for data collection with Python
  • Browser management with Selenium
  • Ethical and regulatory challenges associated with automated data collection
Assessment Elements

Assessment Elements

  • non-blocking Final Project
    he project is a full-fledged project, including both writing Python code for data collection, processing and analysis, and textual elements - presentation of the research idea, description of the resulting data set, presentation of the results of solving the research problem using relevant data analysis methods. The project is completed individually or in teams of up to two people, uploaded by students to SmartLMS.
  • non-blocking Homeworks
    Two homework assignments dedicated to working with requests and beautifulsoup, as well as Selenium.
  • non-blocking Tests
    Two tests with solutions to practical tasks in Python in the SmartLMS interface.
Interim Assessment

Interim Assessment

  • 2025/2026 4th module
    0.4 * Final Project + 0.3 * Homeworks + 0.3 * Tests
Bibliography

Bibliography

Recommended Core Bibliography

  • Data Mining : извлечение информации из Facebook, Twitter, Linkedin, Instagram, Github, Рассел, М., 2020
  • Data Mining. Извлечение информации из Facebook, Twitter, LinkedIn, Instagram, GitHub. 3-е изд. - 978-5-4461-1246-3 - Рассел Мэтью, Классен Михаил - 2020 - Санкт-Петербург: Питер - https://ibooks.ru/bookshelf/365266 - 365266 - iBOOKS
  • HTML5 : для профессионалов, Гоше, Х. Д., 2015
  • Python. Разработка на основе тестирования : повинуйся Билли - тестировщику, используя Django, Selenium и JavaScript, Персиваль, Г., 2018
  • Seppe vanden Broucke, & Bart Baesens. (2018). Practical Web Scraping for Data Science : Best Practices and Examples with Python. Apress.

Recommended Additional Bibliography

  • HTML5 Полный курс : учебно-методическое пособие / М. Р. Богданов, Л. В. Вахидова, И. Н. Думчикова, Л. В. Миниярова. — Уфа : БГПУ имени М. Акмуллы, 2015. — 168 с. — Текст : электронный // Лань : электронно-библиотечная система. — URL: https://e.lanbook.com/book/72507 (дата обращения: 00.00.0000). — Режим доступа: для авториз. пользователей.

Authors

  • Pavlova Irina Anatolevna