Data Extraction Challenges

Pre-requisites

Basic understanding of HTML & CSS
Understanding of Python/ JavaScript (Cheerio, Puppeteer)

Challenges

Week 1 🟩

Target Website:

Recipe website (https://www.allrecipes.com/recipes/)

Pre-requisites

Before you begin, ensure you have the following:

Basic Knowledge of Web Scraping: Familiarity with HTML structure and basic web scraping concepts.
Programming Skills: Proficiency in Python or JavaScript.
Tools Installed:
- For Python: BeautifulSoup, requests, sqlite3
- For JavaScript: Frameworks like Puppeteer, Node.js
Environment Setup: Have your development environment (IDE or text editor) set up and ready to code.

Description

Imagine you’re a data analyst at a startup developing a new recipe recommendation app. Your app aims to help users find the best recipes based on their dietary preferences and ingredient availability. However, to build this app, you need a substantial database of recipes. Instead of manually entering hundreds of recipes, you can use web scraping to automatically gather this data from popular recipe websites like AllRecipes. This approach not only saves time but also ensures your app has a diverse and up-to-date collection of recipes, making it more appealing to users and potential recruiters.

Problem

You are tasked with extracting the following details from AllRecipes:

Recipe Titles: The name of each recipe.
Ingredients: A list of ingredients required for each recipe.
Ratings: The average user rating for each recipe.

Instructions

Identify the Structure:
- Use your browser's developer tools to inspect the HTML structure of the recipe pages.
- Look for patterns in how recipes are listed and how individual details like title, ingredients, and ratings are presented within the HTML code.
- Take notes of the HTML tags, classes, or IDs that contain the required information.
Extract the Data:
- Choose a web scraping library like BeautifulSoup (Python) or Puppeteer (JavaScript).
- Write a script to navigate the HTML structure and extract the desired data points.
- Ensure your script handles pagination if recipes are spread across multiple pages.
Store the Data:
- Save the scraped data in a structured format like an SQLite database.
- Ensure each recipe entry in the database includes the title, ingredients, and ratings.

Submission Guidelines

Code Submission:
- Provide a .zip file containing your web scraping script or link to your GitHub repo.
- Include any necessary configuration files and dependencies required to run your script.
Data Submission:
- Submit the SQLite database file (.db) containing the scraped data.

Learning Resources

Tools

https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en&pli=1

Week 2 🟩

Target Website

Real Estate Directory

Objective

Scrape a list of houses for sale, extracting key details such as the title, description, and date of each listing from the first 5 pages of the website.

Pre-requisites

Before you begin, ensure you have the following:

Basic Knowledge of Web Scraping: Familiarity with HTML structure and basic web scraping concepts.
Programming Skills: Proficiency in Python.
Tools Installed:
- Python libraries: BeautifulSoup, requests, pandas
Environment Setup: Have your development environment (IDE or text editor) set up and ready to code.

Problem Description

Imagine you are working for a real estate analytics company that aims to provide comprehensive market insights to property investors. To achieve this, you need up-to-date data on property listings, including prices, descriptions, and availability. Manually gathering this information from various real estate websites can be time-consuming and prone to errors. By creating a web scraper, you can automate the data collection process, ensuring that your company has access to the most current and accurate information. This not only saves time but also enhances the quality of your market analysis, making your insights more valuable to clients and attracting the attention of recruiters looking for innovative problem solvers.

Problem

You are tasked with extracting the following details from Nigeria Property Centre:

Listing Titles: The title of each house listing.
Descriptions: A brief description of the property.
Listing Dates: The date each listing was posted.
Other Details: Number of bedrooms, bathrooms, and toilets.

Instructions

Identify the Structure:
- Use your browser's developer tools to inspect the HTML structure of the listing pages.
- Look for patterns in how listings are presented, including title, description, date, and other details.
Extract the Data:
- Use a web scraping library like BeautifulSoup or Scrapy in Python.
- Write a script to navigate the HTML structure and extract the desired data points from the first 5 pages.
- Handle pagination by identifying patterns in the URL structure or HTML code to navigate between pages.
Respectful Scraping:
- Check the website's robots.txt file and terms of service.
- Scrape responsibly to avoid overwhelming the server with requests.
- Implement delays between requests if necessary to prevent server overload.

Bonus Challenge

Enhance your scraper to handle different data formats that the website might present for the listings.
Save the extracted data in a structured format like CSV or JSON for further analysis.

Submission Guidelines

Code Submission:
- Provide a .zip file containing your web scraping script.
- Include any necessary configuration files and dependencies required to run your script.
- Ensure your code is well-documented with comments explaining key sections.
Data Submission:
- Submit the extracted data in a CSV or JSON file.
- The file should include all relevant details from the first 5 pages.

Example Output

Your CSV or JSON file should contain the following fields for each listing:

title: Text
description: Text
date: Date
bedrooms: Integer
bathrooms: Integer
toilets: Integer

Learning Resources

Week 3 🟩

Pre-requisite

Before starting this exercise, ensure you have the following:

Programming Languages:
- Basic knowledge of Python for Selenium; or
- Basic knowledge of JavaScript for Puppeteer.
Software:
- Latest version of Python installed on your machine.
- Latest version of Node.js installed.
Libraries:
- Selenium WebDriver for Python (pip install selenium).
- Puppeteer for Node.js (npm install puppeteer).
Web Drivers:
- ChromeDriver for Selenium. Ensure the ChromeDriver version matches your browser version.
- Puppeteer comes with Chromium, so no additional browser installation is needed.

Problem Description

Imagine you're a software developer working for a company that needs to ensure its web application is functioning correctly. The marketing team wants to capture full-page screenshots of various authenticated user pages to showcase in a presentation. Automating this process saves time and ensures consistency. This exercise simulates this scenario by teaching you how to log into a website and capture a full-page screenshot, skills highly valued by recruiters for roles involving automation and QA testing.

Instructions

Your task is to automate the login process for the website GeeksforGeeks Auth and capture a full-page screenshot of the dashboard after login using either Selenium or Puppeteer. Here are the detailed steps:

Create an Account:
- Navigate to the GeeksforGeeks Auth website.
- Create a new account using your email.
Script Development:
- For Selenium (Python):
  - Set up Selenium WebDriver and configure it to use ChromeDriver.
  - Write a script to navigate to the login page, input your credentials, and log in.
  - After logging in, capture a full-page screenshot and save it locally.
- For Puppeteer (Node.js):
  - Set up Puppeteer and configure it to use the built-in Chromium.
  - Write a script to navigate to the login page, input your credentials, and log in.
  - After logging in, capture a full-page screenshot and save it locally.
Ensure Security:
- Avoid hardcoding your login credentials directly in the script.

Submission Guidelines

When submitting your solution, ensure you include the following:

Code:
- For Selenium: Submit your Python script file (.py).
- For Puppeteer: Submit your JavaScript file (.js).
- Ensure your code is well-commented to explain the process and logic used.
Data:
- Include the full-page screenshot captured by your script.
- Save the screenshot as screenshot.png and ensure it's included in your submission.
README File:
- Include a README.md file explaining how to run your script.
- Mention any prerequisites or dependencies needed.
- Provide instructions on setting up environment variables for login credentials.

Learning Resources

Selenium (Python) Crash Course: https://www.youtube.com/watch?v=Z6_uRiUKiA0
Puppeteer (Nodejs) Crash Course: https://www.youtube.com/watch?v=nIJV-LbV_vM&pp=ygUccHVwcGVldHRlciBjcmFzaCBjb3Vyc2Ugbm9kZQ%3D%3D
Taking Full Page Screenshots (Puppeteer): https://www.youtube.com/watch?v=HXGOdkhjt5w&t=136s&pp=ygUmcHVwcGVldHRlciBmdWxsIHBhZ2Ugc2NyZWVuc2hvdCBub2RlanM%3D
Taking Full Page Screenshots (Selenium): https://www.youtube.com/watch?v=QoKXLICbHzE&pp=ygUkU2VsZW5pdW0gZnVsbCBwYWdlIHNjcmVlbnNob3QgcHl0aG9u
Login (Selenium): https://www.youtube.com/watch?v=l5yx6p1dxMI&pp=ygUaYXV0byBsb2dpbiBzZWxlbml1bSBweXRob24%3D
Login(Puppeteer): https://www.youtube.com/watch?v=ojxEJfiseXs&pp=ygUbYXV0byBsb2dpbiBwdXBwZXRlZXIgbm9kZWpz

Week 4 🟩

Introduction to Link Crawling

This challenge will test your skills in ethically scraping data while considering best practices. The target website: https://quotes.toscrape.com/

Pre-requisite

To participate in this challenge, you should have:

Basic understanding of web scraping and HTML structure.
Familiarity with at least one programming language: Python or JavaScript.
Experience with relevant libraries or frameworks:
- Python: requests, BeautifulSoup, and optionally Scrapy.
- JavaScript: Puppeteer, Cheerio, and optionally Axios.

Problem Description

Imagine you're working as a data analyst for a company that wants to create an inspiring quotes database. Your task is to scrape quotes, authors, and tags from a quotes website. A recruiter sees your well-documented, ethical approach to web scraping, noting your adherence to best practices and problem-solving skills. This demonstrates your technical expertise and ethical responsibility, making you a strong candidate for data-centric roles.

Tools

Python:

Use requests to handle HTTP requests.
Parse HTML content with BeautifulSoup.
Optional: Use Scrapy for more advanced scraping needs (if you are already familiar with it).

JavaScript:

Use Puppeteer for headless browser automation.
Parse HTML content with Cheerio.
Optional: Use Axios for making HTTP requests.

Ethical Scraping is Key

Adhere to the website's robots.txt to avoid overloading their servers.
Implement delays between requests, mimicking human browsing patterns.
Focus on extracting only publicly available information (quotes, authors, tags).

Beyond the Basics

Structured Scraping:
- Extract specific data like quotes, author information, and tags.
Depth Control:
- Implement a depth limit to restrict how deep your script crawls into the website's structure.

Steps:

Setup and Initial Request:
- Install necessary libraries.
- Fetch the homepage content.
Data Extraction:
- Parse HTML to extract quotes, authors, and tags.
- Store the data in a structured format (e.g., CSV, JSON).
Implementing Delays:
- Use appropriate methods to add delays between requests.
Depth Control:
- Limit the script to scrape a specified number of pages.
Documentation:
- Document your approach, challenges faced, and solutions implemented.

Submission Guidelines

Code and Data:

Code:
- Organize your code into well-commented, readable files.
- Ensure each script is in its own file and named appropriately (e.g., scrape_quotes.py, scrape_quotes.js).
- Include a main script to demonstrate the functionality.
Documentation:
- Create a README file explaining the purpose of each script, how to run it, and any assumptions made.
- Include comments within your code to explain your logic and thought process.
Data:
- Save the extracted data in a structured format (CSV, JSON).
- Include a sample data file to showcase your scraping results.
Submission:
- Submit the ZIP file through Slack or simply upload it to GitHub and share a link to Repo.

Learning Resources:

How to Crawl Internal & External Links: https://www.youtube.com/watch?v=I42csGcYeXw
Web Crawler from Scratch: https://www.youtube.com/watch?v=C0pXaNchNTA&pp=ygUmbGluayBjcmF3bGluZyAgd2ViIHNjcmFwaW5nIGphdmFzY3JpcHQ%3D

Watch both videos irrrespective of your current stack to grab key concepts.

Week 5 🟩

An Introduction: Bypassing Browser Detection with Selenium/Puppeteer

This challenge introduces you to browser fingerprinting and techniques to bypass basic bot detection mechanisms.

Prerequisite

Before diving into this challenge, ensure you have a foundational understanding of the following:

Basic web scraping concepts and techniques.
Familiarity with Python or JavaScript for scripting.
Experience with web automation tools like Selenium or Puppeteer.
Understanding of HTTP requests and browser behaviors.

Problem Description

Imagine you are working for a competitive intelligence team in a retail company. Your task is to scrape competitor websites to gather pricing information. However, these websites employ sophisticated bot detection mechanisms to prevent automated scraping, including browser fingerprinting. This challenge aims to simulate such a scenario, where overcoming bot detection is crucial to collecting valuable data without being blocked. Successfully completing this task demonstrates your capability to handle real-world scraping challenges, making you an attractive candidate for roles requiring data extraction and automation skills.

Instructions

Browser Fingerprinting Exploration:
- Research browser fingerprinting and how it works.
- Understand the different elements that contribute to a browser fingerprint (e.g., user agent string, screen resolution, installed fonts, etc.).
Fingerprint Manipulation (Choose One):
- Selenium:
  - Set up a Selenium WebDriver for your preferred browser (Chrome, Firefox, etc.).
  - Explore methods provided by Selenium libraries to modify browser fingerprint elements.
- Puppeteer:
  - Set up a Puppeteer instance to control a headless Chrome browser.
  - Utilize Puppeteer's functionalities to modify the fingerprint.
Bot Detection Challenge:
- Use your modified browser instance (Selenium/Puppeteer) to visit the target website: https://www.browserscan.net/
- Try to interact with the website elements (e.g., click buttons, fill forms).
- Observe the website's behavior. Does it identify you as a human or a bot?
Analysis and Improvement:
- Based on your experience, analyze how successful your fingerprint manipulation was in bypassing the bot detection.
- Research more advanced techniques for fingerprint manipulation.

Important Note:

Modifying your browser fingerprint might violate the terms of service of certain websites. Use this challenge for educational purposes only and respect website policies.
Bypassing robust bot detection mechanisms often requires sophisticated techniques beyond the scope of this challenge.

Tips:

Experiment with different user-agent strings and screen resolutions to see their impact.
Remember, some websites employ more sophisticated bot detection methods beyond basic fingerprinting.

By completing this challenge, you'll gain valuable experience in:

Understanding browser fingerprinting concepts.
Using Selenium or Puppeteer for basic browser automation.
Exploring techniques to potentially bypass basic bot detection mechanisms (for educational purposes only).

Submission Guidelines

Code:
- Submit your Python or JavaScript code used to set up and manipulate the browser fingerprint using Selenium or Puppeteer.
- Include comments in your code to explain your approach and any challenges faced.
Data:
- Provide screenshots or logs of your interactions with the target website (https://www.browserscan.net/), highlighting whether you were detected as a bot or not.
- Include a brief report (in ReadMe.md) analyzing your results and discussing potential improvements or alternative techniques for better results.

Learning Resources

Blogs

Videos

Automate the browser fingerprint using puppeteer: https://www.youtube.com/watch?v=AQXmEjnQxbs&pp=ygUcYnJvc3dlciBmaW5nZXJwcmludCBzZWxlbml1bQ%3D%3D
Is Web scraping Legal: https://www.youtube.com/watch?v=ngMHn-duTIQ&pp=ygUdV2ViIHNjcmFwaW5nIGV0aGljcyBsZWdhbCAxMDE%3D

Extra Tools

https://coveryourtracks.eff.org/

Week 6 🟩

Web Crawler - Deep Dive (Building on Week 4 & 5)

Pre-requisite

Before starting this challenge, ensure you have the following skills and tools:

Python Programming: Proficiency in Python, especially with handling libraries and data structures.
Web Scraping Basics: Familiarity with web scraping concepts and experience using libraries such as requests and BeautifulSoup.
HTML/CSS Knowledge: Basic understanding of HTML and CSS to navigate and extract web page content effectively.
BFS Algorithm Understanding: Knowledge of the Breadth-First Search (BFS) algorithm for traversing data structures.

Problem Description

Imagine you're working for a news aggregation service that aims to collect and categorize articles from various news websites. You are tasked with developing a web crawler that can navigate through a news website, follow internal links to discover new articles, and collect relevant data. By building this web crawler, you'll help the company automate data collection, providing timely and comprehensive news coverage. This task demonstrates your ability to handle complex web scraping scenarios and problem-solving skills, making you an attractive candidate for roles in data engineering and web development.

Instructions:

Develop a web crawler in Python/ JavaScript using libraries like requests and BeautifulSoup to navigate a website and discover new pages by following internal links.

The crawler should explore links up to a depth of 3 from the starting URL.

Steps:

Import Libraries and Define Starting URL:
- Import necessary libraries like requests and BeautifulSoup for making HTTP requests and parsing HTML content.
- Define the starting URL of the website you want to crawl.
  pythonCopy codeimport requests from bs4 import BeautifulSoup from collections import deque starting_url = 'https://www.browserscan.net/'
BFS (Breadth-First Search) Approach:
- Implement a BFS algorithm to explore links in a level-by-level manner.
- Maintain a queue (FIFO) to store URLs to be visited, initially containing the starting URL.
- Use a set to keep track of visited URLs to avoid revisiting the same page.
  pythonCopy codequeue = deque([starting_url]) visited = set() depth = {starting_url: 0}
Fetching and Parsing Web Pages:
- Utilize requests to send GET requests to URLs in the queue.
- Parse the retrieved HTML content using BeautifulSoup to extract links.
- Focus on internal links (same domain) relevant to your crawling purpose.
  pythonCopy codedef fetch_page(url): response = requests.get(url) return BeautifulSoup(response.text, 'html.parser')

Depth Control and Link Extraction:

Keep track of the current exploration depth (level) during traversal.

Only add extracted links to the queue if they haven't been visited and their depth doesn't exceed 3 (configurable).

pythonCopy codemax_depth = 3

while queue:
    current_url = queue.popleft()
    current_depth = depth[current_url]
    if current_depth > max_depth:
        continue

    soup = fetch_page(current_url)
    for link in soup.find_all('a', href=True):
        url = link['href']
        if url.startswith('/'):
            url = starting_url + url
        if url not in visited:
            visited.add(url)
            queue.append(url)
            depth[url] = current_depth + 1

Error Handling and Logging:

Implement mechanisms to handle potential errors during the crawling process (e.g., broken links, server errors).

Consider logging visited URLs, encountered errors, or extracted data for debugging and analysis.

pythonCopy codeimport logging

logging.basicConfig(level=logging.INFO)
try:
    soup = fetch_page(current_url)
    data = extract_data(soup)
    logging.info(f'Successfully fetched data from {current_url}')
except Exception as e:
    logging.error(f'Error fetching {current_url}: {e}')

Bonus Challenge:

Enhance the crawler to handle dynamic content loaded using JavaScript (consider libraries like Selenium for basic browser interaction).
Implement functionalities to store the crawled data in a structured format (e.g., database, CSV).
Explore techniques for politeness and respecting robots.txt guidelines when crawling websites.

Submission Guidelines (Code and Data)

Code Submission:
- Provide a well-documented Python script or Jupyter Notebook that includes:
  - Web crawling and BFS implementation.
  - Functions for fetching and parsing web pages.
  - Error handling and logging mechanisms.
- Ensure the code is clean, well-commented, and follows best practices for readability and reproducibility.
Data Submission:
- Include the data (CSV) containing extracted Urls.

By completing this challenge, you'll gain practical experience building a web crawler in Python, understand the principles of BFS for exploring website links, learn to extract data from HTML content using BeautifulSoup, and develop skills in handling errors and managing crawling complexity. This knowledge is valuable for various data extraction and automation tasks.

Important Note:

Respect robots.txt guidelines when crawling websites.
Be mindful of website terms of service and avoid overloading servers with excessive requests.
This challenge is for educational purposes. Use your crawler responsibly and ethically.

Learning Resources

Breadth First Search: https://www.youtube.com/watch?v=oDqjPvD54Ss&pp=ygUUYnJlYWR0aCBmaXJzdCBzZWFyY2g%3D
BFS Link Crawling Implementation: https://www.youtube.com/watch?v=PMYJNgO5KMw&pp=ygUmYnJlYWR0aCBmaXJzdCBzZWFyY2ggYW5kIGxpbmsgY3Jhd2xpbmc%3D
BFS WebCrawler core of search engines: https://www.youtube.com/watch?v=bIOzv83Yo58

Week 7 🟩

Abuja Hotel Price Exploration

This challenge delves into web scraping and data visualization to create a comprehensive view of Abuja's hotel scene.

Prerequisite:

Basic understanding of Python/ Javascript programming.
Familiarity with web scraping libraries such as requests and BeautifulSoup or SelectorLib.
Knowledge of data cleaning and preprocessing techniques using Pandas.
Experience with data visualization libraries like Matplotlib, Seaborn, or Plotly.
Basic knowledge of web development frameworks such as Streamlit (Python) or JavaScript frameworks (e.g., React, Vue.js).

Problem Description:

Imagine you are a data analyst working for a tourism board in Abuja, Nigeria. Your task is to provide a detailed and interactive overview of the hotel landscape in the city to help tourists make informed decisions about their accommodations. This tool will be valuable for tourists, event organizers, and business travelers who need to find the best hotels based on various criteria like location, star rating, and amenities. By developing this web application, you'll not only help visitors but also contribute to the local tourism industry by highlighting the hospitality options available in Abuja.

Instructions:

Develop a web application that displays and visualizes scraped hotel data from Booking.com, focusing on hotels in Abuja, Nigeria.

Steps:

Web Scraping:
- Utilize libraries requests to fetch hotel data from Booking.com (respect robots.txt guidelines).
- Store the scraped data in a structured format (e.g., CSV, JSON).
Data Cleaning and Preprocessing:
- Clean and pre-process the scraped data:
  - Handle missing values.
  - Standardize location names (ensure consistency).
  - Consider basic text cleaning for amenity data (remove irrelevant text).
Data Visualization with Streamlit (Python) or JavaScript Framework:
- Develop a user-friendly web application using Streamlit (Python) or a JavaScript framework of your choice (e.g., React, Vue.js).
- Display key hotel data (name, star rating, location, amenities) in a clear and organized manner within the web app.
- Focus on visualizing the data using interactive charts and graphs (e.g., bar charts to show the distribution of hotels by star rating across different locations, maps to pinpoint hotel locations with markers).
Bonus Challenge:
- Implement functionalities to filter hotels based on user preferences (e.g., location, star rating, amenities).
- Create visualizations that compare amenities offered by hotels across different star ratings.
- Allow users to download the scraped hotel data in a convenient format (CSV, Excel).
- Consider deploying your web app to a cloud platform for wider accessibility.

Learning Outcomes:

Gain hands-on experience with web scraping techniques for data extraction.
Understand the importance of data cleaning and pre-processing for effective visualization.
Learn to build a web application using Streamlit or a JavaScript framework for data presentation.
Develop skills in data visualization to create informative and interactive displays.

By completing this challenge, you'll gain valuable experience in web scraping, data cleaning, and data visualization. You'll build a web application that serves as a valuable resource for anyone exploring hotel options in Abuja, offering insights into the city's hospitality landscape.

Submission Guidelines:

Code:
- Submit a well-documented Jupyter notebook or Python script that includes your web scraping process, data cleaning, and visualization code.
- Ensure your code follows best practices for readability, modularity, and commenting.
Data:
- Provide the cleaned and processed hotel data in a structured format (e.g., CSV, JSON).
- Include a brief data dictionary explaining each column in your dataset.
Web Application:
- Share the source code for your web application.
- Include instructions for setting up and running the web app locally.
- If deployed, provide the URL to the live web application.

Week 8 🟩

News Aggregation Challenge: Curate Your World

Pre-requisite

Before you begin the News Aggregation Challenge, make sure you have the following prerequisites:

Basic Knowledge of Python and JavaScript: Familiarity with Python for web scraping and NLP tasks, and JavaScript if you choose to use a JS framework for the web application.
Web Scraping Skills: Understanding of web scraping techniques and libraries such as BeautifulSoup, Scrapy, or Selenium.
NLP Basics: Basic knowledge of Natural Language Processing concepts and libraries like SpaCy, NLTK, or Hugging Face Transformers, https://rapidapi.com/textanalysis/api/text-summarization
Web Development Experience: Experience with web development frameworks like Streamlit (Python) or React/Vue.js (JavaScript) for creating user interfaces.
Ethical Scraping Awareness: Understanding of ethical web scraping practices, including adherence to robots.txt guidelines.

Problem Description

Imagine you are a data scientist at a startup focused on delivering personalized news experiences to users. In today's fast-paced world, people struggle to keep up with the flood of information from various news sources. Many miss out on important updates relevant to their interests, while others feel overwhelmed by the sheer volume of content.

Consider John, a busy professional who wants to stay informed about the latest technology trends and business news. With a customized news aggregator, John can effortlessly access concise summaries of relevant articles, saving time and ensuring he stays up-to-date on topics that matter to him. This challenge offers a unique opportunity to create a practical solution that not only aids individuals in managing information overload but also demonstrates your ability to apply web scraping, NLP, and web development skills to real-world problems.

Instructions

Develop a web application that collects news articles from various sources, summarizes them using NLP, and allows users to curate their own news feed based on their interests.

Part 1: Web Scraping News.google.com (Respectful Approach)

Ethical Scraping:
- Adhere to Google's robots.txt guidelines (https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt).
- Focus on specific categories or keywords relevant to your users' interests using Google News' search functionality.
Target Specific News:
- Scrape essential elements like article titles, summaries (if available), links, and publication dates.
Data Storage:
- Store the scraped data in a structured format like CSV or JSON for efficient manipulation and summarization.

Part 2: News Summarization with NLP (Optional, Bonus Points)

Text Summarization Implementation:
- Explore pre-trained NLP models like BART or T5 for text summarization.
- Aim for summaries that capture the essential information of the article while maintaining coherence and readability.
Integration with Web Scraping:
- Integrate the summarization model into your web scraping pipeline. Summarize scraped articles before storing them for user consumption.

Part 3: Web Application Development (Streamlit/JavaScript Framework)

User Interface Design:
- Create a user-friendly web application using Streamlit (Python) or a JavaScript framework (e.g., React, Vue.js).
Interest Selection:
- Allow users to choose their preferred news categories or keywords to personalize their feed.
News Display:
- Present scraped and (optionally) summarized articles in an organized and visually appealing manner.
Interactive Features:
- Implement features like article filtering based on user preferences, date range selection, or the ability to save specific articles.

Bonus Challenge

Sentiment Analysis:
- Integrate sentiment analysis to gauge the overall tone (positive, negative, neutral) of news articles.
News Source Filtering:
- Include a filtering option to allow users to specify which websites they want to see news from.
User Preferences:
- Allow saving user preferences for future visits, creating a personalized news experience.

Learning Outcomes

Gain experience with web scraping techniques for data collection.
Understand the importance of ethical web scraping practices.
Explore the application of NLP for text summarization.
Learn to build a user-friendly web application for news aggregation.

Submission Guidelines

Code and Data

Code Submission:
- Ensure your code is well-commented and follows best practices for readability and maintainability.
- Include a README file explaining the project setup, structure, and usage.
- Provide a requirements.txt file listing all the dependencies needed to run the project.
- Ensure your code is modular and divided into logical sections such as data loading, preprocessing, model training, and web application integration.
Data Submission:
- Use the news articles data collected from Google News.
- Ensure your dataset handling respects Google's terms of service.
- Do not include raw data files in your submission; instead, provide instructions on how to collect the data.
Web Application:
- Include clear instructions on how to run the web application.
- Ensure the application is user-friendly, visually appealing, and functional.

By completing this challenge, you'll gain valuable skills in web scraping, NLP, and web application development. You'll build a practical news aggregator system that empowers users to stay informed on topics that matter to them.

Week 9

Week 10

Week 11

Week 12

PreviousData Engineering Challenges NextFrontend Development Challenges

Last updated 11 months ago

Pre-requisites

Challenges

Target Website:

Pre-requisites

Description

Problem

Instructions

Submission Guidelines

Learning Resources

Tools

Target Website

Objective

Pre-requisites

Problem Description

Problem

Instructions

Bonus Challenge

Submission Guidelines

Example Output

Learning Resources

Web Scraping Exercise: Login and Full Page Screenshot (Selenium/Puppeteer)

Pre-requisite

Problem Description

Instructions

Submission Guidelines

Learning Resources

Introduction to Link Crawling

Pre-requisite

Problem Description

Tools

Steps:

Submission Guidelines

Learning Resources:

An Introduction: Bypassing Browser Detection with Selenium/Puppeteer

Prerequisite

Problem Description

Instructions

Submission Guidelines

Learning Resources

Web Crawler - Deep Dive (Building on Week 4 & 5)

Pre-requisite

Problem Description

Instructions:

Submission Guidelines (Code and Data)

Learning Resources

Abuja Hotel Price Exploration

Prerequisite:

Problem Description:

Instructions:

Learning Outcomes:

Submission Guidelines:

News Aggregation Challenge: Curate Your World

Pre-requisite

Problem Description

Instructions

Bonus Challenge

Learning Outcomes

Submission Guidelines

Challenges