Data Engineering Challenges

Challenges

Week 1 🟩

Understanding NumPy

Imagine you're applying for a data scientist role at a leading tech company. As part of your portfolio, you want to demonstrate your proficiency in NumPy, a core library for numerical computing in Python. By solving real-world problems with NumPy, you showcase your ability to handle complex data manipulation tasks efficiently. These exercises not only highlight your technical skills but also your problem-solving abilities and creativity, making you a standout candidate to potential recruiters.

Problem and Instructions

Test your function for each question by feeding it with some input parameters.

  1. Identity Twister: Create a function create_custom_identity(size, value, index) that takes three arguments:

    • size: Size of the square identity matrix (e.g., 4).

    • value: The value to insert at the specified index (e.g., 10).

    • index: A tuple representing the index (i, j) for the custom value.

    • The function should return a NumPy array representing the identity matrix with the specified value inserted at the given index.

  1. Element Order Enigma: Write a function check_same_elements(arr1, arr2) that takes two NumPy arrays and returns True if both arrays contain the same elements regardless of their order, False otherwise. (Hint: Explore sorting and set comparisons).

  2. Broadcasting Broadcast Buster: Write a function safe_add(arr1, arr2) that adds two NumPy arrays and handles potential broadcasting issues gracefully. The function should raise a specific error message if the shapes of the arrays are incompatible for addition.

  1. Flattening the Terrain: Write a function flatten_terrain(terrain_data) that takes a 3D NumPy array representing terrain data (e.g., height at each point) and returns a 2D NumPy array representing the flattened height map.

  1. Universal Absolute Value: Define a function absolute_all(arr) that takes a NumPy array and returns a new array with all elements converted to their absolute values.

  2. Create Random Grayscale Image: Write Python code that:

  • Imports the numpy library as np.

  • Defines the desired image dimensions (width and height) using variables.

  • Generates a random NumPy array filled with values representing grayscale intensities. Remember, grayscale values typically range from 0 (black) to 255 (white).

  • Visualization Power: Now that you have your random grayscale data, it's time to bring it to life! Import another Python library suitable for image display.

    • Matplotlib - A versatile library for data visualization, including images.

Bonus Challenge (optional):

  • Modify your code (No. 6) to generate images with different value ranges (e.g., 0-127) and observe how it impacts the grayscale appearance.

  • Add functionalities to control the minimum and maximum grayscale intensities for more creative control over your random image generation.

Submission Guidelines

  1. Code Submission:

    • Submit your Python code for each function and task.

    • Ensure your code is well-documented and includes comments explaining each step.

Tools

Learning Resources

Week 2 🟩

Building an ETL Pipeline with Pandas and SQLite

This problem set is designed to introduce aspiring data engineers to the pandas library and its application in building an ETL (Extract, Transform, Load) pipeline. The goal is to load video game data from a CSV file into a SQLite database. The challenges progress from basic data manipulation to more advanced techniques.

About the Dataset: Video Games Sales Dataset

The video game sales dataset you're working with contains information on over 55,000 video games (as of April 2019). Here's a quick rundown:

  • Source: Scraped from vgchartz.com

  • Number of Records: 55,792

  • Content:

    • Game details: Name, Platform, Genre, ESRB Rating

    • Sales figures: Total Shipped copies, Global Sales (worldwide), Sales figures for North America (NA_Sales), Europe (PAL_Sales), Japan (JP_Sales), and Other regions

    • Review scores: Critic Score (from 10), User Score (from 10)

    • Release Year

This dataset provides a rich resource for analyzing video game sales trends, exploring correlations between genres and sales, or comparing critic and user reception.


Pre-requisites

To successfully complete this problem set, you should have the following:

  • Knowledge:

    • Basic understanding of Python programming.

    • Familiarity with the pandas library for data manipulation.

    • Basic knowledge of SQL and relational databases.

  • Software:

    • Python 3

    • pandas library

    • sqlite3 library

    • SQLAlchemy library


Problem Description

Imagine you're working for a video game market analysis company. Your company helps game developers and publishers understand sales trends, market demands, and the reception of their games across different regions. Your task is to build a robust ETL pipeline that can process large datasets, transform them into meaningful insights, and store them in a database for easy querying and analysis.

For instance, a game developer wants to know how their games have performed in different regions over the years and whether critic and user scores correlate with sales figures. By building this ETL pipeline, you enable the company to quickly answer such questions, aiding in strategic decision-making and improving the development of future games.


Set 1: Basic Data Exploration (Easy)

  1. Import libraries and load data:

    • Import pandas and sqlite3 libraries.

    • Load the video game data from the CSV file using pandas.read_csv.

    • Print the first few rows of the data to get a sense of its structure.

  2. Data summary:

    • Get basic information about the data using df.info().

    • Describe the numerical columns using df.describe().

    • Check for missing values using df.isnull().sum().

  3. Data cleaning:

    • If there are missing values, explore options for handling them (e.g., removing rows, imputing values).

    • Handle duplicates.

    • Check for and fix typos.

Set 2: Data Transformation (Medium)

  1. Create a new column:

    • Create a new column named "Total Sales" that sums the sales across different regions (NA_Sales, PAL_Sales, etc.).

  2. Data filtering:

    • Filter the data to include only games released after a specific year (e.g., 2010).

    • Filter the data to include only games with a Critic Score greater than a certain threshold (e.g., 8).

  3. Data type conversion:

    • Convert the "Year" column from string to integer format.

    • Set column data types to appropriate formats.

Set 3: Building the ETL Pipeline (Advanced)

  1. Establish connection to SQLite database using SQLAlchemy:

    • Import the sqlite3 library and establish a connection to a new SQLite database (e.g., "video_games.db").

  2. Define a function for data loading:

    • Create a function that takes the pandas DataFrame as input and establishes a connection to the SQLite database.

    • Use the to_sql method from pandas to write the DataFrame to a table named "games" in the database.

  3. Load data into the database:

    • Call the data loading function with the prepared DataFrame.

  4. Bonus: Enhance the pipeline with error handling and logging capabilities.


Submission Guidelines

Code and Data

  1. Code:

    • Organize your code into well-commented, readable scripts or Jupyter notebooks.

    • Include clear section headers for each part of the problem set (e.g., Data Exploration, Data Transformation, ETL Pipeline).

    • Ensure all code runs without errors and produces the expected results.

  2. Data:

    • Include the CSV file used for the ETL process.

    • Provide the SQLite database file (e.g., "video_games.db") after loading the data.

  3. Documentation:

    • Create a README file that explains the steps you took, any assumptions you made, and any challenges you encountered.

    • Include instructions on how to run your scripts/notebooks.

  4. Submission:

    • Compress all files (code, data, documentation) into a single ZIP file.

    • Name the ZIP file using the format: ETL_Pipeline_YourName.zip.

    • Submit the ZIP file through Slack or simply upload to GitHub and share link to Repo.

Learning Resources

  1. Another Data Cleaning (Real World): https://www.youtube.com/watch?v=iaZQF8SLHJs

Week 3 🟩

Data Cleaning with PySpark

This week, we revisit the video game data cleaning challenge (from week 2), but with a twist! Instead of Pandas, we'll leverage the power of PySpark for distributed data processing on larger datasets.

Objective: Replicate the data cleaning steps from Week 2 (using Pandas) but with PySpark. The final output should be a clean PySpark DataFrame saved as a Parquet file.

Steps:

  1. Import Libraries:

    • Import PySpark libraries like SparkSession and necessary functions from pyspark.sql.functions (e.g., col, sum, filter, etc.).

  2. Load Data:

    • Use SparkSession.read.option("header", True).csv to read the video game data CSV file as a PySpark DataFrame.

  3. Data Exploration:

    • Utilize PySpark DataFrame methods to explore the data:

      • show: Display the first few rows.

      • describe: Get summary statistics for numerical columns.

      • filter with col: Filter based on specific conditions (e.g., year of release).

      • groupBy and agg: Calculate aggregations like total sales across regions.

  4. Data Cleaning:

    • Handle missing values using PySpark functions (e.g., fillna, dropna).

    • Remove duplicates using dropDuplicates.

    • Address inconsistencies or typos using string manipulation functions (e.g., lower, replace).

  5. Data Transformation:

    • Create a new column named "Total Sales" using aggregation functions.

    • Convert data types as needed using cast.

  6. Save Clean Data:

    • Save the final cleaned PySpark DataFrame as a Parquet file using write.parquet.

Bonus Challenge:

  • Compare the performance of Pandas and PySpark for this data cleaning task on smaller and larger datasets (consider execution time and resource usage).

  • Explore advanced PySpark features for data wrangling like User-Defined Functions (UDFs) or PySpark SQL.

Remember:

  • Focus on replicating the data cleaning steps from Week 2 using PySpark functions.

  • Your code should be well-commented and structured for readability.

  • While building your ETL pipeline, consider error handling for potential issues.

Learning Outcomes:

  • Gain experience with PySpark for data cleaning on distributed datasets.

  • Understand the key differences between Pandas and PySpark for data manipulation tasks.

  • Explore the benefits of PySpark for handling large datasets efficiently.

Submission Guidelines

  1. Code:

    • Submit your Python script or Jupyter notebook used to set up and perform the data cleaning with PySpark.

    • Ensure your code is well-commented to explain your approach and any challenges faced.

  2. Data:

    • Provide the final cleaned dataset saved as a Parquet file.

  3. Report:

    • If you attempted the bonus challenge, include a comparison of performance between Pandas and PySpark.

Learning Resources

Week 4 🟩

Dockerizing a PostgreSQL Database

Pre-requisite

Before starting this challenge, ensure you have the following skills and tools:

  1. Basic Command Line Skills: Familiarity with using terminal or command prompt.

  2. Docker Basics: Basic understanding of Docker concepts and commands.

  3. SQL Knowledge: Basic knowledge of SQL commands and database operations.

  4. PostgreSQL Basics: Familiarity with PostgreSQL and using tools like psql or pgAdmin.

Problem Description

Imagine you're part of a team responsible for maintaining a scalable and portable development environment for a finance company. The company relies on PostgreSQL databases to manage critical data, but developers often face issues with inconsistent environments. By containerizing the PostgreSQL database using Docker, you can provide a consistent, reproducible, and isolated environment for development, testing, and production. This task not only showcases your ability to use modern containerization tools but also demonstrates your capability to solve practical issues in a real-world setting, making you an attractive candidate for data engineering and DevOps roles.

Instructions

In this experiment, you'll leverage Docker to create a containerized PostgreSQL database environment and interact with it to load data.

Steps:

  1. Install Docker:

    • Download and install Docker Desktop for your operating system (Windows, Mac, or Linux) based on the official Docker website instructions: Docker Desktop.

  2. Docker Basics:

    • Familiarize yourself with basic Docker concepts:

      • Images: Templates for creating containers.

      • Containers: Isolated instances of an application running on top of the host operating system.

    • Explore Docker commands using the terminal:

      • docker ps: List running containers.

      • docker images: List available images.

      • docker run: Create and run a container from an image.

  3. Find a PostgreSQL Image:

    • Search for a pre-built PostgreSQL image on Docker Hub, the official repository for Docker images: Docker Hub - PostgreSQL.

    • Look for reputable images with a high number of downloads and positive ratings (e.g., postgres).

  4. Run a PostgreSQL Container:

    • Use the docker run command to create a container from the chosen PostgreSQL image.

    • Consider specifying environment variables or mounting volumes for persistent data storage (optional for this challenge).

      bashCopy codedocker run --name my_postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres
  5. Connect to PostgreSQL:

    • Utilize a PostgreSQL client tool like psql or a graphical interface (pgAdmin) to connect to the running container's PostgreSQL database.

    • Use the container's IP address and port (usually 5432) for connection details.

      bashCopy codedocker exec -it my_postgres psql -U postgres
  6. Create a Table:

    • Once connected, write SQL commands using psql to create a table within the PostgreSQL database.

    • Define the table structure with columns and data types suitable for storing the Iris flower dataset.

      sqlCopy codeCREATE TABLE iris (
          sepal_length FLOAT,
          sepal_width FLOAT,
          petal_length FLOAT,
          petal_width FLOAT,
          species VARCHAR(50)
      );
  7. Load Iris Data:

    • Download the Iris flower dataset from the UCI Machine Learning Repository: Iris Dataset.

    • Explore options to import the CSV data into your PostgreSQL table:

      • Use the COPY command in psql to import data from a local CSV file mounted into the container.

        sqlCopy codeCOPY iris FROM '/path/to/iris.csv' DELIMITER ',' CSV HEADER;
      • Explore tools like pgAdmin which might offer graphical interfaces for data import.

Important Notes:

  • Ensure you understand the security implications of running a database in a container (consider password configuration and container isolation).

  • This is a basic introduction to Docker. Explore further to learn about building custom Docker images and managing containerized applications.

Submission Guidelines (Code and Data)

  1. Code Submission:

    • Provide a well-documented Dockerfile or shell script that includes:

      • Docker commands to pull and run the PostgreSQL image.

      • Steps to connect to the PostgreSQL container and create the table.

      • Commands to import the Iris dataset into the PostgreSQL table.

    • Ensure the code is clean, well-commented, and follows best practices for readability and reproducibility.

  2. Data Submission:

    • Include the Iris dataset file used for importing data into PostgreSQL.

    • Provide a summary of any data preprocessing steps applied to the CSV file before import.

Learning Resources

Week 5 🟩

Scraping Recipes from AllRecipes

Target Website: AllRecipes

Prerequisites

Basic Knowledge of Web Scraping: Familiarity with HTML structure and basic web scraping concepts.

Programming Skills: Proficiency in Python or JavaScript.

Tools Installed:

For Python:

  • BeautifulSoup

  • requests

  • sqlite3

For JavaScript:

  • Puppeteer

  • Node.js

Environment Setup: Ensure your development environment (IDE or text editor) is set up and ready to code.

Problem Description

Imagine you’re a data analyst at a startup developing a new recipe recommendation app. Your app aims to help users find the best recipes based on their dietary preferences and ingredient availability. To build this app, you need a substantial database of recipes. Instead of manually entering hundreds of recipes, you can use web scraping to automatically gather this data from popular recipe websites like AllRecipes. This approach saves time and ensures your app has a diverse and up-to-date collection of recipes, making it more appealing to users and potential recruiters.

Instructions

You are tasked with extracting the following details from AllRecipes:

  • Recipe Titles: The name of each recipe.

  • Ingredients: A list of ingredients required for each recipe.

  • Ratings: The average user rating for each recipe.

Steps:

  1. Identify the Structure:

    • Use your browser's developer tools to inspect the HTML structure of the recipe pages.

    • Look for patterns in how recipes are listed and how individual details like title, ingredients, and ratings are presented within the HTML code.

    • Take notes of the HTML tags, classes, or IDs that contain the required information.

  2. Extract the Data:

    • Choose a web scraping library like BeautifulSoup (Python) or Puppeteer (JavaScript).

    • Write a script to navigate the HTML structure and extract the desired data points.

    • Ensure your script handles pagination if recipes are spread across multiple pages.

  3. Store the Data:

    • Save the scraped data in a structured format like an SQLite database.

    • Ensure each recipe entry in the database includes the title, ingredients, and ratings.

Submission Guidelines

Code Submission:

  • Provide a .zip file containing your web scraping script or a link to your GitHub repository.

  • Include any necessary configuration files and dependencies required to run your script.

Data Submission:

  • Submit the SQLite database file (.db) containing the scraped data.

Learning Resources

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Last updated