Data Science Challenges

Pre-requisites

  1. Understanding of variables, data types (especially floats and integers).

  2. Familiarity with using import statements to include libraries.

  3. Experience with basic control flow structures like if statements (optional for the bonus challenge).

Challenges

Week 1 🟩

Understanding NumPy

Imagine you're applying for a data scientist role at a leading tech company. As part of your portfolio, you want to demonstrate your proficiency in NumPy, a core library for numerical computing in Python. By solving real-world problems with NumPy, you showcase your ability to handle complex data manipulation tasks efficiently. These exercises not only highlight your technical skills but also your problem-solving abilities and creativity, making you a standout candidate to potential recruiters.

Problem and Instructions

Test your function for each question by feeding it with some input parameters.

  1. Identity Twister: Create a function create_custom_identity(size, value, index) that takes three arguments:

    • size: Size of the square identity matrix (e.g., 4).

    • value: The value to insert at the specified index (e.g., 10).

    • index: A tuple representing the index (i, j) for the custom value.

    • The function should return a NumPy array representing the identity matrix with the specified value inserted at the given index.

  1. Element Order Enigma: Write a function check_same_elements(arr1, arr2) that takes two NumPy arrays and returns True if both arrays contain the same elements regardless of their order, False otherwise. (Hint: Explore sorting and set comparisons).

  2. Broadcasting Broadcast Buster: Write a function safe_add(arr1, arr2) that adds two NumPy arrays and handles potential broadcasting issues gracefully. The function should raise a specific error message if the shapes of the arrays are incompatible for addition.

  1. Flattening the Terrain: Write a function flatten_terrain(terrain_data) that takes a 3D NumPy array representing terrain data (e.g., height at each point) and returns a 2D NumPy array representing the flattened height map.

  1. Universal Absolute Value: Define a function absolute_all(arr) that takes a NumPy array and returns a new array with all elements converted to their absolute values.

  2. Create Random Grayscale Image: Write Python code that:

  • Imports the numpy library as np.

  • Defines the desired image dimensions (width and height) using variables.

  • Generates a random NumPy array filled with values representing grayscale intensities. Remember, grayscale values typically range from 0 (black) to 255 (white).

  • Visualization Power: Now that you have your random grayscale data, it's time to bring it to life! Import another Python library suitable for image display.

    • Matplotlib - A versatile library for data visualization, including images.

Bonus Challenge (optional):

  • Modify your code (No. 6) to generate images with different value ranges (e.g., 0-127) and observe how it impacts the grayscale appearance.

  • Add functionalities to control the minimum and maximum grayscale intensities for more creative control over your random image generation.

Submission Guidelines

  1. Code Submission:

    • Submit your Python code for each function and task.

    • Ensure your code is well-documented and includes comments explaining each step.

Tools

Learning Resources

Week 2 🟩

Exploring Video Game Sales

This challenge dives into the video game sales dataset, shifting the focus from data engineering to analysis and visualization. We'll utilize libraries like pandas, matplotlib, and seaborn to uncover insights from the data.

About the Dataset: Video Games Sales Dataset

The video game sales dataset you're working with contains information on over 55,000 video games (as of April 2019). Here's a quick rundown:

  • Source: Scraped from vgchartz.com

  • Number of Records: 55,792

  • Content:

    • Game details: Name, Platform, Genre, ESRB Rating

    • Sales figures: Total Shipped copies, Global Sales (worldwide), Sales figures for North America (NA_Sales), Europe (PAL_Sales), Japan (JP_Sales), and Other regions

    • Review scores: Critic Score (from 10), User Score (from 10)

    • Release Year


Pre-requisites

To successfully complete this challenge, you should have the following:

  • Knowledge:

    • Basic understanding of Python programming.

    • Familiarity with the pandas library for data manipulation.

    • Basic knowledge of data visualization using matplotlib and seaborn.

  • Software:

    • Python 3

    • pandas library

    • matplotlib library

    • seaborn library


Problem Description

Imagine you're working as a data analyst for a gaming company that is preparing to launch a new video game. The company wants to understand the market trends, such as which genres are most popular, how review scores impact sales, and how sales have evolved over time. Your task is to analyze a comprehensive dataset of video game sales to uncover these insights. By providing a detailed analysis, you help the company make data-driven decisions on game development, marketing strategies, and sales forecasts.

For example, the marketing team wants to know if high critic scores correlate with increased sales to better allocate their promotional budget. The development team is interested in knowing which platforms perform best for specific genres to prioritize their development efforts. By conducting this analysis, you enable the company to make informed decisions that could lead to a successful game launch and improved market positioning.


Set 1: Data Cleaning and Exploration (Easy)

  1. Import libraries and load data:

    • Import pandas, matplotlib, and optionally seaborn.

    • Load the video game data using pandas.read_csv.

    • Print the first few rows of the data to get a sense of its structure.

  2. Data summary:

    • Get basic information about the data using df.info().

    • Describe the numerical columns using df.describe().

    • Identify and handle missing values (e.g., dropping rows, imputing values).

  3. Visualize data distribution:

    • Create histograms or density plots for key numerical columns like "Critic Score," "User Score," and "Global_Sales" using matplotlib or seaborn.

Set 2: Basic Grouping and Analysis (Medium)

  1. Top Genres by Sales:

    • Group the data by "Genre" and calculate the total global sales for each genre.

    • Create a bar chart (using matplotlib or seaborn) to visualize the top-selling genres.

  2. Correlation between Scores and Sales:

    • Calculate the correlation coefficient between "Critic Score" and "Global_Sales" using pandas.

    • Discuss the interpretation of the correlation value.

  3. Sales Over Time:

    • Group the data by "Year" and calculate the average "Global_Sales" for each year.

    • Create a line chart (using matplotlib or seaborn) to visualize the trend of global sales over time.

Set 3: Advanced Analysis and Visualization (Advanced)

  1. Platform Comparison:

    • Analyze the sales performance of a specific genre across different platforms (e.g., PS4, XOne).

    • Create a boxplot (using seaborn) to compare the distribution of "Global_Sales" for that genre across platforms.

  2. Predicting Sales:

    • Explore building a simple linear regression model (using libraries like scikit-learn) to predict "Global_Sales" based on "Critic Score" and "User Score".

    • Evaluate the model's performance and visualize the results.

  3. Interactive Visualization (Bonus):

    • Utilize libraries like Plotly or Bokeh to create interactive visualizations that allow users to explore the data dynamically (e.g., filtering by genre, platform, or year).


Submission Guidelines

Code and Data

  1. Code:

    • Organize your code into well-commented, readable scripts or Jupyter notebooks.

    • Include clear section headers for each part of the problem set (e.g., Data Cleaning and Exploration, Basic Grouping and Analysis, Advanced Analysis and Visualization).

    • Ensure all code runs without errors and produces the expected results.

  2. Data:

    • Include the CSV file used for the analysis.

    • Provide any additional datasets or files generated during the analysis.

  3. Documentation:

    • Create a README file that explains the steps you took, any assumptions you made, and any challenges you encountered.

    • Include instructions on how to run your scripts/notebooks.

  4. Submission:

    • Compress all files (code, data, documentation) into a single ZIP file.

    • Name the ZIP file using the format: Data_Science_Challenge_YourName.zip.

    • Submit the ZIP file through Slack or simply upload to GitHub and share link to Repo.

Learning Resources

  1. Another Data Cleaning (Real World): https://www.youtube.com/watch?v=iaZQF8SLHJs

Week 3 🟩

Machine Learning Challenge: Classifying Iris Flowers

This challenge is designed to introduce you to the fundamentals of machine learning using the classic Iris flower dataset.

Prerequisite

Before starting this challenge, ensure you have a foundational understanding of the following:

  • Basic Python programming.

  • Fundamental concepts of machine learning and data science.

  • Experience with data manipulation libraries like Pandas and NumPy.

  • Familiarity with machine learning libraries such as Scikit-Learn.

Problem Description

Imagine you are a data scientist working for a botanical research institute. Your task is to develop a machine learning model that can accurately classify species of Iris flowers based on their physical characteristics. This model will help botanists quickly identify species in the field, aiding in research and conservation efforts. Successfully completing this task showcases your ability to apply machine learning techniques to real-world problems, making you an attractive candidate for data science and machine learning roles.

Instructions

Objective: Build a machine learning model to classify Iris flowers into three species: Iris Setosa, Iris Versicolor, and Iris Virginica. The dataset provides features like sepal and petal length/width for each flower.

Steps:

  1. Data Preparation:

    • Download the Iris flower dataset from the UCI Machine Learning Repository.

    • Use Python libraries like Pandas to import and explore the data.

    • Handle missing values (if any) through techniques like imputation or deletion.

    • Separate the data into features (sepal/petal length/width) and target variable (flower species).

    • Consider data scaling or normalization if necessary.

  2. Model Building:

    • Utilize the Logistic Regression algorithm to build a classification model.

    • Train your model on the prepared data (features and target variable).

  3. Model Evaluation:

    • Use a confusion matrix to assess model performance.

      • The confusion matrix visualizes how many flower predictions were correct (True Positives, True Negatives) and how many were classified incorrectly (False Positives, False Negatives).

    • (Optional) Calculate additional evaluation metrics like accuracy, precision, recall, and F1-score to gain a deeper understanding of your model's performance.

Learning Outcomes:

  • Gain hands-on experience with data preparation and machine learning model building.

  • Understand the importance of data exploration and cleaning for model performance.

  • Learn how to evaluate and interpret machine learning models using confusion matrices and other metrics.

Submission Guidelines

  1. Code:

    • Submit your Jupyter notebook used to prepare the data, build, and evaluate the machine learning model.

    • Ensure your code is well-commented to explain your approach and any challenges faced.

  2. Data:

    • Include the cleaned dataset used for training and evaluation.

    • Provide any visualizations created during the data exploration and model evaluation phases.

Learning Resources

  1. Machine Learning Crash Course: https://www.youtube.com/watch?v=b2q5OFtxm6A

  2. Machine Learning Algorithms: https://www.youtube.com/watch?v=I7NrVwm3apg

Week 4 🟩

Predicting Titanic Passenger Survival

Pre-requisite

Before diving into this challenge, ensure you have the following skills and tools:

  1. Python Programming: Familiarity with Python and basic programming concepts.

  2. Pandas: Experience with data manipulation and analysis using pandas.

  3. Scikit-learn: Understanding of machine learning concepts and implementation using scikit-learn.

  4. Basic Statistics: Knowledge of basic statistical concepts like mean, median, standard deviation, and distributions.

  5. Data Visualization: Ability to visualize data using libraries like Matplotlib or Seaborn.

Problem Description

Imagine being part of a team tasked with developing a predictive system for a cruise line. The system aims to enhance passenger safety by predicting who would survive in the unfortunate event of a disaster, based on historical data from the Titanic. This problem not only helps in honing your data science skills but also presents a practical and compelling scenario that showcases your ability to solve real-world issues with data-driven solutions. This challenge is particularly attractive to recruiters because it demonstrates proficiency in critical areas like data cleaning, feature engineering, model building, and performance evaluation, all essential for any data science role.

Instructions

Utilize the Kaggle Titanic dataset to build and compare machine learning models for predicting passenger survival.

Steps:

  1. Data Acquisition and Exploration:

    • Download the Titanic dataset from Kaggle: Titanic - Machine Learning from Disaster.

    • Use pandas to import and explore the data.

    • Understand the data structure, identify missing values, and analyze the distribution of features like age, sex, fare class, etc.

  2. Data Cleaning:

    • Handle missing values in a suitable manner (e.g., imputation, deletion).

    • Encode categorical variables (e.g., sex, embarked port) into numerical representations for machine learning models.

    • Consider feature engineering to create new features from existing ones (e.g., family size based on siblings/spouses).

  3. Data Splitting:

    • Split the cleaned data into training and testing sets using train_test_split from scikit-learn.

    • The training set will be used to train the models, and the testing set will be used for unbiased evaluation.

  4. Model Building and Comparison:

    • Implement and train the following classification models from scikit-learn:

      • Logistic Regression

      • Decision Tree Classifier

      • Naive Bayes Classifier

      • Support Vector Machine (SVM)

  5. Model Evaluation:

    • Evaluate the performance of each model on the testing set using metrics like:

      • Accuracy: Proportion of correct predictions.

      • F1-score: Harmonic mean of precision and recall.

      • Precision: Ratio of true positives to all predicted positives.

      • Recall: Ratio of true positives to all actual positives.

    • Create a pandas DataFrame to display the performance metrics for each model side-by-side for easy comparison.

  6. Analysis and Conclusion:

    • Analyze the results and compare the performance of different models based on the evaluation metrics.

    • Discuss potential reasons for performance differences and identify the best performing model for predicting passenger survival.

Bonus Challenge:

  • Explore feature scaling or normalization techniques to improve model performance.

  • Try different feature selection methods to identify the most impactful features for survival prediction.

  • Analyze the impact of hyperparameter tuning on model performance for each algorithm.

  • Visualize the relationship between features and survival using techniques like scatter plots or box plots.

Submission Guidelines (Code and Data)

  1. Code Submission:

    • Provide a well-documented Jupyter Notebook (.ipynb) that includes:

      • Data acquisition and exploration steps.

      • Data cleaning and preprocessing steps.

      • Model building and training code.

      • Model evaluation and comparison.

      • Analysis and conclusions.

    • Ensure the code is clean, well-commented, and follows best practices for readability and reproducibility.

  2. Data Submission:

    • Include any intermediate datasets created during the data cleaning and preprocessing steps.

    • Provide a summary of any feature engineering or transformation steps applied to the data.

Learning Resources

Week 5 🟩

Optimizing Titanic Survival Prediction

This challenge dives into feature engineering and hyperparameter tuning to improve survival prediction accuracy in the infamous Titanic disaster dataset. You'll leverage classical machine learning models and explore best practices for enhanced performance.

Prerequisite:

  • Basic understanding of Python programming.

  • Familiarity with data analysis libraries like Pandas and NumPy.

  • Knowledge of machine learning concepts and libraries such as Scikit-learn.

  • Experience with Jupyter notebooks for interactive coding and analysis.

Problem Description:

Imagine you're a data scientist working for a maritime safety organization. Your task is to develop a predictive model that can accurately determine the likelihood of survival for passengers in the event of a maritime disaster. This model could be integrated into safety protocols and used to design better evacuation procedures, ensuring more lives are saved during such unfortunate events. By tackling the Titanic survival prediction problem, you gain valuable experience in handling real-world data, uncovering critical insights, and developing models that can make a significant impact on safety measures.

Instructions:

Participate in the Kaggle Titanic competition (Kaggle Titanic competition) and strive to achieve a top position on the leaderboard using classical machine learning models.

Focus on feature engineering, hyperparameter tuning, and exploring techniques implemented by high-performing participants.

Steps:

  1. Data Acquisition and Exploration:

    • Download the Titanic dataset from Kaggle.

    • Explore the data using pandas to understand its structure, identify missing values, and analyze feature distributions.

  2. Feature Engineering:

    • Go beyond basic features. Explore feature creation techniques:

      • Derive new features from existing ones (e.g., family size based on siblings/spouses).

      • Encode categorical features (e.g., sex, embarked port) into numerical representations suitable for machine learning models.

      • Consider feature scaling or normalization to improve model performance.

  3. Model Selection and Training:

    • Choose a classical machine learning model of your choice (e.g., Logistic Regression, Decision Tree, Random Forest).

    • Split the data into training and testing sets for unbiased evaluation.

    • Train the model on the training set, experimenting with different hyperparameters using techniques like GridSearchCV or RandomizedSearchCV.

  4. Hyperparameter Tuning:

    • Explore various hyperparameters specific to your chosen model.

    • [OPTIONAL] Utilize GridSearchCV or RandomizedSearchCV to efficiently evaluate different hyperparameter combinations and identify the optimal configuration for your model.

  5. Evaluation and Comparison:

    • Evaluate your model's performance on the testing set using metrics like:

      • Accuracy: Proportion of correct predictions.

      • F1-score: Harmonic mean of precision and recall.

      • Precision: Ratio of true positives to all predicted positives.

      • Recall: Ratio of true positives to all actual positives.

    • Compare your results with the Kaggle leaderboard.

  6. Improving Performance:

    • Analyze the results and identify areas for improvement.

    • Try feature selection techniques to identify the most impactful features for survival prediction.

    • Explore different machine learning models and compare their performance.

    • Learn from the approaches used by top-performing participants on Kaggle. Consider feature engineering techniques, model choices, or hyperparameter configurations they've implemented.

Bonus Challenge:

  • Analyze feature importance to understand which features contribute most to the model's predictions.

  • Implement techniques to handle imbalanced class distributions (if applicable).

  • Explore model ensembling (combining predictions from multiple models) for potentially better results.

Learning Outcomes:

  • Gain practical experience with feature engineering for creating informative features from raw data.

  • Understand the importance of hyperparameter tuning and its impact on model performance.

  • Learn to evaluate and compare the performance of machine learning models.

Learning Resources

Week 6 🟩

Revamped House Price Prediction with Streamlit

Pre-requisite

Before diving into this challenge, ensure you have the following prerequisites covered:

  1. Basic Knowledge of Python: Familiarity with Python programming and libraries like Pandas, NumPy, and Scikit-Learn.

  2. Understanding of Machine Learning Concepts: Knowledge of regression techniques, feature engineering, model evaluation, and deployment.

  3. Experience with Streamlit: Basic understanding of how to create web applications using Streamlit.

  4. Access to Kaggle Account: Ability to download datasets from Kaggle. You can sign up for a free account at Kaggle.

  5. Development Environment: A setup that includes Python and necessary libraries (you might consider using Jupyter Notebook).

Problem Description

Imagine you're a data scientist at a real estate company that wants to enhance its online platform by offering an advanced tool for predicting house prices. Potential buyers and sellers could benefit greatly from accurate price predictions based on various house features. This tool would not only attract more users to the platform but also establish the company as a tech-savvy leader in the real estate market.

Consider Jane, a prospective homebuyer who's browsing your company's website. She enters details like location, number of bedrooms, and lot size into the new price prediction tool. Instantly, she receives a price estimate that helps her decide whether to pursue the property further. Such a tool provides invaluable assistance to buyers and sellers alike, making the home-buying process more transparent and data-driven.

Instructions

Develop a comprehensive data science application using Python that:

  1. Predicts House Prices:

    • Utilize the House Prices dataset from Kaggle House Prices - Advanced Regression Techniques.

    • Employ machine learning techniques such as Random Forest and Gradient Boosting to build a predictive model.

    • Engage in feature engineering to create informative features from the available data.

  2. Integrates Streamlit:

    • Deploy your project as a web application using Streamlit.

    • Create a user-friendly interface allowing users to input house features.

    • Display the predicted house price based on user input using Streamlit.

  3. Saves and Loads Models:

    • Save your trained machine learning model using a library like Scikit-Learn's joblib.

    • Implement functionality to load the saved model within your Streamlit application, allowing you to retrain the model later without needing to redeploy the entire application.

  4. Preprocessing with Persistence:

    • Preprocess the data by handling missing values and performing scaling/normalization before feeding it to your model.

    • Use joblib to save the preprocessor object alongside your model, enabling you to load both the preprocessor and model together for future predictions.

Challenge Hints

  • Build a Well-Structured Pipeline: Focus on creating a reusable machine learning pipeline.

  • Modular Code Design: Break down your code into functions for data loading, preprocessing, model training, prediction, and saving/loading.

  • Leverage Streamlit: Utilize Streamlit's various components for building interactive web interfaces.

  • Address Data Issues: Ensure to handle missing values and perform feature scaling appropriately.

  • Efficient Persistence: Use joblib to persist both models and preprocessors, ensuring efficient use of trained artifacts.

Submission Guidelines

Code and Data

  1. Code Submission:

    • Ensure your code is well-commented and follows best practices for readability and maintainability.

    • Include a README file explaining the project setup, structure, and usage.

    • Provide a requirements.txt file listing all the dependencies needed to run the project.

    • Ensure your code is modular and divided into logical sections such as data loading, preprocessing, model training, and Streamlit integration.

  2. Data Submission:

    • Use the Kaggle dataset provided for the challenge.

    • Ensure your dataset handling respects Kaggle's terms of service.

    • Do not include raw data files in your submission; instead, provide instructions on how to download the dataset from Kaggle.

  3. Streamlit App:

    • Include link to the Streamlit application.

    • Ensure the Streamlit app is user-friendly, visually appealing, and functional.

By participating in this challenge, you'll enhance your data science skills by combining model building with deployment considerations. You'll build a valuable application for house price prediction and showcase your ability to leverage Streamlit for effective data science communication.

Week 7 🟩

NLP Challenge: News Article Analysis with Streamlit

Objective:

This challenge introduces you to the fundamentals of Natural Language Processing (NLP) using Python libraries like NLTK and TextBlob. You'll build a Streamlit application to perform text analysis on news articles, including EDA, sentiment analysis, and named entity recognition.

Pre-requisites:

  • Basic Python Programming: Familiarity with Python's syntax and core concepts is essential.

  • Introduction to NLP: Prior exposure to Natural Language Processing concepts will be beneficial, especially knowledge of libraries like NLTK, TextBlob, and spaCy.

  • Understanding of Web Scraping: A basic understanding of web scraping or accessing APIs to retrieve news articles is required.

  • Streamlit Basics: You should be comfortable creating simple Streamlit applications, including adding widgets and visualizations.

Problem Description:

Imagine you're a data scientist working for a media company. Your team is tasked with analyzing the vast amount of news content published daily to extract meaningful insights. For instance, you're asked to identify trends in the news, gauge public sentiment on major topics, and highlight key entities such as prominent figures and organizations. Your analysis could influence editorial decisions, help journalists craft more compelling stories, or even aid in tracking the impact of news coverage on public opinion. By solving this challenge, you demonstrate your ability to handle real-time data and extract actionable insights, making you a valuable asset to any company involved in media, finance, or public relations.

Instructions:

You are required to build a Streamlit application that analyzes news articles using NLP techniques. Follow these steps:

  1. Data Acquisition: Start by copyin & pasting news article content manually (keep it simple). Avoid implementing scrapers at this point. Once your have the text, preprocess the text by converting it to lowercase, removing stop words, and handling punctuation.

  1. EDA with NLTK: Perform exploratory data analysis (EDA) by normalizing the text using lemmatization or stemming. Analyze the distribution of parts of speech (POS) such as nouns, verbs, and adjectives. Consider using NLTK's POS tagging for this task. Visualize your findings using Streamlit's plotting capabilities—think bar charts for POS distribution or word clouds for commonly used words.

  2. Sentiment Analysis with TextBlob: Use TextBlob's sentiment analysis features to assess the overall sentiment (positive, negative, neutral) of the articles. Provide hints by focusing on polarity and subjectivity scores from TextBlob. Visualize the sentiment distribution with Streamlit's charts.

  3. Named Entity Recognition with spaCy: Apply spaCy's NER model to extract entities like persons, organizations, and locations from the articles. Think about how you can structure this data, perhaps displaying it in tables or lists for easy reference.

  4. Streamlit App Development: Build your Streamlit app with an intuitive interface that allows users to input their news articles or choose from your pre-loaded dataset. Ensure that the results from your EDA, sentiment analysis, and NER are displayed in a clear and organized manner. Utilize Streamlit's interactive components to make your visualizations engaging.

    Hints:

    • Consider breaking down the task into smaller functions to keep your code modular.

    • Think about adding a sidebar for additional controls, such as filtering articles by date or source.

Submission Guidelines:

  • Code:

    • Ensure your code is well-documented with comments explaining the purpose of each function.

    • Structure your project with separate modules or scripts for data acquisition, preprocessing, analysis, and the Streamlit app.

    • Include a requirements.txt file that lists all dependencies needed to run your project.

  • Data:

    • If you used a publicly available dataset, include a link to the source.

    • If you scraped the data, provide the script you used for scraping and the dataset in a .csv or .json format.

  • README:

    • A README.md file should be included, describing the project, the steps to run the app, and any additional instructions or insights.

    • Provide screenshots or a video demo of your Streamlit app in action.

    • Add a URL to a webhosted version of your app

By completing this challenge, you'll acquire valuable skills in NLP and data visualization, laying a strong foundation for further exploration in text analysis and natural language applications.

Learning Resources

Week 8

Week 9

Week 10

Week 11

Week 12

Last updated