Start Your First Project
Learn By Doing
10 Real World Data Science Case Studies Projects with Example
Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2021. Last Updated: 02 Feb 2023
Data science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare, education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses. We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.
Walmart Sales Forecasting Data Science Project
Downloadable solution code | Explanatory videos | Tech Support
Table of Contents
Data science case studies in retail , data science case studies in entertainment industry , data science case studies in travel industry , data science case studies in social media , data science case studies in healthcare, data science case studies in oil and gas, 10 most interesting data science case studies with examples.
So, without much ado, let's get started with data science business case studies !
With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps, infrastructure, and security.
Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science at Walmart:
i) Personalized Customer Shopping Experience
Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.
ii) Order Sourcing and On-Time Delivery Promise
Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.
iii) Packing Optimization
Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .
Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.
Here is a link to a sales prediction project to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science project aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.
Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects
Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data science applications at Amazon:
i) Recommendation Systems
Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.
Here is a Recommender System Project to help you build a recommendation system using collaborative filtering.
ii) Retail Price Optimization
Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.
Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.
iii) Fraud Detection
Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.
You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.
View all New Projects
Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!
Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of how data science is applied at Netflix :
i) Personalized Recommendation System
Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.
ii) Content Development using Data Analytics
Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.
iii) Marketing Analytics for Campaigns
Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.
Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.
Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization
In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the Data science models used my Spotify to provide enhanced services to its listeners:
i) Personalization of Content using Recommendation Systems
Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.
Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.
ii) Targetted marketing through Customer Segmentation
With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.
iii) CNN's for Classification of Songs and Audio Tracks
Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.
Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.
Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience.
i) Recommendation Systems and Search Ranking Algorithms
Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.
ii) Natural Language Processing for Review Analysis
Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .
Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.
iii) Smart Pricing using Predictive Analytics
The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.
Here is a Price Prediction Project to help you understand the concept of predictive analysis.
Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the Data science-driven products used by uber:
i) Dynamic Pricing for Price Surges and Demand Forecasting
Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.
ii) One-Click Chat
Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.
iii) Customer Retention
Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.
You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.
Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro
LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the products developed by data scientists at LinkedIn:
i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems
LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.
ii) Recommendation Systems Personalized for News Feed
The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.
iii) CNN's to Detect Inappropriate Content
To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.
Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.
Get confident to build end-to-end projects.
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.
Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few applications of Data science used by Pfizer :
i) Identifying Patients for Clinical Trials
Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.
ii) Supply Chain and Manufacturing
Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.
iii) Drug Development
Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.
You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this task.
Access Data Science and Machine Learning Project Code Examples
Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few applications of AI and data science used in the petrochemical industry:
i) Precision Drilling
Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used.
ii) Efficient Charging Terminals
Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.
iii) Monitoring Service and Charging Stations
Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.
Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.
Most Watched Projects
View all Most Watched Projects
Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few applications developed by the data scientists at Zomato:
i) Personalized Recommendation System for Homepage
Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato.
You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.
ii) Analyzing Customer Sentiment
Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.
iii) Predicting Food Preparation Time (FPT)
Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time.
Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.
- Data Science
Top 8 Data Science Case Studies for Data Science Enthusiasts
Read it in 15 Mins
- 8 Data Science Case Studies
- Data Science in Hospitality Industry
- Data Science in Healthcare
- Covid 19 and Data Science
- Data Science in Ecommerce
- Data Science in Supply Chain Management
- Data Science in Meteorology
- Data Science in Entertainment Industry
- Data Science in Banking and Finance
- Where to Find Full Data Science Case Studies?
- What Are the Skills Required for Data Scientists?
- Frequently Asked Questions(FAQs)
Data science has become popular in the last few years due to its successful application in making business decisions. Data scientists have been using data science techniques to solve challenging real-world issues in healthcare, agriculture, manufacturing, automotive, and many more. For this purpose, a data enthusiast needs to stay updated with the latest technological advancements in AI. An excellent way to achieve this is through reading industry case studies. Check out Knowledgehut Data Science With Python course syllabus to start your data science journey.
Let’s discuss some case studies that contain detailed and systematic data analysis of people, objects, or entities focusing on multiple factors present in the dataset. Aspiring and practising data scientists can motivate themselves to learn more about the sector, an alternative way of thinking, or methods to improve their organization based on comparable experiences. Almost every industry uses data science in some way. You can learn more about data science fundamentals in this data science course content . Data scientists may use it to spot fraudulent conduct in insurance claims. Automotive data scientists may use it to improve self-driving cars. In contrast, e-commerce data scientists can use it to add more personalization for their consumers—the possibilities are unlimited and unexplored.
We will take a look at the top eight data science case studies in this article so you can understand how businesses from many sectors have benefitted from data science to boost productivity, revenues, and more. Read on to explore more, or use the following links to go straight to the case study of your choice.
Know more about measures of dispersion .
- Airbnb focuses on growth by analyzing customer voice using data science
- Qantas uses predictive analytics to mitigate losses
- Novo Nordisk is Driving innovation with NLP
- AstraZeneca harnesses data for innovation in medicine
- Johnson and Johnson use s d ata science to fight the Pandemic
- Amazon uses data science to personalize shop p ing experiences and improve customer satisfaction
Supply chain management
- UPS optimizes supp l y chain with big data analytics
- IMD leveraged data science to achieve a rec o rd 1.2m evacuation before cyclone ''Fani''
- Netflix u ses data science to personalize the content and improve recommendations
- Spotify uses big data to deliver a rich user experience for online music streaming
Banking and Finance
- HDFC utilizes Big D ata Analytics to increase income and enhance the banking experience
8 Data Science Case Studies
1. data science in hospitality industry.
In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing , tracking market trends, and many more.
Airbnb focuses on growth by analyzing customer voice using data science.
A famous example in this sector is the unicorn '' Airbnb '', a startup that focussed on data science early to grow and adapt to the market faster. This company witnessed a 43000 percent hypergrowth in as little as five years using data science. They included data science techniques to process the data, translate this data for better understanding the voice of the customer, and use the insights for decision making. They also scaled the approach to cover all aspects of the organization. Airbnb uses statistics to analyze and aggregate individual experiences to establish trends throughout the community. These analyzed trends using data science techniques impact their business choices while helping them grow further.
Travel industry and data science
Predictive analytics benefits many parameters in the travel industry. These companies can use recommendation engines with data science to achieve higher personalization and improved user interactions. They can study and cross-sell products by recommending relevant products to drive sales and increase revenue. Data science is also employed in analyzing social media posts for sentiment analysis, bringing invaluable travel-related insights. Whether these views are positive, negative, or neutral can help these agencies understand the user demographics, the expected experiences by their target audiences, and so on. These insights are essential for developing aggressive pricing strategies to draw customers and provide better customization to customers in the travel packages and allied services. Travel agencies like Expedia and Booking.com use predictive analytics to create personalized recommendations, product development, and effective marketing of their products. Not just travel agencies but airlines also benefit from the same approach. Airlines frequently face losses due to flight cancellations, disruptions, and delays. Data science helps them identify patterns and predict possible bottlenecks, thereby effectively mitigating the losses and improving the overall customer traveling experience.
How Qantas uses predictive analytics to mitigate losses
Qantas , one of Australia's largest airlines, leverages data science to reduce losses caused due to flight delays, disruptions, and cancellations. They also use it to provide a better traveling experience for their customers by reducing the number and length of delays caused due to huge air traffic, weather conditions, or difficulties arising in operations. Back in 2016, when heavy storms badly struck Australia's east coast, only 15 out of 436 Qantas flights were cancelled due to their predictive analytics-based system against their competitor Virgin Australia, which witnessed 70 cancelled flights out of 320.
2. Data Science in Healthcare
The Healthcare sector is immensely benefiting from the advancements in AI. Data science, especially in medical imaging, has been helping healthcare professionals come up with better diagnoses and effective treatments for patients. Similarly, several advanced healthcare analytics tools have been developed to generate clinical insights for improving patient care. These tools also assist in defining personalized medications for patients reducing operating costs for clinics and hospitals. Apart from medical imaging or computer vision, Natural Language Processing (NLP) is frequently used in the healthcare domain to study the published textual research data.
Driving innovation with NLP: Novo Nordisk
Novo Nordisk uses the Linguamatics NLP platform from internal and external data sources for text mining purposes that include scientific abstracts, patents, grants, news, tech transfer offices from universities worldwide, and more. These NLP queries run across sources for the key therapeutic areas of interest to the Novo Nordisk R&D community. Several NLP algorithms have been developed for the topics of safety, efficacy, randomized controlled trials, patient populations, dosing, and devices. Novo Nordisk employs a data pipeline to capitalize the tools' success on real-world data and uses interactive dashboards and cloud services to visualize this standardized structured information from the queries for exploring commercial effectiveness, market situations, potential, and gaps in the product documentation. Through data science, they are able to automate the process of generating insights, save time and provide better insights for evidence-based decision making.
How AstraZeneca harnesses data for innovation in medicine
AstraZeneca is a globally known biotech company that leverages data using AI technology to discover and deliver newer effective medicines faster. Within their R&D teams, they are using AI to decode the big data to understand better diseases like cancer, respiratory disease, and heart, kidney, and metabolic diseases to be effectively treated. Using data science, they can identify new targets for innovative medications. In 2021, they selected the first two AI-generated drug targets collaborating with BenevolentAI in Chronic Kidney Disease and Idiopathic Pulmonary Fibrosis.
Data science is also helping AstraZeneca redesign better clinical trials, achieve personalized medication strategies, and innovate the process of developing new medicines. Their Center for Genomics Research uses data science and AI to analyze around two million genomes by 2026. Apart from this, they are training their AI systems to check these images for disease and biomarkers for effective medicines for imaging purposes. This approach helps them analyze samples accurately and more effortlessly. Moreover, it can cut the analysis time by around 30%.
AstraZeneca also utilizes AI and machine learning to optimize the process at different stages and minimize the overall time for the clinical trials by analyzing the clinical trial data. Summing up, they use data science to design smarter clinical trials, develop innovative medicines, improve drug development and patient care strategies, and many more.
Wearable technology is a multi-billion-dollar industry. With an increasing awareness about fitness and nutrition, more individuals now prefer using fitness wearables to track their routines and lifestyle choices.
Fitness wearables are convenient to use, assist users in tracking their health, and encourage them to lead a healthier lifestyle. The medical devices in this domain are beneficial since they help monitor the patient's condition and communicate in an emergency situation. The regularly used fitness trackers and smartwatches from renowned companies like Garmin, Apple, FitBit, etc., continuously collect physiological data of the individuals wearing them. These wearable providers offer user-friendly dashboards to their customers for analyzing and tracking progress in their fitness journey.
3. Covid 19 and Data Science
In the past two years of the Pandemic, the power of data science has been more evident than ever. Different pharmaceutical companies across the globe could synthesize Covid 19 vaccines by analyzing the data to understand the trends and patterns of the outbreak. Data science made it possible to track the virus in real-time, predict patterns, devise effective strategies to fight the Pandemic, and many more.
How Johnson and Johnson uses data science to fight the Pandemic
The data science team at Johnson and Johnson leverages real-time data to track the spread of the virus. They built a global surveillance dashboard (granulated to county level) that helps them track the Pandemic's progress, predict potential hotspots of the virus, and narrow down the likely place where they should test its investigational COVID-19 vaccine candidate. The team works with in-country experts to determine whether official numbers are accurate and find the most valid information about case numbers, hospitalizations, mortality and testing rates, social compliance, and local policies to populate this dashboard. The team also studies the data to build models that help the company identify groups of individuals at risk of getting affected by the virus and explore effective treatments to improve patient outcomes.
4. Data Science in Ecommerce
In the e-commerce sector , big data analytics can assist in customer analysis, reduce operational costs, forecast trends for better sales, provide personalized shopping experiences to customers, and many more.
Amazon uses data science to personalize shopping experiences and improve customer satisfaction. Amazon is a globally leading eCommerce platform that offers a wide range of online shopping services. Due to this, Amazon generates a massive amount of data that can be leveraged to understand consumer behavior and generate insights on competitors' strategies. Amazon uses its data to provide recommendations to its users on different products and services. With this approach, Amazon is able to persuade its consumers into buying and making additional sales. This approach works well for Amazon as it earns 35% of the revenue yearly with this technique. Additionally, Amazon collects consumer data for faster order tracking and better deliveries.
Similarly, Amazon's virtual assistant, Alexa, can converse in different languages; uses speakers and a camera to interact with the users. Amazon utilizes the audio commands from users to improve Alexa and deliver a better user experience.
5. Data Science in Supply Chain Management
Predictive analytics and big data are driving innovation in the Supply chain domain. They offer greater visibility into the company operations, reduce costs and overheads, forecasting demands, predictive maintenance, product pricing, minimize supply chain interruptions, route optimization, fleet management , drive better performance, and more.
Optimizing supply chain with big data analytics: UPS
UPS is a renowned package delivery and supply chain management company. With thousands of packages being delivered every day, on average, a UPS driver makes about 100 deliveries each business day. On-time and safe package delivery are crucial to UPS's success. Hence, UPS offers an optimized navigation tool ''ORION'' (On-Road Integrated Optimization and Navigation), which uses highly advanced big data processing algorithms. This tool for UPS drivers provides route optimization concerning fuel, distance, and time. UPS utilizes supply chain data analysis in all aspects of its shipping process. Data about packages and deliveries are captured through radars and sensors. The deliveries and routes are optimized using big data systems. Overall, this approach has helped UPS save 1.6 million gallons of gasoline in transportation every year, significantly reducing delivery costs.
6. Data Science in Meteorology
Weather prediction is an interesting application of data science . Businesses like aviation, agriculture and farming, construction, consumer goods, sporting events, and many more are dependent on climatic conditions. The success of these businesses is closely tied to the weather, as decisions are made after considering the weather predictions from the meteorological department.
Besides, weather forecasts are extremely helpful for individuals to manage their allergic conditions. One crucial application of weather forecasting is natural disaster prediction and risk management.
Weather forecasts begin with a large amount of data collection related to the current environmental conditions (wind speed, temperature, humidity, clouds captured at a specific location and time) using sensors on IoT (Internet of Things) devices and satellite imagery. This gathered data is then analyzed using the understanding of atmospheric processes, and machine learning models are built to make predictions on upcoming weather conditions like rainfall or snow prediction. Although data science cannot help avoid natural calamities like floods, hurricanes, or forest fires. Tracking these natural phenomena well ahead of their arrival is beneficial. Such predictions allow governments sufficient time to take necessary steps and measures to ensure the safety of the population.
IMD leveraged data science to achieve a record 1.2m evacuation before cyclone ''Fani''
Most d ata scientist’s responsibilities rely on satellite images to make short-term forecasts, decide whether a forecast is correct, and validate models. Machine Learning is also used for pattern matching in this case. It can forecast future weather conditions if it recognizes a past pattern. When employing dependable equipment, sensor data is helpful to produce local forecasts about actual weather models. IMD used satellite pictures to study the low-pressure zones forming off the Odisha coast (India). In April 2019, thirteen days before cyclone ''Fani'' reached the area, IMD (India Meteorological Department) warned that a massive storm was underway, and the authorities began preparing for safety measures.
It was one of the most powerful cyclones to strike India in the recent 20 years, and a record 1.2 million people were evacuated in less than 48 hours, thanks to the power of data science.
7. Data Science in Entertainment Industry
Due to the Pandemic, demand for OTT (Over-the-top) media platforms has grown significantly. People prefer watching movies and web series or listening to the music of their choice at leisure in the convenience of their homes. This sudden growth in demand has given rise to stiff competition. Every platform now uses data analytics in different capacities to provide better-personalized recommendations to its subscribers and improve user experience.
How Netflix uses data science to personalize the content and improve recommendations
Netflix is an extremely popular internet television platform with streamable content offered in several languages and caters to various audiences. In 2006, when Netflix entered this media streaming market, they were interested in increasing the efficiency of their existing ''Cinematch'' platform by 10% and hence, offered a prize of $1 million to the winning team. This approach was successful as they found a solution developed by the BellKor team at the end of the competition that increased prediction accuracy by 10.06%. Over 200 work hours and an ensemble of 107 algorithms provided this result. These winning algorithms are now a part of the Netflix recommendation system.
Netflix also employs Ranking Algorithms to generate personalized recommendations of movies and TV Shows appealing to its users.
Spotify uses big data to deliver a rich user experience for online music streaming
Personalized online music streaming is another area where data science is being used. Spotify is a well-known on-demand music service provider launched in 2008, which effectively leveraged big data to create personalized experiences for each user. It is a huge platform with more than 24 million subscribers and hosts a database of nearly 20million songs; they use the big data to offer a rich experience to its users. Spotify uses this big data and various algorithms to train machine learning models to provide personalized content. Spotify offers a "Discover Weekly" feature that generates a personalized playlist of fresh unheard songs matching the user's taste every week. Using the Spotify "Wrapped" feature, users get an overview of their most favorite or frequently listened songs during the entire year in December. Spotify also leverages the data to run targeted ads to grow its business. Thus, Spotify utilizes the user data, which is big data and some external data, to deliver a high-quality user experience.
8. Data Science in Banking and Finance
Data science is extremely valuable in the Banking and Finance industry . Several high priority aspects of Banking and Finance like credit risk modeling (possibility of repayment of a loan), fraud detection (detection of malicious or irregularities in transactional patterns using machine learning), identifying customer lifetime value (prediction of bank performance based on existing and potential customers), customer segmentation (customer profiling based on behavior and characteristics for personalization of offers and services). Finally, data science is also used in real-time predictive analytics (computational techniques to predict future events).
How HDFC utilizes Big Data Analytics to increase revenues and enhance the banking experience
One of the major private banks in India, HDFC Bank , was an early adopter of AI. It started with Big Data analytics in 2004, intending to grow its revenue and understand its customers and markets better than its competitors. Back then, they were trendsetters by setting up an enterprise data warehouse in the bank to be able to track the differentiation to be given to customers based on their relationship value with HDFC Bank. Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking services. The analytics engine and SaaS use have been assisting the HDFC bank in cross-selling relevant offers to its customers. Apart from the regular fraud prevention, it assists in keeping track of customer credit histories and has also been the reason for the speedy loan approvals offered by the bank.
Where to Find Full Data Science Case Studies?
Data science is a highly evolving domain with many practical applications and a huge open community. Hence, the best way to keep updated with the latest trends in this domain is by reading case studies and technical articles. Usually, companies share their success stories of how data science helped them achieve their goals to showcase their potential and benefit the greater good. Such case studies are available online on the respective company websites and dedicated technology forums like Towards Data Science or Medium.
Additionally, we can get some practical examples in recently published research papers and textbooks in data science.
What Are the Skills Required for Data Scientists?
Data scientists play an important role in the data science process as they are the ones who work on the data end to end. To be able to work on a data science case study, there are several skills required for data scientists like a good grasp of the fundamentals of data science, deep knowledge of statistics, excellent programming skills in Python or R, exposure to data manipulation and data analysis, ability to generate creative and compelling data visualizations, good knowledge of big data, machine learning and deep learning concepts for model building & deployment. Apart from these technical skills, data scientists also need to be good storytellers and should have an analytical mind with strong communication skills.
These were some interesting data science case studies across different industries. There are many more domains where data science has exciting applications, like in the Education domain, where data can be utilized to monitor student and instructor performance, develop an innovative curriculum that is in sync with the industry expectations, etc.
Almost all the companies looking to leverage the power of big data begin with a swot analysis to narrow down the problems they intend to solve with data science. Further, they need to assess their competitors to develop relevant data science tools and strategies to address the challenging issue. This approach allows them to differentiate themselves from their competitors and offer something unique to their customers.
With data science, the companies have become smarter and more data-driven to bring about tremendous growth. Moreover, data science has made these organizations more sustainable. Thus, the utility of data science in several sectors is clearly visible, a lot is left to be explored, and more is yet to come. Nonetheless, data science will continue to boost the performance of organizations in this age of big data.
Devashree holds an M.Eng degree in Information Technology from Germany and a background in Data Science. She likes working with statistics and discovering hidden insights in varied datasets to create stunning dashboards. She enjoys sharing her knowledge in AI by writing technical articles on various technological platforms. She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.
Avail your free 1:1 mentorship session.
Something went wrong
Frequently Asked Questions (FAQs)
A case study in data science requires a systematic and organized approach for solving the problem. Generally, four main steps are needed to tackle every data science case study:
- Defining the problem statement and strategy to solve it
- Gather and pre-process the data by making relevant assumptions
- Select tool and appropriate algorithms to build machine learning /deep learning models
- Make predictions, accept the solutions based on evaluation metrics, and improve the model if necessary.
Getting data for a case study starts with a reasonable understanding of the problem. This gives us clarity about what we expect the dataset to include. Finding relevant data for a case study requires some effort. Although it is possible to collect relevant data using traditional techniques like surveys and questionnaires, we can also find good quality data sets online on different platforms like Kaggle, UCI Machine Learning repository, Azure open data sets, Government open datasets, Google Public Datasets, Data World and so on.
Data science projects involve multiple steps to process the data and bring valuable insights. A data science project includes different steps - defining the problem statement, gathering relevant data required to solve the problem, data pre-processing, data exploration & data analysis, algorithm selection, model building, model prediction, model optimization, and communicating the results through dashboards and reports.
Feb 21, 2021
Data Science Case Studies: Solved and Explained
Data science case studies solved and explained using python..
Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data science use cases in your portfolio. In this article, I’m going to introduce you to 3 data science case studies solved and explained using Python.
Data Science Case Studies
If you’ve learned data science by taking a course or certification program, you’re still not that close to finding a job easily. The most important point of your Data Science interview is to show how you can use your skills in real use cases. Below are 3 data science case studies that will help you understand how to analyze and solve a problem. All of the data science case studies mentioned below are solved and explained using Python.
Case Study 1: Text Emotions Detection
If you are one of them who is having an interest in natural language processing then this use case is for you. The idea is to train a machine learning model to generate emojis based on input text. Then this machine learning model can be used in training Artificial Intelligent Chatbots.
Use Case: A human can express his emotions in any form, such as the face, gestures, speech and text. The detection of text emotions is a content-based classification problem. Detecting a person’s emotions is a difficult task, but detecting the emotions using text written by a person is even more difficult as a human can express his emotions in any form.
Recognizing this type of emotion from a text written by a person plays an important role in applications such as chatbots, customer support forum, customer reviews etc. So you have to train a machine learning model that can identify the emotion of a text by presenting the most relevant emoji according to the input text.
Solution: Machine Learning Project on Text Emotions Detection .
Case Study 2: Hotel Recommendation System
A hotel recommendation system typically works on collaborative filtering that makes recommendations based on ratings given by other customers in the same category as the user looking for a product.
Use Case: We all plan trips and the first thing to do when planning a trip is finding a hotel. There are so many websites recommending the best hotel for our trip. A hotel recommendation system aims to predict which hotel a user is most likely to choose from among all hotels. So to build this type of system which will help the user to book the best hotel out of all the other hotels. We can do this using customer reviews.
For example, suppose you want to go on a business trip, so the hotel recommendation system should show you the hotels that other customers have rated best for business travel. It is therefore also our approach to build a recommendation system based on customer reviews and ratings. So use the ratings and reviews given by customers who belong to the same category as the user and build a hotel recommendation system.
Solution: Data Science Project on Hotel Recommendation System .
Case Study 3: Customer Personality Analysis
The analysis of customers is one of the most important roles that a data scientist has to do who is working at a product based company. So if you are someone who wants to join a product based company then this data science case study is best for you.
Use Case: Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.
You have to do an analysis that should help a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
Solution: Data Science Project on Customer Personality Analysis .
So these three data science case studies are based on real-world problems, starting with the first; Text Emotions Detection, which is completely based on natural language processing and the machine learning model trained by you will be used in training an AI chatbot. The second use case; Hotel Recommendation System, is also based on NLP, but here you will understand how to generate recommendations using collaborative filtering. The last use case; customer personality analysis, is based on someone who wants to focus on the analysis part.
All these data science case studies are solved using Python, here are the resources where you will find these use cases solved and explained:
- Text Emotions Detection
- Hotel Recommendation System
- Customer Personality Analysis
I hope you liked this article on data science case studies solved and explained using the Python programming language. Feel free to ask your valuable questions in the comments section below.
More from Analytics Vidhya
Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com
About Help Terms Privacy
Get the Medium app
I write stories behind the data📈 | instagram.com/amankharwal.official/
Text to speech
- Data Science
- How it works Overview Job guarantee Payment options Scholarships
- Students Student Outcomes Student Stories Community
Data in Action: 7 Data Science Case Studies Worth Reading
In this article
Why Are Data Science Case Studies Important?
7 top data science case studies , data science case studies faqs.
The field of data science is rapidly growing and evolving. And in the next decade, new ways of automating data collection processes and deriving insights from data will boost workflow efficiencies like never before.
There’s no better way to understand the changing nature of data science than by examining some real-life examples. Whether you’re thinking about a career in data science, or if you’re a practicing data science looking to see where the field is heading, we’ve got you covered. Below, we’ll explore seven case studies from the world of data science, all of which reveal the field’s potential for both business and research.
When learning about data science , it’s important to have case studies to learn from and use as examples. Case studies are helpful tools when you want to illustrate a specific point or concept. They can be used to show how a data science project works in real life, or they can be used as an example of what to avoid.
Data science case studies help students, and entry-level data scientists understand how professionals have approached previous challenges. They give you ideas for how to approach your own challenges and ultimately aim for innovation.
Here are 7 top case studies that show how companies and organizations have approached common challenges with some seriously inventive data science solutions:
Data science is a powerful tool that can help us to understand better and predict geoscience phenomena. By collecting and analyzing data, we can make data-driven decisions about how to respond to natural disasters, such as hurricanes or earthquakes.
We can also use data science to collect information about the earth’s climate, which helps us determine how quickly it will change over time and how we might mitigate some of those changes.
Tsunami Early Warning Method From Stanford University
This early warning system can provide communities with valuable information to help geoscientists prepare for and mitigate the impacts of a potential tsunami. Even so, this method has its drawbacks. Tides, currents, temperature, and salinity changes can alter the data generated by the sensors, which can result in false positives.
Data scientists are improving the warning system’s accuracy by using machine learning algorithms to detect and alert communities of possible tsunamis. This system also has implications for climate change research and disaster relief. By using data science to better understand climate change’s effects, scientists can develop strategies to reduce the risk of natural disasters and help communities better prepare for and respond to them.
Data science and machine learning technologies are revolutionizing the medical industry . With the help of these tools, doctors can detect even the slightest changes in the patient’s health indicators and predict potential diseases before they become more severe.
Google’s LYNA for Metastatic Breast Cancer Detection
Pathology is essential for diagnosing cancer, and its microscopic examination of tumors is the gold standard. This evaluation is critical in determining prognosis and treatment decisions. One key aspect of this process is detecting cancer that has metastasized from the primary site to nearby lymph nodes.
The accuracy and timeliness of identifying nodal metastases in breast cancer majorly impact treatment decisions, including radiation therapy, chemotherapy, and the potential surgical removal of additional lymph nodes.
Unfortunately, studies have shown that around 25% of metastatic lymph node staging classifications are revised upon a second pathologic review. Additionally, the detection sensitivity of small metastases on individual slides can be as low as 38% when reviewed under time constraints.
This is where data science comes into play. Google’s LYmph Node Assistant (LYNA) achieved significantly higher cancer detection rates than previously reported. LYNA was applied to pathology slides from the Camelyon Challenge and an independent dataset provided by the Naval Medical Center San Diego. This additional dataset improved the representation of the diversity of slides and artifacts seen in routine clinical trials.
The LYmph Node Assistant was incredibly successful in both datasets, correctly distinguishing slides with metastatic cancer from those without cancer 99% of the time. It could even identify areas of concern within each slide that was too small to be detected by pathologists. This could help them greatly, as LYNA could highlight these areas of concern for pathologists to review and make the final diagnosis faster than ever.
Data science has become a powerful tool for logistics companies to optimize their operations. Namely, it can be used to determine the most efficient delivery routes, manage fuel usage at the most cost-effective times of day, and make more accurate predictions about supply and demand. And all of these factors help logistics companies save time and money.
UPS’s Network Planning Tools (NPT)
Network Planning Tools (NPT) is an app developed by UPS used to help navigate and efficiently move packages worldwide, despite any weather conditions. NPT utilizes machine learning to process and analyze large amounts of real-world data, allowing UPS to view all its facilities, divert its packages around storms, or move a large shipment quickly.
UPS has developed NPT to comprehensively view the organization’s operations. These tools are designed to give an overview of activities at UPS facilities worldwide and package volume and distribution across its pickup and delivery network. It also enables UPS to get detailed information about shipments in transit, such as their weight, volume, and delivery deadlines.
By utilizing NPT, UPS can take advantage of various features such as routing shipments to the facilities with the most capacity, moving volume to lower-cost transportation modes, preemptively rerouting packages to alternative lanes to avoid unexpected costs and delays due to storms or other events, and creating forecasts about package volume and weight based on detailed analysis of historical data.
Like the best data science tools, NPT empowers humans to make better decisions. When a package reroutes, the app notifies an engineer in the new location about the revised plan. The engineer then evaluates various options and takes action, potentially rerouting the package based on new information. This human-driven decision gets updated in the app, helping it learn from human oversight and get smarter about routing plans. NPT also checks the engineer’s choice to ensure that it has the desired result. This allows NPT’s algorithm to improve and make smarter decisions when routing packages.
Data science has revolutionized the way businesses detect fraudulent activities. It plays a crucial role in collecting, summarizing, and predicting customer databases to identify fraudulent activity accurately.
J.P. Morgan’s Use of NLP and AI
J. P. Morgan and APG Asset Management have recently studied how data science can improve how portfolio managers and research analysts utilize information. By using data from the European Central Bank, they have been able to pinpoint multiple use cases for this technology in developing new user interfaces. Individual company employees collaborated through a single data repository to observe each other’s progress and build a prototype AI application.
This project may have overall implications for the banking industry, not just for J.P. Morgan. For example, an AI-enabled dashboard could display the most important market trends for the manager to show to their client, resulting in higher customer satisfaction as they can better understand their investments.
Contract Intelligence (COiN)
J.P. Morgan also uses machine learning and artificial intelligence to develop a new software called COIN (Contract Intelligence). The tool reviewed thousands of the bank’s credit contracts, utilizing image recognition to recognize patterns in these agreements. This technology employs unsupervised learning , enabling the algorithm to classify clauses into one of about 150 different “attributes” of credit contracts. For example, the algorithm can note specific patterns based on clause wording or location in the agreement.
JPMorgan’s investment in data science is motivated by the benefits of cost savings. The software can review contracts in seconds, a process that previously took lawyers over 360,000 man-hours. Not only does the algorithm save time and money. It also improves the accuracy of the process. Compared to human lawyers, the accuracy of COIN is significantly greater. Thus, the bank’s investment is not only about cutting costs but also about increasing quality.
Data science in e-commerce is revolutionizing the way companies do business, as it allows them to collect and analyze valuable information. Innovative technologies are also being used to boost profits with personalized recommended lists, dynamic pricing, and the ability to predict customers’ buying behavior. This gives e-commerce businesses an edge over their competitors and helps them stay ahead of the curve in an increasingly competitive market.
Personalized recommendations are an important part of the modern e-commerce experience. A study conducted by Epsilon shows that 80 percent of consumers are more likely to purchase from a brand that delivers personalized experiences. They can be used to show products that customers have shown interest in or to suggest products that are similar to those they’ve viewed in the past.
Amazon’s Recommendation Engine
The Amazon recommendation engine is one of the most well-known examples of this type of technology. It uses machine learning to match users with items similar to those they’ve previously interacted with. It does this by analyzing what types of items a user purchases, browses, watches, or listens to. It then matches other users with similar interactions with similar items and recommends them based on their preferences.
As “recommendations” are made based on a user’s past interactions and interests, they are very likely to increase Amazon’s bottom line as people will find the recommended product interesting and will be more likely to buy it. But that statistic just scratches the surface regarding the power of data science in e-commerce. McKinsey noted that 35% of all items purchased from Amazon and 75% of what people watch on Netflix come from product recommendation engines .
Agriculture has always been an unpredictable business. But data science is changing how farmers manage their businesses by providing access to information they didn’t have before. It helps them optimize their operations and make more money by allowing them to cut back on waste and increase productivity.
Farmers can now use technology to make better decisions about what crops should be grown in certain areas, which animals are most profitable for them to raise, and how much water is needed for certain plants at specific times of the year.
Watson Decision Platform for Agriculture
IBM is working on ways to help improve farm productivity through artificial intelligence and machine learning. They are developing an AI system called the Watson Decision Platform for Agriculture. It will provide farmers with data about their crops and soil conditions to make more informed decisions about how to grow their crops.
This machine learning model can be applied to any location, regardless of weather conditions or growing conditions, and can be used to determine yields for past growing seasons, which is critical for validating agriculture insurance claims and risk, optimizing supply-and-demand chain logistics, and even predicting commodity prices.
In addition, the API uses weather forecast details to predict the risk of corn pests and disease outbreaks as well as the likelihood of spore transport. Farmers can use this information to reduce pesticide usage and take preventive or curative measures to avoid unexpected yield loss.
Data science can be used to improve transportation operations in several ways:
- Delivery path optimization , which reduces freight costs by evaluating various delivery paths and determining the most cost-effective
- Dynamic price matching , to match supply to demand and maximize profits
- Warehouse optimization , to maximize the efficiency of warehouse operations
- Forecasting demand , to anticipate customer needs and prepare for them in advance
Uber’s Gairos for Dynamic Pricing
Uber, perhaps the most disruptive company in the transportation industry, has leveraged this with the development of Gairos, a real-time data processing, storage, and query language platform. This platform facilitates streamlined and efficient data exploration at scale, enabling teams to gain insights into the Uber Marketplace.
Through the use of Gairos, Uber can gain various insights that are crucial to its operations. For example, dynamic pricing leverages demand and supply data to calculate surge multipliers at specific locations and times. Driver movement utilizes real-time demand and supply data to suggest surge and carbon-friendly routes for drivers.
Get To Know Other Data Science Students
Mengqin (Cassie) Gong
Data Scientist at Whatsapp
Data Scientist at Carlisle & Company
Data Analyst at Verizon Digital Media Services
We’ve got the answers to your most frequently asked questions:
How Can You Use a Case Study About Data Science?
Case studies can be an invaluable resource for students to gain an understanding of the data science field. Such studies can provide insights into the real-world application of data science, as well as the necessary skills like programming language skills and statistical models.
How Do You Write a Data Science Case Study?
First, you should identify the problem you want to solve. This can be as simple as “I want to analyze my customer base” or more complex, such as “I want to predict which customers will respond best to this particular ad campaign.”
You’ll need to write a hypothesis that explains how your proposed solution will work. This should be clear, concise, and easy to understand.
Then you’ll need to describe your data set —what exactly are you working with? How many records are there? What kind of information do they contain? Are there any missing values?
Next, describe how you processed your data before applying machine learning methods. Did you use any preprocessing steps? Did you normalize your data? Did you remove outliers? These are all crucial steps you must include in your case study if you want others who read it later to understand what happened during processing.
How Do Data Science Case Studies Impact the Industry?
Data science case studies highlight the work done by practitioners, and they can be used to educate new and existing data scientists on how to approach problems.
Case studies also help companies determine which type of data science teams they should create and how those teams should be structured. By providing valuable information about what kinds of data science projects are most successful in the real world, they help companies develop business strategies for their future collection of projects.
Since you’re here… Are you a future data scientist? Investigate with our free guide to what a data scientist actually does . When you’re ready to build a CV that will make hiring managers melt, join our Data Science Bootcamp that guarantees a job or your tuition back!
Download our guide to becoming a data scientist in six months
Learn how to land your dream data science job in just six months with in this comprehensive guide.
A beginner’s guide to neural networks in python.
Is AI Hard To Learn? A Guide To Getting Started in 2023
Real Talk With a Data Scientist: Google Interviews
- Data Analytics Bootcamp
- Data Science Bootcamp
- Data Engineering Bootcamp
- Machine Learning Bootcamp
- Software Engineering Bootcamp
- UI/UX Design Bootcamp
- UX Bootcamp
- Cyber Security Bootcamp
- Tech Sales Bootcamp
- Free Learning Paths
- E-books and Guides
- Career Assessment Test
- Student Outcomes
- Compare Bootcamps
- About the Company
- Become a Mentor
- Hire Our Students
- Student Beans
- Inclusion Scholarships
19 Data Science Case Study Interview Questions (with Solutions)
Data science case study interview questions are often the most difficult part of the interview process. Designed to simulate a company’s current and past projects, case study problems rigorously examine how candidates approach prompts, communicate their findings and work through roadblocks.
Practice is key for acing case study interviews in data science. By continuously practicing test problems, you will learn how to approach the case study, ask the right questions from your interviewer and formulate answers that are both illustrative of your abilities and crafted within the time constraints of the question format.
There are four main types of data science case studies:
- Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics.
- Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem. Additionally, you must write a SQL query to pull your proposed metrics, and then perform analysis using the data you queried, just as you would do in the role.
- Modeling and Machine Learning Case Studies - Modeling case studies are more varied and focus on assessing your intuition for building models around business problems.
- Business Case Questions - Similar to product questions, business cases tackle issues or opportunities specific to the organization that is interviewing you. Often, candidates must assess the best option for a certain business plan being proposed, and formulate a process for solving the specific problem.
How Case Study Interviews Are Conducted
Oftentimes as an interviewee, you want to know the setting and format in which to expect the above questions to be asked. Unfortunately, this is company-specific: Some prefer real-time settings, where candidates actively work through a prompt after receiving it, while others offer some period of days (say, a week) before settling in for a presentation of your findings.
It is therefore important to have a system for answering these questions that will accommodate all possible formats, such that you are prepared for any set of circumstances (we provide such a framework below).
Why Are Case Study Questions Asked?
Case studies assess your thought process in answering data science questions. Specifically, interviewers want to see that you have the ability to think on your feet, and to work through real-world problems that likely do not have a right or wrong answer. Real-world case studies that are affecting businesses are not binary; there is no black-and-white, yes-or-no answer. This is why it is important that you can demonstrate decisiveness in your investigations, as well as show your capacity to consider impacts and topics from a variety of angles. Once you are in the role, you will be dealing directly with the ambiguity at the heart of decision-making.
Perhaps most importantly, case interviews assess your ability to effectively communicate your conclusions. On the job, data scientists exchange information across teams and divisions, so a significant part of the interviewer’s focus will be on how you process and explain your answer.
Quick tip: Because case questions in data science interviews tend to be product- and company-focused, it is extremely beneficial to research current projects and developments across different divisions , as these initiatives might end up as the case study topic.
How to Answer Data Science Case Study Questions (The Framework)
There are four main steps to tackling case questions in Data Science interviews, regardless of the type: clarify, make assumptions, gather context, and provide data points and analysis.
Step 1: Clarify
Clarifying is used to gather more information . More often than not, these case studies are designed to be confusing and vague. There will be unorganized data intentionally supplemented with extraneous or omitted information, so it is the candidate’s responsibility to dig deeper, filter out bad information, and fill gaps. Interviewers will be observing how an applicant asks questions and reach their solution.
For example, with a product question, you might take into consideration:
- What is the product?
- How does the product work?
- How does the product align with the business itself?
Step 2: Make Assumptions
When you have made sure that you have evaluated and understand the dataset, start investigating and discarding possible hypotheses. Developing insights on the product at this stage complements your ability to glean information from the dataset, and the exploration of your ideas is paramount to forming a successful hypothesis. You should be communicating your hypotheses with the interviewer, such that they can provide clarifying remarks on how the business views the product, and to help you discard unworkable lines of inquiry. If we continue to think about a product question, some important questions to evaluate and draw conclusions from include:
- Who uses the product? Why?
- What are the goals of the product?
- How does the product interact with other services or goods the company offers?
The goal of this is to reduce the scope of the problem at hand, and ask the interviewer questions upfront that allow you to tackle the meat of the problem instead of focusing on less consequential edge cases.
Step 3: Propose a Solution
Now that a hypothesis is formed that has incorporated the dataset and an understanding of the business-related context, it is time to apply that knowledge in forming a solution. Remember, the hypothesis is simply a refined version of the problem that uses the data on hand as its basis to being solved. The solution you create can target this narrow problem, and you can have full faith that it is addressing the core of the case study question.
Keep in mind that there isn’t a single expected solution, and as such, there is a certain freedom here to determine the exact path for investigation.
Step 4: Provide Data Points and Analysis
Finally, providing data points and analysis in support of your solution involves choosing and prioritizing a main metric. As with all prior factors, this step must be tied back to the hypothesis and the main goal of the problem. From that foundation, it is important to trace through and analyze different examples– from the main metric–in order to validate the hypothesis.
Quick tip: Every case question tends to have multiple solutions. Therefore, you should absolutely consider and communicate any potential trade-offs of your chosen method. Be sure you are communicating the pros and cons of your approach.
Note: In some special cases, solutions will also be assessed on the ability to convey information in layman’s terms. Regardless of the structure, applicants should always be prepared to solve through the framework outlined above in order to answer the prompt.
The Role of Effective Communication
There have been multiple articles and discussions conducted by interviewers behind the Data Science Case Study portion, and they all boil down success in case studies to one main factor: effective communication.
All the analysis in the world will not help if interviewees cannot verbally work through and highlight their thought process within the case study. Again, interviewers are keyed at this stage of the hiring process to look for well-developed “soft-skills” and problem-solving capabilities. Demonstrating those traits is key to succeeding in this round.
To this end, the best advice possible would be to practice actively going through example case studies, such as those available in the Interview Query questions bank. Exploring different topics with a friend in an interview-like setting with cold recall (no Googling in between!) will be uncomfortable and awkward, but it will also help reveal weaknesses in fleshing out the investigation.
Don’t worry if the first few times are terrible! Developing a rhythm will help with gaining self-confidence as you become better at assessing and learning through these sessions.
Product Case Study Questions
With product data science case questions , the interviewer wants to get an idea of your product sense intuition. Specifically, these questions assess your ability to identify which metrics should be proposed in order to understand a product.
1. How would you measure the success of private stories on Instagram, where only certain close friends can see the story?
Start by answering: What is the goal of the private story feature on Instagram? You can’t evaluate “success” without knowing what the initial objective of the product was, to begin with.
One specific goal of this feature would be to drive engagement. A private story could potentially increase interactions between users, and grow awareness of the feature.
Now, what types of metrics might you propose to assess user engagement? For a high-level overview, we could look at:
- Average stories per user per day
- Average Close Friends stories per user per day
However, we would also want to further bucket our users to see the effect that Close Friends stories have on user engagement. By bucketing users by age, date joined, or another metric, we could see how engagement is affected within certain populations, giving us insight on success that could be lost if looking at the overall population.
2. How would you measure the success of acquiring new users through a 30-day free trial at Netflix?
More context: Netflix is offering a promotion where users can enroll in a 30-day free trial. After 30 days, customers will automatically be charged based on their selected package. How would you measure acquisition success, and what metrics would you propose to measure the success of the free trial?
One way we can frame the concept specifically to this problem is to think about controllable inputs, external drivers, and then the observable output . Start with the major goals of Netflix:
- Acquiring new users to their subscription plan.
- Decreasing churn and increasing retention.
Looking at acquisition output metrics specifically, there are several top-level stats that we can look at, including:
- Conversion rate percentage
- Cost per free trial acquisition
- Daily conversion rate
With these conversion metrics, we would also want to bucket users by cohort. This would help us see the percentage of free users who were acquired, as well as retention by cohort.
3. How would you measure the success of Facebook Groups?
Start by considering the key function of Facebook Groups. You could say that Groups are a way for users to connect with other users through a shared interest or real-life relationship. Therefore, the user’s goal is to experience a sense of community, which will also drive our business goal of increasing user engagement.
What general engagement metrics can we associate with this value? An objective metric like Groups monthly active users would help us see if Facebook Groups user base is increasing or decreasing. Plus, we could monitor metrics like posting, commenting, and sharing rates.
There are other products that Groups impact, however, specifically the Newsfeed. We need to consider Newsfeed quality and examine if updates from Groups clog up the content pipeline and if users prioritize those updates over other Newsfeed items. This evaluation will give us a better sense of if Groups actually contribute to higher engagement levels.
4. How would you analyze the effectiveness of a new LinkedIn chat feature that shows a “green dot” for active users?
Note: Given engineering constraints, the new feature is impossible to A/B test before release. When you approach case study questions, remember to always clarify any vague terms. In this case, “effectiveness” is very vague. To help you define that term, you would want to first consider what the goal is of adding a green dot to LinkedIn chat.
5. How would you diagnose why weekly active users are up 5%, but email notification open rates are down 2%?
What assumptions can you make about the relationship between weekly active users and email open rates? With a case question like this, you would want to first answer that line of inquiry before proceeding.
Hint: Open rate can decrease when its numerator decreases (fewer people open emails) or its denominator increases (more emails are sent overall). Taking these two factors into account, what are some hypotheses we can make about our decrease in the open rate compared to our increase in weekly active users?
Data Analytics Case Study Questions
Data analytics case studies ask you to dive into analytics problems. Typically these questions ask you to examine metrics trade-offs or investigate changes in metrics. In addition to proposing metrics, you also have to write SQL queries to generate the metrics, which is why they are sometimes referred to as SQL case study questions .
6. Using the provided data, generate some specific recommendations on how DoorDash can improve.
In this DoorDash analytics case study take-home question you are provided with the following dataset:
- Customer order time
- Restaurant order time
- Driver arrives at restaurant time
- Order delivered time
- Customer ID
- Amount of discount
- Amount of tip
With a dataset like this, there are numerous recommendations you can make. A good place to start is by thinking about the DoorDash marketplace, which includes drivers, riders and merchants. How could you analyze the data to increase revenue, driver/user retention and engagement in that marketplace?
7. After implementing a notification change, the total number of unsubscribes increases. Write a SQL query to show how unsubscribes are affecting login rates over time.
This is a Twitter data science interview question , and let’s say you implemented this new feature using an A/B test. You are provided with two tables: events (which includes login, nologin and unsubscribe ) and variants (which includes control or variant ).
We are tasked with comparing multiple different variables at play here. There is the new notification system, along with its effect of creating more unsubscribes. We can also see how login rates compare for unsubscribes for each bucket of the A/B test.
Given that we want to measure two different changes, we know we have to use GROUP BY for the two variables: date and bucket variant. What comes next?
8. Write a query to disprove the hypothesis: Data scientists who switch jobs more often end up getting promoted faster.
More context: You are provided with a table of user experiences representing each person’s past work experiences and timelines.
This question requires a bit of creative problem-solving to understand how we can prove or disprove the hypothesis. The hypothesis is that a data scientist that ends up switching jobs more often gets promoted faster.
Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.
For example, if we looked at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis that the number of data science managers increased as the number of career jumps also rose.
- Never switched jobs: 10% are managers
- Switched jobs once: 20% are managers
- Switched jobs twice: 30% are managers
- Switched jobs three times: 40% are managers
9. Write a SQL query to investigate the hypothesis: Click-through rate is dependent on search result rating.
More context: You are given a table with search results on Facebook, which includes query (search term), position (the search position), and rating (human rating from 1 to 5). Each row represents a single search and includes a column has_clicked that represents whether a user clicked or not.
This question requires us to formulaically do two things: create a metric that can analyze a problem that we face and then actually compute that metric.
Think about the data we want to display to prove or disprove the hypothesis. Our output metric is CTR (clickthrough rate). If CTR is high when search result ratings are high and CTR is low when the search result ratings are low, then our hypothesis is proven. However, if the opposite is true, CTR is low when the search result ratings are high, or there is no proven correlation between the two, then our hypothesis is not proven.
With that structure in mind, we can then look at the results split into different search rating buckets. If we measure the CTR for queries that all have results rated at 1 and then measure CTR for queries that have results rated at lower than 2, etc., we can measure to see if the increase in rating is correlated with an increase in CTR.
Modeling and Machine Learning Case Questions
Machine learning case questions assess your ability to build models to solve business problems. These questions can range from applying machine learning to solve a specific case scenario to assessing the validity of a hypothetical existing model . The modeling case study requires a candidate to evaluate and explain any certain part of the model building process.
10. Describe how you would build a model to predict Uber ETAs after a rider requests a ride.
Common machine learning case study problems like this are designed to explain how you would build a model. Many times this can be scoped down to specific parts of the model building process. Examining the example above, we could break it up into:
How would you evaluate the predictions of an Uber ETA model?
What features would you use to predict the Uber ETA for ride requests?
Our recommended framework breaks down a modeling and machine learning case study to individual steps in order to tackle each one thoroughly. In each full modeling case study, you will want to go over:
- Data processing
- Feature Selection
- Model Selection
- Cross Validation
- Evaluation Metrics
- Testing and Roll Out
11. How would you build a model that sends bank customers a text message when fraudulent transactions are detected?
Additionally, the customer can approve or deny the transaction via text response.
Let’s start out by understanding what kind of model would need to be built. We know that since we are working with fraud, there has to be a case where either a fraudulent transaction is or is not present .
Hint: This problem is a binary classification problem. Given the problem scenario, what considerations do we have to think about when first building this model? What would the bank fraud data look like?
12. How would you design the inputs and outputs for a model that detects potential bombs at a border crossing?
Additional questions. How would you test the model and measure its accuracy? Remember the equation for precision:
Because we can not have high TrueNegatives, recall should be high when assessing the model.
13. Which model would you choose to predict Airbnb booking prices: Linear regression or random forest regression?
Start by answering this question: What are the main differences between linear regression and random forest?
Random forest regression is based on the ensemble machine learning technique of bagging . The two key concepts of random forests are:
- Random sampling of training observations when building trees.
- Random subsets of features for splitting nodes.
Random forest regressions also discretize continuous variables, since they are based on decision trees and can split categorical and continuous variables.
Linear regression, on the other hand, is the standard regression technique in which relationships are modeled using a linear predictor function, the most common example represented as y = Ax + B.
Let’s see how each model is applicable to Airbnb’s bookings. One thing we need to do in the interview is to understand more context around the problem of predicting bookings. To do so, we need to understand which features are present in our dataset.
We can assume the dataset will have features like:
- Location features.
- Number of bedrooms and bathrooms.
- Private room, shared, entire home, etc.
- External demand (conferences, festivals, sporting events).
Which model would be the best fit for this feature set?
14. Using a binary classification model that pre-approves candidates for a loan, how would you give each rejected application a rejection reason?
More context: You do not have access to the feature weights. Start by thinking about the problem like this: How would the problem change if we had ten, one thousand, or ten thousand applicants that had gone through the loan qualification program?
Pretend that we have three people: Alice, Bob, and Candace that have all applied for a loan. Simplifying the financial lending loan model, let us assume the only features are the total number of credit cards , the dollar amount of current debt , and credit age . Here is a scenario:
- Alice: 10 credit cards, 5 years of credit age, $20K in debt
- Bob: 10 credit cards, 5 years of credit age, $15K in debt
- Candace: 10 credit cards, 5 years of credit age, $10K in debt
If Candace is approved, we can logically point to the fact that Candace’s $10K in debt swung the model to approve her for a loan. How did we reason this out?
If the sample size analyzed was instead thousands of people who had the same number of credit cards and credit age with varying levels of debt, we could figure out the model’s average loan acceptance rate for each numerical amount of current debt. Then we could plot these on a graph to model the y-value (average loan acceptance) versus the x-value (dollar amount of current debt). These graphs are called partial dependence plots.
Business Case Questions
In data science interviews, business case study questions task you with addressing problems as they relate to the business. You might be asked about topics like estimation and calculation, as well as applying problem-solving to a larger case. One tip: Be sure to read up on the company’s products and ventures before your interview to expose yourself to possible topics.
15. How would you estimate the average lifetime value of customers at a business that has existed for just over one year?
More context: You know that the product costs $100 per month, averages 10% in monthly churn, and the average customer stays for 3.5 months.
Remember that lifetime value is defined by the prediction of the net revenue attributed to the entire future relationship with all customers averaged. Therefore, $100 * 3.5 = $350… But is it that simple?
Because this company is so new, our average customer length (3.5 months) is biased from the short possible length of time that anyone could have been a customer (one year maximum). How would you then model out LTV knowing the churn rate and product cost?
16. How would you go about removing duplicate product names (e.g. iPhone X vs. Apple iPhone 10) in a massive database?
See the full solution for this Amazon business case question on YouTube:
17. What metrics would you monitor to know if a 50% discount promotion is a good idea for a ride-sharing company?
This question has no correct answer and is rather designed to test your reasoning and communication skills related to product/business cases. First, start by stating your assumptions. What are the goals of this promotion? It is likely that the goal of the discount is to grow revenue and increase retention. A few other assumptions you might make include:
- The promotion will be applied uniformly across all users.
- The 50% discount can only be used for a single ride.
How would we be able to evaluate this pricing strategy? An A/B test between the control group (no discount) and test group (discount) would allow us to evaluate Long-term revenue vs average cost of the promotion. Using these two metrics how could we measure if the promotion is a good idea?
18. A bank wants to create a new partner card, e.g. Whole Foods Chase credit card). How would you determine what the next partner card should be?
More context: Say you have access to all customer spending data. With this question, there are several approaches you can take. As your first step, think about the business reason for credit card partnerships: they help increase acquisition and customer retention.
One of the simplest solutions would be to sum all transactions grouped by merchants. This would identify the merchants who see the highest spending amounts. However, the one issue might be that some merchants have a high-spend value but low volume. How could we counteract this potential pitfall? Is the volume of transactions even an important factor in our credit card business? The more questions you ask, the more may spring to mind.
19. How would you assess the value of keeping a TV show on a streaming platform like Netflix?
Say that Netflix is working on a deal to renew the streaming rights for a show like The Office , which has been on Netflix for one year. Your job is to value the benefit of keeping the show on Netflix.
Start by trying to understand the reasons why Netflix would want to renew the show. Netflix mainly has three goals for what their content should help achieve:
- Acquisition: To increase the number of subscribers.
- Retention: To increase the retention of active subscribers and keep them on as paying members.
- Revenue: To increase overall revenue.
One solution to value the benefit would be to estimate a lower and upper bound to understand the percentage of users that would be affected by The Office being removed. You could then run these percentages against your known acquisition and retention rates.
Learn More About Feature Changes
This course is designed teach you everything you need to know about feature changes:
More Data Science Interview Resources
Case studies are one of the most common types of data science interview questions . Practice with the data science course from Interview Query, which includes product and machine learning modules.
in Towards Data Science
More on Medium
· Nov 30, 2022
A Real-World Case Study of Using Git Commands as a Data Scientist
Complete with Branch Illustration — You’re a data scientist. As data science is becoming more and more mature every day, software engineering practices begin creeping in. You are forced to venture out of your local jupyter notebooks and meet other data scientists in the wild to build a great product. To help you out with…
11 min read
· Nov 29, 2022
Organize Your Data Science Projects with PPDAC — a Case Study
Define your problem, develop a plan, find the data, analyze the data and then communicate your conclusions — that’s PPDAC — Whatever happened to methodologies? When I first started in software development they were all the rage but they seem to have lost their appeal. Maybe developers simply don’t like being told what to do. I suspect that methodologies are not popular when they are prescriptive. But they don’t have to…
10 min read
Tatev Karen Aslanyan
· Apr 22, 2022
Data Science Case Study: What Makes Spotify Playlist Successful
End-to-end Data Science Case Study with Tips and Python Implementation for Real Life Business Problem: What Makes Spotify Playlist Successful — If you are preparing for your Data Science or Data Consultant Case Study interview then this article is for you. This end-to-end Data Science Case Study with its Python implementation, asked during the final stage of recruitment processes will help you to get an understanding where to start and what…
19 min read
· Mar 1, 2022
Small Samples with Meaningful Results
Expand Your Data Science Toolkit with the Nonparametric Sign Test — Data comes in all shapes and sizes. Most statistical methods rely on assumptions about abundant and well-behaved data. Nonparametric methodologies, like the sign test, have few assumptions and so make a good alternative to analyze small datasets with unusual shapes. Extracting meaningful information from limited data is a common problem…
12 min read
· Jan 13, 2022
Learning Data Science from Real-World Projects
There’s a time for theory—lofty concepts, complex equations, and careful speculation—and a time for rolling up our proverbial sleeves and building something that may or may not work. We’re in a tinkering mood these days, so this edition of the Variable centers some of the most memorable hands-on projects and…
· Aug 21, 2021
Redesigning the Traditional Statistics Class to Teach Data Consumers
Are We Teaching Students to Be Drivers or Mechanics? — When people find out I have a degree in Statistics, the conversation inevitably turns to that college statistics class they took. While some people are complimentary of the class and found it useful, the majority tell of their struggles to understand the concepts. It does not often fall on the…
Pawan Reddy Ulindala
· Jun 11, 2021
End to End Case Study (Classification): Lending Club data
Lending Club is a lending platform that lends money to people in need at an interest rate based on their credit history and other factors. In this blog, we will analyze this data and pre-process it based on our need and build a machine learning model that can identify a…
· Dec 10, 2020
Validating A/B Test Results: SQL Case Study 2
Determining whether a feature is perfect to add or too good to be true. — “I am who I am today because of the choices I made yesterday” — Eleanor Roosevelt The above lines by the former First Lady of the United States signify the importance of choices for individuals and companies alike. …
· Nov 30, 2020
Analyzing Drop in User Engagement: A SQL Case Study
Using product analytics and hypothesis testing to analyze a decrease in the engagement of users on Yammer — It’s Monday morning after Thanksgiving… You are still drooling over that perfectly roasted turkey, creamy mashed potatoes, and heavenly pumpkin pie. You sluggishly turn on your laptop and see an angry email from the head of the Product team. …
· Sep 3, 2020
Master Data Science Case Studies: A Hiring Manager’s Perspective
Let me first say: I’ve been there! — I’ve been in that situation where I got a bunch of data science case studies from different companies and I had to figure out what the problem was, what to do to solve it and what to focus on. …
Your home for data science. A Medium publication sharing concepts, ideas and codes.
Connect with Towards Data Science
Building the most vibrant data science community on the web. Share your insights and projects with like-minded readers: bit.ly/write-for-tds
Editor in Chief, Towards Data Science. Previously: Editorial lead, Automattic & Senior Editor, Longreads. (he/him)
Editor at Towards Data Science, she/her
Sign up for The Variable
By towards data science.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.
Medium sent you an email at to complete your subscription.
Text to speech
Doing Data Science: A Framework and Case Study
Today’s data revolution is not just about big data, it is about data of all sizes and types. While the issues of volume and velocity presented by the ingestion of massive amounts of data remain prevalent, it is the rapidly developing challenges being presented by the third v , variety, that necessitates more attention. The need for a comprehensive approach to discover, access, repurpose, and statistically integrate all the varieties of data is what has led us to the development of a data science framework that forms our foundation of doing data science . Unique features in this framework include problem identification, data discovery, data governance and ingestion, and ethics. A case study is used to illustrate the framework in action. We close with a discussion of the important role for data acumen.
Keywords: data science framework, data discovery, ethics, data acumen, workforce
In the words of Thomas Jefferson, ‘knowledge is power,’ an adage that data scientists understand too well given that data science is quickly becoming the new currency of sound decision making and policy development. But not all data are created equal.
Today’s data revolution is not just about big data, but the emergence of all sizes and types of data. Advances in information technology, computation, and statistics now make it possible to access, integrate, and analyze massive amounts of data over time and space. Further, massive repurposing (using data for purposes other than those for which it was gathered) is becoming an increasingly common practice and concern. These data are often incomplete, flawed, challenging to access, and nonrepresentative.
That predicament is driving a significant need for a data-literate population to move from simple data analytics to actually ‘doing data science.’ To bring this to bear, researchers from the University of Virginia’s (UVA) Biocomplexity Institute and Initiative have developed a research model and data science framework to help mature data science. The data science framework and associated research processes are fundamentally tied to practical problem solving, highlight data discovery as an essential but often overlooked step in most data science frameworks, and, incorporate ethical considerations as a critical feature to the research. Finally, as data are becoming the new currency across our economy, the UVA research team emphasizes the obligation of data scientists to enlighten decision makers on data acumen (literacy). The need to help consumers of research to understand the data and the role it plays in problem solving and policy development is important, as is building a data-savvy workforce interested in public good applications, such as the Data Science for the Public Good young scholars program led by the Biocomplexity Institute team.
Today’s data revolution is more about how we are ‘doing data science’ than just ‘big data analytics,’ a buzzword with little value to policymakers or communities trying to solve complex social issues. With a proper research model and framework, it is possible to bring the all data revolution to all organizations, from local, state, and federal governments to industry and nonprofit organizations, expanding its reach, application, understanding, and impact.
Data science is the quintessential translational research field that starts at the point of translation—the real problem to be solved. It involves many stakeholders and fields of practice and lends itself to team science. Data science has evolved into a powerful transdisciplinary endeavor. This article shares our development of a framework to build an understanding of what it means to just do data science .
We have learned how to do data science in a rather unique research environment within the University of Virginia’s Biocomplexity Institute , one that is an intentional collection of statisticians and social and behavioral scientists with a common interest in channeling data science to improve the impact of decision making for the public good. Our data science approach to research is based on addressing real, very applied public policy problems. It is a research model that starts with translation by working directly with the communities or stakeholders and focusing on their problems. This results in a ‘research pull’ versus a ‘research push’ to lay the research foundation for data science. Research push is the traditional research paradigm. For example, research in biology and life sciences moves from basic bench science to bedside practice. For data science, it is through working several problems in multiple domains that the synergies and overarching research needs emerge, hence a research pull.
Through our execution of multiple and diverse policy-focused case studies, synergies and research needs across the problem domains have surfaced. A data science framework has emerged and is presented in the remainder of this article along with a case study to illustrate the steps. This data science framework warrants refining scientific practices around data ethics and data acumen (literacy). A short discussion of these topics concludes the article.
2. Data Science Framework
Conceptual models are being proposed for capturing the life cycle of data science, for example, Berkeley School of Information (2019) and Berman et al. (2018). A simple Google search of ‘data science’ brings forward pages and pages of images. These figures have overlapping features and are able to nicely summarize several components of the data science process. We find it critical to go beyond the conceptual framing and have created a framework that can be operationalized for the actual practice of data science.
Our data science framework (see Figure 1) provides a comprehensive approach to data science problem solving and forms the foundation of our research (Keller, Korkmaz, Robbins, & Shipp, 2018; Keller, Lancaster, & Shipp, 2017). The process is rigorous, flexible, and iterative in that learning at each stage informs prior and subsequent stages. There are four features of our framework that deviate from other frameworks and will be described in some detail. First, we specify the problem to be addressed and keep it ever-present in the framework, hence grounding the data science research in a problem to be solved. Second, we undertake data discovery, the search for existing data sources, as a primary activity and not an afterthought. Third, governance and data ingestion play a critical role in building trust and establishing data-sharing protocols. Fourth, we actively connect data science ethics to all components of the framework.
Figure 1. Data science framework. The data science framework starts with the research question, or problem identification, and continues through the following steps: data discovery —inventory, screening, and acquisition; data ingestion and governance; data wrangling —data profiling, data preparation and linkage, and data exploration; fitness-for-use assessment; statistical modeling and analyses ; communication and dissemination of results; and ethics review .
In the following, we describe the components of the data science framework. Although the framework is described in a linear fashion, it is far from a linear process as represented by a circular arrow that integrates the process. We also provide a case study example for youth obesity and physical activity in Fairfax County, Virginia, that walks through the components of the framework to demonstrate how a disciplined implementation of the steps taken to do data science ensures transparency and reproducibility of the research.
2.1. Problem Identification
Data science brings together disciplines and communities to conduct transdisciplinary research that provides new insights into current and future societal challenges (Berman et al., 2018). Data becomes a common language for communication across disciplines (Keller, 2007; Keller et al., 2017). The data science process starts with the identification of the problem. Using relevant theories and framing hypotheses is achieved through traditional literature reviews, including the review of the grey literature (e.g., government, industry, and nonprofit organization reports) to find best practices. Subject matter (domain) expertise also plays a role in translating the information acquired into understanding the underlying phenomena in the data (Box, Hunter, & Hunter, 1978). Domain knowledge provides the context to define, evaluate, and interpret the findings at each stage of the research (Leonelli, 2019; Snee, DeVeaux, & Hoerl, 2014).
Domain knowledge is critical to bringing data to bear on real problems. It can take many forms, from understanding the theory, the modeling, or the underlying changes observed in data. For example, when we repurpose local administrative data for analyses, community leaders can explain underlying factors and trends in the data that may not be apparent without contextual knowledge.
Case Study Application—Problem Identification The Health and Human Services (HHS) of Fairfax County, Virginia, is interested in developing capacity for data-driven approaches to gain insights on current issues, such as youth obesity, by characterizing social and economic factors at the county and subcounty level and creating statistical models to inform policy options. Fairfax County is a large county (406 square miles) with 1.1 million people across all income groups and ethnicities. The obesity rate in the United States has steadily increased since the 1970s due to growing availability of food and declining physical activity that occurs as people get older. The project aims are to identify trends and activities related to obesity across geographies of interest for local policy and program development. The HHS sponsors provided insight and context in identifying geographic regions of interest for Fairfax County decision makers. Instead of using traditional census tracts to analyze subcounty trends, they requested that the analyses be based on Fairfax County high school attendance areas and political districts. As described in the following, this led to innovations in our research through the creation of synthetic information technology to align data by these geographic dimensions.
2.2. Data Discovery (Data Inventory, Screening, and Acquisition)
Data discovery is the identification of potential data sources that could be related to the specific topic of interest. Data pipelines and associated tools typically start at the point of acquisition or ingestion of the data (Weber, 2018). A unique feature of our data science framework is to start the data pipeline with data discovery. The goal of the data discovery process is to think broadly and imaginatively about all data, capturing the full potential variety of data (the third v of the data revolution) that could be useful for the problem at hand and literally assemble a list of these data sources.
An important component of doing data science is to first focus on massive repurposing of existing data in the conceptual development work. Data science methods provide opportunities to wrangle these data and bring them to bear on the research questions. In contrast to traditional research approaches, data science research allows researchers to explore all existing data sources before considering the design of new data collection. The advantage of this approach is that data collection can be directly targeted at current gaps in knowledge and information.
Khan, Uddin, and Gupta (2014) address the importance of variety in data science sources. Even within the same type of data, for example, administrative data, the problem (research question) drives its use and applicability of the information content to the issue being addressed. This level of variety drives what domain discoveries can be made (“Data Diversity,” 2019). Borgman (2019) notes that data are human constructs. Researchers and subject matter experts decide “what are data for a given purpose, how those data are to be interpreted, and what constitutes appropriate evidence.” A similar perspective is that data are “relational,” and their meaning relies on their history (how the data are born and evolve), their characteristics, and the interpretation of the data when analyzed (Leonelli, 2019).
Integrating data from disparate sources involves creating methods based on statistical principles that assess the usability of the data (United Nations Economic Commission for Europe, 2014, 2015). These integrated data sources provide the opportunity to observe the social condition and to answer questions that have been challenging to solve in the past. This highlights that the usefulness and applicability of the data vary depending on its use and domain. There are barriers to using repurposed data, which are often incomplete, challenging to access, not clean, and nonrepresentative. There may also exist restrictions on data access, data linkage, and redistribution that stem from the necessity of governance across multiple agencies and organizations. Finally, repurposed data may pose methodological issues in terms of inference or creating knowledge from data, often in the form of statistical, computational, and theoretical models (Japec et al., 2015; Keller, Shipp, & Schroeder, 2016).
When confronted over and over with data discovery and repurposing tasks, it becomes imperative to understand how data are born. To do this, we have found it useful to define data in four categories, designed, administrative, opportunity, and procedural. These definitions are given in Table 1 (Keller et al., 2017, Keller et al., 2018). The expected benefits of data discovery and repurposing are the use of timely and frequently low-cost (existing) data, large samples, and geographic granularity. The outcomes are a richer source of data to support the problem solving and better inform the research plan. A caveat is the need to also weigh the costs of repurposing existing data compared to new data collection, questioning whether new experiments would provide faster results and more unbiased results than finding and repurposing data. In our experience, the benefits of repurposing existing data sources often outweigh these costs and, more importantly, provides guidance on data gaps for cost effective development of new data collection.
Table 1. Data types.
Note. Adapted from Keller et al. (2018).
The typology of data (designed, administrative, opportunity, and procedural) provides a systematic way to think about possible data sources and a foundation for the data discovery steps. Data inventory is the process by which the data sources are first identified through brainstorming, searching, and snowballing processes (see Figure 2).
A short set of data inventory questions is conducted to assess the usefulness of the data sources to support the research objectives for a specific problem. The process is iterative, starting with the data inventory questions to assess whether the data source meets the basic criteria for the project with respect to the type of data, recurring nature of the data, data availability for the time period needed, geographic granularity, and unit of analysis required. If the data meet the basic criteria, then they undergo additional screening to document the provenance, purpose, frequency, gaps, how used in research, and other uses of the data. We employ a ‘data map’ to help drive our data discovery process (see Figure 3). Throughout the course of the project, as new ideas and data sources are discovered, they are inventoried and screened for consideration.
The acquisition process for existing data sources depends on the type and source of the data being accessed and includes downloading data, scraping the Web, acquiring it directly from a sponsor, or purchasing data from aggregators, or other sources. It also includes the development and initiation of data sharing agreements, as necessary.
Figure 2. Data discovery filter. Data discovery is the open-ended and continuous process whereby candidate data sources are identified. Data inventory refers to the broadest, most far-reaching ‘wish list’ of information pertaining to the research questions. Data screening is an evaluative process by which eligible data sets are sifted from the larger pool of candidate data sets. Data acquisition is the process of acquiring the data from a sponsor, purchasing it, downloading it using an application programming interface (API), or scraping the web.
Case Study Application—Data Discovery The creation of a data map highlights the types of data we want to ‘discover’ for this project (see Figure 3). This is guided by a literature review and Fairfax County subject matter experts that are part of the Community Learning Data Driven Discovery (CLD3) team for this project (Keller, Lancaster, & Shipp, 2017). This data map immediately captures the multiple units of analysis that will need to be integrated in the analysis. The data map helps the team identify potential implicit biases and ethical considerations. Figure 3. Data Map. The data map highlights the types of data desired for the study and is used as a guide for data discovery. The lists are social determinants and physical infrastructure that could affect teen behaviors. The map highlights the various units of analysis that will need to be captured and linked in the analyses. These include individuals, groups and networks of individuals, and geographic areas. Data Inventory, Screening, and Acquisition. The data map then guides our approach to identify, inventory, and screen the data. We screened each one to assess its relevance to this project, as follows. For surveys and administrative data: Are the data at the county or subcounty level? (Note: This question screened out several national sources of data that are not available at the geographic granularity needed for the study.) What years are the data available, i.e., are they for the same years as the American Community Survey (ACS) and Fairfax Youth Survey? Can we acquire and use the data in the timeframe of the project, e.g., March to September? For place-based data: Is an address provided? Can the type of establishment be identified? Can we acquire and use the data in the timeframe of the project? Following the data discovery step, we identified and acquired survey, administrative, and place-based (opportunity) data to be used in this study. These are summarized in Table 2. The baseline data are the ACS, which provides demographic and economic data at the census block and census tract levels. We characterize the housing and rental stock in Fairfax County through the use of property tax assessment administrative records. Geo-coded place-based data are scraped from the Web and include location of grocery stores, convenience stores, restaurants (full-service and fast food), recreation centers, and other opportunities for physical activity. We also acquired Fairfax County Youth Survey aggregates (at the high school boundary level) and Fairfax Park Authority administrative data.
Table 2. Selected data sources.
2.3. data governance and ingestion.
Data governance is the establishment of and adherence to rules and procedures regarding data access, dissemination, and destruction. In our data science framework, access to and management of data sources is defined in consultation with the stakeholders and the university’s institutional review board (IRB). Data ingestion is the process of bringing data into the data management platform(s).
Combining disparate data sources can raise issues around privacy and confidentiality, frequently from conflicting interests among researchers and sponsors working together. For clarity, privacy refers to the amount of personal information individuals allow others to access about themselves and confidentiality is the process that data producers and researchers follow to keep individuals’ data private (National Research Council, 2007).
For some, it becomes intoxicating to think about the massive amounts of individual data records that can be linked and integrated, leading to ideas about following behavioral patterns of specific individuals, such as what a social worker might want to do. This has led us to a data science guideline distinguishing between ensuring confidentiality of the data for research and policy analyses versus real-time activities such as casework (Keller et al., 2016). Casework requires identification of individuals and families for the data to be useful, policy analysis does not. For casework, information systems must be set up to ensure that only social workers have access to these private data and approvals granted for access. Our focus is policy analysis.
Data governance requires tools to identify, manage, interpret, and disseminate data (Leonelli, 2019). These tools are needed to facilitate decision making about different ways to handle and value data and to articulate conflicts among the data sources, shifting research priorities to consider not only publications but also data infrastructures and curation of data. Our best practices around data governance and ingestion are included as part of the training of all research team members and also captured in formal data management plans.
Resulting modified read-write data, or code that can generate the modified data, produced from the original data sources are stored back to a secure server and only accessible via secured remote access. For projects involving protected information, unless special authorization is given, researchers do not have direct access to data files. For those projects, data access is mediated by the use of different data analysis tools hosted on our own secure servers that connect to the data server via authenticated protocols (Keller, Shipp, & Schroeder, 2016).
Case Study Application—Data Governance and Ingestion Selected variables from data sources in Table 2 were profiled and cleaned (indicated by the asterisks). Two unique sets of data requiring careful governance were discovered and included in the study. First is the Fairfax County Youth Survey, administered to 8th, 10th, and 12th graders every year. Access to these data requires adhering to specific governance requirements that resulted in aggregate data being provided for each school. These data include information about time spent on activities, e.g., homework, physical activity, screen time; varieties of food eaten each week; family structure and other support; and information about risky behaviors, such as use of alcohol and drugs. Second, the Fairfax County Park Authority data include usage data at their nine recreation centers, including classes taken, services used, and location of recreation center.
2.4. Data Wrangling
These next phases of executing the data science framework activities of data profiling to assess quality, preparation, linkage, and exploration can easily consume the majority of the project’s time and resources and contribute to assessing the quality of the data (Dasu & Johnson, 2003). Details of data wrangling are now readily available from many authors and are not repeated here (e.g., DeVeaux, Hoerl, & Snee, 2016; Wickham, 2014; Wing, 2019). Assessing the quality and representativeness of the data is an iterative and important part of data wrangling (Keller, Shipp, & Schroeder, 2016).
2.5. Fitness-for-Use Assessment
Fitness-for-use of data was introduced in the 1990s from a management and industry perspective (Wang & Stone, 1996) and then expanded to official statistics by Brackstone (1999). Fitness-for-use starts with assessing the constraints imposed on the data by the particular statistical methods that will be used and if inferences are to be made whether or not the data are representative of the population to which the inferences extend. This assessment extends from straightforward descriptive tabulations and visualizations to complex analyses. Finally, fitness-for-use should characterize the information content in the results.
Case Study Application—Fitness-for-Use After linking and exploring the data sources, a subset of data was selected for the fitness-for-use analyses to benchmark the data. We were unable to gain access to individual student-level data and also important health information (even in aggregate) such as body mass indices (BMI is a combination of height and weight data). An implicit bias discussion across the team ensued and given these limitations the decisions on which data would be carried forward into the analyses were guided by a refocusing of the project to characterize the social, economic, and behavioral features of the individual high schools, their attendance areas, and county political districts. These characterizations could be used to target new programing and policy development.
2.6. Statistical Modeling and Analyses
Statistics and statistical modeling are key for drawing robust conclusions using incomplete information (Adhikari & DeNero, 2019). Statistics provide consistent and clear-cut words and definitions for describing the relationship between observations and conclusions. The appropriate statistical analysis is a function of the research question, the intended use of the data to support the research hypothesis, and the assumptions required for a particular statistical method (Leek & Peng, 2015). Ethical dimensions include ensuring accountability, transparency, and lack of algorithmic bias.
Case Study Application—Statistical Modeling and Analyses We used the place-based data to calculate and map distances between home and locations of interest by political districts and high school attendance areas. The data include the availability of physical activity opportunities and access to healthy and unhealthy food. Figure 4 gives an example of the distances from home to locations of fast food versus farmers markets within each political district. Figure 4. Exploratory analysis— direct aggregation of place-based data based on location of housing units. The box plots show the distance from each housing unit to farmers market or fast food by each of the 9 Fairfax County political districts. The take-away is that people live closer to fast food restaurants than to farmers markets. Synthetic information methods . Unlike the place-based data, the survey data do not directly align with geographies of interest, e.g., 9 Supervisor Districts and 24 School Attendance Areas. To realign the data and the subsequent composite indicators to the relevant geographies, we used synthetic information technology to impute social and economic characteristics and attach these to housing and rental units across the county. Multiple sets of representative synthetic information about the Fairfax population based on iterative proportional fitting were constructed allowing for estimation of margins of errors (Beckman, Baggerly, & McKay, 1996). Some of the features of the synthetic data are an exact match to the ACS marginal tabulations, while others are generated statistically using survey data collected at varying levels of aggregation. Synthetic estimates across these multiple data sources can then be used to make inferences at resolutions not available in any single data source alone. Creation of composite indicators . Composite indicators are useful for combining data to create a proxy for a concept of interest, such as the relative economic position of vulnerable populations across the county (Atkinson, Cantillon, Marlier, & Nolan, 2002). Two composite indicators were created, the first to represent economically vulnerable populations and the second to represent schools that have a larger percent of vulnerable students (see Figure 5). We defined the indicators as follows: Economic vulnerability is the statistical combination of four factors: the percent of households with housing burden greater than 50% of household income, with no vehicle, receiving Supplemental Nutrition Assistance Program (SNAP) benefits, and in poverty. High school vulnerability indicators are developed as a statistical combination of percentage of students enrolled in Limited English Proficiency programs, receiving free and reduced meals, on Medicaid, receiving Temporary Assistance for Needy Families, and migrant or homelessness experiences. Figure 5. School and economic vulnerability indicators for Fairfax County, Virginia. Economic vulnerability indicators are mapped by the 24 high school attendance areas and by color; the darker the color, the more vulnerable is an area. The overlaid circles are high school vulnerability indicators geolocated at the high school locations. The larger the circle, the higher the vulnerability of the high school population. Figure 6 presents correlations between factors that may affect obesity. Figure 6. Correlations of factors that may affect obesity. The factors are levels of physical activity (none or 5+ days per week), food and drink consumed during past week, unhealthy weight loss, and food insecurity. As an example, the bottom left-hand corner shows a positive correlation between no physical activity and food insecurity. The next phase of the analyses was to build statistical models that would give insights into the relationships between physical activity and healthy eating based on information from the Youth Surveys. Based on the full suite of data, several machine learning models were used. Fitness-for-use assessment revisited. While we were asked to examine youth obesity, we did not have access to obesity data at the subcounty level or student level. Yet, we decided to move from descriptive analysis to more complex statistical modeling to assess if existing data could still provide useful results. First, we used Random Forest, a supervised machine learning method that builds multiple decision trees and merges them together to get a more accurate and robust prediction. Our Random Forest results did not predict any reasonable or statistically significant results. Next, we used LASSO (least absolute shrinkage and selection operator), a regression analysis method that performs both variable selection and regularization (the process of adding information) to enhance the prediction accuracy and interpretability of the statistical model it produces. However, the LASSO method consistently selected the model with zero predictors, suggesting none are useful. A partial least squares regression had the best performance when no components were used, mirroring LASSO. Instead of using the original data, partial least squares regression reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components. Our conclusion is that more complex statistical modeling does not provide additional information beyond the (still clearly useful) descriptive analysis. As noted below, BMI data and stakeholder input to identify the relative importance of composite indicator components are needed to extend the modeling.
2.7. Communication and Dissemination
Communication involves sharing data, well-documented code, working papers, and dissemination through conference presentations, publications, and social media. These steps are critical to ensure processes and findings are transparent, replicable, and reproducible (Berman et al., 2018). An important facet of this step is to tell the story of the analysis by conveying the context, purpose, and implications of the research and findings (Berinato, 2019; Wing, 2019). Visuals, case studies, and other supporting evidence reinforce the findings.
Communication and dissemination are also important for building and maintaining a community of practice. It can include dissemination through portals, databases, and repositories, workshops, and conferences, and the creation of new journals (e.g., Harvard Data Science Review ). Underlying communication and dissemination is preserving the privacy and ethical dimensions of the research.
Case Study Application—Communication and Dissemination We summarized and presented our findings at each stage of the data science lifecycle, starting with the problem asked, through data discovery, profiling, exploratory analysis, fitness-for-use, and the statistical analysis. We provided new information to county officials about potential policy options and are continuing to explore how we might obtain data-sharing agreements to obtain sensitive data, such as BMI. The data used in this study are valuable for descriptive analyses, but the fitness-for-use assessment demonstrated the statistical models required finer level of resolution of student-level data to obtain better predictive measure, for example, body mass index (BMI) or height and weight data. The exploratory analysis described earlier provided many useful insights for Fairfax County Health and Human Services about proximity to physical activity and healthy food options for each political district and high school attendance area. We encourage Fairfax County Health and Human Services to develop new data governance policies that allow researchers to access sensitive data, while ensuring that the privacy and confidentiality of the data are maintained. Until we can access BMI or height and weight data, we propose to seek stakeholder input to develop composite indicators, such as the economic vulnerability indicator described in this example. These composite indicators would inform stakeholders and decision makers about where at-risk populations live, and changes over time in how those populations are faring from various perspectives such as economic self-sufficiency, health, access to healthy food, and access to opportunities for physical activity.
2.8. Ethics Review
The ethics review provides a set of guiding principles to ensure dialogue on this topic throughout the lifecycle of the project. Because data science involves interdisciplinary teams, conversations around ethics can be challenging. Each discipline has its own set of research integrity norms and practices. To harmonize across these fields, data science ethics touches every component and step in the practice of data science as shown in Figure 1. This is illustrated throughout the case study.
When acquiring and integrating data sources, ethical issues include considerations of mass surveillance, privacy, data sovereignty, and other potential consequences. Research integrity includes improving day-to-day research practices and ongoing training of all scientists to achieve “better record keeping, vetting experimental designs, techniques to reduce bias, rewards for rigorous research, and incentives for sharing data, code, and protocols—rather than narrow efforts to find and punish a few bad actors” (“Editorial: Nature Research Integrity,” 2019, p. 5). Research integrity is advanced by implementing these practices into research throughout the entire research process, not just through the IRB process.
Salganik (2017) proposes a principles-based approach to ethics to include standards and norms around the uses of data, analysis, and interpretation, similar to the steps associated with implementing a data science framework. Similarly, the “Community Principles on Ethical Data Sharing,” formulated at a Bloomberg conference in 2017, is based on four principles—fairness, benefit, openness, and reliability (Data for Democracy, 2018). A systematic approach to implementing these principles is ensuring scientific data are FAIR :
‘ Findable ’ using common search tools;
‘ Accessible ’ so that the data and metadata can be explored;
‘ Interoperable ’ to compare, integrate, and analyze; and
‘ Reusable ’ by other researchers or the public through the availability of metadata, code, and usage licenses (Stall et al., 2019).
Underlying the FAIR principles is to also give credit for curating and sharing data and to count this as important as journal publication citations (Pierce, Dev, Statham, & Bierer, 2019). The FAIR movement has taken hold in some scientific disciplines where issues surrounding confidentiality or privacy are not as prevalent. Social sciences, on the other hand, face challenges in that data access is often restricted for these reasons. However, the aim should be to develop FAIR principles across all disciplines and adapt as necessary. This requires creating repositories, infrastructures, and tools that make the FAIR practices the norm rather than the exception at both national and international levels (Stall et al., 2019).
Building on these principles, we have developed a Data Science Project Ethics Checklist (see the Appendix for an example). We find two things useful to do to instantiate ethics in every step of ‘doing data science.’ First, we require our researchers to take IRB) and the Responsible Conduct of Research training classes. Second, for each project, we develop a checklist to implement an ethical review at each stage of research to address the following criteria:
Balance simplicity and sufficient criteria to ensure ethical behavior and decisions.
Make ethical considerations and discussion of implicit biases an active and continuous part of the project at each stage of the research.
Seek expert help when ethical questions cannot be satisfactorily answered by the research team.
Ensure documentation, transparency, ongoing discussion, questioning, and constructive criticism throughout the project.
Incorporate ethical guidelines from relevant professional societies (for examples, see ACM Committee on Professional Ethics. (2018), American Physical Society (2019), Committee on Professional Ethics of the American Statistical Association (2018),
Creating the checklist is the first step for researchers to agree on a set of principles and serves as a reminder to have conversations throughout the project. This helps address the challenge of working with researchers from different disciplines and allow them to approach ethics through a variety of lenses. The Data Science Ethics Checklist given in the Appendix can be adapted to specific data science projects, with a focus on social science research. Responsible data science involves using a set of guiding principles and addressing the consequences across the data lifecycle.
Case Study Application—Ethics Aspects of the ethics review, a continuous process, have been touched on in the earlier steps of the case study, specifically, the ethics examination of the methods used, including the choice of variables, the creation of synthetic populations, and the models used. In addition, our findings were scrutinized, vetted, and refined based on internal discussions with the team, with our sponsors, Fairfax County officials, and external experts. The primary question asked throughout was whether we were introducing implicit bias into our research. We concurred that some of the findings had the potential to appear biased, such as the finding about level of physical activity by race and ethnicity. However, in this case, these findings would be important to school officials and political representatives.
3. Data Acumen
In the process of doing data science, we have learned that many of the consumers of this research do not have sufficient data acumen and thus can be overwhelmed with how to make use of data-driven insights. It is unrealistic to think that the majority of decision makers are data scientists. Even with domain knowledge, some literacy in data science domains is useful, including the underpinnings of probability and statistics to inform decision making under uncertainty (Kleinberg, Ludwig, Mullainathan, & Obermeyer, 2015).
Data acumen, traditionally referred to as data literacy, appears to be first introduced in the 2000s as social sciences began to embrace and use publicly open data (Prado & Marzal, 2013). We define data acumen as the ability to make good judgements about the use of data to support problem solutions. It is not only the basis of statistical and quantitative analysis; it is a critical mechanism to improve society and a necessary first step to statistical understanding. The need for policy and other decision makers with data acumen is growing in parallel with the massive repurposing of all types of data sources (Bughin, Seong, Manyika, Chui, & Joshi, 2018).
We have found it useful to conceptualize data acumen across three levels or roles (Garber, 2019). The first are the data scientists, trained in statistics, computer science, quantitative social sciences, or related fields. The second are researchers trained in a specific field, such as public health or political science, who also have a range of training in data science, obtained through a master’s degree, certificate programs, or hands-on programs such as the University of Virginia’s Data Science for the Public Good program (UVA, 2019). This second group plays a bridging role by bringing together multidisciplinary teams. The third group are the consumers of data science applications. The first and second groups may overlap with respect to skills, expertise, and application. The third group requires a basic understanding of data science, that is, they must be data literate (Garber, 2019).
Data acumen is both a baseline and overarching concept. A data literate person should conceptually understand the basics of data science, (e.g., the data science framework described in Figure 1 is a good guide), and be able to articulate questions that require data to provide evidence:
What is the problem?
What are the research questions to support the problem?
What data sources might inform the questions? Why?
How are these data born? What are the biases and ethical considerations?
What are the findings? Do they make sense? Do I trust them? How can I use them?
A data literate person understands the entire process, even if they do not have the skills to undertake the statistical research. Data acumen requires an understanding of how data are born, and why that matters for evaluating the quality of the data for the research question being addressed. As many types of data are discovered and repurposed to address analytical questions, this aspect of data literacy is increasingly important. Being data literate is important to know why our intuition may not often be right (Kahneman, 2011). We believe that building data capacity and acumen of decision makers is an important facet of data science.
Without applications (problems), doing data science would not exist. Our data science framework and research processes are fundamentally tied to practical problem solving and can be used in diverse settings. We provide a case study of using local data to address questions raised by county officials. Some contrasting examples that make formal use of the data science framework are the application to industry supply chain synchronization and the application to measuring the value and impact of open source software (Keller et al., 2018; Pires et al., 2017).
We have highlighted data discovery as a critical but often overlooked step in most data science frameworks. Without data discovery, we would fall back on data sources that are convenient. Data discovery expands the power of data science by considering many new data sources, not only designed sources. We are also developing new behaviors by adopting a principles-based approach to ethical considerations as a critical underlying feature throughout the data science lifecycle. Each step of the data science framework involves documentation of decisions made, methods used, and findings, allowing opportunity for data repurposing and reuse, sharing, and reproducibility.
Our data science framework provides a rigorous and repeatable, yet flexible, foundation for doing data science. The framework can serve as a continually evolving roadmap for the field of data science as we work together to embrace the ever-changing data environment. It also highlights the need for supporting the development of data acumen among stakeholders, subject matter experts, and decision makers.
We would like to acknowledge our colleagues who contributed to the research projects described in this paper: Dr. Vicki Lancaster and Dr. Joshua Goldstein, both with the Social & Decision Analytics Division, Biocomplexity Institute & Initiative (BII), University of Virginia, Dr. Ian Crandell, Virginia Tech, and Dr. Emily Molfino, U.S. Census Bureau. We would also like to thank Dr. Cathie Woteki, Distinguished Institute Professor, Biocomplexity Institute & Initiative, University of Virginia and Professor of Food Science and Human Nutrition at Iowa State University, who provided subject matter expertise and review of the research. Our sponsors, Michelle Gregory and Sophia Dutton, Office of Strategy Management, Fairfax County Health and Human Services, supported the research and provided context for many of the findings.
This research was partially supported by US Census Bureau under a contract with the MITRE Corporation; National Science Foundation’s National Center for Science and Engineering Statistics under a cooperative agreement with the US Department of Agriculture, National Agriculture Statistical; U.S. Army Research Institute for Social and Behavioral Sciences; Fairfax County, Virginia.
ACM Committee on Professional Ethics. (2018). Association for Computing Machinery (ACM) code of ethics and professional conduct. Retrieved December 1, 2019, from https://www.acm.org/binaries/content/assets/about/acm-code-of-ethics-and-professional-conduct.pdf
Adhikari, A., & DeNero, J. (2019). The foundations of data science. Retrieved December 1, 2019, from https://www.inferentialthinking.com/chapters/intro#The-Foundations-of-Data-Science
American Physical Society. (2019). Ethics and values. Retrieved from https://www.aps.org/policy/statements/index.cfm
Atkinson, T., Cantillon, B., Marlier, E., & Nolan, B. (2002 ). Social indicators: The EU and social inclusion. Oxford, UK: Oxford University Press.
Beckman, R. J., Baggerly, K. A., & McKay, M. D. (1996). Creating synthetic baseline populations. Transportation Research Part A: Policy and Practice, 30 (6), 415–429. https://doi.org/10.1016/0965-8564(96)00004-3
Berinato, S. (2019). Data science and the art of persuasion: Organizations struggle to communicate the insights in all the information they’ve amassed. Here’s why, and how to fix it. Harvard Business Review, 97 (1). Retrieved from https://hbr.org/2019/01/data-science-and-the-art-of-persuasion
Borgman, C. L. (2019). The lives and after lives of data. Harvard Data Science Review , 1 (1). https://doi.org/10.1162/99608f92.9a36bdb6
Box, G. E. P., Hunter, W. G., & Hunter J. S. (1978). Statistics for experimenters . Hoboken, NJ: Wiley. pp.563-571
Bughin, J., Seong, J., Manyika, J., Chui, M., & Joshi, R. (2018). Notes from the AI frontier: Modeling the impact of AI on the world economy. Stamford, CT: McKinsey Global Institute.
Berkeley School of Information. (2019). What is data science? Retrieved December 1, 2019, from https://datascience.berkeley.edu/about/what-is-data-science/
Berman, F., Rutenbar, R., Hailpern, B., Christensen, H., Davidson, S., Estrin, D.,…Szalay, A. (2018). Realizing the potential of data science. Communications of the ACM , 61 (4), 67–72. https://doi.org/10.1145/3188721
Brackstone, G. (1999). Managing data quality in a statistical agency. Survey Methodology, 25 (2), 139–150. https://repositorio.cepal.org//handle/11362/16457
Committee on Professional Ethics of the American Statistical Association. (2018). Ethical guidelines for statistical practice. Retrieved from https://www.amstat.org/asa/files/pdfs/EthicalGuidelines.pdf
Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning . Hoboken, NJ: Wiley.
Data diversity. (2019, January 11). Nature Human Behaviour, 3 , 1–2. https://doi.org/10.1038/s41562-018-0525-y
Data for Democracy. (2018). A community-engineered ethical framework for practical application in your data work. Global Data Ethics Project. Retrieved December 1, 2019, from https://www.datafordemocracy.org/documents/GDEP-Ethics-Framework-Principles-one-sheet.pdf
De Veaux, R., Hoerl, R., & Snee, R., (2016). Big data and the missing links. Statistical Analysis and Data Mining, 9 (6), 411–416 . https://doi.org/10.1002/sam.11303
Editorial: Nature research integrity is much more than misconduct [Editorial]. (2019, June 6). Nature 570, 5. https://doi.org/10.1038/d41586-019-01727-0
Garber, Allan. (2019). Data science: What the educated citizen needs to know. Harvard Data Science Review, 1 (1). https://doi.org/10.1162/99608f92.88ba42cb
Japec, L., Kreuter, F., Berg, M., Biemer, P., Decker, P., Lampe, C., . . .Usher, A. (2015). Big data in survey research: AAPOR task force report. Public Opinion Quarterly, 79 , 839–880. https://doi.org/10.1093/poq/nfv039
Kahneman, D. (2011). Thinking, fast and slow . New York, NY: Farrar, Straus and Giroux.
Keller-McNulty, S. (2007). From data to policy: Scientific excellence is our future. Journal of the American Statistical Association, 102 (478), 395–399. https://doi.org/10.1198/016214507000000275
Keller, S. A., Shipp, S., & Schroeder, A. (2016). Does big data change the privacy landscape? A review of the issues. Annual Review of Statistics and Its Application, 3, 161–180. https://doi.org/10.1146/annurev-statistics-041715-033453
Keller, S., Korkmaz, G., Orr, M., Schroeder, A., & Shipp, S. (2017). The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches. Annual Review of Statistics and Its Application, 4, 85–108. https://doi.org/10.1146/annurev-statistics-060116-054114
Keller, S., Korkmaz, G., Robbins, C., Shipp, S. (2018) Opportunities to observe and measure intangible inputs to innovation: Definitions, operationalization, and examples. Proceedings of the National Academy of Sciences (PNAS), 115 (50), 12638–12645. https://doi.org/10.1073/pnas.1800467115
Keller, S., Lancaster, V., & Shipp, S. (2017). Building capacity for data driven governance: Creating a new foundation for democracy. Statistics and Public Policy, 4 (1), 1–11. https://doi.org/10.1080/2330443X.2017.1374897
Khan, M. A. Uddin, M. F., & Gupta, N. (2014, April). Seven V's of Big Data understanding Big Data to extract value. In Proceedings of the 2014 Zone 1 Conference of the American Society. https://doi.org/10.1109/ASEEZone1.2014.6820689
Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction policy problems. American Economic Review, 105 (5), 491–495. https://doi.org/10.1257/aer.p20151023
Leek, J. T., & Peng, R. D. (2015). What is the question? Science, 347 (6228) , 1314–1315. https://doi.org/10.1126/science.aaa6146
Leonelli, S. (2019). Data governance is key to interpretation: Reconceptualizing data in data science. Harvard Data Science Review, 1 (1). https://doi.org/10.1162/99608f92.17405bb6
National Research Council. (2007). Engaging privacy and information technology in a digital age . Washington, DC: National Academies Press.
Office for Human Research Protections. (2016). Institutional Review Board (IRB) Written Procedures: Guidance for Institutions and IRBs (draft guidance issued in August 2016). Department of Health and Human Services. Washington DC. Retrieved from: https://www.hhs.gov/ohrp/regulations-and-policy/requests-for-comments/guidance-for-institutions-and-irbs/index.html
Prado, J. C., & Marzal, M. Á. (2013). Incorporating data literacy into information literacy programs: Core competencies and contents. Libri, 63 (2), 123–134. https://doi.org/10.1515/libri-2013-0010
Pierce, H. H., Dev, A., Statham, E., & Bierer, B. E. (2019, June 6). Credit data generators for data reuse. Nature, 570 (7759), 30–32. https://doi.org/10.1038/d41586-019-01715-4
Pires, B., Goldstein, J. Higdon, D., Sabin, P., Korkmaz, G., Shipp, S., ... Reese, S. (2017). A Bayesian simulation approach for supply chain synchronization . In the 2017 Winter Simulation Conference (pp . 1571–1582). New York, NY: IEEE. https://doi.org/10.1109/WSC.2017.8247898
Salganik, M. J. (2017). Bit by bit: Social research in the digital age . Princeton, NJ: Princeton University Press.
Snee, R. D., DeVeaux, R. D., & Hoerl, R. W. (2014). Follow the fundamentals. Quality Progress, 47 (1), 24–28. https://search-proquest-com.proxy01.its.virginia.edu/docview/1491963574?accountid=14678
Stall, S., Yarmey, L., Cutcher-Gershenfeld, J., Hanson, B., Lehnert, K., Nosek, B., ... & Wyborn, L. (2019, June 6). Make all scientific data FAIR. Nature, 570 (7759), 27–29. https://doi.org/10.1038/d41586-019-01720-7
United Nations Economic Commission for Europe (UNECE). (2014). A suggested framework for the quality of big data. Retrieved December 1, 2019, from https://statswiki.unece.org/download/attachments/108102944/Big%20Data%20Quality%20Framework%20-%20final-%20Jan08-2015.pdf?version=1&modificationDate=1420725063663&api=v2
United Nations Economic Commission for Europe (UNECE). (2015). Using administrative and secondary sources for official statistics: A handbook of principles and practices. Retrieved December 1, 2019, from https://unstats.un.org/unsd/EconStatKB/KnowledgebaseArticle10349.aspx
University of Virginia (UVA). (2019). Data Science for the Public Good Young Scholars Program. Retrieved December 1, 2019, from https://biocomplexity.virginia.edu/social-decision-analytics/dspg-program
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12 (4), 5–33. https://doi.org/10.1080/07421222.1996.11518099
Weber, B. (2018, May 17). Data science for startups: Data pipelines (Part 3). Towards Data Science . Retrieved from https://towardsdatascience.com/data-science-for-startups-data-pipelines-786f6746a59a
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59 (10), 1–23. https://doi.org/10.18637/jss.v059.i10
Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1 (1). https://doi.org/10.1162/99608f92.e26845b4
Data Science Project Ethics Checklist
6/8/20: The authors added description and questions about Data Wrangling to the Ethics Checklist in the Appendix.
©2020 Sallie Ann Keller, Stephanie S. Shipp, Aaron D. Schroeder, and Gizem Korkmaz . This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license , except where otherwise indicated with respect to particular material included in the article.
Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch.
Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking
Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show
When learning about data science, it's important to have case studies to learn from and use as examples. Case studies are helpful tools when you
Data science case study interview questions are often the most difficult part of the interview process. Designed to simulate a company's current and past
Data Science Case Studies · 1. Data Science in Pharmaceutical Industries · 2. Predictive Modeling for Maintaining Oil and Gas Supply · 3. Data Science in BioTech.
Read writing about Case Study in Towards Data Science. Your home for data science. A Medium publication sharing concepts, ideas and codes.
Does anybody know of a compendium of data science case studies being applied to business settings? I was looking for something that bridged the gap between
Case Study Application—Data Discovery. The creation of a data map highlights the types of data we want to 'discover' for this project (see
One of the simplest ways to demonstrate Data Science is through case studies. Through case studies, we will easily understand the use of a particular