Data Engineer Projects

These projects were created as part of Codecademy's Data Engineer Career Path course. (Latest projects are added to the top of the list.)

There were several portfolio projects created as part of this course. As they were larger and more extensive programs they can be found in their own repositories:

Portfolio Project: GitHub Data Engineering Portfolio (TODO)
Portfolio Project: Subscriber Cancellations Data Pipeline (TODO)
Portfolio Project: Bike Rental Data Management (TODO)

I have previously completed a number of the projects within Python Fundamentals and all of the projects within SQL Fundamentals as part of Codecademy's Computer Science Career Path course. For completeness, I have included the projects here as well, and have marked them with a *.

Learn MongoDB
Learn Git II: Git For Deployment
Learn Git: Introduction To Version Control
Advanced Python
Learn The Command Line
Intermediate Python
Introduction To Big Data With PySpark
Getting Started Off-Platform
Data Wrangling, Cleaning And Tidying
Advanced SQL
Python Pandas
SQL Fundamentals *
Python Fundamentals
- Hacking The Fender *
- Coded Correspondence
- Thread Shed *
- Abruptly Goblins
- Scrabble *
- Basta Fazoolin'
- Getting Ready For Physics Class *
- Time Travelers Toolkit
- Carly's Clippers *
- Len's Slice *
- Gradebook *
- Sal's Shipping *
- Magic 8-Ball *
Other
- Author

Python Pandas

Page Visits Funnnel

The aim of this project was to use Pandas to analyse data regarding a websites funnel for visits.csv, cart.csv, checkout.csv and purchase.csv. The funnel describes the process of:

A user visits the website
A user adds a product to the cart
A user clicks "checkout"
A user completes a purchase

Analysis was done on each step of the funnel to see how many people continue through the funnel and ultimately make a purchase.

merge() was used to help merge various DataFrames together e.g.
```
visits_cart = visits.merge(cart, how='left')
```
isnull() was used to check which timestamps had null for that particular column, i.e. the user never went to the next funnel stage e.g.
```
null_purchases = len(checkout_purchase[checkout_purchase.purchase_time.isnull()])
```

Code & Potential Improvements

Solution URL: Page Visits Funnel
Other files:

A/B Testing For ShoeFly.com

The aim of this project was to use Pandas to analyse data from adclicks.csv. It is an A/B testing on the number of people clicking on the ads of a hypothetical website called ShoeFly.com They have two different versions of an ad, which they have placed in emails, on Facebook, Twitter, and Google. Analysis of the data for the two ads was done to see how they are performing on each of the different platforms and on each day of the week.

groupby() was used to help organise the data for analysis e.g.

ad_clicks_by_source = ad_clicks.groupby('utm_source').user_id.count().reset_index()

pivot() was used to pivot the data in order to make it more readable e.g.

clicks_pivot = clicks_by_source.pivot(
  columns='is_click',
  index='utm_source',
  values='user_id'
).reset_index()
print(clicks_pivot)

Code & Potential Improvements

Solution URL: A/B Testing For ShoeFly.com
Other files:
- adclicks.csv

I rounded the percent_clicked to 1dp in all the pivot tables where it was used e.g.

a_clicks_grouped['percent_clicked'] = round((a_clicks_grouped['user_id'] / a_clicks_grouped.groupby('day')['user_id'].transform('sum') * 100),1)

Petal Power Inventory

The aim of this project was to use Pandas to analyse data from inventory.csv. A number of columns were added to the data to enhance the information that could be extracted from the data, and also making use of lambda functions:

in_stock column: makes use of quantity column to work out whether in stock i.e. True or False.
total_value column: makes use of price and quantity columns to calculate the total value.

full_description column: makes use of and combines product_type and product_description columns to create a full description.

inventory['in_stock'] = inventory.quantity.apply(lambda quantity: True if quantity > 0 else False)

inventory['total_value'] = inventory.price * inventory.quantity

combine_lambda = lambda row: '{} - {}'.format(row.product_type, row.product_description)
inventory['full_description'] = inventory.apply(combine_lambda, axis=1)

Code & Potential Improvements

Solution URL: Petal Blossom
Other files:
- inventory.csv

SQL Fundamentals

Building An Inventory Database With PostgreSQL

The aim of this project was to set constraints on a database used for keeping track of parts, their manufacturer, location in storeroom, inventory etc, to enable data quality checks to ensure that only valid data can be entered into the database.

The database had the initial following schema:

To ensure data quality the following were utilised on the tables as appropriate:

ADD UNIQUE (column_name)
ALTER COLUMN column_name SET NOT NULL
ADD CHECK (appropriate limitation)
ADD PRIMARY KEY (column_name)
ADD FOREIGN KEY (column_name) REFERENCES table_name(column_name)

Code & Potential Improvements

Solution URL: Building An Inventory Database With PostgreSQL

Designing A Database From Scratch

The aims of this project were to:

Design a database schema on any topic.
Implement the schema using Postbird (which is an open source PostgreSQL GUI client).

I chose to design a database around a hypothetical UK secondary school. I focused the database around people who would be closely associated with the school and how they are linked together to decide on what information to add and how to organise it. I designed the following schema for the database (I've included the DBML database markup language file):

I got ChatGPT to generate hypothetical data based of the schema:

While using ChatGPT helped speed up the generation of data, it didn't keep the data consistent to what it was supposed to add for each individual. For instance, it added a staff member with a note to state they were an art teacher, but never added art teacher as a job. So there were some inconsistencies when trying to add the data that I needed to clear up when inserting data into the table.

Code & Potential Improvements

Solution URL: Designing A Database From Scratch
I added a few test queries to check that I could pull out information appropriately and as expected.

Build A Menu For Bytes Of China

The aim of this project was to design a database schema based around a fictional restaurant "Bytes of China" and perform the following tasks:

Create tables
Define relationships between tables
Designate appropriate columns as keys
Insert sample data and
Make queries from the database

The database has the following schema:

I learnt how to use dbdiagram.io to create an accurate schema and also how to validate and check keys and relationships in the database using information_schema.key_column_usage:

SELECT
   constraint_name,
   table_name,
   column_name
FROM information_schema.key_column_usage
WHERE table_name = 'restaurant';

Code & Potential Improvements

Solution URL: Build A Menu For Bytes Of China

Lyft Trip Data

The aim of this project was make queries to a database containing multiple tables of Lyft trip data information using SQL commands knowledge to date and using JOIN and UNION commands. The database tables have the following schema:

Example use of LEFT JOIN where it was used to create a trip log with the trips and its users:

SELECT trips.date,
  trips.pickup,
  trips.dropoff,
  trips.type,
  trips.cost,
  riders.first,
  riders.last,
  riders.username
FROM trips
LEFT JOIN riders
  ON trips.rider_id = riders.id;

Code & Potential Improvements

Solution URL: Lyft Trip Data

Analyse Hacker News Trends

The aim of this project was make queries to a database table of Hacker News stories information using SQL commands knowledge to date and along with strftime() function. The database table has the following schema:

Example use of strftime() where it was used to find the best time for users to post news stories to get the best scores:

SELECT strftime('%H', timestamp) AS 'Hour',
  ROUND(AVG(score), 1) AS 'Average Score',
  COUNT(*) AS 'Number of Stories'
FROM hacker_news
WHERE timestamp IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC;

Code & Potential Improvements

Solution URL: Analyse Hacker News Trends

Trends In Startups

The aim of this project was make queries using aggregate functions to a database table of startup companies information using SQL commands. The database table has the following schema:

Aggregate functions used: COUNT(), SUM(), MAX(), MIN(), AVG(), ROUND()
GROUP BY column_name;
HAVING aggregrate function conditon;

Code & Potential Improvements

Solution URL: Trends In Startups

New York Restaurants

The aim of this project was make queries to a database table of restaurant information using SQL commands. The database table has the following schema:

SELECT DISTINCT column_name FROM table_name;
WHERE column_name condition;
WHERE column_name LIKE pattern;
WHERE column_name condition AND or OR column_name condition;
WHERE column_name IS NULL;
ORDER BY column_name DESC LIMIT number;

Also used CASE, WHEN, THEN, ELSE, END AS

SELECT name,
  CASE
    WHEN review > 4.5 THEN 'Extraordinary'
    WHEN review > 4 THEN 'Excellent'
    WHEN review > 3 THEN 'Good'
    WHEN review > 2 THEN 'Fair'
    ELSE 'Poor'
  END AS 'Review'
FROM nomnom;

Code & Potential Improvements

Solution URL: New York Restaurants

Create A Table

The aim of this project was to create a friends table and add/delete data to it using basic SQL commands.

CREATE TABLE table_name (column_name data_type);
INSERT INTO table_name (column_name) VALUES (data);
SELECT column_name or * FROM table_name;
UPDATE table_name SET column_name = data;
ALTER TABLE table_name ADD COLUMN column_name data_type;
DELETE FROM table_name WHERE column_name = data;

Code & Potential Improvements

Solution URL: Create A Table

Python Fundamentals

Hacking The Fender

The aim of this project was to create a Python program that reads and writes to files. Extracting Username information from passwords.csv to create a list of user names in compromised_user.csv. Creating boss_message.json and new_passwords.csv.

Imported csv, json and os.
with open("file_name", "w") used to write files, "w" omitted when using only for reading.
csv.DictReader() used to read passwords.csv file.
json.dump(dictionary_name, json_file_name) used to write information to json file from Python dictionary.
file_object.write("text_to_add") used to write information to file.

Code & Potential Improvements

Solution URL: Hacking The Fender
Other files:

Coded Correspondence

The aim of this project was to write Python functions using Jupyter Notebook to code and decode messages using a Caesar Cipher and a Vigenère Cipher.

The Caesar Cipher is a simple offset cipher. However, I found the Vigenère Cipher more difficult to understand how to code as each letter has a different offset depending upon the key word used.

def vigenere_decode(message, key):
  decoded_message = ""
  key_phrase = ""
  key_index = 0

  for char in message:
    if key_index >= len(key):
      key_index = 0
    if char in alphabet:
      key_phrase += key[key_index]
      key_index += 1
    else:
      key_phrase += char

  for i in range(len(message)):
    if message[i] in alphabet:
      old_char_index = alphabet.index(message[i])
      offset_index = alphabet.index(key_phrase[i])
      new_char = alphabet[(old_char_index + offset_index) % 26]
      decoded_message += new_char
    else:
      decoded_message += message[i]
  return decoded_message

Code & Potential Improvements

Solution URL: Coded Correspondence

Thread Shed

The aim of this project was to create a Python program that takes a list of sales information in a string format (customer name, price, colour(s) of thread purchased and date), and then use a variety of techniques to clean up the data into easier-to-access information.

string.replace() used to help with clarifying appropriate sales transaction.
string.split() used to help split up string into appropriate sections.
string.strip() used to clear up and remove whitespace in transaction information.
for loops used to iterate through lists of transactions.
list.append to add information into smaller appropriate groups of lists e.g. customers, sales price and colour of thread purchased.
Function defined to calculate the total numbers sold for each colour thread.
print and .format() used to print out a formatted string of the number of each colour thread purchased.

Code & Potential Improvements

Solution URL: Thread Shed

Abruptly Goblins

The aim of this project was to install Jupyter Notebook and use it to write a program to work out which night the most people can attend for a gamers night.

pip install notebook and to run the notebook, I did cd file_directory_name and then jupyter notebook in the command line.

Code & Potential Improvements

Solution URL: Abruptly Goblins

Scrabble

The aim of this project was to create a Python program that processes some data from a group of friends playing scrabble. Dictionaries are used to organise players, words and points.

List comprehension used to create a dictionary from two provided lists of letters and their points.
```
letters_to_points = {key:value for key, value in zip(letters, points)}
```
Functions defined to:
- play_word() adds a new word played by a player.
  - Sets word.upper() as letters in letters_to_points dictionary are all uppercase.
  - Sets player.title() so that if names are entered differently with lowercase or uppercase letters they will still match when compared to player_to_words dictionary.
  - If player already exists, word is added to their played list in player_to_words dictionary.
  - If player doesn't exist, then the new player along with their word is added to player_to_words dictionary.
  - Calls update_point_totals() function.
- update_point_totals() updates the total points scored for the player.
  - Calls score_word() function.
  - If player already exists, points are added to their total score in player_to_points dictionary.
  - If player doesn't exist, then the new player along with their score is added to player_to_points dictionary.
- score_word() calculates and returns the points score of a word.
- play_round() initialises the program.
  - Gets player name and word from user.
  - Calls play_word() function.
  - Calls another_round() function.
- another_round() asks user whether they wish to enter another player's word.
  - Gets Y/N input from user. Input changed to uppercase.
  - If response is Y or YES, then calls play_round() function.
  - If response not Y or YES, then calls show_results() function.
- show_results() iterates through player_to_points dictionary to print out names and total scores of each player.

Code & Potential Improvements

Solution URL: Scrabble
- Remove hardcoded player's data and ask for input of name and word from user - ADDED TO CODE.
- Check whether more words to be added and scored - ADDED TO CODE.
- Show overall results of players in formatted strings - ADDED TO CODE.

Basta Fazoolin'

The aim of this project was to create a Python program that uses classes and functions to work out what menus are available at different times of day and to calculate bill costs. Classes defined for:

Menu:
- String representation for name of menu and when it is available
- calculate_bill() calculates the cost of a particular meal
Franchise:
- String representation for name of restaurant
- available_menus() works out which menus are currently available depending upon the time of day
Business takes the name of the business and a list of its franchises.

Code & Potential Improvements

Solution URL: Basta Fazoolin'

Getting Ready For Physics Class

The aim of this project was to create a Python program that uses functions and return to help calculate some fundamental physics properties.

Functions defined to calculate:
- Fahrenheit to Celsius temperatures
- Celsius to Fahrenheit temperatures
- Force
- Energy
- Work
Values to test the program are hardcoded into the program.

Code & Potential Improvements

Solution URL: Getting Ready For Physics Class
- Have the user choose which physics property to calculate.
- Have the user enter appropriate values for the physics property they have chosen to calculate.

Time Travelers Toolkit

The aim of this project was to create a Python program that uses imports and work with datetime to simulate time travel and to perform calculations using dates.

Created custom module that was imported into the main script.
Used list for possible destinations.

Use of random, decimal, datetime to carry out calculations:

base_cost = Decimal('1000.00')
current_year = current_date.year
start_year = -2000
end_year = 3500
target_year = randint(start_year, end_year)
cost_multiplier = abs(current_year - target_year)
final_cost = round(base_cost * cost_multiplier, 2)

Code & Potential Improvements

Solution URL: Time Travelers Toolkit
Other files:
- custom_module.py

Carly's Clippers

The aim of this project was to create a Python program that calculates some metrics from lists of data.

Lists of hairstyles, prices and last week's sales are hardcoded into the program.
for, list comprehensions, range(), len() and if are used to calculate average prices, decreased prices, total revenue, average daily revenue and types of haircuts that cost less than £30.

Code & Potential Improvements

Solution URL: Carly's Clippers

Len's Slice

The aim of this project was to create a Python program that takes pizzas and their prices using lists and alters the lists to organise the data.

2D lists of pizzas and prices are hardcoded into the program.
list.count(item) used to count how many pizzas are $2.
len(list) used to count the number of different kinds of pizzas.
list.sort() to sort the list in ascending order of price.
list.pop() to remove the most expensive pizza.
list.insert(index, item) to add a new pizza in appropriate position to keep price sorted in list.
list[:3] to find the cheapest three pizzas.

Code & Potential Improvements

Solution URL: Len's Slice

Gradebook

The aim of this project was to create a Python program that takes student data and organizes subjects and grades using lists.

2D lists of subjects and grades are hardcoded into the program.
list.append(item), list[index].remove(item) are used to alter subjects and grades and print() out gradebook information to the user.

Code & Potential Improvements

Solution URL: Gradebook
- Have the user input the initial subjects and grades.
- Have the user be able to alter subjects and grades.

Sal's Shipping

The aim of this project was to create a Python program that asks the user for the weight of their package and then tells them which method of shipping is cheapest and how much it will cost to ship their package using Sal’s Shippers.

weight variable is hardcoded into the program.
if, elif and else used to calculate cost shipping and print() out costs.

Code & Potential Improvements

Solution URL: Sal's Shipping
- Have the user input the packages weight.

Magic 8-Ball

The aim of this project was to create a Python program that can answer any "Yes" or "No" question with a different fortune each time it executes.

name and question variables are hardcoded into the program.
random module with randint() used to generate a random number within a specified range.
if, elif and else used to select answers and print() out appropriate response.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
python-fundamentals		python-fundamentals
python-pandas		python-pandas
sql-fundamentals		sql-fundamentals
README.md		README.md

VTickner/codecademy-data-engineer

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Projects

Table of contents

Python Pandas

Page Visits Funnnel

Code & Potential Improvements

A/B Testing For ShoeFly.com

Code & Potential Improvements

Petal Power Inventory

Code & Potential Improvements

SQL Fundamentals

Building An Inventory Database With PostgreSQL

Code & Potential Improvements

Designing A Database From Scratch

Code & Potential Improvements

Build A Menu For Bytes Of China

Code & Potential Improvements

Lyft Trip Data

Code & Potential Improvements

Analyse Hacker News Trends

Code & Potential Improvements

Trends In Startups

Code & Potential Improvements

New York Restaurants

Code & Potential Improvements

Create A Table

Code & Potential Improvements

Python Fundamentals

Hacking The Fender

Code & Potential Improvements

Coded Correspondence

Code & Potential Improvements

Thread Shed

Code & Potential Improvements

Abruptly Goblins

Code & Potential Improvements

Scrabble

Code & Potential Improvements

Basta Fazoolin'

Code & Potential Improvements

Getting Ready For Physics Class

Code & Potential Improvements

Time Travelers Toolkit

Code & Potential Improvements

Carly's Clippers

Code & Potential Improvements

Len's Slice

Code & Potential Improvements

Gradebook

Code & Potential Improvements

Sal's Shipping

Code & Potential Improvements

Magic 8-Ball

Code & Potential Improvements

Other

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages