These projects were created as part of Codecademy's Data Engineer Career Path course. (Latest projects are added to the top of the list.)
There were several portfolio projects created as part of this course. As they were larger and more extensive programs they can be found in their own repositories:
- Portfolio Project: GitHub Data Engineering Portfolio (TODO)
- Portfolio Project: Subscriber Cancellations Data Pipeline (TODO)
- Portfolio Project: Bike Rental Data Management (TODO)
I have previously completed a number of the projects within Python Fundamentals and all of the projects within SQL Fundamentals as part of Codecademy's Computer Science Career Path course. For completeness, I have included the projects here as well, and have marked them with a *.
- Learn MongoDB
- Learn Git II: Git For Deployment
- Learn Git: Introduction To Version Control
- Advanced Python
- Learn The Command Line
- Intermediate Python
- Introduction To Big Data With PySpark
- Getting Started Off-Platform
- Data Wrangling, Cleaning And Tidying
- Advanced SQL
- Python Pandas
- SQL Fundamentals *
- Python Fundamentals
- Other
The aim of this project was to use Pandas to analyse data regarding a websites funnel for visits.csv
, cart.csv
, checkout.csv
and purchase.csv
. The funnel describes the process of:
- A user visits the website
- A user adds a product to the cart
- A user clicks "checkout"
- A user completes a purchase
Analysis was done on each step of the funnel to see how many people continue through the funnel and ultimately make a purchase.
-
merge()
was used to help merge various DataFrames together e.g.visits_cart = visits.merge(cart, how='left')
-
isnull()
was used to check which timestamps hadnull
for that particular column, i.e. the user never went to the next funnel stage e.g.null_purchases = len(checkout_purchase[checkout_purchase.purchase_time.isnull()])
- Solution URL: Page Visits Funnel
- Other files:
The aim of this project was to use Pandas to analyse data from adclicks.csv
. It is an A/B testing on the number of people clicking on the ads of a hypothetical website called ShoeFly.com They have two different versions of an ad, which they have placed in emails, on Facebook, Twitter, and Google. Analysis of the data for the two ads was done to see how they are performing on each of the different platforms and on each day of the week.
-
groupby()
was used to help organise the data for analysis e.g.ad_clicks_by_source = ad_clicks.groupby('utm_source').user_id.count().reset_index()
-
pivot()
was used to pivot the data in order to make it more readable e.g.clicks_pivot = clicks_by_source.pivot( columns='is_click', index='utm_source', values='user_id' ).reset_index() print(clicks_pivot)
-
Solution URL: A/B Testing For ShoeFly.com
-
Other files:
-
I rounded the
percent_clicked
to 1dp in all the pivot tables where it was used e.g.a_clicks_grouped['percent_clicked'] = round((a_clicks_grouped['user_id'] / a_clicks_grouped.groupby('day')['user_id'].transform('sum') * 100),1)
The aim of this project was to use Pandas to analyse data from inventory.csv
. A number of columns were added to the data to enhance the information that could be extracted from the data, and also making use of lambda functions:
-
in_stock
column: makes use ofquantity
column to work out whether in stock i.e.True
orFalse
. -
total_value
column: makes use ofprice
andquantity
columns to calculate the total value. -
full_description
column: makes use of and combinesproduct_type
andproduct_description
columns to create a full description.inventory['in_stock'] = inventory.quantity.apply(lambda quantity: True if quantity > 0 else False) inventory['total_value'] = inventory.price * inventory.quantity combine_lambda = lambda row: '{} - {}'.format(row.product_type, row.product_description) inventory['full_description'] = inventory.apply(combine_lambda, axis=1)
- Solution URL: Petal Blossom
- Other files:
The aim of this project was to set constraints on a database used for keeping track of parts, their manufacturer, location in storeroom, inventory etc, to enable data quality checks to ensure that only valid data can be entered into the database.
The database had the initial following schema:
To ensure data quality the following were utilised on the tables as appropriate:
ADD UNIQUE (column_name)
ALTER COLUMN column_name SET NOT NULL
ADD CHECK (appropriate limitation)
ADD PRIMARY KEY (column_name)
ADD FOREIGN KEY (column_name) REFERENCES table_name(column_name)
- Solution URL: Building An Inventory Database With PostgreSQL
The aims of this project were to:
- Design a database schema on any topic.
- Implement the schema using Postbird (which is an open source PostgreSQL GUI client).
I chose to design a database around a hypothetical UK secondary school. I focused the database around people who would be closely associated with the school and how they are linked together to decide on what information to add and how to organise it. I designed the following schema for the database (I've included the DBML database markup language file):
I got ChatGPT to generate hypothetical data based of the schema:
While using ChatGPT helped speed up the generation of data, it didn't keep the data consistent to what it was supposed to add for each individual. For instance, it added a staff member with a note to state they were an art teacher, but never added art teacher as a job. So there were some inconsistencies when trying to add the data that I needed to clear up when inserting data into the table.
-
Solution URL: Designing A Database From Scratch
-
I added a few test queries to check that I could pull out information appropriately and as expected.
The aim of this project was to design a database schema based around a fictional restaurant "Bytes of China" and perform the following tasks:
- Create tables
- Define relationships between tables
- Designate appropriate columns as keys
- Insert sample data and
- Make queries from the database
The database has the following schema:
I learnt how to use dbdiagram.io to create an accurate schema and also how to validate and check keys and relationships in the database using information_schema.key_column_usage
:
SELECT
constraint_name,
table_name,
column_name
FROM information_schema.key_column_usage
WHERE table_name = 'restaurant';
- Solution URL: Build A Menu For Bytes Of China
The aim of this project was make queries to a database containing multiple tables of Lyft trip data information using SQL commands knowledge to date and using JOIN
and UNION
commands. The database tables have the following schema:
- Example use of
LEFT JOIN
where it was used to create a trip log with the trips and its users:SELECT trips.date, trips.pickup, trips.dropoff, trips.type, trips.cost, riders.first, riders.last, riders.username FROM trips LEFT JOIN riders ON trips.rider_id = riders.id;
- Solution URL: Lyft Trip Data
The aim of this project was make queries to a database table of Hacker News stories information using SQL commands knowledge to date and along with strftime()
function. The database table has the following schema:
- Example use of
strftime()
where it was used to find the best time for users to post news stories to get the best scores:SELECT strftime('%H', timestamp) AS 'Hour', ROUND(AVG(score), 1) AS 'Average Score', COUNT(*) AS 'Number of Stories' FROM hacker_news WHERE timestamp IS NOT NULL GROUP BY 1 ORDER BY 2 DESC;
- Solution URL: Analyse Hacker News Trends
The aim of this project was make queries using aggregate functions to a database table of startup companies information using SQL commands. The database table has the following schema:
- Aggregate functions used:
COUNT()
,SUM()
,MAX()
,MIN()
,AVG()
,ROUND()
GROUP BY column_name;
HAVING aggregrate function conditon;
- Solution URL: Trends In Startups
The aim of this project was make queries to a database table of restaurant information using SQL commands. The database table has the following schema:
SELECT DISTINCT column_name FROM table_name;
WHERE column_name condition;
WHERE column_name LIKE pattern;
WHERE column_name condition AND or OR column_name condition;
WHERE column_name IS NULL;
ORDER BY column_name DESC LIMIT number;
- Also used
CASE
,WHEN
,THEN
,ELSE
,END AS
SELECT name, CASE WHEN review > 4.5 THEN 'Extraordinary' WHEN review > 4 THEN 'Excellent' WHEN review > 3 THEN 'Good' WHEN review > 2 THEN 'Fair' ELSE 'Poor' END AS 'Review' FROM nomnom;
- Solution URL: New York Restaurants
The aim of this project was to create a friends table and add/delete data to it using basic SQL commands.
CREATE TABLE table_name (column_name data_type);
INSERT INTO table_name (column_name) VALUES (data);
SELECT column_name or * FROM table_name;
UPDATE table_name SET column_name = data;
ALTER TABLE table_name ADD COLUMN column_name data_type;
DELETE FROM table_name WHERE column_name = data;
- Solution URL: Create A Table
The aim of this project was to create a Python program that reads and writes to files. Extracting Username
information from passwords.csv
to create a list of user names in compromised_user.csv
. Creating boss_message.json
and new_passwords.csv
.
- Imported
csv
,json
andos
. with open("file_name", "w")
used to write files, "w" omitted when using only for reading.csv.DictReader()
used to read passwords.csv file.json.dump(dictionary_name, json_file_name)
used to write information to json file from Python dictionary.file_object.write("text_to_add")
used to write information to file.
- Solution URL: Hacking The Fender
- Other files:
The aim of this project was to write Python functions using Jupyter Notebook to code and decode messages using a Caesar Cipher and a Vigenère Cipher.
The Caesar Cipher is a simple offset cipher. However, I found the Vigenère Cipher more difficult to understand how to code as each letter has a different offset depending upon the key word used.
def vigenere_decode(message, key):
decoded_message = ""
key_phrase = ""
key_index = 0
for char in message:
if key_index >= len(key):
key_index = 0
if char in alphabet:
key_phrase += key[key_index]
key_index += 1
else:
key_phrase += char
for i in range(len(message)):
if message[i] in alphabet:
old_char_index = alphabet.index(message[i])
offset_index = alphabet.index(key_phrase[i])
new_char = alphabet[(old_char_index + offset_index) % 26]
decoded_message += new_char
else:
decoded_message += message[i]
return decoded_message
- Solution URL: Coded Correspondence
The aim of this project was to create a Python program that takes a list of sales information in a string format (customer name, price, colour(s) of thread purchased and date), and then use a variety of techniques to clean up the data into easier-to-access information.
string.replace()
used to help with clarifying appropriate sales transaction.string.split()
used to help split up string into appropriate sections.string.strip()
used to clear up and remove whitespace in transaction information.for
loops used to iterate through lists of transactions.list.append
to add information into smaller appropriate groups of lists e.g. customers, sales price and colour of thread purchased.- Function defined to calculate the total numbers sold for each colour thread.
print
and.format()
used to print out a formatted string of the number of each colour thread purchased.
- Solution URL: Thread Shed
The aim of this project was to install Jupyter Notebook and use it to write a program to work out which night the most people can attend for a gamers night.
pip install notebook
and to run the notebook, I didcd file_directory_name
and thenjupyter notebook
in the command line.
- Solution URL: Abruptly Goblins
The aim of this project was to create a Python program that processes some data from a group of friends playing scrabble. Dictionaries are used to organise players, words and points.
-
List comprehension used to create a dictionary from two provided lists of letters and their points.
letters_to_points = {key:value for key, value in zip(letters, points)}
-
Functions defined to:
-
play_word()
adds a new word played by a player.- Sets
word.upper()
as letters in letters_to_points dictionary are all uppercase. - Sets
player.title()
so that if names are entered differently with lowercase or uppercase letters they will still match when compared toplayer_to_words
dictionary. - If player already exists, word is added to their played list in player_to_words dictionary.
- If player doesn't exist, then the new player along with their word is added to player_to_words dictionary.
- Calls
update_point_totals()
function.
- Sets
-
update_point_totals()
updates the total points scored for the player.- Calls
score_word()
function. - If player already exists, points are added to their total score in player_to_points dictionary.
- If player doesn't exist, then the new player along with their score is added to player_to_points dictionary.
- Calls
-
score_word()
calculates and returns the points score of a word. -
play_round()
initialises the program.- Gets player name and word from user.
- Calls
play_word()
function. - Calls
another_round()
function.
-
another_round()
asks user whether they wish to enter another player's word.- Gets Y/N input from user. Input changed to uppercase.
- If response is
Y
orYES
, then callsplay_round()
function. - If response not
Y
orYES
, then callsshow_results()
function.
-
show_results()
iterates throughplayer_to_points
dictionary to print out names and total scores of each player.
-
- Solution URL: Scrabble
- Remove hardcoded player's data and ask for input of name and word from user - ADDED TO CODE.
- Check whether more words to be added and scored - ADDED TO CODE.
- Show overall results of players in formatted strings - ADDED TO CODE.
The aim of this project was to create a Python program that uses classes and functions to work out what menus are available at different times of day and to calculate bill costs. Classes defined for:
- Menu:
- String representation for name of menu and when it is available
calculate_bill()
calculates the cost of a particular meal
- Franchise:
- String representation for name of restaurant
available_menus()
works out which menus are currently available depending upon the time of day
- Business takes the name of the business and a list of its franchises.
- Solution URL: Basta Fazoolin'
The aim of this project was to create a Python program that uses functions and return
to help calculate some fundamental physics properties.
- Functions defined to calculate:
- Fahrenheit to Celsius temperatures
- Celsius to Fahrenheit temperatures
- Force
- Energy
- Work
- Values to test the program are hardcoded into the program.
- Solution URL: Getting Ready For Physics Class
- Have the user choose which physics property to calculate.
- Have the user enter appropriate values for the physics property they have chosen to calculate.
The aim of this project was to create a Python program that uses imports and work with datetime
to simulate time travel and to perform calculations using dates.
-
Created custom module that was imported into the main script.
-
Used list for possible destinations.
-
Use of
random
,decimal
,datetime
to carry out calculations:base_cost = Decimal('1000.00') current_year = current_date.year start_year = -2000 end_year = 3500 target_year = randint(start_year, end_year) cost_multiplier = abs(current_year - target_year) final_cost = round(base_cost * cost_multiplier, 2)
- Solution URL: Time Travelers Toolkit
- Other files:
The aim of this project was to create a Python program that calculates some metrics from lists of data.
- Lists of hairstyles, prices and last week's sales are hardcoded into the program.
for
, list comprehensions,range()
,len()
andif
are used to calculate average prices, decreased prices, total revenue, average daily revenue and types of haircuts that cost less than £30.
- Solution URL: Carly's Clippers
The aim of this project was to create a Python program that takes pizzas and their prices using lists and alters the lists to organise the data.
- 2D lists of pizzas and prices are hardcoded into the program.
list.count(item)
used to count how many pizzas are $2.len(list)
used to count the number of different kinds of pizzas.list.sort()
to sort the list in ascending order of price.list.pop()
to remove the most expensive pizza.list.insert(index, item)
to add a new pizza in appropriate position to keep price sorted in list.list[:3]
to find the cheapest three pizzas.
- Solution URL: Len's Slice
The aim of this project was to create a Python program that takes student data and organizes subjects and grades using lists.
- 2D lists of subjects and grades are hardcoded into the program.
list.append(item)
,list[index].remove(item)
are used to alter subjects and grades andprint()
out gradebook information to the user.
- Solution URL: Gradebook
- Have the user input the initial subjects and grades.
- Have the user be able to alter subjects and grades.
The aim of this project was to create a Python program that asks the user for the weight of their package and then tells them which method of shipping is cheapest and how much it will cost to ship their package using Sal’s Shippers.
weight
variable is hardcoded into the program.if
,elif
andelse
used to calculate cost shipping andprint()
out costs.
- Solution URL: Sal's Shipping
- Have the user input the packages weight.
The aim of this project was to create a Python program that can answer any "Yes" or "No" question with a different fortune each time it executes.
name
andquestion
variables are hardcoded into the program.random
module withrandint()
used to generate a random number within a specified range.if
,elif
andelse
used to select answers andprint()
out appropriate response.
- Solution URL: Magic 8-Ball
- Have the user input their name and question.
- V. Tickner