Dog Recommendation System

Published in

Web Mining [IS688, Spring 2021]

15 min readMay 3, 2021

Source: https://www.sporcle.com/games/Rackie/where-s_woof_two

Introduction

Owning a dog can be very gratifying, but it’s also a large responsibility. There are hundreds of dog breeds, each with its own unique appearance and traits, so choosing the one that fits best can be very difficult. Even for those who may have a dog in mind that they want, that dog may not be available, or there may be another dog that suits them better. To help solve this problem, I created a dog recommendation system that quickly provides dog recommendations by using either collaborative filtering or content-based filtering. The recommendation system that I created allows users to find the most popular dog breeds (using collaborative filtering), or similar dog breeds to a selected dog breed (using content-based filtering), based on the users’ preferences. People looking to own a new dog will benefit from this, especially if they aren’t that knowledgeable about different dog breeds or if they have strong preferences for certain dog traits.

1. Data Collection

I collected the data about each dog breed from https://github.com/tmfilho/akcdata/blob/master/fetch-data.ipynb, which scrapes the current data from the American Kennel Club website (https://www.akc.org/). The IPython notebook doesn’t seem to be downloadable, so I copied and pasted the content into my own Jupyter Notebook, where I continued to code the rest of my work in Python.

After making a very minor modification of removing its conversion of the heights from inches to cm and the weights from lbs to kg, I used pandas’ .read_csv() function to read the CSV file that the copied code generates.

The DataFrame that I named df initially contained 282 rows of unique dog breeds. However, some of the rows had issues; they had multiple columns of data that had the value NaN.

There was 1 column (called “Unnamed: 0”) for the breed name, 1 column with the description, 1 column for a popularity score (1 is the most popular breed, and higher numbers correspond to less popular breeds), and 18 columns for traits.

The “min_expectancy” and “max_expectancy” columns refer to the minimum and maximum life expectancies of the dog breed, with years as the unit of measurement.

The “group” column refers to the 7 major dog groups, but there were 2 incorrect values of “Foundation Stock Service” and “Miscellaneous Class”.

The columns with “_value” in their name have values of 0.2, 0.4, 0.6, 0.8, or 1. that correspond to the values of the columns with “_category” in their name. The higher numbers indicate that the breed has more of the trait that the column refers to. For example, for shedding_value, the higher numbers mean that the breed sheds more often, which is indicated by shedding_category.

shedding_value and shedding_category sample values

2. Data Processing

Data Cleaning

My first step in processing the data was cleaning the data. I started by renaming the “Unnamed: 0” column to “breed”.

I also dropped the rows with NaN as a value because these aren’t comparable for my content-based filtering, and they had some wrong data like I showed in my Data Collection section. This resulted in 186 rows of unique dog breeds in df. After dropping the rows with NaN as a value, the wrong data was also removed. For example, the “group” column now only had the correct 7 major dog groups.

Creating New Columns

Next, I created new columns that would be used for recommending popular dogs and would be some of the columns used for recommending similar dogs. In this section, I will briefly explain what the columns are and what they do, and then I will elaborate and provide examples in the sections that explain how the recommendations are generated.

I first created a “high”, “medium”, and “low” column for each column that has “value” in the column name. As their names indicate, they distinguish high, medium, and low values. The purpose of these columns is to allow users to adjust their recommendations based on their preferences in a much more intuitive and simple manner. As I mentioned in the Data Collection section, the columns with “value” in their name only have the values 0.2, 0.4, 0.6, 0.8, 1., so I couldn’t distribute them evenly between the 3 (high, medium, low) columns. As shown in the code snippet below, I set the “high” column values to the boolean True if their corresponding “value” columns were ≥ 0.8 (otherwise False), and I made the “low” column True if their corresponding “value” columns were ≤ 0.4. The only value left was 0.6, so for the “medium” column values, I made them overlap 1 value from the “high” and “low” columns to make it less restrictive; I set the values to True if their corresponding “value” columns were ≥ 0.4 and ≤ 0.8. While iterating through the “value” columns, I also converted the values into lists with appended 0’s for the sake of coding pairwise Euclidean distance calculations that I will cover later.

for col in [col for col in df.columns if 'value' in col]:
    df[('high_'+col).replace('_value','')] = df[col].apply(lambda x: x >= .8)
    df[('medium_'+col).replace('_value','')] = df[col].apply(lambda x: .4 <= x <= .8)
    df[('low_'+col).replace('_value','')] = df[col].apply(lambda x: x <= .4)
    
    df[col] = df[col].apply(lambda x: [x,0])

For example, below are the columns related to shedding (which has a column with “value” in its name) again. This time the shedding_value is a list instead of a float, and the 3 right-most columns are the columns that I created that are based on the shedding_value.

I then created height, weight, and expectancy columns that provide the mean of the corresponding minimum and maximum values for each breed.

for col in ['height','weight','expectancy']:
    df[col] = (df['max_'+col] + df['min_'+col])/2

As shown in the code snippet below, I used these columns to create “high”, “medium”, and “low” columns (similar to how I did it for the other columns) and to create “value” columns (like the provided “value” columns). I used pandas’ .describe() function to retrieve the values of the percentiles 0.2, 0.33, 0.4, 0.6, 0.67, and 0.8. The 0.33 and 0.67 percentiles were used to evenly distribute the high, medium, and low values without overlap; above 0.67 (67%) was True for the “high” columns, below 0.33 (33%) was True for the “low” columns, and in between that was True for the “medium” columns. The other percentiles were used to set the values for the “value” columns (used multiple lambda functions to check what percentile interval the values belonged to), and then I again converted the values into lists with appended 0's.

for col in ['height','weight','expectancy']:
    temp = df[col].describe(percentiles=[.2,.33,.4,.6,.67,.8])
    df['high_'+col] = df[col].apply(lambda x: x > temp['67%'])
    df['medium_'+col] = df[col].apply(lambda x: temp['33%'] < x < temp['67%'])
    df['low_'+col] = df[col].apply(lambda x: x < temp['33%'])
    
    df[col+'_value'] = df[col].apply(lambda x: '1' if x >= temp['80%'] else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.8' if ((type(x)!=str) and (x >= temp['60%']) and (x < temp['80%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.6' if ((type(x)!=str) and (x >= temp['40%']) and (x < temp['60%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.4' if ((type(x)!=str) and (x >= temp['20%']) and (x < temp['40%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.2' if ((type(x)!=str) and (x < temp['20%'])) else x) 
    df[col+'_value'] = df[col+'_value'].apply(lambda x: [float(x),0])

For example, below are all the columns related to weight; the 5 right-most ones are the ones that I created.

Below I’ve printed the 0.33 and 0.67 percentiles for weight to show that they align with how most people would categorize light, medium, and heavy dogs.

Filtering Output Columns for Recommendations

My last step before generating the recommendations was creating a list of columns for the data I would output in the recommendations. The columns I chose were group, temperament, and all of the columns that had “min_”, “max_”, or “category” in their name. The breed and description columns would be separate because the description would otherwise get cut off due to character lengths.

3. Recommendations of Popular Dogs

I recommended popular dogs using collaborative filtering. Collaborative filtering provides recommendations based on the preferences of many users. In this case, it was very simple because the data already has a popularity column that indicates how popular the dog breeds are.

Code for Recommendations

Below is my function that has optional parameters to filter the output of up to 10 of the most popular dogs.

def recommend_popular_dogs(group=[],low=[],medium=[],high=[]):
    if type(group) == str:
        group = [group]
    if type(low) == str:
        low = [low]
    if type(medium) == str:
        medium = [medium]
    if type(high) == str:
        high = [high]
    
    temp = df.sort_values('popularity')
    if len(group) > 0:
        temp = temp[temp['group'].isin(group)]
    if len(low) > 0:
        for col in low:
            temp = temp[temp['low_'+col]]
    if len(medium) > 0:
        for col in medium:
            temp = temp[temp['medium_'+col]]
    if len(high) > 0:
        for col in high:
            temp = temp[temp['high_'+col]]
    
    num_dogs = min(10,len(temp))
    
    for i in range(num_dogs):
        print('{}.'.format(i+1),temp['breed'].iloc[i])
    
    for i in range(num_dogs):
        print()
        print('{}.'.format(i+1),temp['breed'].iloc[i])
        print(temp['description'].iloc[i])
        print(temp[output_cols].iloc[i])

    return

The “group” parameter takes in the dog group(s) that the user is interested in. It needs be a list of strings unless the filter is only for 1 group, in which case the code converts the string into a list, as shown in the top part of the function. This parameter is used to exclude, from the recommendations, dog breeds that don’t belong to the group(s), as shown in the second section of the function. I removed the “ Group” substring of the values, so that input would be easier to type.

Remove “ Group” substring from group column

Similarly, the “low”, “medium”, and “high” parameters only include dog breeds that have a True value for the corresponding column names, excluding the “low_”, “medium_”, or “high_” substrings. These again need to be lists of strings unless the filter is only for 1 column name.

I output the filtered most popular breeds in order from most popular to least popular. Then, for each of these breeds, I printed the breed, description, and then values for the output columns that I set in the Data Processing section.

Example 1

Below is an example recommendation with no filtering. These are simply the most popular dogs, as indicated by the popularity values.

Top 10 most popular dogs with no filters

For the sake of the length of this blog, I’m only showing the output for the 1st dog breed, the Labrador Retriever.

Beginning of data for Labrador Retriever

End of data for Labrador Retriever

Example 2

Next, is an example with “Toy” as the filter for group. All the recommendations are for dog breeds that belong to the Toy group. They are still in order of popularity, as evidenced by the Poodle (Toy) being number 1 and the only dog breed below that was in the recommendations with no filters.

Example 3

My last example is with filters for the “low” and “high” parameters. Most of the recommendations from the previous example are no longer recommended because they don’t match the filters. In fact, there are only 9 dog breeds, instead of at least 10, that can be recommended based on these filters.

The outputs for each of these dog breeds doesn’t say whether they have low or high values for the columns in the parameters, but it can be verified from df, as shown below for the top 5 recommendations. They all belong to the Toy group and have True values for the respective columns.

Columns for Top 5 most popular filtered Toy dogs

4. Recommendations of Similar Dogs

For people that already have a dog in mind, they can instead look at my recommendations of similar dogs, which uses content-based filtering. Content-based filtering provides recommendations based on similarity to what the user likes, which is either inferred from the user’s previous actions or explicitly indicated. In this case, the user may enter inputs into the function, so their preferences need to be explicitly indicated.

One-Hot Encodings

Before generating the recommendations, I had to create one-hot encodings for the temperament and group columns because I otherwise couldn’t perform numeric calculations on these categorical variables. For each column of interest for each breed, the one-hot encoding starts as a list of 0’s for each unique value of the column. The 0 is replaced with 1 if the breed’s combination of values for the column contains the corresponding unique value.

Below is my code for creating the one-hot encoding for temperament. The temperament column had 3 temperaments separated by commas, so I converted those temperaments into a list. I extracted the unique temperaments by creating a list of all the temperaments separated by commas, and then converting that list into a set. I efficiently assigned the one-hot encoding by creating a list comprehension for the unique temperaments, which checked whether the breed had each of those temperaments and converted this boolean into an integer (0 or 1).

df['temperament list'] = df['temperament'].apply(lambda x: x.split(',') if type(x)==str else [])temperament = []
for i in df['temperament list']:
    temperament.extend(i)temperament_no_repeats = set(temperament)
df['one-hot temperament'] = df['temperament list'].apply(lambda x: \[int(temperament in x) for temperament in temperament_no_repeats])

The one-hot encoding for the group column was simpler because each breed only has 1 group.

group_no_repeats = df['group'].unique()
df['one-hot group'] = df['group'].apply(lambda x: [int(group in x) for group in group_no_repeats])

Code for Recommendations

Next, I imported the euclidean_distances and cosine_similarity functions from sklearn’s metrics.pairwise library.

I used these functions in my code for generating the recommendations of up to 10 of the most similar dogs to a selected breed, which is provided below.

def recommend_similar_dogs(breed,group=[],low=[],medium=[],high=[],ignore=[],important=[]):
    if type(group) == str:
        group = [group]
    if type(low) == str:
        low = [low]
    if type(medium) == str:
        medium = [medium]
    if type(high) == str:
        high = [high]
    if type(ignore) == str:
        ignore = [ignore]
    
    temp_cols = set(df.columns) - set(ignore)
    temp = df[temp_cols]
    if len(group) > 0:
        temp = temp[(temp['breed']==breed)|(temp['group'].isin(group))]
    if len(low) > 0:
        for col in low:
            temp = temp[(temp['breed']==breed)|(temp['low_'+col])]
    if len(medium) > 0:
        for col in medium:
            temp = temp[(temp['breed']==breed)|(temp['medium_'+col])]
    if len(high) > 0:
        for col in high:
            temp = temp[(temp['breed']==breed)|(temp['high_'+col])]
    temp = temp.reset_index(drop=True)
            
    sims = np.zeros([len(temp),len(temp)])
    for col in [col for col in temp.columns if 'value' in col]:
        if col in important:
            sims += 5*(1-np.array(euclidean_distances(temp[col].tolist(),temp[col].tolist())))
        else:
            sims += (1-np.array(euclidean_distances(temp[col].tolist(),temp[col].tolist())))
            
    for col in ['one-hot temperament','one-hot group']:
        if col in important:
            sims += 5*np.array(cosine_similarity(temp[col].tolist(),temp[col].tolist()))
        else:
            sims += np.array(cosine_similarity(temp[col].tolist(),temp[col].tolist()))
    
    idx = temp[temp['breed']==breed].index
    sims = list(enumerate(sims[idx][0]))
    sims = sorted(sims, key=lambda x: x[1], reverse=True)    
    num_dogs = min(10,len(temp))
    sims = sims[:num_dogs+1]
    breed_indices = [i[0] for i in sims]
    
    n = 0
    for i in breed_indices:
        if n == 0:
            print('Selected:'.format(n),temp['breed'].iloc[i])
        else:
            print('{}.'.format(n),temp['breed'].iloc[i])
        n += 1
    
    n = 0
    for i in breed_indices:
        print()
        if n == 0:
            print('Selected:'.format(n),temp['breed'].iloc[i])
        else:
            print('{}.'.format(n),temp['breed'].iloc[i])
        print(temp['description'].iloc[i])
        print(temp[output_cols].iloc[i])
        n += 1return

Like for the recommendations of the most popular dogs, this function has the optional group, low, medium, and high parameters. It requires the user to enter the breed that they want recommendations for, and it has additional optional parameters called “ignore” and “important”. The “ignore” parameter excludes the corresponding columns from the similarity calculations that are used to find which breeds to recommend. This is useful if the user doesn’t care about certain dog traits. The “important” parameter applies a weight of 5 to the corresponding columns for the similarity calculations. This is useful if the user cares more about certain dog traits, but the user is also okay with them being different if other dog traits make up for it by being much more similar.

In the code, I calculated the pairwise Euclidean distances and cosine similarities using the sklearn functions. Because the 2nd dimension is a 0 for the Euclidean distances, Euclidean distance is equivalent to the absolute value of the difference. The reason I previously (in the Data Processing section) converted the values into lists and appended 0’s was that the sklearn pairwise Euclidean distance function requires at least 2 dimensions and performs the operation for every pair of dog breeds efficiently. I converted both the Euclidean distances and cosine similarities into numpy arrays so that math operations would be applied to each individual value of the arrays. I subtracted the Euclidean distances from 1 because distance is the opposite of similarity. I multiplied the arrays by 5 if they were indicated as important by the user. I added all of these arrays to create a matrix of similarity scores that I named sims.

In the next part of the code, I indexed (after resetting the index because of previously dropped rows) the breed that was selected. For that index, I further indexed the indices of the dog breeds that have the highest similarity scores and are therefore the most similar based on the filtered traits.

Example 1

The output is the same format as for the recommendations of most popular dogs, but this time it also includes the selected breed. For example, below are the recommendations of dogs that are most similar to a Shiba Inu and have a high demeanor and trainability, with more importance toward group, height, and weight.

These are interesting filters because a Shiba Inu normally has a pretty low demeanor and trainability. Because there is importance on group, height, and weight, most but not all of the recommendations have the same group of Non-Sporting, and all of the recommendations are similar in height and weight (small, close to medium).

As an example, below is the comparison for the Shiba Inu and the 1st recommendation of the American Eskimo Dog, which despite their different temperament, demeanor, and trainability (as preferred by the user), have many traits in common and even look similar besides fur color.

Data for American Eskimo Dog

Example 2

Another set of example recommendations is for a Rottweiler. These recommendations ignore energy_level and trainability, and the output includes large dogs like the Rottweiler is.

Comparing the Rottweiler and the 1st recommendation of the Anatolian Shepherd Dog, they again have many traits in common and look similar besides fur color. In fact, many of the categorical values are exactly the same because of less filtering.

Data for Rottweiler

Data for Anatolian Sheperd Dog

Example 3

My last example recommendations are based on a lot of preferences for a Bichon Frise and thus have only 7 recommendations instead of 10. The recommendations are of small dogs like the Bichon Frise is.

Comparing the Bichon Frise and the 1st recommendation of a Coton de Tulear, they once again, despite the different grooming_frequency and expectancy (as preferred by the user), have many traits in common and even look similar when their hair is cut.

Data for Bichon Frise

Data for Coton de Tulear

Conclusion

My dog recommendation system quickly provides dog recommendations based on the dog traits that the user cares about. For the collaborative filtering recommendations, I based the recommendations on the popularity value assigned to each breed. For the content-based filtering recommendations, I combined Euclidean distances and cosine similarities to find and recommend dog breeds with the most similar traits to the selected dog breed. Both types of recommendations allow the user to filter the traits that they care about.

The main limitation of this recommendation system is that it doesn’t explicitly account for dog breed appearances, which are typically an important factor for choosing a similar dog. Characteristics like ear shape, eye shape, snout length, tail shape, and fur/hair style would be useful for such a comparison. Another limitation is that not all dog breeds are included. A few dog breeds were excluded due to the lack of correct data, and cross-bred breeds aren’t included. I hope that this data could be provided in the future, so that people can get the best dog recommendations. It would make sense to put the recommendation system on a webpage, so that users could see pictures of each dog and click to select filters.

Dog Recommendation System

Introduction

1. Data Collection

2. Data Processing

Data Cleaning

Creating New Columns

Filtering Output Columns for Recommendations

3. Recommendations of Popular Dogs

Code for Recommendations

Example 1

Example 2

Example 3

4. Recommendations of Similar Dogs

One-Hot Encodings

Code for Recommendations

Example 1

Example 2

Example 3

Conclusion

Written by Kevin Chen