Overview

The dataset contains 344 penguin observations, with 7 key variables: species, island, bill length (mm), bill depth (mm), flipper length (mm), body mass (g), and sex. In the simplified dataset, missing values were represented as NaN. The dataset includes three penguin species—Adelie, Gentoo, and Chinstrap—found on three different islands: Biscoe, Dream, and Torgersen. There was an unequal distribution of species, with Adelie penguins being the most common at 146, followed by 119 Gentoo penguins and 68 Chinstrap penguins. The islands also had different sample sizes, with Biscoe hosting 163 penguins, Dream 123, and Torgersen 47. The ratio of male to female penguins was almost equal, with 168 males and 165 females.

The aim of this project was to explore the differences in penguins’ body weights and flipper lengths using Python packages, particularly Pandas, Seaborn, and Matplotlib. Several data visualisation techniques were used, such as descriptive statistics, scatter plots, box plots, and pair plots. For the full code, this can be accessed via the github account Nathaniel Pyle

Methodology

The first scatter plot showed that there was great variation in the body weight of penguins compared with the other physical features. As a result, body weight was chosen as the primary focus of this project to investigate its relationship with various factors.

After separating the penguins by sex, two distinct clusters emerged in both male and female groups. For male penguins, the clusters were divided into those weighing above and below 5000 grams. For females, the division was at 4000 grams.

These differences between the four samples sizes were visualised using box plots. For male penguins over 5000 grams, the median value was approximately 5500 grams compared with the median value of penguins under 5000 grams at approximately 4000 grams.

In relation to female penguins under 4000 grams, the median value was approximately 3400 grams, compared with female penguins above 4000 grams at 4700 grams.

It should be noted that there was a lot of variation in these box plots. An alternative method was used to search for a specific sample of the valid dataset. The results from this search were then plotted using both a scatter and a box plot.

Seaborn’s pair plot function was used to visualise the relationships between species, sex, and island, providing a comprehensive overview of the dataset. Additional pair plots were generated to explore the relationships between body weight and flipper length, bill length, and bill depth.

Using Seaborn’s pair plots was computationally more efficient than manually generating search queries for each subset. Manually generating queries would have required creating 18 specific queries (2 for sex, 3 for species, and 3 for islands, leading to 2 x 3 x 3 combinations), which would have been computationally inefficient.

Variation among penguins physical features: Scatter plot for male penguins Body Mass versus Flipper Length: Scatter plot for female penguins Body Mass versus Flipper Length: Box plot for male penguins 5000 grams and under: Box plot for male penguins over 5000 grams: Box plot for female penguins 4000 grams and under: Box plot for female penguins above 4000 grams: Specified scatter plot: Specified Box plot: Pair plots using species : Pair plots using sex:

Pair plots using island: Pair plots using flipper length and body mass: Pair plots using bill length and body mass: Pair plots using bill depth and body mass:

References

Data Citations