For the second CA, I began by using a dataset that was already stored in R. The dataset is called “Survival of Passengers on the Titanic.” The data shows information regarding the statistics of those who lost their lives on when the Titanic sunk on the night of April 14, 1912. The data was broken down into the following variables:
– Age: Adult, Child
– Gender: Female, Male
– Class: First, Second, Third, Crew
– Survived: Yes, No
In order to create the visual representation of the data, I used the example code, and when I ran it in R, a mosaic was produced, as seen below. The mosaic plot shows information in relation to the size of the rectangles, with the length proportional to the X-axis and the width proportional to the Y-axis. The color on the graph indicates the strength of the relationship between the variables, with the scale on the side showing how the colors correspond to the strength of the relationship. The mosaic plot allows for more than one relationship to be shown at a given time, with two variables on each axis in this case.
There is a great deal of information that can be taken from this dataset. The information that can be gathered is directly related to the variables on the mosaic, with gender and survival on the Y-axis, and age and class on the X-axis. It is because there is more than one relationship being shown that there are several different ‘rectangles’ produced, all of varying size and color shading. The following is what can be learned from the information presented by the dataset:
1. The majority of those who lost their lives when the Titanic sunk were males. Of those who lost their lives, 1364 and 126 were female. Of those who survived, 367 were male and 344 were female. This is shown below, with the dark grey representing those who passed and the light grey those who survived.
2. The majority of those who lost their lives were adults. 1438 adults and 52 children didn’t survive. 654 adults and 57 children survived the sinking. This is shown below, with the dark grey representing those who passed and the light grey those who survived.
3. There is virtually no relationship between the variables ‘crew,’ and both ‘female’ and ‘child.’ Nearly all crew members were male, with only a small fraction being female, and most female crew lost their lives.
4. Most males who did not survive were in the crew, followed by third class.
5. The lowest number of deaths was among children, regardless of their class.
6. First class reported the lowest number of non-survivors.
7. No child, female or male, was lost from first class.
There is a great deal of information presented by the dataset, highlighting interesting relationships and more. Using R is helps bring a dataset such at the one used here into a visual, reader friendly outline. It is much easier to create a visual when the data has already been checked and verified as it was in this dataset. When creating a dataset from the beginning, it is important that the data is correct and is presented to R in a way that it can be used for the intended purpose.
The following highlight things I would have included had I had more time and more practice with R:
1. I think it would have been beneficial to have some sort of scale or label for each rectangle on the mosaic to indicate the total number of people each one represents. On the current mosaic, it is unclear if some rectangles are bigger than others because they are not of uniform alignment. Having the specific numbers that each rectangle represents would make the magnitude of the data more apparent. It would be easier to tell that nearly 1,500 people lost their lives on that one night.
2. I would make the ‘Standardized Residual’ key easier to understand. I know that it represents the strength of the relationship between the two variables being examined, but I think this could be a very useful tool is deciphering which variables made a person more or less likely to survive. Was class more powerful than age? Was gender more powerful? You can see that males in lower class were less likely to survive than anyone else, but there is no telling what characteristic was the strongest in determining the outcome.
To conclude, here is my certificate of completion for Try R:
Happy CA 2 & mahalanobis distance!