Data Science Initiative Introduction To R Bootcamp (part 2)

What is there and how much would you expect to rent better to pay for a rental units okay we know that there’s certain things like where it’s where it’s located that we can we can maybe explore that but things like how many bedrooms it has how many bathrooms have has of course which is related to square foot okay as well because you can only fit so many bedrooms into 10 square feet okay but and.

So you kind of think that the price would be somewhat related to the square foot and then maybe that number of.

Bathrooms bedrooms location let’s say there’s all these.

Different things and of course the actual type is is is complicated so how do we look at this various ways to look at us okay what you want to do oh yeah so you can just look well here’s the bet and here’s the simple way to look at this Davis Price square-foot bedroom F let’s just let’s just leave it that got it and then we’ll see a relationship okay not so much please yes just greatest yes yeah I’ll save this and I’ll give you all.

The code from this at the end but yes please just ask me to see this again what are we doing here we got Davis we say I want all the roads but I want.

Just these columns this is the subsetting okay that we talked about yesterday and again quickly and hey you know.

We went through a lot yesterday but this is it’s the same thing over and over and over again okay because that’s what we do all the time okay we just basically said I want I’m here I’m using the subsetting by name and here I’m.

Using the empty subsetting I got this they flow by I put names on these things for my own benefit I’m gonna get rid of these row names of Davis equals no I’m gonna say just kill them off please.

And this will actually print differently now so now I don’t have as much text flying by on the left-hand side so now it becomes obvious what the relationship is between all of these.

You can’t see a relationship between here because they’re not in order because there’s four there’s four variables and we’re trying to say how does price relate to all of these things well you know did this that these are about the same and they’re about the same price okay that we could so we can look at bedrooms and bathrooms to get up.

There they’re all approximately the same this one’s slightly this one is slightly cheaper per square foot we could look at price per square foot for the same thing this one goes up so it looks like if an extra bathroom.

Costs you an extra bedroom to us they cost you 250 bucks so whatever it is okay fine but these are only just three observations what happens across 175 and by the way if you take a look at the web page for this DSi you see our intro 18 that one I put up all 11,000 so that you can peruse over the last month where I scraped it at sort of every week or so okay so you can actually go and sort.

Of look at much bigger data set to see how the relationship holds but how are you gonna how are you going.

To actually say yep a more square foot means more price and how much more price okay now a little bit more specifically than plot because.

There’s a lot of different plot types and then how and making plots how many of you I always ask this question how many of you have ever taken a class that is explore that whose topic is data visualization yeah I.

Didn’t think so not a single person and yet this is what we spend all our time doing and it’s seriously hard okay as I’d like to say a plot you know they say a picture is worth.

A thousand words and I always ask you how long does it take you to write and carefully edit a thousand word that’s a SAR document days and yet plots two seconds I just take the.

Defaults out of this afterward and off we go okay and and yet there’s so much information and they’re seriously.

Hard to actually to under to comprehend so you actually have to put work in to actually explain what they what what the focus is so we have to compose a plot but what sort of plot will be used for this okay we could just mean we can look at price okay so we talked about this yesterday okay and again I’m going to have fun with this we can say we can see a histogram of Davis dollar price okay and where’s this gun that’s now.

We know that’s our histogram so we can see a distribution so what else do I want to.

Do oh come on this is the easy part you can you can create you can dream up any plot you like I’m not gonna do it but you can you can dream up any plus what you want to do one.

Things to do when you’re programming if you don’t know what you’re trying to do you can’t get there okay you can waste a lot of time just typing random stuff thank go that’s a nice plot pretty it’s got nothing to do with what I wanted you gonna have to decide what you want to do so what.

Do you want to do okay so we can now sure now we go back over here and again this is what we call base graphics okay so now we’re gonna put Davis and square foot okay and that’s what we get so what do we think so there’s lots of ways to plot this data this is kind of the question I’m getting asked this is your square foot.

Okay I don’t like this Davis Dollar square foot no.

Let’s just let’s actually go in beefed it up a little bit okay which is let’s say that acting the label on the x axis or the horizontal axis should be a square foot we can put something in and the.

Y axis the Y label should be should be.

Price and I’m going to assume monthly price but that may not be true but let’s just assume but we check this and we may even actually put.

In a title I mean okay we can say main equals relationship between it is but that’s because I’m typing.

In a weird editor where it just says oh this line continues on but thank you very much please yell when it is good because I get slashes in there when I don’t mean right between the price between rental price and again price and size okay and now we have a nice a nicer plot okay still not a very nice plot but a plot well it’s good enough.

O’clock but it’s late but it’s.

Really important to actually put labels on these okay obviously.

When you’re doing this just to actually explore something it doesn’t really matter because it’s only for your own consumption and in two seconds you will you you can remember what the data are what the data or the variables mean but it is really useful but then the second you’re actually going to show this to somebody else you better start actually putting real labels.

On it possibly units that says this is in dollars this is in square if this is in square feet because again you would never hand somebody a piece of paper that says here’s some random notes I took there you go you figure out what I meant so likewise likewise with the plot and it’s not very hard to actually make these plots okay so this is just plot here’s the X here’s.

The y and here we get to dress up and annotate this this thing with X lab Y.

Lab okay the main we can be might even get we might even say.

You know what I don’t really like those circles I’m going to actually change the plot character 2b plus I’m going.

To change the character expansion okay don’t ask why.

These are named this way okay character expansion it’s three because I want it to be bigger and I’d like them all to be read please okay okay okay so so I can change a lot of characteristics about other plus this says I want again what I what is actually happening I’m just telling you this because we’re going to talk about something similar soon okay what I’m simply saying is these get recycled there’s a we mentioned this yesterday but again I.

Can never say you know that I was going quickly this is a vector of 175 observations this should be a vector of 175 observations it gets recycled to actually match up so we could actually put in 175 different values that’s what we were doing whom they were subsetting by the factor to change the.

Plotting character or to change the color we were actually saying change each the point the color of each point in the scatter plot based on some other.

Variable but in this case we’re saying no I want them all to be red okay so this is easy what do we see here by.

The way what do you see come on what do you see it’s what you know it’s just confirming what you know them the bigger the square footage the more you pay okay it’s Conan although when.

You pay when you get a really big house.

Okay our big place it’s not as obvious when you but but it as you go as you go up in the smaller regions it goes up okay what else would you like to look at here what else do you think explains this I mean I kind of mentioned what other things we’re just we only got to look.

At two things add more axes how do you add more axes to it this you mounted three dimensional plot and a four dimensional part you’re gonna.

Get a four dimensional but I want to I want to see the four dimensional plot that one I’ll pay money for it how do i how do I add axis you said over here we don’t want to do it over here because it’s because each point has two back on – on something so one of the things I could do as we did as we did.

Is I could change the color of this or.

I could change the plot type that the actual plotting character for to indicate the.

Number of bedrooms and I could actually sort of say okay then I could also change this it could also maybe change the size to be the number of bathrooms okay and what about the type of this what do you think is this an apartment okay so this is these are remember this remember the type variable actually has different we have different types so we have table let’s just take a quick look at a Davis dollar type okay so we do.

Have a lot of apartments but we also have a few townhouses we have a few houses okay so maybe they’re the.

Ones that are big okay we could just kind of check this you know what’s the summary again it’s just.

It’s so easy in some regards what I want to do I want to look at the square footage for all of the ones with that for the house is that clear I’m just basically saying hey I’m just quickly want to.

Ask what about the square footage what’s the distribution of the square footage it for the only the ones that have that are for houses okay for tight.

Where type is they call the house okay and I see it’s a thousand.

And it goes all the way up to 3,500 square feet I’ve got 20 n A’s there’s the mean so maybe there may be these ones over here are more.

Houses so therefore in which case maybe the square may be the relationship between square foot and price changes if you’re a house not just if it’s the size of the actual so how would you want to check that I want to add more labels I need to I mean I want to add more axes I need to.

Look at more more conditions essentially I need to look them simultaneously yes I can make them a different color okay so I can I can right so we can do something like the following okay I remember what we were doing okay so let’s go back over here just change the color or oh.

Boy I hate colors okay cuz now colors are hard who here is colorblind there’s always somebody in the room I know there’s.

Somebody colorblind okay so you’re picking the colors is really hard especially on my machine versus that machine so there are ways to pick colors carefully there’s a thing there’s an r package called our color Brewer that is that you can install and then you can actually pull pallets for different purposes that actually show continuity changes as we as we as.

If as values increase like along a continuous range or you can actually have categorical pallets that actually try to make make sharp distinctions between different categories.

Or reach out groups so we can we can change this and then we also have to take a handle handle and color blindness and so forth and you get.

Of these things but okay so we can.

Let’s let’s go over here and okay one of the things we can actually do which is slightly dual we just will just do this for the time being okay let’s this answers a slightly different question okay and I want you to be thinking about it okay about what.

What how do we actually ask the question that we want this is four bedrooms okay.

So we were just looking at square footage completely independent of bedrooms so rather than making a complicated plot that actually has colors and plotting characters that indicate the number of bedrooms and the type and so forth this we could look.

At them just two variables at a time okay now this is.

The number of that we could actually just do a scatter plot up the number of bedrooms against the price and I noticed I have no labels on these axes at all which is not good okay is that price or is it square foot I don’t know okay so you’re kind of this.

Is bad news okay yeah it happens to be price but we would we have no idea so what’s this telling us this is the number of bedrooms I’ll tell you that because I didn’t put a label on it how would we put a label on us it’s the same old approach xlab equals number of bedrooms and while AB is equal to price and we can put a title on us and now it’s.

Self describing this is good what they’re telling us that’s not true sort of like the sort of business yeah so it means more bedrooms more money and it’s related to square footage as well so we got to take account of that what the heck’s going on.

Here these are studios probably but now we can actually go and take a look and sort of see ok again I you know this we can take a look and say hey let’s look.

At Davis dollar type such that Davis dollar bedrooms is equal to zero okay and let’s just do a table of this okay and we’ve got two apartments that are Studios okay I.

Never do we got nothing else in this category so there’s only two studios and this thing so that’s kind of weird okay so we don’t where this is a this is a box but it’s it’s only got two observations so it’s helpful to be able to let you just ask these questions this you’re all comfortable doing this yes well that’s 2 b1 watch or what else we want to look at so box plots are really helpful this is actually showing us the distribution.

Okay conditional on each number of bedrooms so if we subset on the number of bedrooms being one and we look at all the prices okay and we can see this in a box plus one of these lines.

Here partner clothes they’re the medians okay so how do.

You know I don’t know if just this is how a box plot is defined is that these are this is the median this is the 75th and 25th quantile and these are these are little these are regions beyond which we would get outliers and it’s kind of weird that but they’re very helpful this allows us to see oh hang on this is kind of handy this is actually allowing us essentially to see density plots okay for.

Each by splitting on the bedrooms okay you subset each of the bed each number of bedrooms you.

Look at the you look at the you look at.

The price for that for that subset by the way this is just the same thing as this which is if I wanted to compute the median its T apply because I want to group by so Davis dollar price Davis dollar bedrooms okay and.

Then I want to compute the median and lo and behold that gave us that that command is just saying group the subset this thing based on the unique values of this so we get the zero to six different bedrooms numbers okay so we have six different groups and then just compute the median of this vector of the sub for each subset everyone.

Happy okay how do we get rid of the zero bedrooms in the plot well this is what we have.

We want to get rid of this okay.

Good question how do we get rid of this how do we get rid of the zeros we don’t want the suppose we just don’t care about this this is goofy fun we can move it over to the side to slide it up okay but it’s night from now how do we.