Data Science Initiative Introduction To R Bootcamp (part 2)

Get rid of it so this is the game saying game we’re playing okay so all I do is just subset okay so I.

Basically I can go back here and say Davis dollar Davis give me all the things in Davis such the Davis dollar what is it bedroom equals zero that guy okay so I’m subsetting okay but I don’t that’s not the guy I want it’s everything other than that okay so I can put in not equal.

To zero okay and now I get this so let’s call this D 0 okay DN.

0 so what’s the dimension of this it’s a hundred and seventy-three that got rid of our – yeah the naught means not equal to okay this what this this is the.

Exclamation mark sorry okay so yes it so not not equal okay I could have just said equals equals and.

Put them not in front of the whole thing okay you can you can all I’m going to do is just say it’s the opposite of that and now I’ve got this thing and now we can go back in and we can go and say go and draw this and now I’ll put in I have to change all of the references to Davis.

To being my new data set and that’s kind of annoying sorry – thank goodness and lo and behold the zeros gone okay so we can subset and only look at.

Want we could also even just so say okay show me that show me show.

Me the the groups of the distributions where we have at least five types are five observations in each number of bedrooms okay or whatever it is okay.

What else do I want to do okay yeah yes s apply to s apply is the simple aks apply just walks through an ordered an ordered collection and apply is a function to each element T apply is for table apply it groups first.

And then and make subsets okay so in this case with the T apply up here we’re saying get every get there’s.

Six unique values four bedrooms zero.

One two three four five six split okay split the price vector based on the corresponding number of bedrooms okay so I will end up with I will group I will do a split first end up with six different six different blocks okay so I’ll end up with this with this okay and with this and then what’s four five and six okay this is four-bedroom zero one two three four five okay that’s fine whatever it is however many we have okay now this is what we’ve done is these.

Are all the price values okay we got price in everywhere for those observations that correspond to I have so in the first one all the two observations corresponding to the bedroom being equal to zero are now in the leftmost bucket and so forth and so forth and then now what we do okay now what we do is we put we call median okay yeah median median on this group median on.

On this group and median on this group okay so the T apply first splits.

Everything and then it goes through each subgroup and applies the function okay so this is the so T apply T apply is the same as split into the.

Into the different blocks okay and then an S apply over these blocks with.

The function okay which is the median okay so s apply just go we say.

S apply over each element and call median that’s s apply it just says do the same thing to each group okay are each elements where’s T apply says before you do the air supply group through first that make sense okay so we can you can do an awful lot of plotting in our with using things like box plot and plot and histogram and pop density and you name.

It a case you can draw everything you’ve ever wanted to draw okay and more you have a lot of control over this part of it is there there’s a bunch of individual functions they all work approximately the.

Same way they take the same arguments okay they have a day that you give it the data okay and you tell it and so the argument that here’s the data I want to plot and.

This says here’s your here’s your y here’s your X okay may even take it in different forms if I said box plot of Davis cut bedrooms it would just.

Draw one box plot because I do it this way it’s actually saying group by bedrooms and then draw the draw a distribution of price if I done it the other way around with commas it would actually do the same thing so it takes data it takes the it takes the description in various forms of the data but the box plot here this is the verb that basically says I want the box plot and the other case I wanted.

A plot a scatter plot another case I wanted a histogram so there’s a variety of different plots that you can construct there’s another way of drawing of drawing plots it’s actually a little and although people are.

It’s I it’s slightly harder to learn and then it’s easier to use to do a lot of different things but you can do you can do everything in ITIN in several we have three different plotting tech.

Engines or models in our then pick whichever one you want and there’s some nice benefits to using gg pause okay because if nothing else it gives you legends very very easily you one of the things is we’re dealing here with data or in our data frames if you if all.

Your data and data frames life is good okay ggplot requires everything to be in a data frame so you actually have to augment your data frame with extra variables and so forth but that’s fine then.

You get a certain sort of uniformity and that’s that’s fine there’s a learning curve to ggplot but it’s but there’s a there’s a principle to it that comes from a book by Ali Wilkinson called the grammar of graphics and it’s very simple it’s very obvious when you make a plot of data we’re only talking about data plot okay data in for you.

Know data analysis plots we’re not talking about how you make video games we’re not talking about how you do ray tracing we’re not talking about how you do all sorts of CGI and stuff like that we’re talking about visualizing data basically here’s what you do there’s a function called ggplot okay the first thing you need to do is you need to say library of ggplot2 okay why is it ggplot2 because the original gg+ is now deprecated okay so this is the second incarnation of ggplot okay same principles.

Different sets of commands and functions.

And behavior ggplot is your command okay so there’s a principle to this okay you you create a debt you create a plot and you can you tell it what data you’re gonna use okay then I have that there’s it may be simpler to actually do this okay you basically have you have data inside in your data frame you have variables okay what we’re gonna do is we’re gonna map these variables some of them the ones of interest not all of them we’re.

Gonna map those into an a static okay that’s gonna actually okay and then so this is going to act and then we’re gonna map these into a geometry.

Plot okay so a geometry is gonna give us points but we actually have to put them we have to say what variables go with a certain set of with points and so we’re mapping them.

Into X and Y’s or we’re mapping them into the size of a point or we’re mapping them into bars.

In a bar chart if you ever want to create a bar chart a great a bar chart okay.

Never create a bar chart yeah so the okay but there’s so we’re gonna map that data to something on that it that appears on the that is a variable in our class that then gets mapped and rendered as actually an element on the on the plot itself which are points so this allows us to deal with certain things in a very uniform.

Manner okay so we can change okay so.

We can take a pause make it an XY plot but then actually by changing the scales we can actually use polar coordinates or something like that and actually get things plotting on a circle or so and so forth so there is actually a principle behind this as they leave wilkinson just wrote wrote down a bunch of stuff fifteen twenty years ago that said you know this is a better way of thinking about making plots rather than ad-hoc stuff so saying the.

Same plots it’s exactly the same class we’ve been drawing for the last 50 years it’s just a different way of actually thinking about them sometimes it works and makes certain things easier sometimes it doesn’t but what we’re going to do is maybe basically start with data we’re gonna create a plot and then we’re going to put layers on the plot and every layer that we do that we create we have a.

Mapping of the variables to these things that we want to plot the eggs you can think about them as being x in a scatter plot it’s the X the X the X variable the Y variable that may it be all we need for a scatter plot okay if you might think about it as being for a bar chart they may be counts you’re actually taking the data on.

In two counts which is the height of the which is the height of the the beach bar in which case we’re going to do a transformation we’re going to do some statistical summary to actually get that get the values that we want to map the.

Data to the aesthetic and then we’re going to actually draw them a map them to bars okay so there’s these jobs okay geometries are we call they’re called jams okay so this is basically the pieces you have data anesthetic.

A job hopefully the scales we don’t have to specify but we can okay hopefully we don’t have to specify any statistical summaries but we can okay and then the last bit will lecture we’ll talk about later on these are not this is not necessary at the moment came.

So here’s one of the things we can do in ggplot with box plots okay okay thanks for a while it’s gives us.

A warning message that it remote removed twelve rows containing n A’s ok and then we let’s go back over and look and see what it actually got what’s the difference between what we had before hmm background okay the back which you can actually only just see it’s got its got grid line so it appears differently okay.

And but it’s got it’s the same information but that’s fine okay.

So it’s it’s it gives us basically the same thing it doesn’t actually put the labels are actually slightly different okay so it doesn’t show us one and five which is good and bad now we can discuss the aesthetics okay now you can’t go and change the X layer the X labels in the same way we did before which is just in the arguing as arguments to box plot we said X lab and Y lab and the main and all that stuff if.

You can’t do that you can’t do that in ggplot because you it’s an extra layer you actually construct it as an extra layer so down here we’re actually so.

Saying here I’d like to change the label on the.

Horizontal axis so let me call the xlab function so then we’re replacing what you know we’re replacing this argument to a function with the foot with a with a layer down here with a similar name you have to learn a different set of things it’s kind of annoying when you go between the two so a lot of.

People just do everything in GT plus or everything in base graphics okay so I though I don’t want pick your pick your poison it’s really nice if you can use both why because it’s.

This is actually shorter than this you don’t have to think in as much in some regards for doing very quick plots for doing very rich plots ggplot give gives you a different set of controls which can be helpful okay so probably what we’re actually doing is is.

Is this okay what do we got we’re basically saying GG applause here’s my data everything I refer to all the variables I’m going to refer to are gonna be.

In this data set okay we can we can you if we need to we can bring in different data sets in different in.

Different layers but the nice thing is most plots many many plots say I’ve got one.

Data set and all the layers I’m going to draw like all the points and then a couple of lines or texts and that annotates every point all of that text all of the XY coordinates are in the same data set so I’m gonna specify it once so here’s the day here’s the data okay and by the way here’s the aesthetics here’s the mapping and that’s when we talk about aesthetics it’s a mapping okay of a variable to a concept in the plot and this one says basically the aesthetic is this is your X.

This is your Y oh and by the way I want to group by bedrooms okay so bedrooms.

Is on the horizontal axis but we but I’m actually want to actually group this when that’s because because I want to essentially draw my box by for the groups with this subset split it by bedrooms and off we go everyone with me any questions just yell if you want to play along yep yeah so installed our packages so if you’re if if you don’t have it installed that’s that’s fine you can just you could do it.

Later on if you want but but installed our packages of ggplot2 should go off and your you can point-and-click in our studio and over in the packages area that should be enough to install us it’ll go off and find let’s go off and find the latest version it’ll pull down a bunch of the other dependent packages part and part of this is you everything needs to be in a data frame and that means any direct derivative.

To derive variables that you want to actually include have to.

Get into a data frame in some way so sometimes we don’t want to plot the raw data we actually want to do some statistical operations on it and then plot the results of that you need to put these in a data frame they need to be organized in a particular way this is the Citiz true of the of the base graphics to.

You sometimes do a lot of work to actually get the data that you want to plot not the raw data but the but because the data frame is so critical to ggplot there’s a bunch of there’s a bunch of packages that try.

To actually aid you in making these data frames things like deep liar and so forth so that your you’ll pull down a bunch of other packages for.

This and you have to go off and learn deep liar and all these different things you can do in my opinion my belief is that you can do everything you want in base are okay they’re just this is a different way of doing the same.

Things just pick your poison okay they’re both equally annoying okay oh okay thank you yep there’s masks there’s the masking involved yes thank you.

Yep yep so so dude this is just this is an aside it’s a good thing again you know it if for what we’re talking about now this may not this is the least of your worries what you’re trying to actually get are to do something you’re not thinking about sharing it with anybody else.

So let’s just focus on why but there is a reason why you want to share it with somebody else.

And that is because you want to you want help he says I’ve got this far it’s broken now if you send it if you if I’m if I’m standing around or you come to office hours and the DSi or whatever you can you bring your laptop and we.

Laptop and it’s all there you’d the file need never leave your machine.

But if there is but if you want to send it to me then you better actually make it.

Help help me make it work so one of the things we actually do is what is this we basically say in your script you’ll sort of say.

Read RDS from a URL of HTTP ok or whatever it is and again one other thing this is a very good thing to do which is I can’t get the data if you send me.

The code and I don’t have the data I can’t help you because you know we’re just like that’s nice I mean maybe I can read your code and try to guess but I don’t even know what variables you have.

So you better be able to give me the data you better be able to do it in such.

A way that it’s local where we actually just say read CSV I’ll dip my file ok ok and ok so my file dot CSV or whatever it is and that should be in the same directory that I can that you send to me in.

Some way so as opposed to CP : users ok or whatever it is where I have to change this so either well how will we.

Get our data and what we tend to do is at the very top we say this this is the these are the packages I need.