Data Science Initiative Introduction To R Bootcamp (part 2)

Know you get tired after a while you’re just kind of like on board these maybe at the end in fact actually I wonder where they are okay which lines do these correspond to how would I find which line numbers there may be some pattern to this maybe not.

But I can just say hey which is dot na of valves okay they’re the line numbers so oh some of them in near the end okay fair enough of some of the beginning okay no pattern okay that’s fine so what do we do do we fix this manually edited file do we go back and read the real one.

Do we do or do we do we fix this by hand or do we actually do something clever I don’t know why this.

Isn’t clever to me dude we can fix it here okay we could if we.

You know what we could do is we could do the following we could do the.

Our problem is this let’s just do this which is flow edited about CSV I’m just gonna read the lines and I’m not gonna read the data I’m gonna just read the entire each line of text as a string okay okay I select ll for lines okay now.

My problem is I don’t actually want the ones I.

Want to find I want to exclude the ones that contain the word note in the n ot how.

Do I do that I don’t let’s not exclude them first let’s find them first how do I find whether note is in a string that’s just we just let’s take.

A look at ll of look at that I look at this one on the one before us so it’s 736 – 737 okay why didn’t I get the right answer why is that wrong let’s go up to.

Let’s go to 738 okay there it.

Is why because this number up here corresponds to the row number after I’ve skipped the header so okay okay fair enough okay good okay there that’s the one so okay how do I find that note but I don’t know what it is so this is a very weird very very very weird thing one way you could.

Do it is you could actually break up every single letter here and look for n followed by O followed by T followed by E okay I’d have to write code.

To do that and this is seriously ugly coat let’s just do this don’t ask why okay those are the lines there are almost exactly the same lines except that I add one to the ones I family.

For what does grep do let’s just grab okay it’s pattern matching and replacement in the search for matches to argument pattern within each element of a character vector table what the hell is this I have no idea what it is okay this is a very weird language within our and many their systems is a way for us.

To search but the four patterns like n followed by o followed by T followed by e but it’s really fast it’s really easy to use we don’t have to do anything we don’t have to do the computation ourselves but it can find really clever things like an n followed by three O’s but not followed by an X.

And there must be six digits at the end of the line okay that’s a very powerful pattern but you need a little bit of you need to but it you need to express that and you have to learn a very small tiny little language a weird.

Language but very powerful but this one is easy just go find the end followed by o followed by T followed by E and.

Then we’re done so one of the things I can do is this I is equal to that that’s the row numbers now if I do ll of I note they all have the notes what I actually want is not I everything but the I don’t I okay I so what I can do now is I.

Can just think ll equals okay so I basically got rid of those lines the offending lines I hope there’s the only I hope those are.

The only offending lines we’ve only done one column there could be more okay but one of one step at a time so I’ve done this.

Now what I could do is I could write them back out to a file and then read dot CSV we can do that if we want so I can say cash ll into file equals temp dot CSV and now I’m going to do read dot CSV of temp dot CSV we won’t bother with strings as factors and I’m going to say D two equals this guy what the heck is going on why is this so.

Idea why this is so slow that’s not good this can’t be good I’ve done something goofy okay oh that was one of the things I did very badly here okay so it’s clearly very very sad okay so I made a mistake what what am i what mistake did I make anyone see what mistake I made how many lines are there hair oh.

It is very unhappy this is all on one line well okay that wasn’t good for change so that’s just okay so we made a mistake okay so let’s go.

Back and say sorry you say put a separator in please now oh that’s much better okay so you have a sentence game so what’s dim of d2 it’s that now let’s.

Take a look at a head of d2 and if I could spell there we go doesn’t look too this is looking a bit better okay now let’s do a supply of d2 of class everything’s become everything’s the way we want it to be that was a tiny little trick okay because we’re using data analysis and.

To actually go and find the offending line rather the next sheets because we didn’t even know what they were what we were looking for we didn’t know what the.

Problem was we just said this isn’t coming.

In the way we expected it so we just what do we do strategy was.

Look at the thing that was causing us grief try to find the values which weren’t numbers then we actually identified which what what the characteristic of those lines was we said it has a note that’s why it couldn’t convert them to numbers then we went in and found them in the in the text okay and then we when we in our we we not not by hand but but in our we said reading all the lines throw.

Those lines away write them back out to to a file and then read them back in again so we’ve actually fixed that file hey can you figure out how to read this one how would you do to read this one this is where it.

Starts I don’t know not really this is where it starts this is little starting to look like a comma-separated file that was all metadata extra stuff at the top so we want.

We want to find the lines that start with a bang or an exclamation mark that’s common okay actually I didn’t want to keep this line these actually are the names of the variables so I want to read those ones then I’ll.

Put the names of the variables here okay and I want to skip anything.

With a note in it again and then I go on who knows what’s.