there are good for doing string matching. they have the remarkable property,the ones you have built,a subject,speak from your text.you can find strings in it for a time,which is independent of the length of that. sort of unusual,very nice thing to be able to do. they can begin to finding common subsequences of two. so i'm only useful thing to be able. finally,a show,this unique subsequence that also so much more to want to do with that. that the. they can be used for a variety of things. stop,i will introduce you to some extremes. with a very silly idea. sometimes the way you get into something you speak,study with a silly idea.and then. not rejecting it immediately because it is the question.and you left the silliness skills away and your left with something useful that we going to try and do.so this is the the idea. in the everyone of the substance of the text. every single substring in the provided index do it. brother. inherited that.we can choose. a point just before his. to begin the subsequent,so just as the last time.so between any two coal and one places. only two of those places,and that too.subject. so that how many suspects there are an,unless you not used to looking at these things,just let me tell you,this is big. big,big. no. every substance. how to say. it is a prefix that some subjects of the string. so if we have,if we're looking for a particular thing,and we can find the suffixes that begin with it somehow,and that would be fine,because that one will find you. actually,what we're again is that what we do when we use some. so we can use something called the digital search tree,and you'll see what that is.this is a very,very common thing.it's been knowing you personally,but don't. okay,known as a three split,ie. the middle syllable of the word retrieval. whoever well it was medical.and use that. and although he will be long on it,the contributions to information retrieval,he will spend a good deal of time.and thirty two before you finally. for introducing this. because anybody who introduced something which is a tree of sounds like a three. but isn't,but doesn't feel like a three problems that i'd like to see him in my office.that. so we don't process the text. to produce. one member of family of data,some of them are suffix trees,that when i'm going to concentrate on the rules and suffix arrays,which. a lot of the same thing,they have some slight,slightly different process to be. patricia trees,which you get,instead of thinking in terms of letters,you think in terms of the pits,zeros and ones to make up the. the functionally you get the same thing. and i'll just mention,as we go along a neat little variance into which we were actually use,but which,you know that completing the lakeside. if you get nothing else out of this talk that a few. sometimes,and publications that you can throw around,you see that there are various of these that you can. the unsuspecting,with. the fifth time these things,going only by the length of the thing you want to search for,there's nothing to do with the thing.you are thirteen. the processing time to prepare the text of the. that is determined by the sit to the sides of the text.so you have to do a lot of work in advance,but when you ready,you get to do things very. so i am going to. was going to corpus. some extra good with very large conference that we used to use this heavily electronic version of the oxford english dictionary. so it was possible to say any such string.in the. and the in the in the corpus of four hundred and fifteen million millions. the large,at least by the standards of that day. the copy that you work on today is so much shorter. there is. but it's got enough problems in it,or enough interest to develop whole day things that everyone. here is a suffix tree of that. remember,this is a silly idea. um. so here is one of it.maybe the whole course. so if you are looking for anything that happens to get,begin with.an n. because only with those character,this is,you start here to go on looking for.so here he was looking for money,looking at my,yes,you go little,but then my,everything you go along here. until i find what you want,but you can't go any further. if you're looking for something that begins with an s that go down here to remember,the rest is there's only one bridge to the s.on it.there's only one branch in a letter on it. could any node. maybe you begin to see this as this here,so that it is.i. it's like pi and design something different from everything else who ended it all.with the technical. also,s s i p p,i develop here.a. those two possible. so this is wonderful thing for searching for such. it's too big to fit in fifty million characters.you can see this thing would be huge already,which i can help to get it on the slide. we've got to do something about that. but everything that responds to from the room for some node in this. so we can find everything,but the general real problem.and any given,though,each correspond to a current of that string. so. three. that you can see. yes,as i was twice in here. and we know that there are two of,we start is no idea. this terminal at this. therefore,we know that there are two occurrences of that in there. right,that's the one where you think there are in here.four years of the letter i got from the one. four. okay. so just to summarize a little bit of math here. there are n plus one,two,two. actually,one more nodes,but there are eighty.or what is worth. there are terminal nodes.and one slightly surprising thing. although there are numbers of nodes,there are only n minus one of them.whether more than one letter going out of that. no. be a key thing for us to work with. so most of these,there's only one level. that must be true. for all. necessarily the case. you will be able. so i hope to see in what i say to you,why that is not going to. okay. here's what we're going to do. what we're going to start replacing sequences of non branching nodes with just a single branch,which has all the letters on it. single branch to represent all the ppi. but doesn't get us very far because of that.we replace the. up here we got this big label long here. but we can think that,never hurry.but i think he is the one of the. pbi,because between. exposition. so it actually could be position to ppi dollars,so it because here.so just by having two numbers that. i can represent any substring of the text in particular,and every subject of the text whatsoever. so,oh,my. oh,my branches are going to have exactly the same size of label on it to numbers,regardless of the length of the string.so if people are doing that. and i used the whole tree,or something that looks like this. ladies,and did what we have just done is exceedingly dramatic. they. exponential qualities of the we had before i had disappeared. this state destructor is the size of this data structure is linear. in the length of the string. each science is for some small k k times the length of the original strength,but i know,and two something or else,this is a very,this is a relatively budget. look,there are two. for t i.and. the many topics is other.when you start anywhere,you like the sex,and you go all the way to the end.but to. how many of the murder was forever? that's easy,one for every. what's the largest number of these branching nodes that we can have in there? and the other is. the number of. of terminals. minus one. one way to see that follows,imagine you were adding these things,one by one. as you like it,you follow the text you have already seen before for a while,until you come to the point when you can,when you can,no longer. follow the path that you read on this course,and you have to branch out,because this is a new material. from then on,its all new material,you use one branch at most,for every subject you have one,the only one. so this is the most,the total number of nodes in here can be the number of characters,minus one.and in general,it would be less than that. once you play both to have a constant,so we have a million. but. so. you live,we want to look up at s i p d. we look them,all the things that are available because no,he is,only one of them is with it.s. three on it. so it's going to be. a little help. yes,yeah. two. so here. that's the only answer is to bother about. it comes as one pack of this business to three. so now,where did you sit?in. among these here,but we want that to be an s o.to. that three point four,so we can say that. and then it was on why that half way,part of the way down this year.and that's where we end. so they were actually in between eighty twelve,back to his position,ten in the in the tree that you look to be up in here. so much,five characters. ending position now we know the string is in there.which is. don't know where we know one place where it is,but there might be other ones who have to explore the rest of three.there is never rest and get the rest of the three. find all the s. the way down here. two terminal.yet. so the two instances of,as in here. we could find out where they are,but it is the difference between the second member here and the person of number here. but doing that all the way alone,the difference when we get the difference,is what you see.if we look at this one here,there's a psi here. and then what photos?it is the position. well,this is right to meet up with some,something that begins to position eight. actually,there must be another instance of it,three characters for the right to the right in order for that to be possible. so by adding up these things all the way out to the end,here,we can sell where each of those. such things to begin. fine,i live and i had the,we had these things. we found out where they will.so this one. which is to track. two,well,that's nothing there to five to eighty three.so we add to one,two,three that gives us four and. five,so we know that.this one. four and five. the text. i told you what a beautiful book. the beauty of this particular point can be such as to either make you excited to put you to sleep.you will see you somewhere in between. yeah. observe. because the three or something later that two in the final accounts is one was the first created. requires that the most two most visited for each,and what that means is that finding all the appliances of the string. is dependent only upon how many of those currencies that are,but not how much intervening textures between them. so in other words,it depends about the least expensive thing that you can imagine it depending on. you cannot. fk thing in less than k times. and that's all we're going to spend. but nice thing about these trees that you can build them in. in linear time. and i'll do that rather quickly,but is still going through it without you really knowing when the world,what's going on.because if i don't. i'll have time. not enough time to let,let's say one or two somewhat potentially interesting things for which i would like to give myself five minutes,but i will point out here that when you do this,you will find. there are pieces of the tree you will always find this where the same stuff we got to make the pi to twelve here.we got to make. in the first here,we got eighty one of five to twelve.we can we want to take them by. those no. into one. but not telling me. nobody would. you would do just what it is doing it before. just don't get surprised to just see what time we will have an even smaller data structure. so it will actually reduce to something like that in this particular place. this is called the minimum finite state,called thomas,on corresponding to that extreme,we had. so how do we use this to doing an alignment? fill the screen with the text in both languages. create that point is what i mean.but that is that for every node in there. at the moment,with each point to its total,to the things further down the line i want to do is to point back to the mothers as well.so if you give me a note that i can find my way back towards. they. they root of the tree. terminal,those with sentence numbers,remember the string corresponds to begin with it into a string corresponds to terminal. so much love to show what the sentences of those occurring.so i knew that. character number forty two thousand seven hundred and twenty eight because in sentence number twenty seven thousand forty. okay. they keep doing this. find an interesting sub three.and a. interesting,just means that you set the parameters you say it's got to be at least five characters. in some way. so now we find this,we can find all the sentence numbers in it,because people,really three,to show that. good on the three b which we built on the strength of the other,from which has a life sentence,numbers the same sentence.numbers. in both of these. propagate the sentence numbers back from the terminals to all the previous notes that you will have to follow in order to get to those terminal. okay,now that means that most of the nodes and treat people not have any sense of adaptations on line. because they can't respond to substrate that getting to any of the in the sentences than any of them.interesting things from. take a million. so we use only the decided the tree.it has citizens numbers on it. but now we have. we can now walk down to two sets of the past,and the two trees.finding,finding. potentially corresponding strings,and using a similarity between them and the ones that gets sufficiently high.similarity was going to be interested in and particular,we're going to try and say. for a given string,interesting spring,a letter by the string and language p,which is most similar to it. so a line be with a is the most. similar. in. pleasantly interesting to a. or. and this,i think,actually,that spend more than they can do a line a and p and p is the most interesting,similar to the other. so i think an interesting spring.a. i found the most interesting thing too,is most similar interesting things to it in the other language. yes,i required the most interesting thing in any corresponding of it should be the one that first. so to work in that direction,and actually been doing that.but i convinced myself. but that's going to believe the right thing to do. what does what's the similarity?measure. the matter a whole lot. i try to information.i try a couple of things just taking the size of the intersection of dividing by the size of the union seems to be as good as anything. and what do i find? well,i had some six hundred and fifty sentences from attractive manual. which magic pan had not conveniently translation in french and german,just for me. and the first things i found were already,to me,quite interesting and surprising,but most similar,that i found on my first run was. the french. suffix as vertical surface.from. and they english adverbial suffix. so i thought a good,more on a good start.yet. but every subject s corresponding to the french.complete would lay. because french. i take it because french nouns are a little bit more promiscuous in the various types of photos,and they use for plurals.on the other hand,they really do like using articles.to. so that if you try to what is most frequently,for now,the answer is abnormal.in. so this is actually make the responses like,got right up. and the. that was a,that was a slightly confusing one.yes,there is a sort of,there is a sort of semantic similarity.what did it actually. come from. yeah,it came from anti. which. but which the french had this pet?he little of short dollar,so-called secreto. it's stands in front. just as clockwise been indeed.line up.with. but. and also that you would rather long expression of that same thing in german. conditioning line with lim.is. remember,this is only six hundred fifty words.from the next. from a messy ferguson tractor manual,a lot of my problems came from the fact that this corpus was small and conditioning that has only with air. so it's always air conditioning,so it always seemed that. lighter is always to go like that,because these are very much of people as per the drive.this track is something like that. and the clockwise stocking was site. hey,brady gave us to german.come down really has brains.so it's recognized this.this is actually study with.the german.was the nickel set of interesting. string that is great by itself,or omega.a couple of other things break. turn to. in the outset of this way to turn in the ox's direction. this. this referred to the study.and only. i'll see regulation even simple,have been given my stopping orders. and i always do exactly what,and so that some time around the time when i told to do it. so. yeah.