How to use sentence Tokenizer with Keras

Tokenization is an important step of pre-processing the data, Tokenizer class in Keras turns text into a sequence of integers. Let's understand how to use Keras Tokenizer class for sentence Tokenization.

Prerequisites
  • Run below code snippets with Google Colab or Jupyter Notebook.

Create a text file with some sample sentences


%%writefile sample.txt

I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played ""Thunderbirds"" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.

When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, but the animation in particular is a sad sight. The storyline is also pretty weak, as its more like three episodes of Schooby-Doo than the single adventurous story we got the last time. But dont misunderstand, it's not very good Schooby-Doo episodes. I didnt laugh a single time, although I might have sniggered once or twice.

To the audience who haven't seen the first movie, or don't especially care for a similar sequel, here is a fast review of this movie as a stand-alone product: If you liked schooby-doo, you might like this movie. If you didn't, you could still enjoy this movie if you have nothing else to do. And I suspect it might be a good kids movie, but I wouldn't know. It might have been better if Milo's Return had been a three-episode series on a cartoon channel, or on breakfast TV. Why do people who do not know what a particular time in the past was like feel the need to try to define that time for others? Replace Woodstock with the Civil War and the Apollo moon-landing with the Titanic sinking and you've got as realistic a flick as this formulaic soap opera populated entirely by low-life trash. Is this what kids who were too young to be allowed to go to Woodstock and who failed grade school composition do? ""I'll show those old meanies, I'll put out my own movie and prove that you don't have to know nuttin about your topic to still make money!"" Yeah, we already know that. The one thing watching this film did for me was to give me a little insight into underclass thinking. The next time I see a slut in a bar who looks like Diane Lane, I'm running the other way. It's child abuse to let parents that worthless raise kids. It's audience abuse to simply stick Woodstock and the moonlanding into a flick as if that ipso facto means the film portrays 1969. Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is too long, the acting is most of the time a Joke and the script is horrible. I did not get the point in mixing the story about Abraham and Noah together. So if you value your time and sanity stay away from this horror." Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen to them its brand new.

The film. The film is a re run of certain episodes, Man and the hour, Enemy within the gates, Battle School and numerous others with a different edge. Introduction of a new General instead of Captain Square was a brilliant move - especially when he wouldn't cash the cheque (something that is rarely done now).

It follows through the early years of getting equipment and uniforms, starting up and training. All in all, its a great film for a boring Sunday afternoon.

Two draw backs. One is the Germans bogus dodgy accents (come one, Germans cant pronounced the letter ""W"" like us) and Two The casting of Liz Frazer instead of the familiar Janet Davis. I like Liz in other films like the carry ons but she doesn't carry it correctly in this and Janet Davis would have been the better choice. A terrible movie as everyone has said. What made me laugh was the cameo appearance by Scott McNealy, giving an award to one of the murdered programmers in front of a wall of SUN logos. McNealy is the CEO of SUN Microsystem, a company that practically defines itself by its hatred of Microsoft. They have been instrumental in filing antitrust complaints against Microsoft. So, were they silly enough to think this bad movie would add fuel to that fire?

There's no public record I see of SUN's involvement, but clearly the makers of this movie know Scott McNealy. An interesting mystery.",0 Finally watched this shocking movie last night, and what a disturbing mindf**ker it is, and unbelievably bloody and some unforgettable scenes, and a total assault on the senses. Looks like a movie from the minds of Lynch (specifically ERASERHEAD), Buttgereit, and even a little of ""Begotten"". What this guy does to his pregnant sister is beyond belief, but then again, did it really happen or is it his brain's left and right sides doing battle. That's the main theme of this piece of art, to draw a fine line between fantasy and reality, and what would happen if the right side of the brain that dreams and fantasizes overtakes the reasoning and logical left side. And the music in this movie is unbelievable, a kind of electronic score that is absolutely perfect. Even though this movie is totally shocking and pretty disgusting in some of the most extreme scenes (including hard core sex) you will ever see in any movie, I viewed it as a work of art, and loved it. And that music still amazes me, I have to try and find the soundtrack if is available. Watching ""Subconscious Cruelty"" is a real event, and not something the viewer will easily forget. And a note to gorehounds, this is a must-have.

Warning... Be careful buying this movie, because some prints have fogging on the graphic sex scenes and extreme gore, especially the copies from the Japanese release.

Import Tokenization and set max vocab size


from keras.preprocessing.text import Tokenizer

MAX_VOCAB_SIZE = 20000


Create sentence list from sample.txt


sentences = []

with open('./sample.txt', 'r') as file:
  for line in file:
    sentences.append(line)


convert sentence into integers


tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

print("max sequence length:", max(len(seq) for seq in sequences))
print("min sequence length:", min(len(seq) for seq in sequences))

for sequence, sentence in zip (sequences,sentences) :
  print(sequence)
  print(sentence)


You should see output similar to the below one


max sequence length: 336
min sequence length: 69
[6, 159, 81, 160, 161, 55, 3, 162, 1, 56, 34, 35, 163, 82, 36, 83, 43, 164, 56, 165, 36, 166, 167, 3, 168, 36, 43, 34, 84, 5, 20, 169, 29, 57, 85, 24, 84, 5, 20, 170, 171, 58, 37, 172, 173, 59, 60, 174, 6, 175, 35, 176, 5, 44, 1, 7, 86, 61, 38, 87, 2, 177, 4, 25, 6, 88, 15, 2, 89, 178, 179, 180, 1, 181, 182, 90, 16, 1, 183, 91, 184, 21, 11, 14, 92, 185, 26, 1, 186, 93, 4, 1, 56, 187, 94, 188, 189, 24, 190, 95, 45, 191, 192, 4, 1, 96, 193, 194, 3, 39, 195, 196, 197, 198, 199, 200, 10, 39, 201, 202, 39, 203, 16, 204, 205, 2, 206, 4, 30, 207, 208, 2, 209, 210, 211, 20, 212, 22, 213, 214, 26, 215, 97, 216, 97, 16, 2, 217, 218, 4, 219]
I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played ""Thunderbirds"" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.

[62, 6, 98, 9, 7, 10, 35, 220, 221, 3, 222, 58, 26, 2, 223, 3, 27, 224, 6, 40, 27, 99, 6, 16, 86, 11, 9, 7, 38, 225, 27, 4, 1, 226, 100, 4, 1, 31, 7, 227, 101, 63, 228, 64, 229, 65, 102, 230, 231, 3, 2, 232, 233, 103, 22, 5, 35, 234, 21, 104, 4, 9, 8, 5, 20, 235, 10, 105, 66, 67, 40, 6, 236, 27, 237, 31, 6, 41, 21, 13, 32, 46, 106, 58, 1, 238, 239, 47, 20, 240, 5, 107, 28, 13, 108, 1, 31, 7, 3, 28, 241, 14, 242, 33, 1, 100, 243, 12, 12, 62, 1, 31, 244, 245, 48, 10, 33, 2, 246, 17, 18, 247, 248, 105, 66, 67, 37, 1, 249, 250, 82, 48, 251, 252, 29, 253, 3, 40, 1, 99, 6, 40, 1, 68, 254, 15, 2, 49, 255, 4, 1, 31, 7, 3, 1, 65, 102, 109, 32, 256, 50, 2, 21, 46, 257, 24, 26, 1, 258, 4, 2, 259, 260, 19, 1, 65, 4, 261, 1, 262, 263, 264, 11, 49, 22, 1, 101, 10, 110, 8, 2, 265, 266, 1, 267, 8, 268, 111, 269, 15, 51, 270, 19, 112, 69, 4, 70, 71, 271, 1, 113, 272, 64, 43, 72, 1, 114, 23, 22, 273, 274, 73, 21, 275, 63, 70, 71, 69, 6, 276, 115, 2, 113, 23, 277, 6, 41, 13, 278, 279, 29, 280, 12, 12, 5, 1, 116, 28, 281, 108, 1, 31, 7, 29, 117, 74, 282, 33, 2, 283, 284, 285, 8, 2, 286, 287, 4, 9, 7, 15, 2, 288, 289, 290, 17, 18, 291, 70, 71, 18, 41, 19, 9, 7, 17, 18, 292, 18, 92, 45, 293, 9, 7, 17, 18, 13, 118, 294, 5, 52, 3, 6, 295, 14, 41, 20, 2, 63, 75, 7, 22, 6, 119, 42, 14, 41, 13, 32, 120, 17, 66, 67, 40, 32, 2, 112, 296, 96, 53, 2, 297, 95, 29, 53, 298, 299]
When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.

When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, but the animation in particular is a sad sight. The storyline is also pretty weak, as its more like three episodes of Schooby-Doo than the single adventurous story we got the last time. But dont misunderstand, it's not very good Schooby-Doo episodes. I didnt laugh a single time, although I might have sniggered once or twice.

To the audience who haven't seen the first movie, or don't especially care for a similar sequel, here is a fast review of this movie as a stand-alone product: If you liked schooby-doo, you might like this movie. If you didn't, you could still enjoy this movie if you have nothing else to do. And I suspect it might be a good kids movie, but I wouldn't know. It might have been better if Milo's Return had been a three-episode series on a cartoon channel, or on breakfast TV. [300, 52, 301, 28, 52, 21, 42, 25, 2, 110, 23, 10, 1, 302, 16, 19, 303, 1, 304, 5, 121, 5, 305, 11, 23, 33, 122, 306, 76, 26, 1, 307, 308, 3, 1, 309, 310, 311, 26, 1, 312, 313, 3, 314, 72, 15, 315, 2, 123, 15, 9, 316, 317, 318, 319, 320, 50, 321, 322, 323, 8, 9, 25, 75, 28, 124, 125, 324, 5, 20, 325, 5, 326, 5, 76, 3, 28, 327, 328, 36, 329, 52, 126, 330, 107, 331, 332, 126, 98, 333, 35, 334, 7, 3, 335, 11, 18, 117, 13, 5, 42, 336, 127, 48, 337, 5, 45, 338, 339, 340, 43, 341, 42, 11, 1, 24, 342, 55, 9, 30, 77, 33, 54, 16, 5, 343, 54, 2, 128, 344, 129, 345, 346, 1, 347, 23, 6, 44, 2, 348, 10, 2, 349, 28, 130, 19, 350, 351, 352, 353, 1, 131, 354, 73, 89, 132, 5, 106, 355, 11, 356, 357, 75, 73, 116, 132, 5, 358, 359, 76, 3, 1, 360, 129, 2, 123, 15, 17, 11, 361, 362, 363, 1, 30, 364, 365] Why do people who do not know what a particular time in the past was like feel the need to try to define that time for others? Replace Woodstock with the Civil War and the Apollo moon-landing with the Titanic sinking and you've got as realistic a flick as this formulaic soap opera populated entirely by low-life trash. Is this what kids who were too young to be allowed to go to Woodstock and who failed grade school composition do? ""I'll show those old meanies, I'll put out my own movie and prove that you don't have to know nuttin about your topic to still make money!"" Yeah, we already know that. The one thing watching this film did for me was to give me a little insight into underclass thinking. The next time I see a slut in a bar who looks like Diane Lane, I'm running the other way. It's child abuse to let parents that worthless raise kids. It's audience abuse to simply stick Woodstock and the moonlanding into a flick as if that ipso facto means the film portrays 1969. [78, 133, 6, 13, 134, 366, 10, 367, 368, 6, 16, 369, 5, 370, 135, 371, 4, 1, 7, 372, 8, 49, 1, 7, 8, 125, 373, 1, 374, 8, 136, 4, 1, 23, 2, 375, 3, 1, 376, 8, 377, 6, 77, 21, 87, 1, 90, 10, 378, 1, 64, 127, 379, 3, 380, 381, 46, 17, 18, 382, 48, 23, 3, 383, 384, 385, 37, 9, 386] Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is too long, the acting is most of the time a Joke and the script is horrible. I did not get the point in mixing the story about Abraham and Noah together. So if you value your time and sanity stay away from this horror." [387, 2, 388, 137, 389, 390, 391, 3, 118, 47, 138, 392, 11, 6, 72, 34, 1, 393, 394, 3, 395, 3, 135, 23, 6, 396, 397, 5, 398, 51, 399, 139, 12, 12, 1, 30, 1, 30, 8, 2, 400, 401, 4, 402, 69, 403, 3, 1, 404, 405, 406, 1, 407, 140, 36, 3, 408, 122, 26, 2, 409, 410, 411, 4, 2, 139, 412, 141, 4, 413, 414, 16, 2, 415, 416, 74, 62, 417, 119, 418, 1, 419, 142, 11, 8, 420, 421, 422, 12, 12, 14, 423, 424, 1, 94, 425, 4, 426, 427, 3, 428, 429, 81, 3, 430, 34, 10, 34, 51, 2, 134, 30, 33, 2, 431, 432, 433, 12, 12, 143, 144, 434, 24, 8, 1, 145, 435, 436, 437, 438, 24, 145, 439, 440, 1, 441, 442, 19, 443, 3, 143, 1, 444, 4, 146, 445, 141, 4, 1, 446, 147, 148, 6, 19, 146, 10, 131, 447, 19, 1, 149, 448, 22, 449, 450, 149, 14, 451, 10, 9, 3, 147, 148, 38, 13, 32, 1, 120, 452] Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen to them its brand new.

The film. The film is a re run of certain episodes, Man and the hour, Enemy within the gates, Battle School and numerous others with a different edge. Introduction of a new General instead of Captain Square was a brilliant move - especially when he wouldn't cash the cheque (something that is rarely done now).

It follows through the early years of getting equipment and uniforms, starting up and training. All in all, its a great film for a boring Sunday afternoon.

Two draw backs. One is the Germans bogus dodgy accents (come one, Germans cant pronounced the letter ""W"" like us) and Two The casting of Liz Frazer instead of the familiar Janet Davis. I like Liz in other films like the carry ons but she doesn't carry it correctly in this and Janet Davis would have been the better choice. [2, 453, 7, 15, 454, 109, 455, 25, 456, 54, 115, 16, 1, 457, 458, 50, 57, 79, 459, 59, 460, 5, 24, 4, 1, 461, 462, 10, 463, 4, 2, 464, 4, 150, 465, 79, 8, 1, 466, 4, 150, 467, 2, 468, 11, 469, 470, 471, 50, 51, 472, 4, 151, 61, 13, 32, 473, 10, 474, 475, 476, 477, 151, 46, 124, 61, 478, 479, 5, 480, 9, 49, 7, 38, 481, 482, 5, 11, 483, 12, 12, 484, 85, 485, 486, 6, 44, 4, 487, 488, 22, 489, 1, 490, 4, 9, 7, 42, 57, 79, 59, 491, 492, 493] A terrible movie as everyone has said. What made me laugh was the cameo appearance by Scott McNealy, giving an award to one of the murdered programmers in front of a wall of SUN logos. McNealy is the CEO of SUN Microsystem, a company that practically defines itself by its hatred of Microsoft. They have been instrumental in filing antitrust complaints against Microsoft. So, were they silly enough to think this bad movie would add fuel to that fire?

There's no public record I see of SUN's involvement, but clearly the makers of this movie know Scott McNealy. An interesting mystery.",0 [494, 83, 9, 152, 7, 114, 495, 3, 25, 2, 496, 497, 498, 14, 8, 3, 499, 500, 3, 27, 501, 80, 3, 2, 502, 503, 53, 1, 504, 130, 19, 2, 7, 37, 1, 505, 4, 506, 507, 508, 509, 3, 78, 2, 128, 4, 510, 25, 9, 511, 512, 5, 39, 513, 514, 8, 515, 516, 22, 517, 518, 77, 14, 519, 153, 29, 8, 14, 39, 520, 154, 3, 155, 521, 522, 140, 523, 1, 524, 91, 4, 9, 525, 4, 60, 5, 144, 2, 526, 527, 528, 529, 3, 530, 3, 25, 38, 153, 17, 1, 155, 156, 4, 1, 531, 11, 532, 3, 533, 534, 1, 535, 3, 536, 154, 156, 3, 1, 68, 10, 9, 7, 8, 537, 2, 538, 4, 539, 93, 11, 8, 540, 541, 78, 133, 9, 7, 8, 542, 152, 3, 111, 543, 10, 27, 4, 1, 136, 157, 80, 544, 137, 545, 158, 18, 47, 138, 44, 10, 104, 7, 6, 546, 14, 15, 2, 547, 4, 60, 3, 88, 14, 3, 11, 68, 45, 548, 54, 6, 13, 5, 121, 3, 549, 1, 103, 17, 8, 550, 55, 551, 552, 8, 2, 553, 554, 3, 21, 142, 1, 555, 47, 556, 557, 3, 2, 558, 5, 559, 9, 8, 2, 560, 13, 12, 12, 561, 20, 562, 563, 9, 7, 564, 27, 565, 13, 566, 53, 1, 567, 158, 80, 3, 157, 568, 74, 1, 569, 37, 1, 570, 571] Finally watched this shocking movie last night, and what a disturbing mindf**ker it is, and unbelievably bloody and some unforgettable scenes, and a total assault on the senses. Looks like a movie from the minds of Lynch (specifically ERASERHEAD), Buttgereit, and even a little of ""Begotten"". What this guy does to his pregnant sister is beyond belief, but then again, did it really happen or is it his brain's left and right sides doing battle. That's the main theme of this piece of art, to draw a fine line between fantasy and reality, and what would happen if the right side of the brain that dreams and fantasizes overtakes the reasoning and logical left side. And the music in this movie is unbelievable, a kind of electronic score that is absolutely perfect. Even though this movie is totally shocking and pretty disgusting in some of the most extreme scenes (including hard core sex) you will ever see in any movie, I viewed it as a work of art, and loved it. And that music still amazes me, I have to try and find the soundtrack if is available. Watching ""Subconscious Cruelty"" is a real event, and not something the viewer will easily forget. And a note to gorehounds, this is a must-have.

Warning... Be careful buying this movie, because some prints have fogging on the graphic sex scenes and extreme gore, especially the copies from the Japanese release.

View word and integer mapping


word2idx = tokenizer.word_index

for i in word2idx.items():
  print(i)


You should see output similar to the below one


('the', 1)
('a', 2)
('and', 3)
('of', 4)
('to', 5)
('i', 6)
('movie', 7)
('is', 8)
('this', 9)
('in', 10)
('that', 11)
('br', 12)
('have', 13)
('it', 14)
('as', 15)
('was', 16)
('if', 17)
('you', 18)
('like', 19)
('be', 20)
('not', 21)
('but', 22)
('time', 23)
('one', 24)
....

Category: TensorFlow