Análisis de los guiones de Juego de Tronos

Ander Fernández Jauregui.

Ander Fernández Jauregui

Data Scientist y Business Intelligence.

Problema Inicial

Día a día se genera mucha información en forma de texto: tweets, reviews, noticias, emails… Estos datos pueden contener mucha información y muy relevante, pero pocas veces se analiza.

En mi primer proyecto de data science, abordo el problema del análisis de textos para extraer información de unos de los textos más reconocidos por mucha gente: los guiones de Juego de Tronos.

Cuestiones Involucradas

  • Natural Lenguage Processing
    • Análisis de Sentimiento
    • Análisis de relación del uso de palabras
  • Preparación de datos
    • Web scrapping
    • Limpieza de datos léxicos

Solución

Para poder abordar este problema decidimos crear un programa para hacer web scrapping de los guiones de Juego de Tronos.

Una vez obtenidos y limpiados los guiones, los textos se analizaron siguiendo la metodología tidy sugerida en el libro «Text Mining with R».

Desarrollo del Proyecto

Share on linkedin
Share on twitter
Share on email
Share on linkedin
Share on twitter
Share on email

1 Business Understanding

1.1 Objetivo de Negocio

La empresa está preocupada por analizar el texto de los guiones de Juego de Tronos y obtener la mayor informacion posible de esa fuente aún sin explotar.

Nota: aunque el proyecto trate sobre Juego de Tronos, el mismo caso (quizás incluso más sencillo), sería analizar las reviews de los comensales de un restaurante o las opiniones de los huéspedes de un hotel.

1.2 Evaluación de la situación

A la hora de abordar el proyecto nos encontramos con las siguientes limitaciones:

  • No disponemos de los guiones de Juego de Tronos. En su lugar, los extraeremos de internet, con las posibles erratas o errores que estos puedan acarrear.
  • Al tratarse de una base de datos con bastantes variables (muchos personajes, muchas temporadas y capitulos, muchas casas, etc.) en la medida de lo posible centraremos el análisis en los personajes principales.

1.3 Objetivo de Minería

De cara a poder realizar el análisis necesitamos obtener una base de datos con las siguientes características:

  • Granularidad: frase de cada uno de los personajes.
  • Cadencia: los capítulos.

Por tanto, deberíamos obtener una base de datos en las que cada fila sea una frase de un personaje diferente y deberíamos obtener esta información para cada capitulo.

2 Data Understanding

2.1 Obtención de datos

Una búsqueda rápido en Google del keyword “game of thrones scripts season 1”, obtenemos como resultado la página Genius. De un vistazo rápido en la url de la misma podemos ver cómo sigue el siguiente patrón: /Season-[number of season]-scripts.

Por tanto creamos un loop para obtener dichas URLs.

library(rvest) # Para hacer web scrapping
library(dplyr) # Cuestiones generales (filtrado, agrupación, etc.)
library(stringr) # Manipulación de textos
library(qdap) # Función mgsub 
library(tidyr) # Manipulación de textos (unnest)
library(tm) # Manipulación de textos
library(ggplot2) # Gráficos
library(rvest)

url1=NA
for(i in 1:7){
  url1[i] = paste0("https://genius.com/albums/Game-of-thrones/Season-",i,"-scripts")
  i = i + 1 
}

url1
## [1] "https://genius.com/albums/Game-of-thrones/Season-1-scripts"
## [2] "https://genius.com/albums/Game-of-thrones/Season-2-scripts"
## [3] "https://genius.com/albums/Game-of-thrones/Season-3-scripts"
## [4] "https://genius.com/albums/Game-of-thrones/Season-4-scripts"
## [5] "https://genius.com/albums/Game-of-thrones/Season-5-scripts"
## [6] "https://genius.com/albums/Game-of-thrones/Season-6-scripts"
## [7] "https://genius.com/albums/Game-of-thrones/Season-7-scripts"

Dentro de cada una de esas URLs encontramos los links que nos dirigen a la página donde se encuentra el guión del capítulo de la temporada correspondiente. Haremos otro loop pero mantendremos el link anterior, de tal manera que tengamos el nombre de episodio y la temporada para cada uno de los guiones.

capitulos = NA
y = NA
for(i in 1:length(url1)) {

  x <- read_html(url1[i]) %>%
      html_nodes(".u-display_block") %>%
      html_attr('href')

y = cbind(url1[i], x)

if(i == 1){
  capitulos = y
} else {
  capitulos = rbind(capitulos,y)
  }
i = i + 1
}

Creamos un Dataframe a partir de los datos que hemos obtenido.

capitulos<- as.data.frame(capitulos, stringsAsFactors = FALSE)
colnames(capitulos) <- c("Temporada", "Episodio")

Ahora que disponemos de todas las URLs, podemos acceder a cada una de ellas y obtener cada uno de los guines. El guión se encuentra dentro del div lyrics, así que usaremos ese class a modo de “nodo”.

for(i in 1:length(capitulos$Episodio)) {
  x <- read_html(capitulos[i,2]) %>%
    html_nodes(".lyrics") %>%
    html_text()
  y <- cbind(capitulos[i,], x)
  if(i == 1){
    textos = y
  } else {
    textos = rbind(textos,y)
  }
  i = i + 1
}
rm(y,url1,x,i)

colnames(textos) <- c("Temporada", "Episodio", "Texto")
textos$Texto <- as.character(textos$Texto)

Ahora disponemos de los guiones raw. Ahora faltaría limpiarlos, extraer más información y ponerlos en el formato que nos sea útil.

3 Preparación de los Datos

3.0.1 Limpieza de datos

Obtener el número de episodio The episode URL follows the following pattern: “https://genius.com/albums/Game-of-thrones/” + “Season-Number” + “-scripts”. Thus, we will substract all that.

After doing so, we will obtain something like “Episode-1”, so we will substract the “-” to.

textos$Temporada <- gsub("https://genius.com/albums/Game-of-thrones/","",textos$Temporada )
textos$Temporada <- gsub("-scripts","",textos$Temporada)
textos$Temporada <- gsub("-"," ",textos$Temporada)
textos$Temporada <- as.factor(textos$Temporada)

Obtaining the names of the episode In this case, the urls follow a different pattern: “https://genius.com/Game-of-thrones-” + “episode-name” + “-annotated”.

Once again, we proceed to delete all those pattenrs. After that, we will delete de “-” used to separate words.

textos$Episodio <- gsub("https://genius.com/Game-of-thrones-","",textos$Episodio)
textos$Episodio <- gsub("-annotated","",textos$Episodio)
textos$Episodio <- gsub("-"," ",textos$Episodio)
textos$Episodio <- as.factor(textos$Episodio)

3.1 Deleting Extra rows

Hemos escrapeado de la temporada 1 a la 7. Cada una de las temporadas cuenta con 10 capítulos, menos la séptima, que únicamente tiene 7. Por tanto, deberíamos tener una base de datos con 67 observaciones. Sin embargo, vemos que no es el caso.

## The df has 69 rows, 2 rows more than it should

Puede que en alguna de las páginas del capítulo haya más URLs que hayamos obtenido pero que, en el fondo, no son necesarias. Para ello, analizamos los nobres de lis episodios.

levels(textos$Episodio)
##  [1] "a golden crown"                     
##  [2] "a man without honor"                
##  [3] "and now his watch is ended"         
##  [4] "baelor"                             
##  [5] "battle of the bastards"             
##  [6] "beyond the wall"                    
##  [7] "blackwater"                         
##  [8] "blood of my blood"                  
##  [9] "book of the stranger"               
## [10] "breaker of chains"                  
## [11] "cripples bastards and broken things"
## [12] "dark wings dark words"              
## [13] "dragonstone"                        
## [14] "eastwatch"                          
## [15] "fire and blood"                     
## [16] "first of his name"                  
## [17] "garden of bones"                    
## [18] "hardhome"                           
## [19] "high sparrow"                       
## [20] "home"                               
## [21] "kill the boy"                       
## [22] "kissed by fire"                     
## [23] "lord snow"                          
## [24] "mhysa"                              
## [25] "mockingbird"                        
## [26] "mothers mercy"                      
## [27] "no one"                             
## [28] "oathbreaker"                        
## [29] "oathkeeper"                         
## [30] "season 4 preview"                   
## [31] "season 5 trailer breakdown"         
## [32] "second sons"                        
## [33] "sons of the harpy"                  
## [34] "stormborn"                          
## [35] "the bear and the maiden fair"       
## [36] "the broken man"                     
## [37] "the children"                       
## [38] "the climb"                          
## [39] "the dance of dragons"               
## [40] "the door"                           
## [41] "the dragon and the wolf"            
## [42] "the ghost of harrenhal"             
## [43] "the gift"                           
## [44] "the house of black and white"       
## [45] "the kingsroad"                      
## [46] "the laws of gods and men"           
## [47] "the lion and the rose"              
## [48] "the mountain and the viper"         
## [49] "the night lands"                    
## [50] "the north remembers"                
## [51] "the old gods and the new"           
## [52] "the pointy end"                     
## [53] "the prince of winterfell"           
## [54] "the queens justice"                 
## [55] "the rains of castamere"             
## [56] "the red woman"                      
## [57] "the spoils of war"                  
## [58] "the wars to come"                   
## [59] "the watchers on the wall"           
## [60] "the winds of winter"                
## [61] "the wolf and the lion"              
## [62] "two swords"                         
## [63] "unbowed unbent unbroken"            
## [64] "valar dohaeris"                     
## [65] "valar morghulis"                    
## [66] "walk of punishment"                 
## [67] "what is dead may never die"         
## [68] "winter is coming"                   
## [69] "you win or you die"

Así, nos hemos encontrado con “season 5 trailer breakdown” y “season 4 preview”. Al no ser episodios, no contienen guión y, por tanto, no nos interesan. Los eliminamos.

textos$Episodio <- as.character(textos$Episodio)
textos<- textos[textos$Episodio != "season 5 trailer breakdown" & textos$Episodio != "season 4 preview", ]
textos$Episodio <- as.factor(textos$Episodio)

Asimismo, eliminamos esas urls de laWe will delete those rows from the “capitulos” df, just in case we need to use it asfterwards.

capitulos<- capitulos[capitulos$Episodio != "https://genius.com/Game-of-thrones-season-5-trailer-breakdown-annotated" & capitulos$Episodio != "https://genius.com/Game-of-thrones-season-4-preview-annotated", ]

Now the dataset contains 67 observations, so in that regard it is OK.

3.2 Deleting Extra text

If we analyze some text, we will see that there are many things that are not interesting, such as descriptions, changes of location, etc. Thus, we will proceed to delete all those, so that we just get the person speaking and what he or she has said.

Before doing any analysis we should have clear what kind of structure we want to get, that is, how we will break the sentences and divide the person who is talking and what he/she is saying.

After analyzing the texts, my decision (as you will see later in the notebook) has been to split by the “:”. The reeason is because every time a person speaks the structure is the following: Name (Surname): text.

However I am pretty aware this is not the only way. So if you find a new one, feel free to share it.

3.2.1 Deleting descriptions

For a better understanding of what we have to delete, we shall look at the chapters. Besides, it is important to look into the scripts of episodes in different seasons.

The reason is because as we will see later on, transcribers have not been consistent in the way they upload the scripts. Thus, if we just look at sripts from the same season, we will not get everything correct. That was my case and I have “wasted” a lot of time because of it, so hopefully does not happen to you.

So, if we have a closer look we will see that there are three ways in which they have added the descriptions: putting it into brackets, using the italics and using bold.

3.2.1.1 Deleting descriptions: brackets and parenthesis

This is an easy case. We will just have to find and delete the content between brackets and then between parenthesis. However, before doing so we should analyze how each of them are used.

Just by analyzing the text within the first transcript we will see that while brackets are used for descriptions, parenthesis fulfill several purposes: 1. To express how something is said. Example: WILL (whispering): Forgive me, lord. 2. To express background music. Example: (Eerie music in background). 3. To express to whome is talking. Example. JON (to BRAN): Don’t look awa

This last case it is quite interesting. In case the transcribers have not put : after the end of the paranthesis, we will loose the way we separate the text, and thus will lead to a poorer analysis.

Thus, we will analyze this last case. To do so, we will use regular expressions. If you are a beginner and you are not familiar with them yet, I encourage you to use the [Regex Texter] (https://spannbaueradam.shinyapps.io/r_regex_tester/), developed by Adam Spannbauer.

To do so, we will count how many times in each text there is a “Name (to Name)” and substract the amount of times that thereis a “Name (to Name):”. If the result if higher than 0, then we have the amount of times in a certain text that we have “Name (to Name)” not followed by a “:”.

lengths(regmatches(textos$Texto, gregexpr("[A-z]+\\s*\\({1}[t]{1}[o]{1}\\s*[A-z]*\\){1}", textos$Texto))) - lengths(regmatches(textos$Texto, gregexpr("[A-z]+\\s*\\({1}[t]{1}[o]{1}\\s*[A-z]*\\){1}\\s*\\:", textos$Texto)))
##  [1] 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

As we can se, this issue happens two times. Besides, we now both times happen in the first episode, as the position of the vector refers to the episode.

Just by having a closer, we will find the two cases where it happens: 1. ROBERT (to BRAN) 2. VISERYS (to DAENERYS)

We proceed to change those, adding a : after them.

textos$Texto <- gsub("ROBERT (to BRAN)","ROBERT: (to BRAN)", textos$Texto, fixed = TRUE)
textos$Texto <- gsub("VISERYS (to DAENERYS) ","VISERYS: (to DAENERYS) ", textos$Texto, fixed = TRUE)

If we now analyze how many times the previous regular expressión is matched, the answer should be 0.

grep("[A-z]+\\s*\\({1}[t]{1}[o]{1}\\s*[A-z]*\\){1} ",textos$Texto)
## integer(0)

Now that we have cleaned everything up, we can now delete the text within brackets and parenthesis. To do so, once again we will use the regular expressions. Besides, I will delete another phrase I have found from episode 1, that despite being a description, it is not included within brackets.

textos$Texto <- gsub("\\[.*?\\] *","",textos$Texto)
textos$Texto <- gsub("\\(.*?\\) *","",textos$Texto) 
textos$Texto <- gsub("NED nods yes, and WILL is positioned on the tree limb that serves as a block","",textos$Texto)
3.2.1.2 Deleting descriptions: italics

In order to delete the text in italics we will grab that text through a web scraping and then substract it. The natural process would be to just analyze one page and find the class that the italics fall within. However, as commented before, we should look at various scripts in different seasons, because they have not been very consistent.

If we do so, we will find that there are two labels that have been used «i» and «em».

We will create a function that reads the and another reading the and use lapply to get the results.

leer_cursiva <- function(x) {
  read_html(x) %>%
    html_nodes(".lyrics p i") %>%
    html_text()
}
descripcciones <- sapply(capitulos$Episodio, leer_cursiva)
df1 <- unlist(descripcciones)
df1 <- as.data.frame(df1, stringsAsFactors = FALSE)

leer_cursiva2 <- function(x) {
  read_html(x) %>%
    html_nodes(".lyrics p em") %>%
    html_text()
}
descripcciones2 <- sapply(capitulos$Episodio, leer_cursiva2)
df2 <- unlist(descripcciones2)
df2 <- as.data.frame(df2, stringsAsFactors = FALSE)

We will now bind together both dataframes and remove everything we have created to achieve this result so as to keep the workplace clean.

colnames(df2) <- c("df1") #Cambiamos el nombre de la columna para poder juntarlas sin problemas.
df <- rbind(df1,df2, stringsAsFactors = FALSE) 
rm(descripcciones,descripcciones2, leer_cursiva, leer_cursiva2,df1,df2)

As we see, we now have all the descriptions labeled with «i»  and «em». Even though in this 5 descriptions everything looks to be OK, we will check if there is anything wrong with any of the descriptions.

Bear in mind that as we are going to delete all these descriptions, we might be deleting something we would not like to delete that will affect the whole process.

## [1] "Daenerys on horseback looking at Drogo and his people."                                  
## [2] "Arriving at camp, Daenerys looks exhausted from riding."                                 
## [3] "After Daenerys is escorted away, Jorah and Viserys are alone."                           
## [4] "Joffrey finding Tyrion asleep in the dog’s pen of Winterfell."                           
## [5] "Tyrion walks from the courtyard to the dining hall where his family is eating breakfast."

But, how do we know that the description is not rigth? Well, a good proxy might be the amount of words in the string. Descriptions usually have many words, so if we find some with few, they might be errors.

First, we create a new column that counts the amount of words:

df$words = sapply(strsplit(df$df," "), length)

Let’s see if there is any description with length 1 and how they might affect.:

df %>%
  filter(words == 1)

5(screaming)1

#df1words
1Throne1
2sraw1
3beserk1
4(singing)1

As we see, there are 1,572 descriptions with just one word, out of 8136 total descriptions. However, we can see that many of them appear between parenthesis. As we have already deleted those words, it won’t be such a problem if we don’t keep descriptions with just one word.

Thus, we discard the descriptions with a single word.

df <- df %>%
  filter(words > 1)

Now we will analyze the sentences that contain two words, just in case.

df %>%
  filter(words == 2) %>%
  select(df1) %>%
  distinct(df1) %>%
  t(.) %>%
  as.vector(.)
##   [1] "Benjen enters."           "They break."             
##   [3] "He did."                  "The K"                   
##   [5] "rd approaches."           "A witch."                
##   [7] "(To Catelyn)"             "(in Dothraki)"           
##   [9] "Bronn drinks."            "(Tyrion drinks)"         
##  [11] "Tyrion drinks."           "(to Tyrion)"             
##  [13] " you"                     "Tyrion laughs."          
##  [15] "(to Jon) "                "JON spits."              
##  [17] "Jon leaves."              "SAM leaves"              
##  [19] "Balon enters."            "BALON exits."            
##  [21] "Yara leaves."             "Shae enters."            
##  [23] "They pause."              "Theon kneels."           
##  [25] "BRONN enters."            "KOVARRO dismounts."      
##  [27] "RENLY stands."            "DAVOS exits."            
##  [29] "YARA enters."             "ARYA exits."             
##  [31] "IRRI exits."              "QUAITHE leaves."         
##  [33] "RODRIK enters."           "EXT WINTERFELL"          
##  [35] "OSHA exits."              "OSHA kneels."            
##  [37] "LITTLEFINGER enters."     "CATELYN enters"          
##  [39] "JON pauses."              "YGRITTE smiles"          
##  [41] "ARYA stops."              "ROBB exits."             
##  [43] "JORAH exits."             "XARO stands."            
##  [45] "THEON pauses."            "YARA exits."             
##  [47] "Riders appraoch."         "ARYA enters."            
##  [49] "VARYS enters."            "He sits."                
##  [51] "JAQEN considers."         "JAQEN exits."            
##  [53] "CERSEI'S CHAMBERS"        "PODRICK snickers."       
##  [55] "She sits."                "They kiss."              
##  [57] "ROOSE exits."             "JOFFREY exits."          
##  [59] "JOFFREY hesistates."      "TYWIN enters"            
##  [61] "Pycelle exits."           "Robb exits."             
##  [63] "THOEN rises."             "LUWIN enters."           
##  [65] "VARYS bows."              "JON rises."              
##  [67] "TORMUND glowers."         "She rises."              
##  [69] "Tyrion considers."        "Davos hesitates."        
##  [71] "Salladhor sits."          "Talisa approaches."      
##  [73] "Tywin sits."              "Littlefinger approaches."
##  [75] "They bow."                "Barristan approaches."   
##  [77] "CERSEI stands."           "TALISA sits."            
##  [79] "She approaches."          "PODRICK enters."         
##  [81] "VARYS approaches."        "JEOR exits."             
##  [83] "JEOR stands."             "SAM enters."             
##  [85] "BERIC enters."            "YGRITTE moans."          
##  [87] "THEY kiss."               "CERSEI exits."           
##  [89] "PODRICK extis."           "ROBB coniders."          
##  [91] "Stannis exits."           "LITTLEFINGER exits."     
##  [93] "SAM sits."                "RICKON wakes."           
##  [95] "TORMUND exits."           "ORELL exits."            
##  [97] "They laugh."              "RAMSAY stops."           
##  [99] "EDMURE stands."           "BRYDEN sits."            
## [101] "ROBB stands."             "They walk."              
## [103] "JAIME exits."             "OSHA sits."              
## [105] "MELISSANDRE exits."       "[Opening Credits]"       
## [107] "[ROBB pauses]"            "[In Valyrian]"           
## [109] "[From outside]"           "[Thunder rumbles]"       
## [111] "[To BRAN]"                "[To TORMUND]"            
## [113] "[Addressing DAARIO]"      "[ROSLIN smiles]"         
## [115] "[To RICKON]"              "[BARRISTAN shrugs]"      
## [117] "ROOSE smiles"             "[Everybody laughs]"      
## [119] "[Everybody cheers]"       "[Men laughing]"          
## [121] "[ARYA grins]"             "THEON sulks."            
## [123] "SHAE exits."              "CERSEI sits."            
## [125] "Hoofbeats approach."      "TYRION claps."           
## [127] "OBERYN enters."           "TYRION enters."          
## [129] "MAREI enters."            "MISSANDEI exists."       
## [131] "RAMSAY sits."             "MELISANDRE enters."      
## [133] "BRONN exits."             "JAIME enters."           
## [135] "SHIREEN claps."           "OBERYN considers"        
## [137] "JON exits."               "They shake."             
## [139] "MARGAERY enters."         "PODRICK exits"           
## [141] "LOCKE enters"             "LOCKE exits"             
## [143] "He stands."               "JORAH stands."           
## [145] "They embace."             "TYWIN sits."             
## [147] "RAST enters."             "Ghost approaches."       
## [149] "MORAG spits."             "DAVOS enters."           
## [151] "HIZDAHR kneels."          "Everyone sits."          
## [153] "ARYA kneels."             "BRONN sits."             
## [155] "DAARIO rises."            "DAENERYS considers."     
## [157] "OBERYN rises."            "CERSEI smiles."          
## [159] "ELLARIA scoffs."          "DORAN dies."             
## [161] "VARYS kneels."            "YOUNG BENJEN"            
## [163] "TOMMEN exits."            "VARYS sighs."            
## [165] "SANSA smiles."            "GILLY stands."           
## [167] "VARYS stands."            "TYRION nods."            
## [169] "BRIENNE approaches."      "TYRION stammers."        
## [171] "TYRION smiles."           "DAARIO chuckles."        
## [173] "THEON sobs."              "EURON drowns."           
## [175] "JORAH nods."              "They embrace."           
## [177] "HODOR chuckles."          "HODOR laughs."           
## [179] "DICKON laughs."           "ARYA smiles."            
## [181] "OLENNA stands."           "MARGAERY stands."        
## [183] "ROBETT laughs."           "EDMURE scoffs."          
## [185] "EDMURE chuckles."         "EDMURE laughs."          
## [187] "SANDOR laughs."           " projectiles"            
## [189] "RAMSAY chuckles."         "TORMUND laughs."         
## [191] "DAENERYS smiles."         "LORAS kneels."           
## [193] "LORAS whinces."           "QYBURN stands."          
## [195] "JAIME nods."              "DAARIO stands."          
## [197] "DAENERYS stands."         "DAENERYS nods."          
## [199] "SANSA exits."             "LYANNA MORMONT"          
## [201] " stands."                 "ROBETT stands."          
## [203] "LYANNA stands."           "TORMUND sits."           
## [205] "YOHN stands."             "YOHN sits."              
## [207] "JON sighs."               "JON chuckles."           
## [209] "SANSA sighs."             "ARYA drinks."            
## [211] "THOROS smiles."           "BERIC chuckles."         
## [213] "VARYS smiles."            "EURON laughs."           
## [215] "ELLARIA whimpers."        "EURON exits."            
## [217] "Tyrion exits."            "CERSEI pauses."          
## [219] "SANSA contemplates."      "THEON approaches"        
## [221] "SANSA rises."             "Varys nods."             
## [223] "Baelish smiles."          "Cersei laughs."          
## [225] "More laughs."             "Stannis leaves."         
## [227] "Grey Scale"               "Jorah nods."             
## [229] "Bronn advances."          "Theon enters."           
## [231] "Ramsay smiles."           "Sansa nods."             
## [233] "Daario laughs."           "Tormund nods."           
## [235] "Everyone drinks. "        "Tyene laughs."           
## [237] "They hug."                "Crowd jeers."            
## [239] "Sam smiles."              "Trant whimpers."         
## [241] "No response."

As you can see, one of those words is “you”. Thus we will proceed to delete this row.

df <- df[-565,]

Finally, before deleting, we will rearrenge the descriptions in descending order considering their amount of words. The reason behind is simple. If two descriptions have similar structure, such as » stands.» and “QYBURN stands.”, and you first subset the one with fewer words (just because you haven’t sort the list in descending order), the we would not subset everything we want, as “QYBURN” would not had been subseted.

df <- df%>%
  arrange(desc(words))

So, we are now prepared to subset the descriptions. To do so, I will use the mgsub function from the qdap package. Unlike gsub, mgsub allows you to pass a vector of strings to subset, which is very helpful.

textos$Texto <- mgsub(df$df," ",textos$Texto)
rm(df)
3.2.1.3 Deleting descriptions: bolds

Finally, we will have to delete the descriptions in bold. Once again, if we analyze the web page we find that there are two types of labels. These two labels are used to refer to several things, like the title of the episode, the location of the scene and the location change, among others.

 

However, a closer look at the scripts shows that they have not been consistent with the usage of this labesl: in some seasons they have used the label for the person that it is speaking. Thus, we will have to be cautious before deleting anything.

 

3.2.1.4 Getting the «b»  descriptions

This time, we will get all the info in a for loop, in a way very similar to the previous ones.

x = NA
y = NA
urls <- capitulos$Episodio
for(i in 1: length(urls)){
  x =read_html(urls[i]) %>%
    html_nodes(".lyrics p b") %>%
    html_text()
  x = as.data.frame(x, stringsAsFactors = FALSE)
  if(i == 1){
  y = x
  i = i + 1
  }else{
  y = rbind(y,x)
  i = i + 1
} 
}
##  [1] "EPISODE 1 - WINTER IS COMING" "EPISODE 2 - THE KINGSROAD"   
##  [3] "Jorah Mormont:"               "Daenerys Targaryen:"         
##  [5] "Jorah Mormont:"               "Doreah:"                     
##  [7] "Irri:"                        "Jorah Mormont:"              
##  [9] "Viserys Targaryen:"           "Jorah Mormont:"              
## [11] ""                             "Viserys Targaryen:"          
## [13] "Jorah Mormont:"               "Viserys Targaryen:"          
## [15] "Joffrey Baratheon:"           "Tyrion Lannister:"           
## [17] "Joffrey Baratheon:"           "Tyrion Lannister:"           
## [19] "Joffrey Baratheon:"           "Tyrion Lannister:"           
## [21] "Joffrey Baratheon:"           "Tyrion Lannister:"           
## [23] "Joffrey Baratheon:"           "Tyrion Lannister:"           
## [25] ""                             "Sandor Clegane:"             
## [27] "Tyrion Lannister:"            "Tyrion Lannister:"           
## [29] "Jaime Lannister:"             "Tyrion Lannister:"
##  [1] "INT. DRAGONSTONE - CHAMBER OF THE PAINTED TABLE" 
##  [2] "EPISODE 7 - THE DRAGON AND THE WOLF"             
##  [3] "EXT. Outside King's Landing"                     
##  [4] "CUT TO: Deck of Ironborn Ship."                  
##  [5] "INT. Below deck of the Ironborn Ship"            
##  [6] "INT. King's Landing"                             
##  [7] "EXT. Outside King's Landing"                     
##  [8] "CUT TO: Further back on the road."               
##  [9] "CUT TO: Outside the Dragon PIt"                  
## [10] "CUT TO: The Dragon Pit"                          
## [11] "CUT TO: King's Landing portico"                  
## [12] "CUT TO: Council Chamber."                        
## [13] "CUT TO: Side Area of the Dragon Pit"             
## [14] "CUT TO: The North"                               
## [15] "CUT TO: Winterfell interior"                     
## [16] "CUT TO: Dragonstone Map Room"                    
## [17] "CUT TO: Dragonstone Throne Room"                 
## [18] "CUT TO: The shores of King's Landing"            
## [19] "CUT TO: King's Landing Exterior"                 
## [20] "CUT TO: The Great Hall of Winterfell"            
## [21] "CUT TO: An Interior Courtyard in King's Landing."
## [22] "CUT TO: Outside King's Landing."                 
## [23] "CUT TO: Montage of King's Landing at night."     
## [24] "CUT TO: Winterfell Interior"                     
## [25] "CUT TO: Flashback"                               
## [26] "CUT TO: Boat interior"                           
## [27] "CUT TO: Flashback"                               
## [28] "CUT TO: Boat Interior"                           
## [29] "CUT TO: Winterfell Exterior"                     
## [30] "CUT TO: THE GODSWOOD OF WINTERFELL."             
## [31] "CUT TO: The Wall."

Just analyzing the first and last 30 descriptions we realize of two things: * The first season seems to use to refer to the people speaking. * The last season does seem to use to refer to descriptions. * In all seasons the episode name comes in

Thus, we will have to do the following: 1. Filter to just get the used for describing.

As all the episodes within a season usually follow the same pattern, it is easier to inspect the dataframe and visually check which rows to pick. It could be done with regular expressions, but I believe the “traditional” way is much faster.

cambios <- y[c(3161:3695,4004:4853),]
head(cambios)
## [1] "EPISODE 1 - THE NORTH REMEMBERS"   
## [2] "EXT. KING’S LANDING"               
## [3] "INT. SMALL COUNCIL MEETING CHAMBER"
## [4] "CUT TO: WINTERFELL"                
## [5] "EXT. OUTSIDE WINTERFELL"           
## [6] "INT. INSIDE WINTERFELL’S KEEP"

2. Filter to get the episode names. In this case, as the are several episode names and they all follow the same pattern “EPISODE – NAME”, I will use the regular expressions to get them.

episodios <- y %>% filter(grepl("EPISODE.*",x))
episodios <- as.vector(episodios$x)
head(episodios)
## [1] "EPISODE 1 - WINTER IS COMING"                    
## [2] "EPISODE 2 - THE KINGSROAD"                       
## [3] "EPISODE 3 - LORD SNOW"                           
## [4] "EPISODE 4 - CRIPPLES, BASTARDS AND BROKEN THINGS"
## [5] "EPISODE 5 - THE WOLF AND THE LION"               
## [6] "EPISODE 6 - A GOLDEN CROWN"
 Getting the «h3» descriptions

To get the «h3» we will repeat the previous loop, just changing b by h3.

for(i in 1: length(urls)){
  x =read_html(urls[i]) %>%
    html_nodes(".lyrics h3") %>%
    html_text()
  x = as.data.frame(x, stringsAsFactors = FALSE)
  if(i == 1){
    y = x
    i = i + 1
  }else{
    y = rbind(y,x)
    i = i + 1
  } 
}

titulos <- as.vector(y$x)
head(titulos)
## [1] "EXT. BRAAVOS - SEA "       "TITLE SEQUENCE"           
## [3] "EXT. CASTLE BLACK - NIGHT" "EXT. WINTERFELL - DAY"    
## [5] "EXT. NORTHERN WILDERNESS"  "EXT. KING’S LANDING - DAY"

Now that we have all the descriptions, we will gather them in the same variable. Besides we will count the amount of words in each description. As we have done before, we will use this to get rid off one word descriptions and to sort them by amount of words.

eliminar <- as.vector(matrix(c(titulos,episodios, cambios), byrow=TRUE)) 
eliminar <- as.data.frame(eliminar, stringsAsFactors = FALSE)
eliminar$longitud <- sapply(strsplit(eliminar$eliminar, " "), length)
eliminar %>%
  filter(longitud == 1)

5CREDITS1

#df1words
1CREDITS1
2CREDITS1
3CREDITS1
4CREDITS1


The one word descriptions seem to be OK, so we will just sort the descriptions by amount of words and finally (yes, we are done with descriptions;) we will subset them.

eliminar <- eliminar %>%
  arrange(desc(longitud))

textos$Texto <- mgsub(pattern = eliminar$eliminar, replacement = " ",
                                         text.var = textos$Texto)

rm(eliminar)

4 From text to sentences

Now that we have cleaned all the text, we reach to the crux of the matter: breaking the texts into sentences and the person who have said it.

As you might have notices, the main problem here is the lack of homogeneity. To achieve a 100% perfect database, it will be necessary to check each and every script individually. Any help in that regard is very welcome.

Even though there will be margin for further improvement, we will do a good work in keeping the most of the sentences.

First, we must have clear what do we have to do and how do we want to do it.

Regarding the “what”“, we must do two things: * Split the phrases into sentences. * Obtain the name of the person who have said that sentence.

4.1 Spliting the phrases into sentences

The key is to separate the text when a person starts talking. The structure when a person talks is the following: > Name + (Surname) + (Surmname)( ): text.

So, I have worked out the following regex:

(?=[A-z]\s[A-z]*\:)\:

This regex will pick all “:” that have a structure like a “person talking” ahead. This will get all conversations in the text. However, it has a little proble, as the person who speaks will be one row above what he/she has said. We will fix this issue later on.

textos_por_persona <- textos %>%
  mutate(frases = strsplit(Texto, "(?=[A-z]*\\s*[A-z]*\\:)\\:", perl = TRUE)) %>%
  unnest(frases)
textos_por_persona$Texto <- NULL
textos_por_persona$frases[1:10]
##  [1] "WAYMAR ROYCE"                                                                                                                                  
##  [2] " What d’you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces. WILL"
##  [3] " I’ve never seen wildlings do a thing like this. I’ve never seen a thing like this, not ever in my life. WAYMAR ROYCE"                         
##  [4] " How close did you get? WILL"                                                                                                                  
##  [5] " Close as any man would. GARED"                                                                                                                
##  [6] " We should head back to the wall.ROYCE"                                                                                                        
##  [7] " Do the dead frighten you? GARED"                                                                                                              
##  [8] " Our orders were to track the wildlings. We tracked them. They won’t trouble us no more. ROYCE"                                                
##  [9] " You don’t think he’ll ask us how they died? Get back on your horse. WILL"                                                                     
## [10] " Whatever did it to them could do it to us. They even killed the children. ROYCE"

As you can see, we now have the sentence and the name of the person that will speak next at the end. We will have to split this into two columns. As the name is at the end of the phrase it will easy to get to it using regular expressions.

However, before doing so, we will use a simple transformation. There are many extras that are differenciated with a #, like Soldier #1, Soldier #2, and so on. Besides we will also delete the ’, as it is not detected by the regex.

Yes, you’re right, we could include those characters in the regex, but I find it much easier to subset them. In the end, the result does not change.

textos_por_persona$frases <- gsub("'","",textos_por_persona$frases, fixed = TRUE)
textos_por_persona$frases <- gsub("#","",textos_por_persona$frases, fixed = TRUE)

Now we will split the names into two columns. To do so, I have used the following regex: > [[:punct:]]\s[A-z]\s[A-z]\s[A-z]\s[A-z0-9]$

It simply finds the last punctuation mark that has a name forehead.

textos_por_persona <- textos_por_persona %>%
  mutate(nombres = str_extract(frases, "[[:punct:]]\\s*[A-z]*\\s*[A-z]*\\s*[A-z]*\\s*[A-z0-9]*$")) %>%
  unnest(nombres)

4.2 Obtain the name of the person who have said that sentence.

Now that we have the sentences splitted, we will have to clean the names of the people who said those sentences. This is crucial if we want to make a good analysis. To do so, we will follow this structure:

  • Clean the sentences
  • Find names that have not been splited correctly.
  • Joining the person with the phrase
  • Homogenize the names.
  • Correct possible mistakes in character names.

4.2.1 Clean the sentences

As you can see, we have splited the sentences, but the name still appears in the sentence. So, we will run a for loop that gsubs the name from the sentence.

for(i in 1:length(textos_por_persona$frases)){
  if(is.na(textos_por_persona$nombres[i]) == TRUE ){
    i = i + 1
  } else {
    textos_por_persona$frases[i] <- gsub(pattern = textos_por_persona$nombres[i],replacement = " ", x = textos_por_persona$frases[i])
    i = i +1
}
}
rm(i)

Great. One thing done. Let’s continue.

4.2.2 Find names that have not been splited correctly.

colSums(is.na(textos_por_persona))
## Temporada  Episodio    frases   nombres 
##         0         0         0       449

As we can see, we have 450 names that have not been correctly splited. Considering the amount of phrases we have is not that bad, but it must be better. So, what did it happend?

textos_por_persona$frases[is.na(textos_por_persona$nombres)][1:30]
##  [1] "WAYMAR ROYCE"                                                                                                                                                                                                                
##  [2] " As your brother, I feel it’s my duty to warn you"                                                                                                                                                                           
##  [3] " How many times have I told you"                                                                                                                                                                                             
##  [4] " I want you to promise me"                                                                                                                                                                                                   
##  [5] "Jorah Mormont"                                                                                                                                                                                                               
##  [6] " The Dothraki have two things in abundance"                                                                                                                                                                                  
##  [7] " First lesson"                                                                                                                                                                                                               
##  [8] " Grenns father left him too... Outside a farmhouse when he was three. Pyp was caught stealing a wheel of cheese. His little sister hadnt eaten in three days. He was given a choice"                                         
##  [9] " Ill go to war with him if I have to. They can write a ballad about us "                                                                                                                                                     
## [10] " "                                                                                                                                                                                                                           
## [11] " If you rule a city and you see the horde approaching, you have two choices "                                                                                                                                                
## [12] " This summer has lasted nine. But reports from the Citadel tell us the days grow shorter. The Starks are always right eventually "                                                                                           
## [13] "Old Nan"                                                                                                                                                                                                                     
## [14] " No, the last one died many years before I was born. Ill tell you what I have seen "                                                                                                                                         
## [15] "Nothing of import, my Lord. There was one phrase he kept repeating"                                                                                                                                                          
## [16] " The truth now"                                                                                                                                                                                                              
## [17] "Knight of House Frey"                                                                                                                                                                                                        
## [18] " Wheres AryaSansa Stark"                                                                                                                                                                                                     
## [19] " The dagger foundTyrion Lannister"                                                                                                                                                                                           
## [20] ""                                                                                                                                                                                                                            
## [21] "Eddard Stark"                                                                                                                                                                                                                
## [22] " Jorah Mormont"                                                                                                                                                                                                              
## [23] " Viserys Targaryen"                                                                                                                                                                                                          
## [24] "Jaime Lannister"                                                                                                                                                                                                             
## [25] " Daenerys Stormborn"                                                                                                                                                                                                         
## [26] " Daenerys Stormborn"                                                                                                                                                                                                         
## [27] "Entering Kings Landing Syrio Forel"                                                                                                                                                                                          
## [28] "Varys"                                                                                                                                                                                                                       
## [29] " The gods were cruel when they saw fit to test my vows. They waited till l was old. What could l do when the ravens brought the news from the South"                                                                         
## [30] " l was born lucky. Tribesmen of the Vale, gather round! Stone Crows! Black Ears! Burned Men! Moon Brothers! And Painted Dogs! Your dominion over the Vale begins now! Onward, to claim what is yours! Tribesmen of the Vale "

Well, we see that there are three different cases: * The “:” was nor used to separate a person and what he/she said, but rather to enumerate. (First lesson) * When the person says nothing. In this case, the sentence itself should be the name. * When there was no punctuation mark at the end of the sentence. In this cases the last few words are the name of the person speaking.

Besides we find two strange cases, where there is no space between the phrase itself and who said it. We will manually change these two cases:

textos_por_persona$frases[textos_por_persona$frases == " The dagger foundTyrion Lannister"] <- " The dagger found Tyrion Lannister"
textos_por_persona$frases[textos_por_persona$frases == " Wheres AryaSansa Stark"] <- " Wheres Arya Sansa Stark"

When the person says nothing

As we have done many times before, we will count the amount of words in each sentence. Then we will filter the result to see sentences with 3 words or less. This will help us decide if we consider 3 word sentences are names or not.

textos_por_persona$palabras = sapply(strsplit(textos_por_persona$frases," "), length)
textos_por_persona%>%
  filter(is.na(nombres) == TRUE, palabras==3) %>%
  select(frases) %>% t(.) %>% as.vector(.)
##  [1] " First lesson"       " Jorah Mormont"      " Viserys Targaryen" 
##  [4] " Daenerys Stormborn" " Daenerys Stormborn" " Kevan Lannister"   
##  [7] " Hodor LUWIN"        " Shhh TALISA"        " She ROBB"          
## [10] " Robb ROBB"          " MALE SINGER"        " Guard 2"           
## [13] " 1,000 ORELL"        " 19 EDMURE"          " Oh TYWIN"          
## [16] " Roose BOLTON"       " CUT TO"             " 14 TYRION"         
## [19] " 19 MISSANDEI"       " Night DAVOS"        " BARRISTAN SELMY"   
## [22] " MEREEN MISSANDEI"   "A younger MELARA"    " Theon THEON"       
## [25] "THR YOUNG RODRIK"    " GREY WORM"          " JON SNOW"          
## [28] " Lady JON"           " Father ARYA"

As we can see here, except for the first case, sentences with three » » are names. Thus, we will make sentences with three or less spaces become names.

We will get the row number of the first case, just to correct it later.

textos_por_persona[textos_por_persona$frases == " First lesson", ] 

#TemporadaEpisodiofrasesnombrespalabras
404Season 1the kingsroadFirst lessonNA3


Now we will make all names with three blank spaces or less become them name of the phrase.

for(i in 1:length(textos_por_persona$frases)){
  if(is.na(textos_por_persona$nombres[i]) == TRUE){
    if(textos_por_persona$palabras[i]<=3){
      textos_por_persona$nombres[i] <- textos_por_persona$frases[i]
      textos_por_persona$frases[i] <- ""
      i = i + 1
    }else{
      i = i + 1
    }
  } else{
    i = i + 1 
  }
}
rm(i)

Just to make sure everything is OK, we rerun the filter used beforehand.

textos_por_persona%>%
  filter(is.na(nombres) == TRUE, palabras==3) %>%
  select(frases) %>% t(.) %>% as.vector(.)
## logical(0)

Now we fix the row for the “First lesson” sentece.

textos_por_persona$nombres[404] <- NA
textos_por_persona$frases[404] <- " First lesson"

With all these changes all the problems when only the name appears should have been corrected. We will check if that’s true or not.

textos_por_persona%>%
  filter(is.na(nombres) == TRUE) %>%
  select(frases) %>% t(.) %>% as.vector(.) %>% 
  head(.,n=15)
##  [1] " As your brother, I feel it’s my duty to warn you"                                                                                                                                  
##  [2] " How many times have I told you"                                                                                                                                                    
##  [3] " I want you to promise me"                                                                                                                                                          
##  [4] " The Dothraki have two things in abundance"                                                                                                                                         
##  [5] " First lesson"                                                                                                                                                                      
##  [6] " Grenns father left him too... Outside a farmhouse when he was three. Pyp was caught stealing a wheel of cheese. His little sister hadnt eaten in three days. He was given a choice"
##  [7] " Ill go to war with him if I have to. They can write a ballad about us "                                                                                                            
##  [8] " If you rule a city and you see the horde approaching, you have two choices "                                                                                                       
##  [9] " This summer has lasted nine. But reports from the Citadel tell us the days grow shorter. The Starks are always right eventually "                                                  
## [10] " No, the last one died many years before I was born. Ill tell you what I have seen "                                                                                                
## [11] "Nothing of import, my Lord. There was one phrase he kept repeating"                                                                                                                 
## [12] " The truth now"                                                                                                                                                                     
## [13] "Knight of House Frey"                                                                                                                                                               
## [14] " Wheres Arya Sansa Stark"                                                                                                                                                           
## [15] " The dagger found Tyrion Lannister"

As you can see, there are phrases with more than 3 words that are actually a name, such as “Knight of House Frey”. We will filter if these happens more frecuently and we will change them later on.

textos_por_persona%>%
  filter(is.na(nombres) == TRUE, palabras==4) %>%
  select(frases) %>% t(.) %>% as.vector(.)
##  [1] " The truth now"             "Knight of House Frey"      
##  [3] " Lord Commander JEOR"       " Lord Commander JEOR"      
##  [5] " Lady Zuriff SHAE"          " You cant TYRION"          
##  [7] " Lady Talisa TALISA"        " TYRIONS CHAMBERs VARYS"   
##  [9] " Your Grace CERSEI"         " What the DOREAH"          
## [11] " KRAZNYS MO NAKLOZ"         " 1000 men TORMUND"         
## [13] " RADZAL MO ERAZ"            " Daario Naharis MERO"      
## [15] " Here comes TYRION"         " THE VALE SANSA"           
## [17] " THE VALE ROBIN"            " Maester Faull AEMON"      
## [19] " Of course DAARIO"          " We ask again"             
## [21] " You know MARGAERY"         " ARCHMAESTERS STUDY MARWYN"
## [23] " OLENNAS QUARTERS OLENNA"
textos_por_persona$palabras = NULL
textos_por_persona[textos_por_persona$frases == "Knight of House Frey" | textos_por_persona$frases == " RADZAL MO ERAZ" | textos_por_persona$frases == " KRAZNYS MO NAKLOZ" , ]

#TemporadaEpisodiofrasesnombres
1390Season 1cripples bastards and broken thingsKnight of House FreyNA
7585Season 3valar dohaerisKRAZNYS MO NAKLOZNA
9837Season 3the bear and the maiden fairRADZAL MO ERAZNA

textos_por_persona$frases[1390] <- ""
textos_por_persona$nombres[1390] <- "Knight of House Frey"

textos_por_persona$frases[7584] <- ""
textos_por_persona$nombres[7584] <- " RADZAL MO ERAZ"

textos_por_persona$frases[9836] <- ""
textos_por_persona$nombres[9836] <- " KRAZNYS MO NAKLOZ"

There is no punctuation mark

In the cases when there is no punctuation mark at the end of the sentence, we have some prons and cons. First, names come either all in Uppercase or with the first letter in uppercase. This will allow us to use regular expressions. However, as there is no punctuation mark, we have to tell the regex how much words we want to take. This will undoubtedly lead us to some errors unless we work it out for each case.

The idea is to make the least amount of errors possible. Thus, in order to pick the amount of words to be picked as names by the regex, we will check the phrases.

textos_por_persona%>%
  filter(is.na(nombres) == TRUE) %>%
  select(frases) %>% t(.) %>% as.vector(.) %>% 
  head(.,n=15)
##  [1] " As your brother, I feel it’s my duty to warn you"                                                                                                                                  
##  [2] " How many times have I told you"                                                                                                                                                    
##  [3] " I want you to promise me"                                                                                                                                                          
##  [4] " The Dothraki have two things in abundance"                                                                                                                                         
##  [5] " First lesson"                                                                                                                                                                      
##  [6] " Grenns father left him too... Outside a farmhouse when he was three. Pyp was caught stealing a wheel of cheese. His little sister hadnt eaten in three days. He was given a choice"
##  [7] " Ill go to war with him if I have to. They can write a ballad about us "                                                                                                            
##  [8] " If you rule a city and you see the horde approaching, you have two choices "                                                                                                       
##  [9] " This summer has lasted nine. But reports from the Citadel tell us the days grow shorter. The Starks are always right eventually "                                                  
## [10] " No, the last one died many years before I was born. Ill tell you what I have seen "                                                                                                
## [11] "Nothing of import, my Lord. There was one phrase he kept repeating"                                                                                                                 
## [12] " The truth now"                                                                                                                                                                     
## [13] " Wheres Arya Sansa Stark"                                                                                                                                                           
## [14] " The dagger found Tyrion Lannister"                                                                                                                                                 
## [15] "Entering Kings Landing Syrio Forel"
textos_por_persona%>%
  filter(is.na(nombres) == TRUE) %>%
  select(frases) %>% t(.) %>% as.vector(.) %>% 
  tail(.,n=15)
##  [1] " Men shit themselves when they die. Didnt they teach you that at fancy lad school? JAIME looks over at BRONN BRONN"
##  [2] " We are all facing a unique EURON"                                                                                 
##  [3] " Theon! I have your sister TYRION"                                                                                 
##  [4] " Just the King CERSEI"                                                                                             
##  [5] " Im pleased you bent the knee to our queen JON"                                                                    
##  [6] " I didnt come all this way to have my Hand TYRION"                                                                 
##  [7] " You killed our father TYRION"                                                                                     
##  [8] " I will not TYRION"                                                                                                
##  [9] " Shes your sister SANSA"                                                                                           
## [10] " Shes our queen HARRAG"                                                                                            
## [11] " Shes your sister THEON"                                                                                           
## [12] " Have my sister ARYA"                                                                                              
## [13] " Whatever your aunt SANSA"                                                                                         
## [14] " You told our mother LITTLEFINGER"                                                                                 
## [15] " Disobeying your queen JAIME"

As we can see here, it changes deppending on whether we are at the one season or another. As it seems more difficult to make mistakes by selecting words that are uppercase, we will fixe those first.

for(i in 1:length(textos_por_persona$frases)){
  if(is.na(textos_por_persona$nombres[i]) == TRUE){
    if(grepl("[A-Z]+\\s*[0-9]*\\s*$", textos_por_persona$frases[i], perl = TRUE) == TRUE){
    textos_por_persona$nombres[i] <- str_extract(textos_por_persona$frases[i], "[A-Z]+\\s*[0-9]*\\s*$")
    textos_por_persona$frases[i] <- gsub(pattern = textos_por_persona$nombres[i], replacement = "", x = textos_por_persona$frases[i])
    i = i + 1
    } else{
      i = i + 1
    }
  } else{
    i = i + 1
  } 
}  
rm(i)

We run the code above again to see how is the structure of the phrases we have missing.

textos_por_persona$frases[is.na(textos_por_persona$nombres)]
##  [1] " As your brother, I feel it’s my duty to warn you"                                                                                                                                                                                                                                                                                                                                                                                                        
##  [2] " How many times have I told you"                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [3] " I want you to promise me"                                                                                                                                                                                                                                                                                                                                                                                                                                
##  [4] " The Dothraki have two things in abundance"                                                                                                                                                                                                                                                                                                                                                                                                               
##  [5] " First lesson"                                                                                                                                                                                                                                                                                                                                                                                                                                            
##  [6] " Grenns father left him too... Outside a farmhouse when he was three. Pyp was caught stealing a wheel of cheese. His little sister hadnt eaten in three days. He was given a choice"                                                                                                                                                                                                                                                                      
##  [7] " Ill go to war with him if I have to. They can write a ballad about us "                                                                                                                                                                                                                                                                                                                                                                                  
##  [8] " If you rule a city and you see the horde approaching, you have two choices "                                                                                                                                                                                                                                                                                                                                                                             
##  [9] " This summer has lasted nine. But reports from the Citadel tell us the days grow shorter. The Starks are always right eventually "                                                                                                                                                                                                                                                                                                                        
## [10] " No, the last one died many years before I was born. Ill tell you what I have seen "                                                                                                                                                                                                                                                                                                                                                                      
## [11] "Nothing of import, my Lord. There was one phrase he kept repeating"                                                                                                                                                                                                                                                                                                                                                                                       
## [12] " The truth now"                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [13] " Wheres Arya Sansa Stark"                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [14] " The dagger found Tyrion Lannister"                                                                                                                                                                                                                                                                                                                                                                                                                       
## [15] "Entering Kings Landing Syrio Forel"                                                                                                                                                                                                                                                                                                                                                                                                                       
## [16] " The gods were cruel when they saw fit to test my vows. They waited till l was old. What could l do when the ravens brought the news from the South"                                                                                                                                                                                                                                                                                                      
## [17] " l was born lucky. Tribesmen of the Vale, gather round! Stone Crows! Black Ears! Burned Men! Moon Brothers! And Painted Dogs! Your dominion over the Vale begins now! Onward, to claim what is yours! Tribesmen of the Vale "                                                                                                                                                                                                                             
## [18] " I am Eddard Stark, Lord of Winterfell and Hand of the King. I come before you to confess my treason in the sight of Gods and men. I betrayed the faith of my King and the trust of my friend Robert. I swore to protect and defend his children, but before his blood was cold I plotted to murder his son and seize the Throne for myself. Let the High Septon and Baelor the Blessed bear witness to what I say"                                       
## [19] " As is mine. Youre not the young man, Salladhor. And correct me if Im wrong"                                                                                                                                                                                                                                                                                                                                                                              
## [20] " \"And who are you,\" the proud lord said \"That I must bow so low?\" \"Only a cat of a different coat \"Thats all the truth I know \"In a coat of gold or a coat of red \"A lion still has claws \"And mine are long and sharp, my lord \"As long and sharp as yours\" And so he spoke, and so he spoke That Lord of Castamere But now the rains weep oer his hall With no one there to hear Yes, now the rains weep oer his hall And not a soul to hear"
## [21] " Ive been through two slave revolts, boy. They always end the same way"                                                                                                                                                                                                                                                                                                                                                                                   
## [22] " He wants me to bend the knee. And he wants the free folk to fight for him. I’ll give him this much"                                                                                                                                                                                                                                                                                                                                                      
## [23] " The House of Black and White"                                                                                                                                                                                                                                                                                                                                                                                                                            
## [24] " There are only two like it in the world"                                                                                                                                                                                                                                                                                                                                                                                                                 
## [25] " I imagine this is strange for you. Everyone you meet has a hidden motive, and you pride yourself on slitting in and out. But I’m telling you a simple truth"                                                                                                                                                                                                                                                                                             
## [26] " None of you saw Mance die! I did. The Southern King who broke our army, Stannis, wanted to burn him alive to send him a message. Jon Snow defied that cunt’s orders. His arrow was mercy. What it did took courage, and that’s what we need today"                                                                                                                                                                                                       
## [27] " We ask again"                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [28] " You men have a choice"                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [29] " Ill tell you what doesnt scare me"

As we can see, there are few phrases containing names (just 3, I believe). Thus we will change them manually, as it will take less time than thinking a new regular expressiion.

textos_por_persona[textos_por_persona$frases == " Wheres Arya Sansa Stark" | textos_por_persona$frases == " The dagger found Tyrion Lannister" | textos_por_persona$frases == "Entering Kings Landing Syrio Forel" , ]

#TemporadaEpisodiofrasesnombres
1438Season 1the wolf and the lionWheres Arya Sansa StarkNA
1471Season 1the wolf and the lionThe dagger found Tyrion LannisterNA
2385Season 1the pointy endEntering Kings Landing Syrio ForelNA

textos_por_persona$frases[1438] <- "Wheres Arya"
textos_por_persona$nombres[1438] <- "Sansa Stark"

textos_por_persona$frases[1471] <- "The dagger found"
textos_por_persona$nombres[1471] <- "Tyrion Lannister"

textos_por_persona$frases[2385] <- "Entering Kings Landing "
textos_por_persona$nombres[2385] <- "Syrio Forel"

 The “:” was used to enumerate. 

Finally we will fix the last problem. This problem arouse while splitting phrases by “:”, as we have splitted some phrases when we shouldn’t have, as the use of “:” is to enumrate rather than separate what a person says. We will simply fix this by pasting phrase and name of the afore observation to the observation with NA in the name.

for(i in 1:length(textos_por_persona$frases)){
  j = i + 1
  if(i == 1 ){
    i = i + 1
  } else if(is.na(textos_por_persona$nombres[i]) == TRUE){
    textos_por_persona$frases[i] = paste(textos_por_persona$frases[i], textos_por_persona$frases[j], sep = "")
    textos_por_persona$nombres[i] =  textos_por_persona$nombres[j]
    textos_por_persona$frases[j] <- ""
    textos_por_persona$nombres[j] <- ""
    i = i + 1
  } else{
    i = i + 1
  }
}
rm(i,j)

We make sure that there are no names with NA to check everything is now OK.

textos_por_persona$frases[is.na(textos_por_persona$nombres)]
## character(0)

4.2.3 Joining the person with the phrase

Our next step it to match the phrase with the person who has said it. To do so, we will first have to get rid off the phrases that, as a result of the previous code, have “” in both variables “nombres” and “frases”.

textos_por_persona <- textos_por_persona[!(textos_por_persona$nombres == "" & textos_por_persona$frases == ""),]

Now we just have to create a simple for loop to move each phrase a row up.

for(i in 1:length(textos_por_persona$frases)){
  j = i - 1
  if(i == 1){
    i = i + 1
  } else {
    textos_por_persona$frases[j] <- textos_por_persona$frases[i]
    i = i + 1
  }
}
rm(i,j)

Finally, we simply have to delete the last row.

textos_por_persona <- textos_por_persona[1:(length(textos_por_persona$Temporada)-1),]

4.2.4 Homogenize the names.

We now have the person that says the phrase and the phrase itslef. However, this is complete mess. We will have to clean and homogenize names. We will begin by simply removing the punctuation in the names.

textos_por_persona$nombres <- removePunctuation(textos_por_persona$nombres)

If we inspect the df we will now realize that we have many empty rows. This might have happend because when we have removed the observations with no content on “nombres” and “frases”, there were some observations with nothing but punctuation marks.

Besides, we can observe that there are some few observations were we do have a phrase but not a name. We will check them out to decide what we will do with them.

textos_por_persona[(textos_por_persona$nombres == "" & textos_por_persona$frases != ""),"frases"]
## [1] "King "                                                                                                                                                                                                                                                                                                                                                    
## [2] "He enters a tent where Ser Hugh is being tended to, he enters to speak with Barristan "                                                                                                                                                                                                                                                                   
## [3] "Entering Kings Landing "                                                                                                                                                                                                                                                                                                                                  
## [4] "tents "                                                                                                                                                                                                                                                                                                                                                   
## [5] "NIGHT "                                                                                                                                                                                                                                                                                                                                                   
## [6] "The Bloody Hand "                                                                                                                                                                                                                                                                                                                                         
## [7] "Smoke and fire are seen in the distance across the river. BRONN breaks the surface of the water and takes a breath. He pull  up an  gasps for breath. BRONN help  to the shore and drops him to the ground and collapses next to him  coughs up water and continues struggling on the ground to breathe. Both men roll onto their backs and gasp for air "
## [8] "EPISODE 6 - Script in progress The camera floats over the painted table showing detailed carvings of rivers, holdfasts, and cities. A fire burns in the hearth "

Looks like these phrases are actually descriptions. Even though I’m not sure how they are here after all the processing we have done, we will undoubtedly delete them. Thus, we will delete all observations that do not contain a name.

textos_por_persona <- textos_por_persona[textos_por_persona$nombres != "", ]

Back to homogenizing names, we will delete some punctuation marks that have not been deleted by the removePunctuation function. Then we will convert names into lowercase, so that there is no problem with the uppercase. And finally we will remove the numbers.

borrar <- c("'S ","' ","'s",".",'" ',"-","- ","'t ","- ",".","-","-")
textos_por_persona$nombres <- mgsub(borrar,"",textos_por_persona$nombres)
textos_por_persona$nombres <- tolower(textos_por_persona$nombres)
textos_por_persona$nombres <- removeNumbers(textos_por_persona$nombres)
rm(borrar)

Now there are two main things we have to correct: * Different names for the same character. Example= Sandor Clegane & The Hound.  The usage of the surname. Example: Daenerys & Daenerys Targarien. Other titles and “fancy” stuff: Daenerys & Daenerys Stormborn.

As there are quite a few characters, we will only undertake this process to main characters, as I belive is the most interesting thing to analyze. Therefore, further improvement can be done in this regard.

Note: another point of improvement are the few sentences that do not have a name. As they are fewer than 10 and some seem to be descriptions I have not cared about them, but there they are.

The usage of the surname & fancy names 

We will just create a vector with the names of the main families of the serie and substract them using mgsub. For more precise analysis, a web scraping of the family names could be undertaken.

In this case, I have also added a fancy name that is used quite often: “stormborn”. I know it is not a family, but the process to delete it is the exact same, so this way I write less code.

casas <- c("lannister","stark","baratheon","targaryen","mormont","tyrell","tarly","snow","greyjoy","stormborn")
textos_por_persona$nombres <- mgsub(casas,"",textos_por_persona$nombres)
rm(casas)

 Different names for the same character 

Once again, I have just come up with some main characters that can have different names. I’m quite a fan of the serie, so I know some of them. However, this cleaning is not inteded to be exhaustive, so once more, further improvement could be done.

The characters than I have found can be named diferently are:

  • Sandor Clegane: Hound, The Hound.
  • Littlefinger: Petyr Baelish.
  • Sam: Samwell.
  • Ned: Eddard.
  • Varys: Lord Varys.
sandor <- c("sandor","the hound","hound")
textos_por_persona$nombres <- mgsub(sandor,"sandor",textos_por_persona$nombres)
sam <- c("samwell","sam")
textos_por_persona$nombres <- mgsub(sam,"sam",textos_por_persona$nombres)
littlefinger <- c("baelish","petyr","petyr baelish" ,"baelish","littlefinger")
textos_por_persona$nombres <- mgsub(littlefinger,"littlefinger",textos_por_persona$nombres)
ned <- c("eddard","ned")
textos_por_persona$nombres <- mgsub(ned,"ned",textos_por_persona$nombres)
varys <- c("varys","lord varys")
textos_por_persona$nombres <- mgsub(varys,"varys",textos_por_persona$nombres)
rm(sam, sandor, littlefinger, ned, varys)

We will now check which are the top 20 characters that speak more times.

textos_por_persona %>%
  group_by(nombres) %>%
  summarize(n=n()) %>%
  arrange(desc(n)) %>%
  ungroup()%>%
  top_n(20)
## Selecting by n
nombresn
tyrion1542
cersei987
jon966
daenerys858
jaime850
sansa733
arya697
sam496

4.2.5 Correct possible mistakes in character names.

If we analyze whothe table above we will not find any problems. However, considering the several mistakes that we have found during the process (which is normal as they have transcribed thousands of conversations), we will use some regular expressions to homogenize any spelling mistakes.

This could also be done by calculating the Levenshtein distance, but as there are too many characters, this might lead to mistakes that would be tedious to identify. It is not the perfect way, but it is the safest.

daenerys <- textos_por_persona %>%
  filter(grepl("d[A-z]*rys", nombres) == TRUE ) %>%
   distinct(nombres)
textos_por_persona$nombres <- mgsub(daenerys$nombres,"daenerys",textos_por_persona$nombres)

cersei <- textos_por_persona %>%
  filter(grepl("c[A-z]*ei", nombres) == TRUE ) %>%
  distinct(nombres)
textos_por_persona$nombres <- mgsub(cersei$nombres,"cersei",textos_por_persona$nombres)

tyrion <- textos_por_persona %>%
  filter(grepl("t[A-z]*ion", nombres) == TRUE ) %>%
  distinct(nombres)
textos_por_persona$nombres <- mgsub(tyrion$nombres,"tyrion",textos_por_persona$nombres)

tywin <- textos_por_persona %>%
  filter(grepl("t[A-z]*win", nombres) == TRUE ) %>%
  distinct(nombres)
textos_por_persona$nombres <- mgsub(tywin$nombres,"tywin",textos_por_persona$nombres)

rm(daenerys, tyrion, cersei, tywin)

We will run the code above once again to see if the times they have spoke has changed or not.

textos_por_persona %>%
  group_by(nombres) %>%
  summarize(n=n()) %>%
  arrange(desc(n)) %>%
  ungroup()%>%
  top_n(20)
## Selecting by n
nombresn
tyrion1550
cersei994
jon966
daenerys868
jaime850
sansa733
arya697
sam496
littlefinger480
davos473

As we can see it does have change something. For example, cersei has increased from 987 to 994, and daenerys from 858 to 868. These are few sentences, but still with some simple steps have contributed to get a better database.

4.3 Cleaning some sentences

We will undertake two main stpes. First, we will delete some characters in the phrases. Then we will analyze those sentences with awkward content.

Deleting extra characters We will undertake the same exact process we have undertaken while eliminating characters in the names. I copy and paste the code.

borrar <- c('"',".",". ","--","- ","- ","...")
textos_por_persona$frases <- mgsub(borrar," ",textos_por_persona$frases)
rm(borrar)

Now, we will see how there are some ohrases that have some information betweeen asterisk. This is useless information, such as “in Valyrian”. I must have been some kind of description, as the one in brackets, that we have missed. Thus, we will delete it now.

textos_por_persona$frases = gsub("\\*[A-z]*\\s*[A-z]*\\s*\\**"," ",textos_por_persona$frases)

Improving smoe sentences Finally with so much substitutions we have undertake, we will check if any phrase have not been corecctly. The structure that “frightens” me is that a phrase is l i k e t h i s. Why? Because the replacement in any gsub por » “.

As I haven’t come up with any regex that works properly, I have calculated both the number of characters and number of blanks in each sentence and then caculate the ratio of blanks in the sentence. The case mentioned before will have almost 50% ratio, it will help us to filter those phrases.

textos_por_persona$caracteres <- nchar(textos_por_persona$frases)
textos_por_persona$palabras <- sapply(strsplit(textos_por_persona$frases, " "), length)
textos_por_persona$ratio <- textos_por_persona$palabras/textos_por_persona$caracteres
boxplot(textos_por_persona$ratio, main = "Distribution of the %blanks per sentence")

As we can see there are many outliers, that will most likely be cases like the one I have mentioned. This happen above 0.3-0.4. So we will filter and see what kind of cases there are.

textos_por_persona %>%
  select(frases, ratio) %>%
  arrange(desc(ratio)) %>%
  head(10)
nombresn
1?
2I
3No
4I , I
5So am I
6N o w a n d a l w a y s ?
7S h a l l w e b e g i n ?
8W h e r e a r e t h e y ? W h e r e a r e m y d r a g o n s ?
9U n s u l l i e d ! Y o u h a v e b e e n s l a v e s a l l y o u r l i f e T o d a y y o u a r e f r e e A n y m a n w h o w i s h e s t o l e a v e m a y l e a v e , a n d n o o n e w i l l h a r m h i m I g i v e y o u m y w o r d W i l l y o u f i g h t f o r m e ? A s f r e e m e n ?
10NO

As we can see, the suspicion was true. In some few cases we have some problems. We will fix those problems.

textos_por_persona$frases[textos_por_persona$frases=="N o w a n d a l w a y s ?" ] <- "Now and always?"
textos_por_persona$frases[textos_por_persona$frases=="S h a l l w e b e g i n ?" ] <- "Shall we begin?"
textos_por_persona$frases[textos_por_persona$frases=="  W h e r e a r e t h e y ? W h e r e a r e m y d r a g o n s ?" ] <- "Where are they ? Where are my dragons?"
textos_por_persona$frases[textos_por_persona$frases == "U n s u l l i e d ! Y o u h a v e b e e n s l a v e s a l l y o u r l i f e T o d a y y o u a r e f r e e A n y m a n w h o w i s h e s t o l e a v e m a y l e a v e , a n d n o o n e w i l l h a r m h i m I g i v e y o u m y w o r d W i l l y o u f i g h t f o r m e ? A s f r e e m e n ?"] <- "Unsullied! You have been slaves all your life. Today you are free. Any man who wishes to leave may leave, and no one will harm him. I give you my word. Will you fight for me? As free men?"

Finally, if you analize the dataset you will find that there are some observations that despite having a name, do not have a sentence. As this is not useful, we will delete them.

## [1] "There are 281 cases with blanck sentences."
textos_por_persona<- textos_por_persona %>%
  filter(!frases == "")

5 Data Enrichment

For improving the analysis we will add some extrainformation. To do so, I have created a database in Excel with some interesting facts about the episodes (number of episode in the serie and the season, release date, etc). All data is available at Wikipedia.

got <- read.csv("got.csv",sep=";", stringsAsFactors =  FALSE)

We check if the names in both datasets are the same or not

textos_por_persona%>%
  anti_join(got, by="Episodio") %>%
  select(Episodio) %>%
  distinct(.)
## Warning: Column `Episodio` joining factor and character vector, coercing
## into character vector
Episodio
cripples bastards and broken things
dark wings dark words
unbowed unbent unbroken
mothers mercy
the queens justice

We delete the punctuation simbols and join both table. The we check everything is fine.

got$Episodio <- removePunctuation(got$Episodio) 
textos_por_persona <- left_join(textos_por_persona, got, by="Episodio")
## Warning: Column `Episodio` joining factor and character vector, coercing
## into character vector
textos_por_persona %>%
  filter(is.na(Emision) == TRUE) %>%
  summarize(n = n())
n
0
rm(got)

5.1 Sentence Cleaning

Before doing any analysis it is crucial to have the data cleaned. In that sence, I will:

  • Delete all punctuation marks, as they add no value.
  • Delete the possessives and contractions.
  • Make everything lowercase so that we avoid uppercase problems.
textos_por_persona$frases <- gsub('[[:punct:] ]+',' ',textos_por_persona$frases)
textos_por_persona$frases <- gsub("'s","",textos_por_persona$frases, fixed = TRUE)
textos_por_persona$frases <- tolower(textos_por_persona$frases)

5.2 Data Enrichment

We will now add some extra information that might be of interest for further analysis, such as the house of the main characters.

textos_por_persona$House = NA
stark <- c("ned","robb","jon","bran","sansa","arya")
lannister <- c("cersei","jaime","tyrion","twywin","tommen")
baratheon <- c("stannis","renly")
targaryen <- c("viserys","daenerys")

for(i in 1:length(textos_por_persona$nombres)){
  if (textos_por_persona$nombres[i] %in% stark){
    textos_por_persona$House[i] <- "stark"
    i = i + 1
  } else if (textos_por_persona$nombres[i] %in% lannister){
    textos_por_persona$House[i] <- "lannister"
    i = i + 1
  } else if (textos_por_persona$nombres[i] %in% baratheon){
    textos_por_persona$House[i] <- "baratheon"
    i = i + 1
  } else if (textos_por_persona$nombres[i] %in% targaryen){
    textos_por_persona$House[i] <- "targaryen"
    i = i + 1
  } else{
    i = i + 1
  }
}

rm(stark, targaryen, lannister, baratheon,i)

Besides, another thing that might be of interest is whether the character is a main character or not. This will enable us to make some analysis just for the important characters, as they are the ones that matter the most. We will begin by taking the 20 characters that speak the most and we will consider them as main characters. We will exclude man, as it is obviously not a main character.

textos_por_persona %>%
  group_by(nombres) %>%
  filter(nombres != "man")%>%
  summarize(n=n()) %>%
  top_n(20) %>%
  arrange(desc(n)) 
## Selecting by n
nombresn
tyrion1537
cersei984
jon954
daenerys861
jaime842
sansa719
arya690
main_characters <- textos_por_persona %>%
  group_by(nombres) %>%
  filter(nombres != "man")%>%
  summarize(n=n()) %>%
  top_n(20)%>%
  select(nombres) %>%
  t(.) %>%
  as.vector(.)

Even though we could create a new column, we will keep the vector. In case we want to filter for main characters, we will just have to do a semi join on the vector.

Finally we will create a new variable to display the season. Even though we do have the season on one variable (Temporada), it has a character type. We will create an integer variable, as it is more versatil.

textos_por_persona$Temporada_n <- as.integer(gsub("Season ","", textos_por_persona$Temporada))

6 Text Analysis

6.1 Approach Analysis

According to the book “Tidy Text Mining” there are two main ways to face a text analysis: creating a tidy text matrix or a Document-Term-Matrix.

In this case I have used the Tidy Text way for a very simple reason: DTM always broke the session. Besides I believe that the Tidy Text it’s much more flexible, at leat for me.

So, as I will proceed to unnest the sentences, that is, to create an observation for each word. I will do it in two ways: monogram (breaking for just one word) and as bigram (breaking it into two words).

6.2 Unnesting

Before unnesting into a single word, I will put together words that do have an impacto on the serie, such as “Iron Throne”, “Night King” or “White Walker” among others. As this changes will indeed affect the unnest into two words, we will duplicate the dataset.

textos_por_persona_2 <- textos_por_persona
textos_por_persona_2$frases <- gsub("iron throne","ironthrone",textos_por_persona_2$frases, fixed = TRUE)
textos_por_persona_2$frases <- gsub("seven kingdoms","sevenkingdoms",textos_por_persona_2$frases, fixed = TRUE)
textos_por_persona_2$frases <- gsub("seven kingdoms","sevenkingdoms",textos_por_persona_2$frases, fixed = TRUE)
textos_por_persona_2$frases <- gsub("night king","nightking",textos_por_persona_2$frases, fixed = TRUE)
textos_por_persona_2$frases <- gsub("white walkers","whitewalkers",textos_por_persona_2$frases, fixed = TRUE)

Now, we will do the unnest.

library(tidytext)

textos_por_persona_unnest <- unnest_tokens(textos_por_persona_2, Palabra, frases)
rm(textos_por_persona_2)

To be sure that the unnest is ok, we will analyze the amount of characters in each words. If we plot a boxplot, there shouldn’t be any outliers.

textos_por_persona_unnest$length <- nchar(textos_por_persona_unnest$Palabra)
boxplot(textos_por_persona_unnest$length, main = "Distribution of characters in words")

We can now filter to see whether everything is OK or not.

textos_por_persona_unnest %>%
  filter(length>10) %>%
  arrange(desc(length)) %>%
  select(Palabra) %>%
  distinct(.) %>%
  head(10)
Palabra
1responsibilities
2misunderstanding
3congratulations
4accomplishments
5tralalalaleeday
6straightforward
7accomplishment

As we can see, the words with maximum length are correct. It seems that the unnest has been correctly done.

6.3 Text cleaning & Homogenizing

If we now analyze which are the 10 most used words we will see none of them any value to the analysisi. These words are known as stopwords. They are words used very often as are essential for speaking.

textos_por_persona_unnest %>%
  group_by(Palabra) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
Palabran
the12169
you10397
to7889
i7456
a6116
and5341
of4514
your3285
my3163
it2939

In order to undertake a meaningful text analysis, we will have to remove the stopwords. The tm package includes the function stopwords.

stop_words <- stopwords("en")
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words)

Now we rerun the code one more time to see if there have been any changes in top 20 words.

textos_por_persona_unnest %>%
  group_by(Palabra) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
Palabran
will1515
know1212
lord1105
one1087
dont1048
im894
like854
us851

As you can see, despite we have removed stopwords, there are still some of them, such as im (I’m). This is due to the preprocessing we have undertake. Besides, there are other words that as they use another apostrophe, have not been deleted. In order to fix this, we will substitute the apostrophe from the stop_words vector and rerun the code.

#Removing lines with ' as apostrophe
stop_words <- gsub("'","'", stop_words, fixed = TRUE)
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words)
#Removing words without apostrophe
stop_words <- gsub("'","", stop_words, fixed = TRUE)
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words)

Now that everything is fine, let’s see which are the most used Words in GoT Series.

textos_por_persona_unnest %>%
  group_by(Palabra) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
Palabran
will1515
know1212
lord1105
one1087
like854
us851

As you can see, there are still many words that add little value to the analysis, such as “us” or “one”. It seems that the tm stopwords functions returns a vector with few stopwords. Thus we will undertake a more agressive stopword removing.

To do so I’ve found this website that seems have a large list of stopwords. We will scrap the web and delete the stopwords.

url <- c("https://www.ranks.nl/stopwords")
stop_words <- read_html(url) %>%
  html_nodes(".panel-body table") %>%
  html_table(fill=TRUE)
stop_words <- stop_words[[2]]
stop_words <- unlist(stop_words, use.names=FALSE)

#We will create some duplicates with other kind of apostrophes
stop_words2 <- gsub("'","",stop_words, fixed = TRUE)
stop_words3 <- gsub("'","'",stop_words, fixed = TRUE)

#We subset them all
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words)
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words2)
textos_por_persona_unnest <- textos_por_persona_unnest %>%
  filter(!Palabra %in% stop_words3)

We will once again check which are the most used words in GoT series.

library(wordcloud)

textos_por_persona_unnest %>%
  group_by(Palabra) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  with(wordcloud(Palabra, n, min.freq=1, max.words = 100, random.order=FALSE, rot.per=0.35, 
                   colors=mycols10, fixed.asp = TRUE))

We will undertake a final process called word stemming. Basically it consist to take the root of all words (stemming) and reconstruct the words (completion). This, however, requires a completion dictionary, which I don’t have and will be very tedious to get one for all these words. Thus, I will make the word stemming maually, with some regex.

textos_por_persona_unnest$Palabra <- gsub("love.*","love",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("[^g].*love","love",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("kill.*","kill",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("war[^nmd].*","war",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("destr.*","destruction",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("north.*","north",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest$Palabra <- gsub("enem.*","enemy",textos_por_persona_unnest$Palabra)
textos_por_persona_unnest %>%
  group_by(Palabra) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  with(wordcloud(Palabra, n, min.freq=1, max.words = 100, random.order=FALSE, rot.per=0.35, 
                   colors=mycols10, fixed.asp = TRUE))

6.4 Wordclouds

6.4.1 Worcloud of Main Characters

Now we will get the most used word by main characters in GoT. To do so, I have pasted the code I would use for a single character and create a function with it. By doing so I can call sapply and do all the wordclouds at the same time.

character_worcloud <- function(x){
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, paste("Wordcloud of ", x))

textos_por_persona_unnest %>%
  filter(nombres == x)%>%
  group_by(Palabra) %>%
  summarize(n=n())%>%
  with(wordcloud(Palabra, n, min.freq=1, max.words = 100, random.order=FALSE, rot.per=0.35, 
                   colors=mycols10, fixed.asp = TRUE))
  }

6.4.2 Wordcloud of Main Houses

Now we will undertake the same exact process to show the most used words of the main houses.

house_worcloud <- function(x){
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, paste("Wordcloud of ", x))

textos_por_persona_unnest %>%
  filter(House == x)%>%
  group_by(Palabra) %>%
  summarize(n=n())%>%
  with(wordcloud(Palabra, n, min.freq=1, max.words = 100, random.order=FALSE, rot.per=0.35, 
                   colors=mycols10, fixed.asp = TRUE))
  }
houses <- textos_por_persona_unnest %>%
  select(House) %>%
  filter(is.na(House) == FALSE) %>%
  distinct() %>%
  t(.) %>%
  as.vector(.)

6.5 Characters comparison

Another interesting thing to analyze is to comper characters by what they say in some generic words, such as love or kill, for example. To do so, we shouldn’t use the total amount of words, because the characters that speak the most would be penalized. We have to compare out of all the words they say, how many times they have used those words.

Thus, we will first have to calculate and sum for each characters the amount of words they have spoken.

total_words <- textos_por_persona_unnest %>%
  group_by(nombres) %>%
  summarize(total_words = n())

textos_por_persona_unnest <- left_join(textos_por_persona_unnest, total_words, by = "nombres")

rm(total_words)

Now we can compare.

Comparison between Daenerys and Cersei

par(mfrow=c(1,1))
words_compare <- c("love", "kill", "war","ironthrone", "enemy","war","destroy")
textos_por_persona_unnest%>%
  filter(nombres %in% c("daenerys", "cersei"), Palabra %in% words_compare) %>%
  group_by(nombres, Palabra, total_words) %>%
  summarize(n_use = n()) %>%
  mutate(usage = n_use/total_words*100) %>%
  ggplot(aes(reorder(Palabra,usage),usage, fill=nombres)) +
    geom_bar(stat = "identity", position = "dodge")+
    coord_flip() +
    labs(title = "Daenerys vs Cersei word usage comparison",
         x = "",
         y = "Relative Times used (%) ",
         fill = "") +
  theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8"))

rm(words_compare)

As simple as this graph might look it gives you a good grasp of the motives of each character. It looks that Cersei is moved by love, which might be the reason for the wars she has created. Daenerys, however, is moved by the Iron Throne, and this is what leads her to kill.

Comparison between Tyrion and Jaime

words_compare <- c("father", "sister", "lannister","stark", "love")
textos_por_persona_unnest%>%
  filter(nombres %in% c("tyrion", "jaime"), Palabra %in% words_compare) %>%
  group_by(nombres, Palabra, total_words) %>%
  summarize(n_use = n()) %>%
  mutate(usage = n_use/total_words*100) %>%
  ggplot(aes(reorder(Palabra,usage),usage, fill=nombres)) +
    geom_bar(stat = "identity", position = "dodge")+
    coord_flip() +
    labs(title = "Tyrion vs Jaime word usage comparison",
         x = "",
         y = "Relative Times used (%) ",
         fill = "") +
  theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8"))

rm(words_compare)

Comparison between Sandor Clegane and Bronn

Neither of them is known for their politness. Both like drinking, killing and ultimately they both say “fuck” many times. Can you guess who says it more often?

words_compare <- c("fuck", "kill", "drink","die","wine")
textos_por_persona_unnest%>%
  filter(nombres %in% c("sandor", "bronn"), Palabra %in% words_compare) %>%
  group_by(nombres, Palabra, total_words) %>%
  summarize(n_use = n()) %>%
  mutate(usage = n_use/total_words*100) %>%
  ggplot(aes(reorder(Palabra,usage),usage, fill=nombres)) +
    geom_bar(stat = "identity", position = "dodge")+
    coord_flip() +
    labs(title = "Sandor vs Bronn word usage comparison",
         x = "",
         y = "Relative Times used (%) ",
         fill = "") +
  theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8"))

rm(words_compare)

6.7 Sentiment Analysis

Another interesting analysis is sentiment analysis. How do characters feel in each and every episode? How “sad” or “happy” is the episode? Does the feeling of the in all season follow a trend? Those are the questions that will be answered.

6.7.1 Getting sentiment analysis

Even though there are many ways to get the sentiment, I will pick a rather simple one: the bing sentiment. This will tell you whether a word is either positive or negative, which will enable us to answer all previous questions.

sentiment <- get_sentiments("bing")
textos_por_persona_sentiment <- textos_por_persona_unnest %>%
  mutate(word = Palabra) %>%
  inner_join(sentiment, by = "word")
rm(sentiment)

6.7.2 Total Sentiment evolution

Let’s see how the sentiment has changed over the episodes.

par(mfrow=c(1,2))
textos_por_persona_sentiment %>%
  group_by(sentiment, N_serie) %>%
  summarize(n = n()) %>%
  ggplot(aes(N_serie, n, fill=sentiment) ) + 
    geom_area(position = "fill") + 
    scale_x_continuous(breaks = seq(0,67,5))+
      theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8"))+
    labs(title = "Episode sentiment evolution", 
         x = "Episodes",
         y = "Distribution of possitve and negative words per episode",
         fill = "")

textos_por_persona_sentiment %>%
  group_by(sentiment, N_serie) %>%
  summarize(n = n()) %>%
  ggplot(aes(N_serie, n, fill=sentiment) ) + 
    geom_area() + 
    scale_x_continuous(breaks = seq(0,67,5))+
      theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8"))+
    labs(title = "Episode sentiment evolution", 
         x = "Episodes",
         y = "% of possitve and negative words",
         fill="")

par(mfrow=c(1,1))

6.7.3 Character sentiment analysis

textos_por_persona_sentiment %>%
  group_by(nombres, sentiment, Temporada_n, total_words) %>%
  filter(nombres %in% main_characters) %>%
  filter(!nombres %in% c("robb","ned","tywin","davos","brienne")) %>%
  summarize(n = n()) %>%
  mutate(word_percentage = n/total_words*100) %>%
  ggplot(aes(Temporada_n, word_percentage, fill=sentiment)) + geom_area(position = "fill") + facet_wrap(.~nombres, nrow = 2) + geom_hline(yintercept=0.5, linetype = "dashed", col = "white") +
    theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_fill_manual(values=c("#00263e", "#8dc8e8")) 

I have removed some characters that either have not appeared in some seasons (Davos, Brienne) or that have died (Tywin, Ned, Robb), as they did not show a big trend. We can clearly see how Sansa has negative trend, while Arya has a more positive one. Besides, most characters are more negative than positive.

6.7.4 House sentiment analysis

textos_por_persona_sentiment %>%
  filter(House %in% houses) %>%
  group_by(House, sentiment, Temporada_n) %>%
  summarize(n = n()) %>%
  spread(sentiment, n) %>%
  mutate(Sentiment = positive/(positive+negative)) %>%
  ggplot(aes(Temporada_n, Sentiment, col=House)) + geom_line() + geom_point() + geom_hline(yintercept=0.5, linetype = "dashed") +
  scale_x_continuous(breaks=seq(1,7,1)) +
  theme(legend.position = "bottom", panel.background = element_blank()) +
  scale_color_manual(values= mycols10) +
  labs(title = "Sentiment Evolution of GoT Main Houses", x="Temporada", y="Positive sentiment (%)", col ="") 

As you can see, At the beginnig all houses had a very similar sentiment distribution. However, as seasons go by, the increase of positive sentiment in a certain House usually implies the decrease in positive sentiment of other house.

7 Palabra Relationships

7.1 Creating the bigram

Before we have tokenize by 1 word. However, this is not the proper manner to do an analysis. Why? Because a word might have a “not” before and thus might mean just the opposite. I will fix this issue in later versions.

For the time being, we will analyze by bigrams just to get the correlation between words.

textos_por_persona_unnest_2 <- unnest_tokens(textos_por_persona, output="Words", input = frases, token = "ngrams", n=2) %>%
  separate(Words,c("Palabra1","Palabra2"), sep = " ")

7.2 Bigram Network

First we will delete the stopwords. count by both words so that we can filter and get the bigrams that appear the most.

stop_words <- stopwords("en")

#We delete some characters 
textos_por_persona_unnest_2$Palabra1 <- gsub("'","", textos_por_persona_unnest_2$Palabra1)
textos_por_persona_unnest_2$Palabra2 <- gsub("'","", textos_por_persona_unnest_2$Palabra2)

textos_por_persona_unnest_2$Palabra1 <- gsub("'","", textos_por_persona_unnest_2$Palabra1)
textos_por_persona_unnest_2$Palabra2 <- gsub("'","", textos_por_persona_unnest_2$Palabra2)

stop_words <- stopwords("en")
stop_words <- gsub("'","",stop_words)

Now we will simply delete the words and group by bigram.

network <- textos_por_persona_unnest_2 %>%
  filter(!is.na(Palabra1),!is.na(Palabra2))%>%
  filter(!Palabra1 %in% stop_words) %>%
  filter(!Palabra2 %in% stop_words) %>%
  count(Palabra1,Palabra2,sort=TRUE)

head(network,20)
Palabra1Palabra2n
kingslanding171
sevenkingdoms119
jonsnow110
nightswatch93
castleblack86
lordcommander84
hodorhodor68
ironthrone61
casterlyrock59
don’tknow54

Now we will create the graph

library(igraph)

network <- network %>%
  filter(n > 20) %>%
  graph_from_data_frame()

network
## IGRAPH c66d519 DN-- 128 92 -- 
## + attr: name (v/c), n (e/n)
## + edges from c66d519 (vertex names):
##  [1] kings   ->landing   seven   ->kingdoms  jon     ->snow     
##  [4] nights  ->watch     castle  ->black     lord    ->commander
##  [7] hodor   ->hodor     iron    ->throne    casterly->rock     
## [10] don’t   ->know      lord    ->baelish   will    ->never    
## [13] come    ->back      ned     ->stark     robb    ->stark    
## [16] lady    ->sansa     ser     ->davos     lord    ->stark    
## [19] iron    ->islands   tywin   ->lannister just    ->like     
## [22] lady    ->stark     one     ->day       white   ->walkers  
## + ... omitted several edges

Finally we will convert the igraph into ggraph

library(ggraph)

set.seed(2017)

ggraph(network) +
  geom_edge_link(color = "#00263e") +
  geom_node_point( color = "#8dc8e8", size = 2) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_graph(background = "white")