Web-scraping Final Fantasy VII with Python & Beautiful Soup

Over the last week I’ve been figuring out how to web scrape, using some handy tutorials from the internet (mostly this one from Data Science Dojo). I’ve learnt a lot about Python and Beautiful Soup, and I’m here to share some of that sweet, sweet knowledge with you today!

I’m a big fan of video games, and they are pretty heavily tied to my workplace, so I decided the best place to try out web scraping would be on a (very) comprehensive video game wiki, so I chose Giant Bomb.

Just a quick little disclaimer here – I’m very much learning as I go and mid-way through this web-scraping process I found out that Giant Bomb has an API built just for this purpose! I will be attempting to get the same information using the API soon, but for the sake of learning in steps I think its still very valuable to learn how to do a basic web scrape first (not everyone has fancy API’s!).

How to web-scrape (steps):

1. Find a website with some very cool information that you want √

I’m using the Giant Bomb video game wiki, and since I’m a massive Final Fantasy fan, I’m focusing my efforts on the Final Fantasy VII wiki page.

2. Find a uniform part of the website that will make scraping simple!

Web-Scraping is simplest when the information is contained in a uniform way across a website, so find a bit of information you want that looks the same on every page (or sometimes it might be repeated on the same page.

In my case, there is a “Game Details” section on every game page of the website that contains things like ‘Genre’, ‘Developer’,’Publisher’ etc that should work perfectly!

eg: Final Fantasy VII Game Details:

 GameDetails

3. Go for a cheeky dig through the html on the webpage to find where the “Game Details” are hiding in the code 

All you need to do is right click on your page in your preferred web browser (I’m using Chrome) and click “Inspect”

inspect.jpg

And then have a look in the inspection window for what part of the code the “Game Details” table is under.

  • I tried by pressing CTRL+F to search the code and typing in ‘game details.’
  • Sure enough, I found it inside of a Div Class “wiki-details,”
  • And if I hover over that Div Class in the code it highlights the exact table I need. Awesome!

game details html 2.JPG

4. Load up Python and Beautiful Soup  √

Now that we have an idea which part of the html code we will be scraping our data from, its time to get into the fun stuff! Coding in Python! I’ve been using two programs to run my code, Sublime Text to put my code together, and a handy-dandy command window to run my code.

If you don’t already have Python installed on your computer, the best way to get that set up is to download Anaconda with Python here – then you can get Sublime Text from here.

  • Open up a new file in Sublime Text and save it into a new folder on your computer. Call it anything you want! I’m calling mine ‘Giant_Bomb_Webscrape’.
  • Open the folder you saved your Sublime file into with your File Explorer, and call a command window from there. (You can do this by holding SHIFT + Right Mouse click in the folder, then click on “Open command window here”

opencommandwindow2

  • So now your screen should be looking something like this:

setupscreens.JPG

  • Lets start out by putting together a little bit of the code in Sublime Text
    • First, if you haven’t done it already, set the programming language to Python. You can do this by clicking into Sublime in any spot where you can enter text, and then pressing CTRL+SHIFT+P, then typing “Set Syntax:Python” (That’s how you get Sublime to colour the text the right colours while you’re typing. If all your text is white, you need to set the syntax!)
    • Next, import the libraries:

#--Import the libraries--
#URLLib is used to open webpages
from urllib.request import urlopen
#Beautiful Soup is what we use to parse our webpage
from bs4 import BeautifulSoup as soup

Your code should now look something like this:importthelibraries

5. Set the URL to a variable and parse the webpage  √

Lets really start getting into this coding business now, we’ve found our webpage, we have somewhat of an idea of where in that html code we want to look and we’ve setup Python and installed our libraries – now lets get digging!

Firstly, lets get Python to open up our Final Fantasy VII wiki page. We do that in the following order:

  1. Set the URL to a variable
  2. Open the URL
  3. Store the URL’s html code in a variable
  4. Close the URL
  5. Parse the html with BeautifulSoup

And here’s how we do that:

#Set your URL
my_url = 'https://www.giantbomb.com/final-fantasy-vii/3030-13053/'
#Open your URL
url_opener = urlopen(my_url)
#Store the URL's html in a variable
html_holder = url_opener.read()
#Close the URL
url_opener.close()
#Parse the html with BeautifulSoup
page_soup = soup(html_holder, "html.parser")

Now we are just about ready to have a look at that html, but hang on – its probably not a good idea to open the WHOLE webpage’s html, it could be huge and crash our command window – and that’s not good!

Instead, let’s just look inside of the ‘div class’ that we know the ‘Game Details’ should be contained in: <div class = "wiki-details">

Div class wiki details

We can do that with a handy little function in Python “findAll”, and chucking it in a variable like so:

#Only look at the div class "Wiki-Details"
containers = pagesoup.findAll("div",{"class":"wiki-details"})

So what this code will do is look through our code, find all ‘divs where the class is wiki-details” and then hold that in the variable containers

So your total code should be looking like this:

containers3.PNG

 

6. Give your code a run in the command window and dig a little deeper  √

Now we move back over to the command window. First thing, type in “python” and press ENTER to make sure we’re inside of the Python programming language. You should see something like this:commandpython.png

Great, we are now running in Python and ready to give our code a shot.

Now copy the code from your Sublime window and paste it into the command window. (Note that you may have to press ENTER to run the last line of copied code if you haven’t included a line break at the end of the code) 

firstcommand1.png

You can tell the code above has worked because it has continued onto the next line and shown the “>>>” symbols that are ready for you to continue typing code. Now let’s check out the first of those ‘Wiki-Details’ div containers by typing the following into the command window:
containers[0]

And from running this code, we can see some very satisfying information! It looks like we are finding all the little elements of our ‘Game-Details’ table we were looking at earlier.

GameDetails

firstcommand2

So it was a bit of a fluke, but our first container (containers[0]) holds all of the information we need. So go ahead and assign that to a variable so that we don’t have to keep calling it in full every time we need it:

#Hold containers[0] in a variable
container = containers[0]

7. Find each of the parts of information you want and assign them to individual containers √

We are looking for:

  • Name
  • First Release Date
  • Platform
  • Developer
  • Publisher
  • Genre
  • Theme
  • Franchises
  • Aliases

After a quick look through the code, I can find a couple of similarities between where each of these slices of information are hiding. It looks like they are mostly contained in a ‘div’ that has a ‘data-field’ assigned to it, and within that ‘data-field’ is contained the name of that particular slice of data.

For example, this div’s ‘data-field’ is “themes”, which is one of the slices of data we are looking for:

divthemes

So hypothetically, if we just try out that “FindAll” function we were using before, but without the “All” (since there probably shouldn’t be more than one) – then we should be able to pull just the “themes” div:

container.find("div",{"data-field":"themes"})

And you will find the above code pulls that exact div.

The only two values that can’t be found through this method are the “Name” and “Aliases” which happen to be the first and last values in the data. Neither of those ‘div’ objects have a ‘data-field’ class.

1. The “Name” field is contained in the very first ‘a’ tag of the container, so it can be easily found just by typing: 

container.a

2. The “Aliases” field actually has a very similar setup to the rest of the fields, only instead of a ‘data-field’, it has a ‘span class’: 

container.find("span",{"class":"aliases"})

So to put it all together, here’s how we would assign each of these fields to containers:

#Assign each data field to a container
name_container = container.a
release_container = container.find("div",{"data-field":"release_date"})
platform_container = container.find("div",{"data-field":"platforms"})
developer_container = container.find("div",{"data-field":"developers"})
publisher_container = container.find("div",{"data-field":"publishers"})
genre_container = container.find("div",{"data-field":"genres"})
theme_container = container.find("div",{"data-field":"themes"})
franchise_container = container.find("div",{"data-field":"franchises"})
aliases_container = container.find("span",{"class":"aliases"})

8. Scrape the data field containers until you have lovely clean data values √

I found that the best way to do this is to iterate down, step-by-step and cut off a little bit of the information we don’t need until the data is clean.

Here’s what I did with the “franchise_container” :

farnchisecontainer.PNG

  • First you can see what the franchise_container looks like in full
  • Next, we only pull the text, which almost works if it weren’t for the line breaks in between the two values!
  • “.strip()” gets rid of the line breaks (the “\n”s) at the beginning and the end, and we are only left with the middle line breaks.
  • I’ve replace the line breaks in the centre of the data with pipes “|” to split up the different values.
  • And done!

I’ve found that code pretty much works for every data field, except for the “name” data field which never has line breaks in the case of GiantBomb.com, and the “aliases” data field which had sneaky “\r” tags that I replaced with blanks.

So this is the code we end up with:

Name = name_container.text
First_release_date = release_container.text.strip().replace("\n","|").replace(","," -")
Platform = platform_container.text.strip().replace("\n","|")
Developer = developer_container.text.strip().replace("\n","|")
Publisher = publisher_container.text.strip().replace("\n","|")
Genre = genre_container.text.strip().replace("\n","|")
Theme = theme_container.text.strip().replace("\n","|")
Franchises = franchises_container.text.strip().replace("\n","|")
Aliases = aliases_container.text.strip().replace("\n","|").replace("\r","")

So lets check in our command window and make sure everything is working.

Your full code should be looking like this now:

#--Import the libraries--
#URLLib is used to open webpages
from urllib.request import urlopen
#Beautiful Soup is what we use to parse our webpage
from bs4 import BeautifulSoup as soup

#Set your URL
my_url = 'https://www.giantbomb.com/final-fantasy-vii/3030-13053/'
#Open your URL
url_opener = urlopen(my_url)
#Store the URL's html in a variable
html_holder = url_opener.read()
#Close the URL
url_opener.close()
#Parse the html with BeautifulSoup
page_soup = soup(html_holder, "html.parser")

#Only look at the div class "Wiki-Details"
containers = page_soup.findAll("div",{"class":"wiki-details"})

#Hold containers[0] in a variable
container = containers[0]

#Assign each data field to a container
name_container = container.a
release_container = container.find("div",{"data-field":"release_date"})
platform_container = container.find("div",{"data-field":"platforms"})
developer_container = container.find("div",{"data-field":"developers"})
publisher_container = container.find("div",{"data-field":"publishers"})
genre_container = container.find("div",{"data-field":"genres"})
theme_container = container.find("div",{"data-field":"themes"})
franchise_container = container.find("div",{"data-field":"franchises"})
aliases_container = container.find("span",{"class":"aliases"})

Name = name_container.text
First_release_date = release_container.text.strip().replace("\n","|").replace(","," -")
Platform = platform_container.text.strip().replace("\n","|")
Developer = developer_container.text.strip().replace("\n","|")
Publisher = publisher_container.text.strip().replace("\n","|")
Genre = genre_container.text.strip().replace("\n","|")
Theme = theme_container.text.strip().replace("\n","|")
Franchises = franchise_container.text.strip().replace("\n","|")
Aliases = aliases_container.text.strip().replace("\n","|").replace("\r","")

Copy and paste your code into the command window. We can check the code has run correctly by double checking all our variables, like so:

Name

First_release_date

Platform

Developer

Publisher

Genre

Theme

Franchises

Aliases

And here’s what we get:

value fields

Success!!!!!

9. Export the data into a CSV √

This is my favourite part of the whole project, being able to export this wonderful data into a CSV we can use later. And it’s so simple to do!

Here’s the code:

#Export data to a CSV
filename = "Final_Fantasy_VII.csv"
f = open(filename,"w")
headers = "Name,First_release_date,Platform,Developer,Publisher,Genre,Theme,Franchises, Aliases\n"
f.write(headers)
f.write(Name +","+ First_release_date + "," + Platform +","+ Developer +","+ Publisher +","+ Genre +","+ Theme +","+ Franchises +","+ Aliases + "\n")
f.close()

And the above handy-dandy code does the following:

  • Assign a filename,
  • Opens the file,
  • Assigns headers to the file, writes them with a line break at the end,
  • Writes our data under each field
  • And closes the file (very important step, you can’t open the file if its not closed!!!)

So after running that code, and popping back into our original folder we have our Sublime text saved and the command window opened from – we should find this:

ffvii

excel-e1508141061951.png

Let me know what you think in the comments, and if you have any trouble I’ll try my best to help! The next step to this is combining it with a loop to pull multiple pages at a time and fill up a CSV with a whole bunch of games’ information, but I’ll leave that for another day!

I hope you enjoyed this tutorial, and most of all – I hope you learned something today!

– GirlvsData

 

2 thoughts on “Web-scraping Final Fantasy VII with Python & Beautiful Soup

  1. A cool place for Web scrapping is thenewboston site. I’m gonna try your url recommendation to. “Beautiful” blog post (got it…from BeautifulSoup “Beautiful”…..leave I’m bad at puns ).Have a nice day 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s