Data

DIY Data Science, Part 3: W is for Web Scraping

May 22, 2017
diydata3

I’ve been quite excited to dive into web scraping. After all, my teenage self used to copy and paste HTML code like crazy to put together amateur website templates, and scraping is also maybe the most intuitive way of collecting data online – just grab what you see. In practice, it’s obviously a little trickier than that. Programming a web scraper is not that difficult per se, but I hadn’t done it in over a year and also didn’t have that much experience to begin with. This combination made my Sunday night, while fun, incredibly frustrating at times (which seems to be a common thread).

The great thing is that this little project didn’t take me long to do at all. Yes, I did feel like I aged at least two years in the process. But all in all, this was a very cost-efficient topic to cover. A few hours was all I needed to get my scraper up and running, including some basic tutorials but excluding some general reading I did earlier in the week. Speaking of reading – I found this, this and this very useful. Analytics Vidhya, you are a hero.

This week’s topic: Collecting housing data

To work through my data science alphabet, I’ve decided that I’m going to bundle several letters into small data projects. W for Web Scraping is the first week of a project with the very provisional title Exploring Singapore’s Housing Market – a title that my Masters supervisor would definitely criticise because exploring is super vague verb to use in any research context.

This week, I’ve scraped data on 5000 rooms available for rent on Rent in Singapore, including their description, latitude and longitude of their location, the title, any room details, price, district as well as the broader area (North, South, etc). In future weeks, I want to collect more data on Singapore and its residential areas, look for interesting patterns, and create some neat visualisations. But for now, I’ve restricted myself to the scraping itself.

Instead of talking through what I did step by step, I’ve uploaded my code and will talk about a few things that I’ve learned in the process. Here, I’ve used the BeautifulSoup Python library, but I’ve also heard great things about Scrapy, which might be a less fiddly alternative.

This is what my result looks like – a CSV file with one row per room. Have I mentioned that I love Pandas?

Screen Shot 2017-05-22 at 20.31.10

It still requires some data wrangling in Pandas, which I’m embarrassingly excited for and might cover in a little mid-week special. The room details, for example, are all in a list in the same column – I want them in separate columns so I can compare rooms. I also need to remove the HTML from the description column, and I’m sure I will notice plenty more things once I get started.

Some thoughts on scraping

There are different ways in which you access the different elements of a page, so what works for one heading might not work for a different text field. The good thing is that anything you see in the source code, you can for sure scrape. Nevertheless, it took me a good hour and one frustrated tweet to figure out how to scrape one particularly stubborn element.

Some websites really don’t want you to scrape their information. Ethics seem to be a big deal when it comes to web scraping, and any basic introduction to the discipline will tell you to respect the terms and conditions. I was confronted with this right at the beginning; originally, I wanted to scrape PropertyGuru. But when I had a look at the HTML I pulled, I realised that I couldn’t see any of the rooms. Instead, there was a surprise captcha I had no way of filling out – testing my scraper for humanity. Now, there are most likely ways to avoid these tests and trick the website, but it didn’t feel right to start my web scraping journey by blatantly disrespecting Internet etiquette. I moved on to a different property website with no such obstacles.

The funny thing about web scraping is that you are really supposed to get your scraper to behave like a human; a little fake human that will click onto the next page for you, select the information you want, and download it at uebermensch speed. That’s what a scraper is doing – it’s not doing anything a human can’t do, but it does it much more efficiently. If you’re too efficient though, you might get blocked, so part of the fun seems to be to figure out a way to be just human enough in your scraping efforts.

Pagination can be super easy (as it was in my case), or a little more tricky. If there are page numbers in the URL, it’s generally quite easy.

And last but not least, I’ve got a confession to make. I didn’t scrape all 5346 available rooms. My internet connection happened to interrupt after it found exactly 5000 rooms, which delighted me so much I didn’t realise it was an SSL error rather than divine coincidence and just kept going. I will rectify this before I do any further analysis – but for this week, I think 5000 rooms will do.

 

You Might Also Like

No Comments

Leave a Reply