A recent project I worked on is this python script which gathers text of articles from a news website, summarises it, and then sends it to my email every day at a given hour. The technique of gathering specific data from websites is also called web scraping.
Below, I will explain the less efficient method I first deployed, while covering the more efficient one using an API later in the article.
Initially, the goal was to scrape through the HTML of a site (I will use The Guardian as an example) and find relevant information within the HTML. As explained in the Web development – HTML #1 tutorial, we can simply explore HTML of any given site by clicking the right mouse button and selecting ‘inspect’ or ‘inspect element’ in our browsers. Since HTML stores every type of content that is displayed on the website, we can find through the inspect function of our browser the exact link, image, title or body of a text in the HTML.
To begin, the top part of the code is simply telling BeautifulSoup, a python HTML parser package, to look at the ‘/business’ section of The Guardian and find all <a> HTML tags (links). When there are over 100 links on the page (this could as well be ‘if length(all_links) > 3’), then it will look for the link on the 141st position (which I found out to be the most recent article). I then did the same for 142nd and 143rd position – to get the latest three articles.
This is not the most efficient method, but it was the method I used before switching to the easier API method below.
This is the link we scraped:
I then did the same with titles of the articles, through scraping the <h1> element on the given url we found before.
Now, this method did in fact work quite well, but every few months the number and position of the links changed and, hence, I ended up with articles from unrelated sections (sports, lifestyle etc.). Here, I realised that there must be a better way to do this, and searched for an API.
What is an API?
First of all – no, API is not just another type of a beer, it stands for ‘Application Programming Interface’. It defines what information can be requested from a page and in what form will it be provided.
When visiting the restaurant, we do not prepare our food in the restaurant’s kitchen, but instead give our request to the waiter, who then passes it to the kitchen, where the food is prepared for to us. Similarly, API takes our request to bring us specific data, without having to see the original code. It passes our request to the original website’s server, which brings the requested data back. Most API providers also provide online documentation (this would be the menu in our restaurant example) where it is shown what all sort of information can be accessed.
After registering with TheGuardian, we are given an ‘API key’, which is a way for our code to interact with the database. In the above picture, we can see an example of TheGuardian’s list of different responses which we can then incorporate in our code to get a certain response. I.e. we see that the first article on the position  (usually, lists in programming languages begin with a zero), and that within the ‘0’ dictionary, we see a bunch of different items. To get the web site’s URL for example, we have to go through ‘root’ –> ‘response’ –> ‘0’ —> ‘webUrl’.
This way I navigated to the ‘webUrl’ element from the given API link (using the Requests library for Python) and printed it out. Voilà, we got our desired link.
Getting the article’s summary.
After we got the link (either through the first, inefficient method or via an API), we have to get the article’s text and summarise it. I used Sumy, a Python library made by Mišo Belica, which extracts summaries from HTML pages or plain texts. The library includes multiple different summarization methods, of which I chose the ‘TextRank’ method (explained here).
Here is the actual code (using TextRankSummarizer) and below it is the result of the summary. Sentence count was set to ‘2’ but can be set to any other number.
Send the summaries to email
Using Google App password, I am giving my code access to my Google account (the ‘sender’) from which it sends the article summaries to another email account (or in my case, to the same one). I then added all the titles and summaries scraped above to the body of the actual email that the code will send.
Lastly, using Python’s ‘Time’ and ‘Schedule’ libraries, I put the whole code in the following function to let the code execute itself every 24 hours.
def job(t): [ALL CODE HERE] schedule.every(24).hours.do(job) while True: schedule.run_pending() time.sleep(60) # wait one minute
Since the code runs locally (on my computer) it gets interrupted every time I shut down my laptop. For this reason, I simply double click my code every time I want the desired articles, instead of setting the function as above. Another option would be to deploy the code on a virtual server (such as Heroku), and the code would run nonstop.
Finally, this is the result we get in our email box:
I am now in the process of developing an advanced version of the scraper, which would show the summaries in an interactive online app. The users could mark any summarised article as ‘interesting’ or ‘not interesting’, and the model would then slowly adjust itself to provide articles relevant to the user’s interest.