Automate your repostings using a simple python web scraper (90 min)

In this tutorial, we’re going to create a simple web scraper. Our bot will be able to scrape valuable content on any website, even if there’s no public API provided. Later, the gathered articles can be filtered and automatically (re)posted on social media or your blog.

My example bot is searching for new shopping deals related to a specific keyword on MyDealz.de and later posts the deal links on a public Facebook page. I encourage you to be creative! Don’t just blindly follow my tutorial, but try to adapt it to your use cases, other websites and social media platforms!

Simplified architecture of the bot

Overview: The bot lives in the looped main() function (mybot.py) that sequentially triggers the following methods. Every method will be discussed in a separate section of this tutorial.

  1. fetch_articles() | scrape.py
  2. filter_articles() | scrapy.py
  3. persist_new_articles() | persist.py
  4. post_new_articles() | post.py
Looped main() function of our bot (mybot.py)

Please note: I can’t provide every single detail on Medium. So please clone the repo to follow my tutorial. You can find the full source code of this project on GitHub.

We’re going to deploy our bot on an Ubuntu AWS EC2 instance in the last step of this tutorial. That’s why I’m providing all terminal commands for Ubuntu only.

1) Install Python and dependencies

First, you have to install the latest version of Python 2. Binaries for other operating systems can be found here.

sudo apt install python

Next, you need to install the most wildly uses package management system for Python: pip.

wget -O get-pip.py "https://bootstrap.pypa.io/get-pip.py"sudo python get-pip.py

After getting pip, please install the following packages: We’ll use requests for HTTP- Requests (get/post), beautifulsoup4 and collective.soupstrainer for parsing the result pages’ HTML content and tinydb as straightforward and flexible database.

sudo pip install requests beautifulsoup4 collective.soupstrainer tinydb

2) Analyze your data source

I’m describing the process I used to analyze my bot’s target data source. This approach can be applied to nearly every website lacking a public API. However, you can’t just apply it directly to your target. You always have to adapt to the website’s peculiarities.

First of all, you have to define which topics are interesting for you and your social media/ blog audience. Next, you have to search for websites with articles about these particular topics. In my case, I’m having a Facebook page focusing on deals and coupons for shared mobility apps (e. g. e-scooters). Where can I find these kind of deals? On MyDealz.de, one of Germany’s biggest user-generated platforms in this segment.

Second, you have to analyze how a human user would find and filter relevant content on the target website. Most platforms have a search bar or specific URLs listing articles with a particular tag (e. g. blog archive pages). Once you found a potential solution, turn on your browser developer tools (Chrome: Ctrl + Shift + I), go to Network, and check Preserve log. Now, just act like a real user: On MyDealz.de, I’m clicking into the search field, typing ‘voi scooter’, and finally pressing the submit button.

After submitting, look for a request in your network logs that contains the word ‘search’ (or similar) and has ‘voi scooter’ in its payload. In my example, it’s just a simple GET-Request with only one parameter (q).

Find your search request in the developer console

Advanced: Some websites make their content and search functionality available for logged-in users only. In this case, you have to explore how the login works. After you figured it out, reconstruct and run the login request (see next section) with all mandatory fields like username and password. You can then use the same (logged-in) session object for all subsequent search requests.

Key message: Try to understand how relevant information is loaded onto the target website (mostly GET/POST-Requests). Conduct experiments and figure out the particularities of your platform’s endpoints, like mandatory parameters (e. g. session token).

3) Scrape it

Once we figured out how to request relevant information from our target website, we now can rebuild exactly these requests in Python.

First, instantiate a new session and use it to post the search request to your target’s endpoint:

session = requests.Session();SEARCH_URL = 'https://www.mydealz.de/search' 
SEARCH_DATA = {'q': search_query}
raw_response = session.get(url=SEARCH_URL, params=SEARCH_DATA)

With BeautifulSoup we’re now going to parse the returned raw HTML response:

response = BeautifulSoup(raw_response, 'lxml')

On our BeautifulSoap response object, we’re now able to extract relevant HTML elements. We’re using BeautifulSoap’s select() method to get an array of article nodes that all match a particular pattern. To extract your article elements, use CSS selectors.

But how to find a unique CSS selector? Once again, open the developer tools (Chrome: Ctrl + Shift + I), click on Elements, Inspect and hover your target articles. Now search for the smallest section possible, still holding all article information.

Search a unique selector for the article wrapper

Try to come up with a unique CSS selector that wraps each relevant article on your result page. In my example, I’m sticking to the element name ‘article’ and class name ‘thread’. Therefore, my combined selector is ‘article.thread’.

For more information on CSS query selectors in general, please check out W3School’s CSS Selector Reference.

Tipp: There’s an easy way to test your query selector directly in the browser. Just type following JS command into the developer console. You will get back a list of HTML nodes matching your selector.

console.log(document.querySelectorAll('YOUR_SELECTOR'))

Cool! We finally have an array of HTML nodes holding the articles’ content and metadata (e. g. author, date, description).

articles = response.select('article.thread')

Now we have to find unique query selectors within the article wrapper node for each field we’re interested in. In my case, I need the deal’s title, description, and link. Again, we have to use the developer tools to find the corresponding elements holding the data and determine their unique selectors (same procedure, see above).

We can now use these selectors to iterate on the article array and parse the relevant information into native Python objects:

Parse relevant information from scraped article nodes (scrape.py)

Explanation: BeautifulSoap’s select function always returns an array. In my solution, I assume there’s only one title in each article node. Therefore, I just use the only matching element at index 0. Also, you can get the value of a specific HTML element attribute with [‘YOUR_ATTRIBUTE’].

We finally got rid of all irrelevant HTML and now can work with a native article object array (articles_objects).

For your reference, here’s the full code example of fetch_articles():

Full code example: fetch_articles() (scrape.py)

Key message: Use CSS selectors to parse the scraped result page of your search request using BeautifulSoup.

4) Filter your articles

Depending on the precision of your target website’s search algorithm, you might get articles that aren’t relevant for you and your target group. That’s why we should filter them before posting.

For example on MyDeals, the search query ‘lime scooter’ sometimes returns irrelevant deals for ‘lime soda’ or similar. Therefore, I’m only allowing articles to get published, that contain ‘scooter’ in their titles. Also, I keep a blacklist of words, that aren’t allowed (‘soda’):

Filter your articles with keywords (scrape.py)

Key message: Check your fetched articles’ quality! Build a list of irrelevant keywords and put them into your blacklist.

5) Persist them

After filtering, we’re now going to persist our articles using TinyDB, a simple JSON database. For each article, we first have to check if it’s already in the table (key: deal title). If so, we’re creating a new entry:

Persist new articles to database (persist.py)

5) Post them on social media

Now, we’re going to post the filtered articles on a Facebook page using Facebook’s Graph API. If you want to make your bot post on other SM platforms or your own blog, please refer to the official documentation of the respective API (e. g. Snap Kit, Wordpress REST-API).

If you haven’t done already, sign up for a Facebook-Account and create a new page. Next, set up a new App at Facebook for Developers. Click on My Apps, Create new, and assign a display name to it. We will need this Facebook App to get permission to use Facebook’s Graph API.

Create a new Facebook App

Once you set up the app, it will show up in Facebook’s Graph API Explorer. We’re going to get a short-term access token (expires after 1 hour) for our Facebook page there. Select your app and page and copy the generated access token:

Get a short-term access token for your page

Now, we’re going to expand the lifetime of our token using the Access Token Debugger. Open the link, paste your token, and click on Debug. At the bottom of the result page, you will see a button labeled Extend Access Token. After confirming with your Facebook credentials, you get a 2-month token, which we’ll use for our postings.

Request a long-term access token for your page

If you haven’t done already, please download the (unofficial) Facebook SDK for Python now:

pip install facebook-sdk

Let’s finally start coding again! The method post_new_articles() will only post the articles in our database that haven’t been marked as sent already. If the post request is successful, we’ll update the corresponding database entry to ‘sent = True’.

Post unsent article links onto the Facebook page (post.py)

Please make sure you’ve added your own access token and Facebook page ID (parent_object). Typically, you will find the page ID in the last segment of the public page URL.

That’s it! Now, debug your bot on your local machine with the following command:

python mybot.py

If everything works, you can follow the last section of this tutorial and deploy your bot on a remote server.

6) Deploy your bot on AWS EC2 (optional)

You can always debug and run your python scripts on your local device. However, running your bot on a remote server is highly recommended if you want to run it frequently, e. g. multiple times a day for a couple of weeks.

I’m using a free Ubuntu EC2 VM at Amazon Web Services (AWS). Just sign up at AWS and follow their tutorial on how to set up an EC2 instance. All in all, it will take you about 10–15 minutes. The VM will be free of charge for your first 12 months at AWS (Sept 19).

If you haven’t done already, get an SSL client (e. g. openSSH). Generate a key pair to be able to connect to your machine via SSL. For more information, please have a look at the official documentation. Also, you will need an SFTP client like SCP or FileZilla, to get access to the remote file system.

Once you’re ready, connect to your EC2 instance via SSL.

[local]: ssh -i "key.pem" ubuntu@xxx.amazonaws.com

Now, repeat steps 1) and 2) of this tutorial to set up python3, pip, and all the libraries on your remote server.

Next, transfer all your python scripts to your remote server using SCP:

[local]: scp -r /deploy ubuntu@xxx.amazonaws.com:/home/ubuntu

Now, you’re ready to go! Just run the following command on the remote console:

[remote]: nohup python ~/deploy/mybot.py &

Our bot is finally running on your server. The starting nohup will keep the application running even though the session expires. The postfix & puts our application in the background and frees the remote console.

The bot is running as a background task. Therefore, exiting it is a little bit tricky. We need to search for the script name to get the process id using grep and awk. Then we’re able to kill the corresponding process. Here’s a concise one-liner:

[remote]: sudo kill $(ps aux | grep '[p]ython mybot.py' | awk '{print $2}')

Advanced: Register following aliases for conveniently starting and stopping your bot. By doing so, you only have to type start-bot or stop-bot into your console to set its state.

[remote]: alias start-bot="nohup python deploy/mybot.py &"[remote]: alias stop-bot="sudo kill $(ps aux | grep '[p]ython mybot.py' | awk '{print $2}')"

Conclusion

My tutorial shows a quick and dirty solution for a given use case. Of course, the concept of a bot reposting articles on Facebook pages (like the one described here) is not really revolutionary. However, I aimed to show that it’s relatively easy to create useful scrapers for websites not offering any public APIs. Nowadays, you don’t have to have a deep understanding of the Python language or any particular library to build robust and useful bots. And with the help of platforms like AWS, everybody can easily run them on a remote server!

Now it’s your turn! Are there any potential applications in your domain? Which social media platforms would you consider attractive in this context, besides Facebook? I’m looking forward to hearing from you in the comment section!

Frontend Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store