LinkedIn Scraping with Python

How to set up a Python script to scrape LinkedIn Profiles

Image for post

Scraping of LinkedIn profiles is a very useful activity especially to achieve public relations / marketing tasks. Using Python you can make this process smoother, using your time to focus on those profiles that have critical peculiarities.

At the end of the article you can find a working Python script, completely documented, to scrape basic information from LinkedIn.

Contents of the Article

  1. What is Web Scraping
  2. Web Scraping with Python
  3. The Linkedin Profile Scraping Python Script
  4. Useful Resources

1. What is Web Scraping

Web Scraping is a technique used to extract data from websites.

If you?ve ever copy and pasted information from a website, you?ve performed the same function as any web scraper, only on a microscopic, manual scale. https://scrapinghub.com/what-is-web-scraping

Technically, it can be performed in two ways:

  • Direct HTTP requests: best choice for static websites.
  • Driving a Web Browser: best choice for dinamic websites with content asynchronously loaded or IFrames (unfortunatelly not so uncommon as you may think, especially in legacy systems).

In both cases the final step is parsing the page to extract the content.

Direct HTTP Requests

As you may know, website are just a rendering of the HTML + CSS code that the web server returns as a result of a GET / POST request of your browser. As a result, a simple script can send automatically HTTP requests and parse the answer, scraping the content.

Example

From the command line, run the following instruction:

curl -X GET http://www.wikipedia.com

As a result, you will receive the response of the corresponding webserver. Here you have a screenshot from a MacOS terminal.

Image for post

You can run the aforementioned instruction in any programming language, store the response and parse it accordingly (see following paragraph ?Web Scraping with Python ? BeatifulSoup?).

Driving a Web Browser

Sometimes websites load (part of) the content asynchronously. This means that the information you want to scrape may not be contained in the first HTTP response, but they are loaded only as a consequence of a page scrolling (like LinkedIn case) or after the click of a button.

To overcome this barrier, you can use a Web Browser Driver (see following paragraph ?Web Scraping with Python ? Selenium Web Driver?).

The Web Browser drivers let you run a real web browser enabling your script (in Python or other languages) to emulate user behavior on the page (basically executing Javascript code through the browser console).

In this way you can, for example, emulate the click on a button ? assuming this is useful to the scraping activity.

document.getElementById(‘buttonID’).click()

2. Web Scraping with Python

Python is the perfect language for web scraping, thanks to many libraries that are available to be installed through the Python package manager pip.

Selenium Web Driver

Selenium Web Driver is one of the best Web Browser Driver available for Python (see previous ?Driving a Web Browser? paragraph). It?s part of the Selenium framework which is a portable framework for testing web applications.

Example: Loading the LinkedIn.com home page.

Interacting with the page: how to run Javascript

If you open a LinkedIn Profile page, you will realize that in order to scrape the email address is necessary to click on the ?Contact info? link, wait for a popup to load, and then ? if provided by the user ? you can see the email address (and so, eventually, scrape it).

Therefore we must emulate such user interaction through some javascript:

More info from Wikipedia:

Selenium accepts commands (sent in Selenese, or via a Client API) and sends them to a browser. This is implemented through a browser-specific browser driver, which sends commands to a browser and retrieves results.

Selenium WebDriver does not need a special server to execute tests: instead, the WebDriver directly starts a browser instance and controls it.

Beautiful Soup

As you can see in the previous paragraph (?Interacting with the page?) the instruction browser.execute_script can also return the value returned by the javascript code (as in the email scraping example). As a result, the whole scraping could be done in this way.

However, if you don?t need to emulate a user interaction, but you just have to go through the HTML structure, you can use a parsing library (like Beautiful Soup) that do the job for you.

This opportunity could be interesting to exploit if ? in case of huge scraping ? you design your code to run on multi instances.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

Reference: crummy.com.

Example: Getting the LinkedIn Profile Name

3. The LinkedIn Profile Scraping Python Script

You can download for free my LinkedIn Scraping tool in Python here:

federicohaag/LinkedInScraping

Creates an Excel file containing the personal data and the last job position of all the provided LinkedIn profiles?

github.com

For further information, feel free to drop me a message! ?

4. Useful resources ?

  • Looking for the best book where to study Python? Here it is. ?
  • New into programming? I suggest you to have a look here. ?
17

No Responses

Write a response