How to set up a Python script to scrape LinkedIn Profiles
Scraping of LinkedIn profiles is a very useful activity especially to achieve public relations / marketing tasks. Using Python you can make this process smoother, using your time to focus on those profiles that have critical peculiarities.
At the end of the article you can find a working Python script, completely documented, to scrape basic information from LinkedIn.
Contents of the Article
- What is Web Scraping
- Web Scraping with Python
- The Linkedin Profile Scraping Python Script
- Useful Resources
1. What is Web Scraping
Web Scraping is a technique used to extract data from websites.
If you?ve ever copy and pasted information from a website, you?ve performed the same function as any web scraper, only on a microscopic, manual scale. https://scrapinghub.com/what-is-web-scraping
Technically, it can be performed in two ways:
- Direct HTTP requests: best choice for static websites.
- Driving a Web Browser: best choice for dinamic websites with content asynchronously loaded or IFrames (unfortunatelly not so uncommon as you may think, especially in legacy systems).
In both cases the final step is parsing the page to extract the content.
Direct HTTP Requests
As you may know, website are just a rendering of the HTML + CSS code that the web server returns as a result of a GET / POST request of your browser. As a result, a simple script can send automatically HTTP requests and parse the answer, scraping the content.
From the command line, run the following instruction:
curl -X GET http://www.wikipedia.com
As a result, you will receive the response of the corresponding webserver. Here you have a screenshot from a MacOS terminal.
You can run the aforementioned instruction in any programming language, store the response and parse it accordingly (see following paragraph ?Web Scraping with Python ? BeatifulSoup?).
Driving a Web Browser
Sometimes websites load (part of) the content asynchronously. This means that the information you want to scrape may not be contained in the first HTTP response, but they are loaded only as a consequence of a page scrolling (like LinkedIn case) or after the click of a button.
To overcome this barrier, you can use a Web Browser Driver (see following paragraph ?Web Scraping with Python ? Selenium Web Driver?).
In this way you can, for example, emulate the click on a button ? assuming this is useful to the scraping activity.
2. Web Scraping with Python
Python is the perfect language for web scraping, thanks to many libraries that are available to be installed through the Python package manager pip.
Selenium Web Driver
Selenium Web Driver is one of the best Web Browser Driver available for Python (see previous ?Driving a Web Browser? paragraph). It?s part of the Selenium framework which is a portable framework for testing web applications.
Example: Loading the LinkedIn.com home page.
If you open a LinkedIn Profile page, you will realize that in order to scrape the email address is necessary to click on the ?Contact info? link, wait for a popup to load, and then ? if provided by the user ? you can see the email address (and so, eventually, scrape it).
More info from Wikipedia:
Selenium accepts commands (sent in Selenese, or via a Client API) and sends them to a browser. This is implemented through a browser-specific browser driver, which sends commands to a browser and retrieves results.
Selenium WebDriver does not need a special server to execute tests: instead, the WebDriver directly starts a browser instance and controls it.
However, if you don?t need to emulate a user interaction, but you just have to go through the HTML structure, you can use a parsing library (like Beautiful Soup) that do the job for you.
This opportunity could be interesting to exploit if ? in case of huge scraping ? you design your code to run on multi instances.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Example: Getting the LinkedIn Profile Name
3. The LinkedIn Profile Scraping Python Script
You can download for free my LinkedIn Scraping tool in Python here:
Creates an Excel file containing the personal data and the last job position of all the provided LinkedIn profiles?
For further information, feel free to drop me a message! ?
4. Useful resources ?
- Looking for the best book where to study Python? Here it is. ?
- New into programming? I suggest you to have a look here. ?