Main Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

0 / 0
How much do you like this book?
What’s the quality of the file?
Download the book for quality assessment
What’s the quality of the downloaded files?
Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.

This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.

Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.

What You Will Learn
* Understand web scraping, its applications/uses, and how to avoid web scraping by hitting publicly available rest API endpoints to directly get data
* Develop a web scraper and crawler from scratch using lxml and BeautifulSoup library, and learn about scraping from JavaScript-enabled pages using Selenium
* Use AWS-based cloud computing with EC2, S3, Athena, SQS, and SNS to analyze, extract, and store useful insights from crawled pages
* Use SQL language on PostgreSQL running on Amazon Relational Database Service (RDS) and SQLite using SQLalchemy
* Review sci-kit learn, Gensim, and spaCy to perform NLP tasks on scraped web pages such as name entity recognition, topic clustering (Kmeans, Agglomerative Clustering), topic modeling (LDA, NMF, LSI), topic classification (naive Bayes, Gradient Boosting Classifier) and text similarity (cosine distance-based nearest neighbors)
* Handle web archival file formats and explore Common Crawl open data on AWS
* Illustrate practical applications for web crawl data by building a similar website tool and a technology profiler similar to builtwith.com
* Write scripts to create a backlinks database on a web scale similar to Ahrefs.com, Moz.com, Majestic.com, etc., for search engine optimization (SEO), competitor research, and determining website domain authority and ranking
* Use web crawl data to build a news sentiment analysis system or alternative financial analysis covering stock market trading signals
* Write a production-ready crawler in Python using Scrapy framework and deal with practical workarounds for Captchas, IP rotation, and more

Who This Book Is For
Primary audience: data analysts and scientists with little to no exposure to real-world data processing challenges, secondary: experienced software developers doing web-heavy data processing who need a primer, tertiary: business owners and startup founders who need to know more about implementation to better direct their technical team


















Year:
2020
Edition:
1
Publisher:
Apress
Language:
english
ISBN 13:
9781484265758
File:
EPUB, 8.46 MB
Download (epub, 8.46 MB)

Most frequently terms

 
0 comments
 

To post a review, please sign in or sign up
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

De laatste liefde van mijn moeder

Language:
dutch
File:
EPUB, 579 KB
0 / 0
2

Eindelijk gelukkig

Language:
dutch
File:
EPUB, 214 KB
0 / 0
Jay M. Patel





Getting Structured Data from the Internet


Running Web Crawlers/Scrapers on a Big Data Production Scale


1st ed.





Jay M. PatelSpecrom Analytics, Ahmedabad, India





Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.​apress.​com/​9781484265758. For more detailed information, please visit http://​www.​apress.​com/​source-code.

				ISBN 978-1-4842-6575-8e-ISBN 978-1-4842-6576-5

https://doi.org/10.1007/978-1-4842-6576-5

© Jay M. Patel 2020

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 New York Plaza, New York, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline; .com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.



To those who believe “Live as if you were to die tomorrow. Learn as if you were to live forever.”

—Mahatma Gandhi.



Introduction

Web scraping, also called web crawling, is defined as a software program or code designed to automate the downloading and parsing of the data from the Web.

Web scraping at scale powers many successful tech startups and businesses, and they have figured out how to efficiently parse terabytes of data to extract a few megabytes of useful insights.

Many people try to distinguish web scraping from web crawling based on the scale of the number of pages fetched and indexed, with the latter being used only when it’s done for thousands of web pages. Another point of distinction commonly applied is the level of parsing performed on the web page; web scraping may mean a deeper level of data extraction with more support for JavaScript execution, filling forms, and so on. We will try to stay away from such superficial distinctions and use web scraping and web crawling interchangeably in this book, because our eventual goal is the same: find and extract data in structured format from the Web.

There are no major prerequisites for this book, and the only assumption I have made is that you are proficient in Python 3.x and are somewhat familiar with the SQL language. I suggest that you download and install the Anaconda distribution (www.anaconda.com/products/individual ) with Python version 3.6.x or higher.

We will take a big picture look in Chapter 1 by exploring how successful businesses around the world and in different domain areas are using web scraping to power their products and services. We’ll also illustrate a third-party data source that provides structured data from Reddit and see how we can apply it to gain useful business insights. We will introduce common web crawl datasets and discuss implementations for some of the web scraping applications such as creating an email database like Hunter.io in Chapter 4, a technology profiler tool like builtwith.com, and a website similarity, backlinks, domain authority, and ranking databases like Ahrefs.com, Moz.com, and Alexa.com in Chapters 6 and 7. We will also discuss steps in building a production-ready news sentiments model for alternative financial analysis in Chapter 7.

You will also find that this book is opinionated; and that’s a good thing! The last thing you want is a plain vanilla book full of code recipes with no background or opinions on which way is preferable. I hope you are reading this book to learn from the collective experience of others and not make the same mistakes I did when we first started out with crawling the Web over 15 years ago.

I spent a lot of formative years of my professional life working on projects funded by government agencies and giant companies, and the mantra was if it’s not built in house, it’s trash. Frequently, this aversion against using third-party libraries and publicly available REST APIs is for good reason from a maintainability and security standpoint. So I get it why many companies and new startups prefer to develop everything from scratch, but let me tell you that’s a big mistake. The number one rule taught to me by my startup’s major investor was: pick your battles, because you can’t win them all! He should know, since he was a Vietnam War veteran who ended up having a successful career as a startup investor. Big data is such a huge battlefield, and no one team within a company can hope to ace all the different niches within it except for very few corporations. So based on this philosophy, we will extensively use popular Python libraries such as Gensim, scikit learn, SpaCy for natural language processing (NLP) in Chapter 4, an object-relational mapper called SQLAlchemy in Chapter 5, and Scrapy in Chapter 8.

I think most businesses should rely on cloud infrastructure for their big data workloads as much as possible for faster iteration and quick identification of cost sinks or bottlenecks. Hence, we will extensively talk about a major cloud computing provider, Amazon Web Services (AWS), in Chapter 3 and go through setting up services like IAM, EC2, S3, SQS, and SNS. In Chapter 5, we will cover Amazon Relational Database Service (RDS)–based PostgreSQL, and in Chapter 7, we will discuss Amazon Athena.

You can switch to on-premises data centers once you have documented cost, traffic, uptime percentage, and other parameters. And no, I am not being paid by cloud providers, and for those readers who know my company’s technology stack, this is no contradiction. I admit that we run our own servers on premises to handle crawl data, and we also have GPU servers on premises to handle the training of our NLP models. But we have made the decision to go with our setup after doing a detailed cost analysis that included many months of data from our cloud server usage, which conclusively told us about potential cost savings.

I admit that there is some conflict of interest here because my company (Specrom Analytics) is active in the web crawling and data analytics space. So, I will try to keep mentions of any of our products to an absolute minimum, and I will also mention two to three competitors with all my product mentions.

Lastly, let me sound a note of caution and say that scraping/crawling on a big data production scale is not only expensive from the perspective of the number of developer hours required to develop and manage web crawlers, but frequently project managers underestimate the amount of computing and data resources it takes to get data clean enough to be comparable to structured data you get from REST API endpoints.

Therefore, I almost always tell people to look hard and wide for REST APIs from official and third-party data API providers to get the data you need before you think about scraping the same from a website.

If comparable data is available through a provider, then you can dedicate resources to evaluating the quality, update frequency, cost, and so on and see if they meet your business needs. Some commercially available datasets seem incredibly expensive until you factor in computing, storage, and man-hours that go into replicating that in house.

At the very least, you should go out and research the market thoroughly and see what’s available off the shelf before you embark on a long web crawling project that can suck time out of your other projects.





Acknowledgments

I would like to thank my parents for sparking my interest in computing from a very early age and encouraging it by getting subscriptions and memberships to rather expensive (for us at the time) computing magazines and even buying a pretty powerful PC in summer 2001 when I was just a high school freshman. It served as an excellent platform to code and experiment with stuff, and it was also the first time I coded a basic web crawler after getting inspired by the ACM Queue’s search engine issue in 2004.

I would like to thank my former colleagues and friends such as Robbie, Caroline, John, Chenyi, and Gerald and the wider federal communities of practice (CoP) members for stimulating conversations that provided the initial spark for writing this book. At the end of a lot of conversations, one of us would make a remark saying “someone should write a book on that!” Well, after a few years of waiting for that someone, I took the plunge, and although it would’ve taken four more books to fit all the content on our collective wishlist, I think this one provides a great start to anyone interested in web crawling and natural language processing at scale.

I would like to thank the Common Crawl Foundation for their invaluable contributions to the web crawling community. Specifically, I want to thank Sebastian Nagel for his help and guidance over the years. I would also like to appreciate the efforts of everyone at the Internet Archive, and in particular I would like to thank Gordon Mohr for his invaluable contributions on Gensim listserv.

I am grateful to my employees, contractors, and clients at Specrom Analytics who were very understanding and supportive of this book project in spite of the difficult time we were going through while adapting to the new work routine due to the ongoing Covid-19 pandemic.

This book project would not have come to fruition without the support and guidance of Susan McDermott, Rita Fernando, and Laura Berendson at Apress. I would also like to thank the technical reviewer, Brian Sacash, who helped keep the book laser focused on the key topics.





Table of Contents



Chapter 1:​ Introduction to Web Scraping

Who uses web scraping?​

Marketing and lead generation



Search engines



On-site search and recommendation



Google Ads and other pay-per-click (PPC) keyword research tools



Search engine results page (SERP) scrapers





Search engine optimization (SEO)

Relevance



Trust and authority



Estimating traffic to a site



Vertical search engines for recruitment, real estate, and travel



Brand, competitor, and price monitoring



Social listening, public relations (PR) tools, and media contacts database



Historical news databases



Web technology database



Alternative financial datasets



Miscellaneous uses





Programmatically​ searching user comments in Reddit



Why is web scraping essential?​



How to turn web scraping into full-fledged product



Summary





Chapter 2:​ Web Scraping in Python Using Beautiful Soup Library

What are web pages all about?​



Styling with Cascading Style Sheets (CSS)



Scraping a web page with Beautiful Soup

find( ) and find_​all( )



Scrape an ecommerce store site





XPath

Profiling XPath-based lxml





Crawling an entire site

URL normalization



Robots.​txt and crawl delay



Status codes and retries



Crawl depth and crawl order



Link importance



Advanced link crawler





Getting things “dynamic” with JavaScript

Variables and data types



Functions



Conditionals and loops



HTML DOM manipulation



AJAX





Scraping JavaScript with Selenium



Scraping the US FDA warning letters database

Scraping from XHR directly





Summary





Chapter 3:​ Introduction to Cloud Computing and Amazon Web Services (AWS)

What is cloud computing?​



List of AWS products



How to interact with AWS



AWS Identity and Access Management (IAM)

Setting up an IAM user



Setting up custom IAM policy



Setting up a new IAM role





Amazon Simple Storage Service (S3)

Creating a bucket



Accessing S3 through SDKs





Cloud storage browser



Amazon EC2

EC2 server types



Spinning your first EC2 server



Communicating with your EC2 server using SSH



Transferring files using SFTP





Amazon Simple Notification Service (SNS) and Simple Queue Service (SQS)



Scraping the US FDA warning letters database on cloud



Summary





Chapter 4:​ Natural Language Processing (NLP) and Text Analytics

Regular expressions



Extract email addresses using regex

Re2 regex engine





Named entity recognition (NER)

Training SpaCy NER





Exploratory data analytics for NLP

Tokenization



Advanced tokenization, stemming, and lemmatization



Punctuation removal



Ngrams



Stop word removal





Topic modeling

Latent Dirichlet allocation (LDA)



Non-negative matrix factorization (NMF)



Latent semantic indexing (LSI)





Text clustering



Text classification

Packaging text classification models



Performance decay of text classifiers





Summary





Chapter 5:​ Relational Databases and SQL Language

Why do we need a relational database?​



What is a relational database?​



Data definition language (DDL)

Sample database schema for web scraping





SQLite



DBeaver



PostgreSQL

Setting up AWS RDS PostgreSQL





SQLAlchemy



Data manipulation language (DML) and Data Query Language (DQL)

Data insertion in SQLite



Inserting other tables





Full text searching in SQLite



Data insertion in PostgreSQL



Full text searching in PostgreSQL



Why do NoSQL databases exist?​



Summary





Chapter 6:​ Introduction to Common Crawl Datasets

WARC file format



Common crawl index



WET file format



Website similarity



WAT file format



Web technology profiler



Backlinks database



Summary





Chapter 7:​ Web Crawl Processing on Big Data Scale

Domain ranking and authority using Amazon Athena



Batch querying for domain ranking and authority



Processing parquet files for a common crawl index



Parsing web pages at scale



Microdata, microformat, JSON-LD, and RDFa



Parsing news articles using newspaper3k



Revisiting sentiment analysis



Scraping media outlets and journalist data



Introduction to distributed computing



Rolling your own search engine



Summary





Chapter 8:​ Advanced Web Crawlers

Scrapy



Advanced crawling strategies



Ethics and legality of web scraping



Proxy IP and user-agent rotation



Cloudflare



CAPTCHA solving services



Summary





Index





About the Author



Jay M. Patel



is a software developer with over ten years of experience in data mining, web crawling/scraping, machine learning, and natural language processing (NLP) projects. He is a cofounder and principal data scientist of Specrom Analytics (www.specrom.com ) providing content, email, social marketing, and social listening products and services using web crawling/scraping and advanced text mining.

Jay worked at the US Environmental Protection Agency (EPA) for five years where he designed workflows to crawl and extract useful insights from hundreds of thousands of documents that were parts of regulatory filings from companies. He also led one of the first research teams within the agency to use Apache Spark–based workflows for chemistry and bioinformatics applications such as chemical similarities and quantitative structure activity relationships. He developed recurrent neural networks and more advanced LSTM models in TensorFlow for chemical SMILES generation.

Jay graduated with a bachelor’s degree in engineering from the Institute of Chemical Technology, University of Mumbai, India, and a master of science degree from the University of Georgia, USA.

Jay serves as an editor at a Medium publication called Web Data Extraction (https://medium.com/web-data-extraction ) and also blogs about personal projects, open source packages, and experiences as a startup founder on his personal site (http://jaympatel.com ).





About the Technical Reviewer



Brian Sacashis a data scientist and Python developer in the Washington, DC area. He helps various organizations discover the best ways to extract value from data. His interests are in the areas of natural language processing, machine learning, big data, and statistical methods. Brian holds a master of science in quantitative analysis from the University of Cincinnati and a bachelor of science in physics from Ohio Northern University.





© Jay M. Patel 2020

J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_1





1. Introduction to Web Scraping




Jay M. Patel1

(1)Specrom Analytics, Ahmedabad, India





In this chapter, you will learn about the common use cases for web scraping. The overall goal of this book is to take raw web crawls and transform them into structured data which can be used for providing actionable insights. We will demonstrate applications of such a structured data from a REST API endpoint by performing sentiment analysis on Reddit comments. Lastly, we will talk about the different steps of the web scraping pipeline and how we are going to explore them in this book.





Who uses web scraping?


Let’s go through examples and use cases for web scraping in different industry domains. This is by no means an exhaustive listing, but I have made an effort to provide examples that crawl a handful of websites to those that need crawling a major portion of the visible Internet (web-sized crawls).





Marketing and lead generation


Companies like Hunter.io, Voila Norbert, and FindThatLead run crawlers that index a large portion of the visible Internet, and they extract email addresses, person names, and so on to populate an email marketing and lead generation database. They provide an email address lookup service where a user can enter a domain address and the contacts listed in their database for a lookup fee of $0.0098–$0.049 per contact. As an example, let us enter my personal website’s address (jaympatel.com) and see the emails it found on that domain address (see Figure 1-1).

Figure 1-1Hunter.io screenshot





Hunter.io also provides an email finder service where a user can enter the first and last name of a person of interest at a particular domain address, and it can predict the email address for them based on pattern matching (see Figure 1-2).

Figure 1-2Hunter.io screenshot





Search engines


General-purpose search engines like Google, Bing, and so on run large-scale web scrapers called web crawlers which go out and grab billions of web pages and index and rank them according to various natural language processing and web graph algorithms, which not only power their core search functionality but also products like Google advertising, Google translate, and so on. I know you may be thinking that you have no plans to start another Google, and that’s probably a wise decision, but you should be interested in ranking your business’s website higher on Google. This need for being high enough on search engine rankings has spurned off a lot of web scraping/crawling businesses, which I will discuss in the next couple of sections.





On-site search and recommendation


Many websites use third-party providers to power the search box on their website. These are called “on-site searching” in our industry, and some of the SaaS providers are Algolia, Swiftype, and Specrom.

The idea behind all of the on-site searching is simple; they run web crawlers which only target one site, and using algorithms inspired by search engines, they return search engine results pages based on search queries.

Usually, there is also a JavaScript plugin so that the users can get autocomplete for their entered queries. Pricing is usually based on the number of queries sent as well as the size of the website with a range of $20 to as high as $70 a month for a typical site.

Many websites and apps also perform on-site searching in house, and the typical technology stacks are based on Elasticsearch, Apache Solr, or Amazon CloudSearch.

A slightly different product is the content recommendation where the same crawled information is used to power a widget which shows the most similar content to the one on the current page.





Google Ads and other pay-per-click (PPC) keyword research tools


Google Ads is an online advertising platform which predominantly sells ads that are frequently known in the digital marketing field as pay-per-click (PPC) where the advertiser pays for ads based on the number of clicks received on the ads, rather than on the number of times a particular ad is shown, which is known as impressions.

Google, like most PPC advertising platforms, makes money every time a user clicks on one of their ads. Therefore, it’s in the best interest of Google to maximize the ratio of clicks per impressions or click-through rate (CTR).

However, businesses make money every time one of those clicked users take an action such as converting into a lead by filling out a form, buying products from your ecommerce store, or personally visiting your brick-and-mortar store or restaurant. This is known as a “conversion.” A conversion value is the amount of revenue your business earns from a given conversion.

The real metric advertisers care about is the “return on ad spend” or ROAS which can be defined as the total conversion value divided by your advertising costs. Google makes money based on the number of clicks or impressions, but an advertiser makes money based on conversions. Therefore, it’s in your best interest to write ads that don’t have a high CTR or click-through rate but rather an ad that has a high conversion rate and high ROAS.

ROAS is completely dependent on keywords, which can be simply defined as words or phrases entered in the search bar of a search engine like Google which triggers your ads. Keywords, or a search query as it is commonly known, will result in a results page consisting of Google Ads, followed by organic results. If we “Google” car insurance, we will see that the top two entries on the results page are Google Ads (see Figure 1-3).

Figure 1-3Google Ads screenshot. Google and the Google logo are registered trademarks of Google LLC, used with permission





If your keywords are too broad, you’ll waste a bunch of money on irrelevant clicks. On the other hand, you can block unnecessary user clicks by creating a negative keyword list that excludes your ad being shown when a certain keyword is used as a search query.

This may sound intuitive, but the cost of running an ad on a given keyword on the basis of cost per click (CPC) is directly proportional to what other advertisers are bidding on that keyword. Generally speaking, for transactional keywords, its CPC is directly linked on how much volume of traffic the keyword generates, which in turn drives up its value. If you take an example of transactional keywords for insurance such as “car insurance,” the high traffic and the buy intent make its CPC one of the highest in the industry at over $50 per click. There are certain keyword queries made of phrases with two or more words, known as long tail keywords, which may actually see lower search traffic but are pretty competitive, and the simple reason for that is that longer keywords with prepositions sometimes capture buyer intent better than just one or two word search queries.

To accurately calculate ROAS, you need a keyword research tool to get accurate data on (1) what others are bidding in your geographical area of interest on a particular keyword, (2) the search volume associated with a particular keyword, (3) keyword suggestions so that you can find additional long tail keywords, and (4) lastly, you would like to generate a negative keyword list that includes words when appearing in a search query do not trigger your ad. As an example, if someone types “free car insurance,” that is a signal that they may not buy your car insurance product, and it would be insane to spend $50 on such a click. Hence, you can choose “free” as a negative keyword, and the ad won’t be shown to anyone who puts “free” in their search query.

Google’s official keyword research tool, called Keyword Planner, included all of the data I listed here up until a few years ago when they decided to change tactics and stopped showing exact search data in favor of insanely broad ranges like 10K–100K. You can get more accurate data if you spend more money on Google Ads; in fact, they don’t show any actionable data in the Keyword Planner for new accounts who haven’t spent anything on running ad campaigns.

This led to more and more users relying on third-party keyword research providers such as Ahrefs’s Keywords Explorer (https://ahrefs.com/keywords-explorer), Ubersuggest (https://neilpatel.com/ubersuggest/), and keywordtool.​io/​ (https://keywordtool.io/) that provide in-depth keyword research metrics. Not all of them are upfront about their data sourcing methodologies, but an open secret in the industry is that it’s coming from extensively scraping data from the official Keyword Planner and supplementing it with clickstream and search query data from a sample population across the world. These datasets are not cheap, with pricing going as high as $300/month based on how many keywords you search. However, this is still worth the price due to unique challenges in scraping Google Keyword Planner and methodological challenges of combining it in such a way to get an accurate search volume snapshot.





Search engine results page (SERP) scrapers


Many businesses want to check if their Google Ads are being correctly shown in a specific geographical area. Some others want SERP rankings for not only their page but their competitor’s pages in different geographical areas. Both of these use cases can be easily served by an API service which takes as an input a JSON with a search engine query and geographical area and returns a SERP page as a JSON. There are many providers such as SerpApi, Zenserp, serpstack, and so on, and pricing is around $30 for 5000 searches. From a technical standpoint, this is nothing but adding a proxy IP address, with CAPTCHA solving if required, to a traditional web scraping stack.





Search engine optimization (SEO)


This is a group of techniques whose sole aim is to improve organic rankings on the search engine results pages (SERPs).

There are dozens of books on SEO and even more blog posts, all describing how to improve your SERP ranking; we’ll restrict our discussions on SEO here to only those factors which directly need web scraping.

Each search engine uses their own proprietary algorithm to determine rankings, but essentially the main factors are relevance, trust, and authority. Let us go through them in greater detail.





Relevance


These are group of factors that measure how relevant a particular page is for a given search query. You can influence the ranking for a set of keywords by including them on your page and within meta tags on your page.

Search engines rely on HTML tags called “meta” to enable sites such as Google, Facebook, and Twitter to easily find certain information not visible to normal web users. Web masters are not mandated to insert these tags at all; however, doing so will not only help users on search engine and social media find information, but that will increase your search rankings too.

You can see these tags by right-clicking any page in your browser and clicking “ view source. ” As an example, let us get the source from Quandl.com; you may not yet be familiar with this website, but the information in the meta tags (meta property= ” og:description and meta name= ” twitter:description) tells you that it is a website for datasets in the financial domain (see Figure 1-4).

Figure 1-4Meta tags





It’s pretty easy to create a crawler to scrape your own website pages and see how effective your on-page optimization is so that search engines can “ find ” all the information and index it on their servers. Alternately, it’s also a good idea to scrape pages of your competitors and see what kind of text they have put in their meta tags. There are countless third-party providers offering a “ freemium” audit report on your on-page optimization such as https://seositecheckup.com, https://sitechecker.pro, and www.woorank.com/.





Trust and authority


Obtaining a high relevance score to a given search query is important, but not the only factor determining your SERP rankings. The other factor in determining the quality of your site is how many other high-quality pages link to your site’s page (backlinks). The classic algorithm used at Google is called PageRank, and now even though there are a lot of other factors that go into determining SERP rankings, one of the best ways to rank higher is get backlinks from other high-quality pages; you will hear a lot of SEO firms call this the “link juice,” which in simple terms means the benefit passed on to a site by a hyperlink.

In the early days of SEO, people used to try “black hat” techniques of manipulating these rankings by leaving a lot of spam links to their website on comment boxes, forums, and other user-generated contents on high-quality websites. This rampant gaming of the system was mitigated by something known as a “nofollow” backlink, which basically meant that a webmaster could mark certain outgoing links as “nofollow” and then no link juice will pass from the high-quality site to yours. Nowadays, all outgoing hyperlinks on popular user-generated content sites like Wikipedia are marked with “nofollow,” and thankfully this has stopped the spam deluge of the 2000s. We show an example in Figure 1-5 of an external nofollow hyperlink at the Wikipedia page on PageRank; don’t worry about all the HTML tags, just focus on the <a rel = “nofollow” for now.

Figure 1-5Nofollow HTML links





Building backlinks is a constant process because if you aren’t ahead of your competitors, you can start losing your SERP ranking. Alternately, if you know your competitor’s site’s backlinks, then you can target those websites by writing compelling content and see if you can “steal” some of the backlinks to boost your SERP rankings. Indeed, all of the strategies I mention here are followed by top SEO agencies every day for their clients.

Not all backlinks are gold. If your site gets disproportionate amount of backlinks from low-quality sites or spam farms (or link farms as they are also known), your site will also be considered “spammy,” and search engines will penalize you by dropping your ranking on SERP. There are some black hat SEOs out there that rapidly take down rankings of their competitor’s sites by using this strategy. Thankfully, you can mitigate the damage if you identify this in time and disavow those backlinks through Google Search Console.

Until now, I think I have made the case about why it’s useful to know your site’s backlinks and how people will be willing to pay if you can give them a database where they can simply enter either their site’s URL or their competitors and get all the backlinks.

Unfortunately, the only way to get all the backlinks is by crawling large portions of the Internet, just like search engines do, and that’s cost prohibitive for most businesses or SEO agencies to do in themselves. However, there are a handful of companies such as Ahrefs and Moz that operate in this area. The database size for Ahrefs is about 10 PB (= 10,000 TB) according to their information page (https://ahrefs.com/big-data); the storage cost alone for this on Amazon Web Services (AWS) S3 would come out to over $200,000/month so it’s no surprise that subscribing to this database is pricey at cheapest licenses starting at hundreds of dollars a month.

There is a free trial to the backlinks database which can be accessed here (https://ahrefs.com/backlink-checker); let us run an analysis on apress.com.

We see that Apress has over 1,500,000 pages linking back to it from about 9500 domains, and majority of these backlinks are “dofollow” links that pass on the link juice to Apress. The other metric of interest is the domain rating (DR), which normalizes a given website’s backlink performance on a 1–100 scale; the higher the DR score, the more “link juice” passed from the target site with each backlink. If you look at Figure 1-6, the top backlink is from www.oracle.com with its DR being 92. This indicates that the page is of highest quality, and getting such a top backlink helped Apress’s own DR immensely, which drove traffic to its pages and increased its SERP rankings.

Figure 1-6Ahrefs screenshot





Estimating traffic to a site


Every website owner can install analytics tools such as Google Analytics and find out what kind of traffic their site gets, but you can also estimate traffic by getting a domain ranking based on backlinks and performing some clever algorithmic tricks. This is indeed what Alexa does, and apart from offering backlink and keyword research ideas, they also give pretty accurate site traffic estimates for almost all websites. Their service is pretty pricey too, with individual licenses starting at $149/month, but the underlying value of their data makes this price tag reasonable for a lot of folks. Let us query Alexa for apress.com and see what kind of information it has collected for it (see Figure 1-7).

Figure 1-7Alexa screenshot





Their web-crawled database also provides a list of similar sites by audience overlap which seems pretty accurate since it mentions manning.com (another tech publisher) with a strong overlap score (see Figure 1-8).

Figure 1-8Alexa screenshot





It also provides data on the number of backlinks from different domain names and percentage of traffic received via search engines. One thing to note is that the number of backlinks by Alexa is 1600 (see Figure 1-9), whereas the Ahrefs database mentioned about 9000. Such discrepancies are common among different providers, and that just shows you the completeness of web crawls each of these companies is undertaking. If you have a paid subscription to them, then you can get the entire list and check for omissions yourself.

Figure 1-9Alexa screenshot showing the number of backlinks





Vertical search engines for recruitment, real estate, and travel


Websites such as indeed.com, Expedia, and Kayak all run web scrapers/crawlers to gather data focusing on specific segment of online content which they process further to extract out more relevant information such as name of the company, city, state, and job title in the case of indeed.com, which can be used for filtering through the search results. The same is true of all search engines where web scraping is at the core of their product, and the only differentiation between them is the segment they operate in and the algorithms they use to process the HTML content to extract out content which is used to power the search filters.





Brand, competitor, and price monitoring


Web scraping is used by companies to monitor prices of various products on ecommerce sites as well as customer reviews, social media posts, and news articles for not just their own brands but also for their competitors. This data helps companies understand how effective their current marketing funnel has been and also lets them get ahead of any negative reviews before they cause a noticeable impact on sales. There are far too many examples in this category, but Jungle Scout, AMZAlert, AMZFinder, camelcamelcamel, and Keepa all serve a segment of this market.





Social listening, public relations (PR) tools, and media contacts database


Businesses are very interested in what their existing and potential customers are saying about them on social media websites such as Twitter, Facebook, and Reddit as well as personal blogs and niche web forums for specialized products. This data helps businesses understand how effective their current marketing funnel has been and also lets them get ahead of any negative reviews before they cause a noticeable impact on sales. Small businesses can usually get away with manually searching through these sites; however, that becomes pretty difficult for businesses with thousands of products on ecommerce sites. In such cases, they use professional tools such as Mention, Hootsuite, and Specrom, which can allow them to do bulk monitoring. Almost all of these get some fraction of data through web crawling.

In a slightly different use case, businesses also want to guide their PR efforts by querying for contact details for a small number of relevant journalists and influencers who have a good following and readership in a particular niche. The raw database remains the same as previously discussed, but in this case, the content is segmented by topics such as apparels, fashion accessories, electronics, restaurants, and so on and results combined with a contacts database. A user should be able to query something like find email addresses and phone numbers for ten top journalists/influencers active in the food, beverage, and restaurant market in the Pittsburgh, PA area. There are too many products out there, but some of them include Muck Rack, Specrom, Meltwater, and Cision.





Historical news databases


There is a huge demand out there for searching historical news articles by keyword and returning news titles, content body, author names, and so on in bulk to be used for competitor, regulatory, and brand monitoring. Google News allows a user to do it to some extent, but it still doesn’t quite meet the needs of this market. Aylien, Specrom Analytics, and Bing News all provide an API to programmatically access news databases, which index 10,000–30,000 sources in all major languages in near real time and archives going back at least five or more years. For some use cases, consumers want these APIs coupled to an alert system where they get automatically notified when a certain keyword is found in the news, and in those cases, these products do cross over to social listening tools described earlier.





Web technology database


Businesses want to know about all the individual tools, plugins, and software libraries which are powering individual websites. Of particular interest is knowing about what percentage of major sites run a particular plugin and if that number is stable, increasing, or decreasing.

Once you know this, there are many ways to benefit from it. For example, if you are selling a web plugin, then you can identify your competitors, their market penetration, and use their customers as potential leads for your business.

All of the data I mentioned here can be aggregated by web crawling through millions of websites and aggregating the data in headers and response by a plugin type or displaying all plugins and tools used by a certain website. Examples include BuiltWith and SimilarTech, and basic product offerings start at around $290/month with prices going as high as a few thousand a month for searching unlimited websites/plugins.





Alternative financial datasets


Any company-specific datasets published by third-party providers consisting of data compiled and curated from nontraditional financial market sources such as social/sentiment data and social listening, web scraping, satellite imagery, geolocation to measure foot traffic, credit card transactions, online browsing data, and so on can be defined as alternative financial datasets.

These datasets are mainly used by quantitative traders or algorithmic traders who can be simply defined as traders engaged in buying/selling of securities on stock exchanges solely on the basis of computer algorithms. Now these so-called algorithms or trading strategies are rule based and coded by traders themselves, but the actual buy/sell triggers happen automatically once the strategy is put into production.

A handful of hedge funds started out with quantitative trading over 10 years ago and consuming alternative datasets that provided trading signals or triggers powering their trading strategies. Now, however, almost all institutional investors in the stock market from small family offices to large discretionary funds use alternative datasets to some extent.

A large majority of alternative datasets are created by applying NLP algorithms for sentiments, text classification, text summarization, named entity recognition, and so on on web crawl data described in earlier sections, and therefore this is becoming a major revenue stream for most big data and data analytics firms including Specrom Analytics.

You can explore all kinds of available alternative datasets on marketplaces such as Quandl, which has data samples for all the popular datasets such as web news sentiments (www.quandl.com/databases/NS1) for more than 40,000 stocks.





Miscellaneous uses


There are a lot of use cases that are hard to define and put into one of these distinct categories. In those cases, there are businesses that offer data on demand, with the ability to convert any website data into an API. Examples include Octoparse, ParseHub, Webhose.io, Diffbot, Apify, Import.io, Dashblock, and so on. There are other use cases such as security research, identity theft monitoring and protection, plagiarism detection, and so on—all of which rely on web-sized crawls.





Programmatically searching user comments in Reddit


Let’s work through an example to search through all the comments in a subreddit by accessing a free third-party database called pushshift.io and perform sentiment analysis on it by using algorithms on the request service at Algorithmia.

Aggregating sentiments from social media, news, forums, and so on represents a very common use case in alternative financial datasets, and here we are trying to just get a taste for it by doing it on one major company.

You will also learn how to communicate with web servers using the Hypertext Transfer Protocol (HTTP) methods such as GET and POST requests with authentication, which will be useful throughout this book, as there can be no web scraping/crawling without fetching the web page.

Reddit provides an official API, but there are a lot of limitations to its use compared to pushshift which has compiled the same data and made it available either through an API (https://github.com/pushshift/api) or through raw data dumps (https://files.pushshift.io/reddit/).

We will use the Python requests package to make GET calls in Python 3.x; it’s much more intuitive than the urllib in the Python standard library.

The request query is pretty simple to understand. We are searching for the keyword “Exxon” in the top stock market–related subreddit called “investing” which has about one million subscribers (see Listing 1-1). We are restricting ourselves to a maximum of 100 results and searching between August 20, 2019, and December 10, 2019, so that the request doesn’t get timed out. Users are encouraged to go through the pushshift.io documentation (https://github.com/pushshift/api) and generate their own query as a learning exercise. The time used in the query is epoch time which has to be converted to date or vice versa by using an online calculator (www.epochconverter.com/) or pd.to_datetime().import requests

import json



test_url = 'https://api.pushshift.io/reddit/search/comment/?q=Exxon&subreddit=investing&size=100&after=1566302399&before=1575979199&sort=asc&metadata=True'



r = requests.get(url = test_url)



print("Status Code: ", r.status_code)

print("*"*20)

print(r.headers)



html_response = r.text



# Output



Status Code: 200

********************

{'Date': 'Wed, 15 Apr 2020 11:47:37 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=db18690163f5c909d973f1a67bbdc79721586951257; expires=Fri, 15-May-20 11:47:37 GMT; path=/; domain=.pushshift.io; HttpOnly; SameSite=Lax', 'cache-control': 'public, max-age=1, s-maxage=1', 'Access-Control-Allow-Origin': '*', 'CF-Cache-Status': 'EXPIRED', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '58456ecf7ee0e3ce-ATL', 'Content-Encoding': 'gzip', 'cf-request-id': '021f4395ae0000e3ce5d928200000001'}



Listing 1-1Calling the pushshift.io API





We see that the response code was 200, meaning that the request has succeeded and the response content-type is application/json. We’ll use the JSON package to read and save the raw response (see Listing 1-2).with open("raw_pushshift_response.json", "w") as outfile:

outfile.write(html_response)



json_dict = json.loads(html_response)

json_dict.keys()



json_dict["metadata"]



# output

{'after': 1566302399,

'agg_size': 100,

'api_version': '3.0',

'before': 1575979199,

'es_query': {'query': {'bool': {'filter': {'bool': {'must': [{'terms': {'subreddit': ['investing']}},

{'range': {'created_utc': {'gt': 1566302399}}},

{'range': {'created_utc': {'lt': 1575979199}}},

{'simple_query_string': {'default_operator': 'and',

'fields': ['body'],

'query': 'Exxon'}}],

'should': []}},

'must_not': []}},

'size': 100,

'sort': {'created_utc': 'asc'}},

'execution_time_milliseconds': 31.02,

'index': 'rc_delta2',

'metadata': 'True',

'q': 'Exxon',

'ranges': [{'range': {'created_utc': {'gt': 1566302399}}},

{'range': {'created_utc': {'lt': 1575979199}}}],

'results_returned': 71,

'shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},

'size': 100,

'sort': 'asc',

'sort_type': 'created_utc',

'subreddit': ['investing'],

'timed_out': False,

'total_results': 71}



Listing 1-2Parsing a JSON response





We see that we only got back 71 results out of a maximum request of 100.

Let us explore the first element in our data list to see what kind of data response we are getting back (see Listing 1-3).json_dict["data"][0]



Output:

{'all_awardings': [],

'author': 'InquisitorCOC',

'author_flair_background_color': None,

'author_flair_css_class': None,

'author_flair_richtext': [],

'author_flair_template_id': None,

'author_flair_text': None,

'author_flair_text_color': None,

'author_flair_type': 'text',

'author_fullname': 't2_mesjk',

'author_patreon_flair': False,

'body': 'Individual stocks:\n\nBoeing and Lockheed: initially languished until 1974, then really took off and gained almost 100x by the end of the decade.\n\nHewlett-Packard: volatile, but generally a consistent winner throughout the decade, gained 15x.\n\nIntel: crashed >70% during the worst of 1974, but bounced back very quickly and went on to be a multi bagger.\n\nOil stocks had done of course very well, Halliburton and Schlumberger were the low risk, low volatility, huge gain stocks of the decade. Exxon on the other hand had performed nowhere as well as these two.\n\nWashington Post: fought Nixon head on in 1973, stocks dropped big. More union troubles in 1975, but took off afterwards. Gained between 70x and 100x until 1982.\n\nOne cannot mention WaPo without mentioning Berkshire Hathaway. Buffett bought 10% in 1973, got himself elected to its board, and had been advising Cathy Graham. However, BRK was a very obscure and thinly traded stock back then, investors would have a hard time noticing it. Buffett himself said the annual meeting in 1978 all fit in one small cafeteria.\n\n\n\nOther asset classes:\n\nCommodities in general had performed exceedingly well. Gold went from 35 in 1970 all the way to 800 in 1980.\n\nReal Estate had done well. Those who had the foresight to buy in SF Bay Area did much much better than buying gold in 1970.',

'created_utc': 1566311377,

'gildings': {},

'id': 'exhpyj3',

'is_submitter': False,

'link_id': 't3_csylne',

'locked': False,

'no_follow': True,

'parent_id': 't3_csylne',

'permalink': '/r/investing/comments/csylne/what_were_the_best_investments_of_the_stagflation/exhpyj3/',

'retrieved_on': 1566311379,

'score': 1,

'send_replies': True,

'stickied': False,

'subreddit': 'investing',

'subreddit_id': 't5_2qhhq',

'total_awards_received': 0}



Listing 1-3Viewing JSON data





You will learn more about applying NLP algorithms in Chapter 4, but for now let’s just use an algorithm as a service platform called Algorithmia where you can access a large variety of algorithms based on machine learning and AI on text analysis, image manipulation, and so on by simply sending your data over a POST call on their REST API.

This service provides 10K free credits to everyone who signs up, and an additional 5K credits per month. This should be more than sufficient for running the example in Listing 1-4, since it will consume no more than 2–3 credits per request. Using more than the allotted free credits will incur a charge based on the request amount.

Once you register with Algorithmia, please go to the API keys section in the user dashboard and generate new API keys which you will use in this example.

Usually, you need to do some text preprocessing such as getting rid of new lines, special characters, and so on to get accurate text sentiments; but in this case, let’s just take the text body and package it into a JSON format required by the sentiment analysis API (https://algorithmia.com/algorithms/specrom/GetSentimentsScorefromText).

The response is an id and a sentiment value from 0 to 1 where 0 and 1 mean very negative and positive sentiments, respectively. A value near to 0.5 indicates a neutral sentiment.date_list = []

comment_list = []

rows_list = []

for i in range(len(json_dict["data"])):

temp_dict = {}

temp_dict["id"] = i

temp_dict["text"] = json_dict["data"][i]['body']

rows_list.append(temp_dict)

date_list.append(json_dict["data"][i]['created_utc'])

comment_list.append(json_dict["data"][i]['body'])

sample_dict = {}

sample_dict["documents"] = rows_list

payload = json.dumps(sample_dict)

with open("sentiments_payload.json", "w") as outfile:

outfile.write(payload)



Listing 1-4Creating request JSON





Creating an HTTP POST request needs a header parameter that sends over the authorization key and content type and a payload, which is a dictionary converted to JSON (see Listing 1-5).url = 'https://api.algorithmia.com/v1/algo/specrom/GetSentimentsScorefromText/0.2.0?timeout=300'

headers = {



'Authorization': YOUR_ALGORITHMIA_KEY,

'content-type': "application/json",

'accept': "application/json"

}

response = requests.request("POST", url, data=payload, headers=headers)



print("Status Code: ", r.status_code)

print("*"*20)

print(r.headers)

# Output:

Status Code: 200

********************

{'Content-Encoding': 'gzip', 'Content-Type': 'application/json; charset=utf-8', 'Date': 'Mon, 13 Apr 2020 11:08:58 GMT', 'Strict-Transport-Security': 'max-age=86400; includeSubDomains', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'Content-Length': '682', 'Connection': 'keep-alive'}



Listing 1-5Making a POST request





Let us load the response in a pandas dataframe and look at the first row to get an idea of the output (see Listing 1-6).import numpy as np

import pandas as pd



df_sent = pd.DataFrame(json.loads(response.text)["result"]["documents"])

df_sent.head(1)

#Output



Listing 1-6Viewing sentiments data





id

sentiments_score



0

0

0.523785





We should convert this score into distinct labels positive, negative, and neutral (see Listing 1-7).def get_sentiments(score):



if score > 0.6:

return 'positive'

elif score < 0.4:

return 'negative'

else:

return 'neutral'



df_sent["sentiments"]=df_sent["sentiments_score"].apply(get_sentiments)

df_sent.head(1)



#Output



Listing 1-7Converting the sentiments score to labels





id

sentiments_score

sentiments



0

0

0.523785

neutral





Finally, let us visualize the sentiments by plotting a bar plot as shown in Listing 1-8 and then displayed in Figure 1-10.import matplotlib.pyplot as plt

import seaborn as sns



sns.set()

%matplotlib inline



g = sns.countplot(df_sent["sentiments"])

loc, labels = plt.xticks()

g.set_xticklabels(labels, rotation=90)

g.set_title('Subreddit comments sentiment analysis')



g.set_ylabel("Count")

g.set_xlabel("Sentiments")



Listing 1-8Plotting sentiments as a bar plot





Figure 1-10Bar plot of sentiment analysis on subreddit comments





So it seems like the comments are overwhelmingly neutral, with some positive comments and only a couple of negative comments.

Let us switch gears and see if these sentiments have any correlation with Exxon stock prices. We will get that using a REST API from www.alphavantage.co; it is free to use, but you will have to register to get a key (see Listing 1-9) from the alphavantage user dashboard.# Code block 1.2

# getting data from alphavantage



import requests

import json



test_url = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=XOM&outputsize=full&apikey=' + API_KEY + '&datatype=csv'



r = requests.get(url = test_url)

print("Status Code: ", r.status_code)

print("*"*20)

print(r.headers)

html_response = r.text

with open("exxon_stock.csv", "w") as outfile:

outfile.write(html_response)

# Output

Status Code: 200

********************

{'Connection': 'keep-alive', 'Server': 'gunicorn/19.7.0', 'Date': 'Thu, 16 Apr 2020 04:25:18 GMT', 'Transfer-Encoding': 'chunked', 'Vary': 'Cookie', 'X-Frame-Options': 'SAMEORIGIN', 'Allow': 'GET, HEAD, OPTIONS', 'Content-Type': 'application/x-download', 'Content-Disposition': 'attachment; filename=daily_adjusted_XOM.csv', 'Via': '1.1 vegur'}



Listing 1-9Requesting data from the Alpha Vantage API





This includes all the available stock prices data going back at least 10 years; hence, we will filter it to the date range we used for the previous sentiments (see Listing 1-10).import numpy as np

import pandas as pd



import matplotlib.pyplot as plt

import seaborn as sns



from dateutil import parser

datetime_obj = lambda x: parser.parse(x)



df = pd.read_csv("exxon_stock.csv", parse_dates=['timestamp'], date_parser=datetime_obj)

start_date = pd.to_datetime(date_list[0], unit="s")

end_date = pd.to_datetime(date_list[-1], unit="s")

df = df[(df["timestamp"] >= start_date) & (df["timestamp"] <= end_date)]



df.head(1)

# Output



Listing 1-10Parsing response data





timestamp

open

high

low

close

adjusted_close

volume

dividend_amount

split_coefficient



86

2019-12-10

69.66

70.15

68.7

69.06

68.0723

14281286

0.0

1.0





As a final step, let’s plot the closing price and volumes and see if the stock price stays neutral or not, as shown in Listing 1-11.# Plotting stock and volume



top = plt.subplot2grid((4,4), (0, 0), rowspan=3, colspan=4)

top.plot(df['timestamp'], df['close'], label = 'Closing price')

plt.title('Exxon Close Price')

plt.legend(loc=2)

bottom = plt.subplot2grid((4,4), (3,0), rowspan=1, colspan=4)

bottom.bar(df["timestamp"], df["volume"])

plt.title('Exxon Daily Trading Volume')

plt.gcf().set_size_inches(12,8)

plt.subplots_adjust(hspace=0.75)



Listing 1-11Plotting response data





As you can see from the plot shown in Figure 1-11, the stock prices have shown considerable movement in that five-month range with trading volume magnitudes higher than the number of comments extracted from a subreddit. So we can safely say that sentiment analysis of comments in just one subreddit is not a good indicator of the share price movement without performing any further trends analysis.

Figure 1-11Exxon stock prices and stock volumes





But that is hardly much of a surprise, since sentiment analysis only really works as a predictor if we are aggregating information from a large fraction of the visible Internet and plotting the data temporally as a time series to overlay it over stock market data.

There are lots of other flaws with simply plotting sentiments data like done earlier without correcting for the company-specific or sector-specific biases from the authors, editors, and so on. For example, someone who is a known environmentalist might have a well-known bias against fossil fuel companies like Exxon, and any negative sentiments expressed by such an author has to be corrected for that bias before using them as a predictor for stock market analysis.

This is a perfect illustration why we need to crawl on big data scale to generate useful insights and why almost all datasets you will find on alternative financial dataset marketplaces like Quandl or AlternativeData.org will have a significant web crawling and big data component to them, even if they are getting some fraction of data from hitting the REST API endpoints.

We will revisit this example in Chapter 7 and show you how to use big data to generate sentiments using a similar methodology to commercial data providers.





Why is web scraping essential?


So after learning about all the things publicly available (both paid and free) REST APIs can do for you, let me distill them into common use cases for performing web scraping:Your company works in one of the areas mentioned in the beginning of this chapter, and web scraping/crawling is part of your core business activity.



The website you want to extract data from does not provide a public API, and there are no comparable third-party APIs which provide the same set of data you need.



If there is an API, then the free tier is rate limited, meaning you are capped to calling it only a certain number of times. The paid tier of the API is cost prohibitive for your intended use case, but accessing the website itself is free.



The API does not expose all the data you wish to obtain even in their paid tier, whereas the website contains that information.





How to turn web scraping into full-fledged product


Let us break down web scraping into its individual components:The first step is data ingestion, where all you are doing is grabbing the raw web pages from the Internet and storing them for further processing. I would argue that this is the easiest step in web crawling. We will perform web scraping and crawling using common Python-based parsing libraries in Chapter 2. We will also introduce cloud computing in Chapter 3 so that you are not restricted by memory and computational resources of your local server. We will discuss advanced crawling strategies in Chapter 8 which will bring together everything we have learned in the book.



The second step is data processing, where we take in the raw data from web crawls and use some algorithms to extract useful information from it. In some cases, the algorithm will be as simple as traversing the HTML tree and extracting values of some tags such as the title and headline. In intermediate cases, we might have to run some pattern matching in addition to HTML parsing. For the most complicated use cases, we will have to run a gamut of NLP algorithms on raw text to extract people’s names, contact details, text summaries, and so on. We will introduce natural language processing algorithms in Chapter 4, and we will put them into action in Chapters 6 and 7 on a Common Crawl dataset.



The next step is loading the cleaned data from the preceding step into an appropriate database. For example, if your eventual products benefit from graph-based querying, then it’s logical that you will load up the cleaned data onto a graph database such as Neo4j. On the other hand, if your product relies on providing full text searching, then it’s logical to use a full text search database such as Elasticsearch or Apache Solr. For the majority of other uses, a general-purpose SQL database such as MySQL and PostgreSQL works well. We will introduce databases in Chapter 5 and illustrate practical applications in Chapters 6 and 7.



The final step is exposing your database as a user client (mobile app and website) or allowing programmatic access through REST APIs. We will not talk about it; however, you can do it using the Amazon API Gateway.





Summary


We have introduced web scraping in this chapter and talked about quite a few real-world applications for it. We also discussed how to get structured data from third-party REST APIs using a Python-based library called requests.





© Jay M. Patel 2020

J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_2





2. Web Scraping in Python Using Beautiful Soup Library




Jay M. Patel1

(1)Specrom Analytics, Ahmedabad, India





In this chapter, we’ll go through the basic building blocks of web pages such as HTML and CSS and demonstrate scraping structured information from them using popular Python libraries such as Beautiful Soup and lxml. Later, we’ll expand our knowledge and tackle issues that will make our scraper into a full-featured web crawler capable of fetching information from multiple web pages.

You will also learn about JavaScript and how it is used to insert dynamic content in modern web pages, and we will use Selenium to scrape information from JavaScript.

As a final piece, we’ll take everything we have learned and use it to scrape information from the US FDA’s warning letters database.





What are web pages all about?


All web pages are composed of HTML, which basically consists of plain text wrapped around tags that let web browsers know how to render the text. Examples of these tags include the following:Every HTML document starts and ends with <html>...</html> tags.



By convention, <!DOCTYPE html> at the start of an HTML document. Note that any text wrapped in “<!” and “>” is considered to be a comment and not really rendered by web browsers.



<head>...</head> encloses meta-information about the document.



<body>...</body> encloses the body of the document.



<title>...</title> element specifies the title of the document.



<h1>...</h1> to <h6>...</h6> tags are used for headers.



<div>...</div> to indicate a division in an HTML document, generally used to group a set of elements.



<p>...</p> to enclose a paragraph.



<br> to set a line break.



<table>...</table> to start a table block.<tr>...<tr/> is used for the rows.



<td>...</td> is used for individual cells.





<img> for images.



<a>...</a> for hyperlinks.



<ul>...</ul>, <ol>...</ol> for unordered and ordered lists, respectively; inside of these, <li>...</li> is used for each list item.





HTML tags also contain common attributes enclosed within these tags:href attribute defines a hyperlink and anchor text and is enclosed by <a> tags.

<a href=“https://www.jaympatel.com”>Jay M. Patel’s homepage</a>



Filename and location of images are specified by src attribute of the image tag.

<img src=“https://www.jaympatel.com/book_cover.jpg”>



It is very common to include width, height, and alternative text attributes in img tags for cases when the image cannot be displayed. You can also include a title attribute.

<img src=“https://www.jaympatel.com/book_cover.jpg” width=“500” height=“600” alt = “Jay’s new web crawling book’s cover image” title = “Jay’ book cover”>



<html> tags also include a lang attribute.

<html lang=“en-US”>



A style attribute can also be included to specify a particular font color, size, and so on.

<p style=”color:green”>...</p>





In addition to the HTML tags mentioned earlier, you can also optionally specify “ids” and “class” such as for h1 headers such as for h1 tags, such as<h1 id="firstHeading" class="firstHeading" lang="en">Static sites are awesome</h1>





Id: A unique identifier representing a tag within the document



Class: An identifier that can annotate multiple elements in a document and represents a space-separated series of Cascading Style Sheets (CSS) class names





Classes and ids are case sensitive, start with letters, and can include alphanumeric characters, hyphens, and underscores. A class may apply to any number of instances of any elements, whereas ID may only be applied to a single element within a document.

Classes and IDs are incredibly useful not only for applying styling via Cascading Style Sheets (CSS) (discussed in the next section) or using JavaScript but also for scraping useful information out of a page.

Let us create an HTML file: open your favorite text editor, copy-paste the code in Listing 2-1, and save it with a .html extension. I really like Notebook++ and it’s free to download, but you can pretty much use any text editor you like.<!DOCTYPE html>

<html>

<body>



<h1 id="firstHeading" class="firstHeading" lang="en">Getting Structured Data from the Internet:</h1>



<h2>Running Web Crawlers/Scrapers on a Big Data Production Scale</h2>



<p id = "first">

Jay M. Patel

</p>



</body>



</html>



Listing 2-1Sample HTML code





Once you have saved the file, simply double-click it, and it should open up in your browser. If you use Chrome or other major browsers like Firefox or Safari, right-click anywhere and select inspect, and then you will get the screen shown in Figure 2-1, which shows the source code you typed along with the rendered web page.

Figure 2-1Inspecting rendered HTML in Google Chrome





Congratulations on creating your first HTML page! Let’s insert some styling to the page.





Styling with Cascading Style Sheets (CSS)


Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document, such as layout, colors, and fonts written in a markup language like HTML. There are three ways to apply CSS styles to HTML pages:The first is inside a regular HTML tag such as shown next. You can also apply styles to change font colors: <p style="color:green;">...</p>. Using this type of styling will only affect the text enclosed by these tags. Note that inline styling takes precedence over other methods, and this is used sometimes to override the main CSS of the page.





<!DOCTYPE html>

<html>

<head>

<link rel="stylesheet" type="text/css" href="main.css">

</head>

<body>





You can create a separate CSS file and link it by including it in a link tag within the main <head> of the HTML document; the browser will go out and request the CSS file whenever a page is loaded.



Style can also be applied inside of <style>...</style> tags, placed inside the <head> tag of a page.



A CSS file consists of code blocks which applies styling to individual HTML tags; in the following example, we are applying green color and center alignment to all text enclosed in the <p> paragraph tag:





p {

color: green;

text-align: center;

}





We can use ID as a selector so that the styling is only applied to an id called 1para:





# 1para {

color: green;

text-align: center;

}





You can also use a class to apply the same styling across all classes with value maincontent:





.maincontent {

color: green;

text-align: center;

}





Let’s combine two approaches for greater selectivity and apply style to only paragraphs within the maincontent’s class:





p.maincontent {

color: green;

text-align: center;

}





Let us edit the preceding HTML file to add style=“color:green;” to the <h1> tag. The revised HTML file with styling block is shown in Figure 2-2.

Figure 2-2Inspecting the HTML page with inline styling





Scraping a web page with Beautiful Soup


Beautiful Soup is a Python library primarily intended to parse and extract information from an HTML string. It comes with a variety of HTML parsers that let us extract information even from a badly formatted HTML, which is unfortunately more common than what one assumes. We can use the requests library we already saw in Chapter 1 to fetch the HTML page, and once we have it in our local computer, we can start playing around with Beautiful Soup objects to extract useful information. As an initial example, let’s simply scrape information from a Wikipedia page for (you guessed it) web scraping!

Web pages change all the time, and that makes it tricky when we are trying to learn web scraping which needs the web page to stay exactly the same as it was when I wrote this book so that even two or three years from now you can learn from live examples.

This is why web scraping book authors tend to host a small test website that can be used for scraping examples. I don’t particularly like that approach since toy examples don’t scale very well to real-world web pages, which are full of ill-formed HTML, unclosed tags, and so on. Besides, in a few years’ time, maybe the author will stop hosting the pages on their website, and then how will readers work the examples in that case?

Therefore, ideally, we need to scrape from snapshots of real web pages with versioning so that a link will unambiguously refer to how the web page was on a particular date and time. Fortunately, such a resource already exists and is called the Internet Archive’s Wayback Machine. We will be using links generated by the Wayback Machine so that you can continue to experiment and learn from this book even after 5–10 years since these links will stay up as long as the Internet Archive continues to exist.

It is easy enough to create a Beautiful Soup object, and in my experience, one of the easiest ways to find more information on a new object is to call the dir() on it to see all available methods and attributes.

As you can see, Beautiful Soup objects come with a long list of available methods with very intuitive names, such as FindParent, FindParents, findPreviousSibling, and findPreviousSiblings, for traversing the HTML tags, which presumably are helping you navigate the HTML tree (see Listing 2-2). There is no way for us to showcase all the methods here, but what we’ll do is use a handful of them, and that will give you a sufficient idea on usage patterns for the rest of them.import requests

from bs4 import BeautifulSoup



test_url = 'https://web.archive.org/web/20200331040501/https://en.wikipedia.org/wiki/Web_scraping'



r = requests.get(test_url)

html_response = r.text

# creating a beautifulsoup object

soup = BeautifulSoup(html_response,'html.parser')

print(type(soup))

print("*"*20)

print(dir(soup))

# output

<class 'bs4.BeautifulSoup'>

********************

['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'HTML_FORMATTERS', 'NO_PARSER_SPECIFIED_WARNING', 'ROOT_TAG_NAME', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_attr_value_as_string', '_attribute_checker', '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_formatter_for_name', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_most_recent_element', '_popToTag', '_select_debug', '_selector_combinators', '_should_pretty_print', '_tag_name_matches_and', 'append', 'attribselect_re', 'attrs', 'builder', 'can_be_empty_element', 'childGenerator', 'children', 'clear', 'contains_replacement_characters', 'contents', 'currentTag', 'current_data', 'declared_html_encoding', 'decode', 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 'endData', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'get', 'getText', 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent', 'parentGenerator', 'parents', 'parse_only', 'parserClass', 'parser_class', 'popTag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'pushTag', 'quoted_colon', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'string', 'strings', 'stripped_strings', 'tagStack', 'tag_name_re', 'text', 'unwrap', 'wrap']



Listing 2-2Parsing HTML using the BeautifulSoup library





The second major object created by the Beautiful Soup library is known as a tag object, which corresponds to the HTML/XML tag in the original document. Let us call the tag object for h1 heading; a tag’s name can be accessed by .name method, and the attributes can be accessed via treating it as a dictionary. So in the case shown in Listing 2-3, I can access the tag id by simply calling first_tag[“id”]; to get all available attributes, please review the .attrs method.first_tag = (soup.h1)

print(type(first_tag))

print("*"*20)

print(first_tag)

print("*"*20)

print(first_tag["id"])

print("*"*20)

print(first_tag.attrs)



# Output



<class 'bs4.element.Tag'>

********************

<h1 class="firstHeading" id="firstHeading" lang="en">Web scraping</h1>

********************

firstHeading

********************

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}



Listing 2-3Exploring BeautifulSoup objects





The last Beautiful Soup object of interest is the NavigableString type, and contains the string that is enclosed by HTML/XML tags. You can easily convert this to a regular Python string by calling the str() method on it as shown in Listing 2-4. An analogous way to get the Python string is by simply calling the get_text() method on the tag object, and this is actually the preferred way to do it; we went through this exercise just to make you familiar with all the objects of the Beautiful Soup library.first_string = first_tag.string

print(type(first_string))

print("*"*20)

python_string = str(first_string)

print(type(python_string), python_string)

print("*"*20)

print(type(first_tag.get_text()), first_tag.get_text())



# Output



<class 'bs4.element.NavigableString'>

********************

<class 'str'> Web scraping

********************

<class 'str'> Web scraping



Listing 2-4Exploring BeautifulSoup objects (cont.)





find( ) and find_all( )


These are some of the most versatile methods in Beautiful Soup; find_all() retrieves matching tags from all the nested HTML tags (called descendants), and if you pass in a list, then it will retrieve all the matching objects. Let us use find_all() to get contents enclosed by the h1 and h2 tags from the wiki page, as shown in Listing 2-5.

In contrast, the find() method will only return the first matching instance and will ignore all the remaining arguments.# Passing a list to find_all method

for object in soup.find_all(['h1', 'h2']):

print(object.get_text())

# doing the same to find()

print("*"*20)

print(soup.find(['h1','h2']).get_text())



# Output:



Web scraping

Contents

History[edit]

Techniques[edit]

Software[edit]

Legal issues[edit]

Methods to prevent web scraping[edit]

See also[edit]

References[edit]

Navigation menu

********************

Web scraping



Listing 2-5Exploring the find_all function





Getting links from a Wikipedia page


Let’s say that you are trying to scrape the anchor text and links to the “see also” section of the preceding Wikipedia page (as shown in Figure 2-3).

Figure 2-3Screenshot of links and text you wish to scrape





The first step would be to locate these links in the source code of the HTML page so as to find the class name or a CSS style, which can help you target this using Beautiful Soup’s find and find_all() methods. We used the inspect in Chrome to find out that the div class we are interested in is “div-col columns column-width.”link_div = soup.find('div', {'class':'div-col columns column-width'})

link_dict = {}

links = link_div.find_all('a')



for link in links:

anchor_text = link.get_text()

link_dict[anchor_text] = link['href']

print(link_dict)

# output

{'Archive.is': '/wiki/Archive.is', 'Comparison of feed aggregators': '/wiki/Comparison_of_feed_aggregators', 'Data scraping': '/wiki/Data_scraping', 'Data wrangling': '/wiki/Data_wrangling', 'Importer': '/wiki/Importer_(computing)', 'Job wrapping': '/wiki/Job_wrapping', 'Knowledge extraction': '/wiki/Knowledge_extraction', 'OpenSocial': '/wiki/OpenSocial', 'Scraper site': '/wiki/Scraper_site', 'Fake news website': '/wiki/Fake_news_website', 'Blog scraping': '/wiki/Blog_scraping', 'Spamdexing': '/wiki/Spamdexing', 'Domain name drop list': '/wiki/Domain_name_drop_list', 'Text corpus': '/wiki/Text_corpus', 'Web archiving': '/wiki/Web_archiving', 'Blog network': '/wiki/Blog_network', 'Search Engine Scraping': '/wiki/Search_Engine_Scraping', 'Web crawlers': '/wiki/Category:Web_crawlers'}



Listing 2-6Extracting links





The first line of the code in Listing 2-6 finds all the <div> tags with the class name as “div-col columns column-width”; the resulting object link_div is a Beautiful Soup <tag> object. Next, we are using this tag object and calling a find_all() to find all the instances with <a> HTML tag which encloses an anchor text and a link. Once we have a list of such Beautiful Soup tag objects, all we need to do is iterate through them to pull out the anchor text and the link which is accessible by the “hrefs” links. We are loading it onto a Python dictionary which you can easily save as JSON, thus extracting structured information from the scraped Wikipedia page. Note that the links extracted are relative links, but you can simply use Python string methods to append the baseUrl with each of the links to get an absolute URL.





Scrape an ecommerce store site


Extracting structured information from ecommerce websites for price and competitor monitoring is in fact one of the major use cases for web scraping.

You can view the headers your browser is sending as part of request headers by going over to a site such as www.whatismybrowser.com. My request header’s user-agent is shown in the screenshot in Figure 2-4.

Figure 2-4Browser headers





I would encourage you to modify your requests from now on and include a header dictionary which includes a user-agent so that you can blend in with real humans using browsers when you are programmatically accessing the sites for web scraping. There are much more advanced antiscraping measures websites can take, so this will not fool everyone, but this will get you more access than having no headers at all. To illustrate an effective antiscraping measure, let us try to scrape from Amazon.com; in Listing 2-7, all we are doing is removing scripts from the BeautifulSoup object and converting the soup object into full text. As you can see, Amazon correctly identified that we are a robot and gave us a CAPTCHA instead of allowing us to proceed with the page.my_headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100Safari/537.36'

}



url = 'https://www.amazon.com'

rr = requests.get(url, headers = my_headers)

ht_response = rr.text

soup = BeautifulSoup(ht_response,'html.parser')

for script in soup(["script"]):

script.extract()

soup.get_text()

# Output

"\n\n\n\n\n\n\n\n\nRobot Check\n\n\n\n\n\n\n\n\n\n\n\n\n\nEnter the characters you see below\nSorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.\n\n\n\n\n\n\n\n\n\n\nType the characters you see in this image:\n\n\n\n\n\n\n\n\nTry different image\n\n\n\n\n\n\n\n\n\n\n\nContinue shopping\n\n\n\n\n\n\n\n\n\n\n\nConditions of Use\n\n\n\n\nPrivacy Policy\n\n\n © 1996-2014, Amazon.com, Inc. or its affiliates\n \n\n\n\n\n\n\n\n"



Listing 2-7Scraping from Amazon.com





Let us switch gears and instead try to extract all the links visible on the first page of the ecommerce site of Apress. We will be using an Internet Archive snapshot (Listing 2-8). We are only filtering links to extract only from class name product information so that our links correspond to individual book pages.url = 'https://web.archive.org/web/20200219120507/https://www.apress.com/us/shop'

base_url = 'https://web.archive.org/web/20200219120507'

my_headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100Safari/537.36'

}

r = requests.get(url, headers = my_headers)

ht_response = r.text



soup = BeautifulSoup(ht_response,'html.parser')



product_info = soup.find_all("div", {"class":"product-information"})

url_list =[]



for product in product_info:



temp_url = base_url + str(product.parent.find('a')["href"])

url_list.append(temp_url)



Listing 2-8Scraping from the Apress ecommerce store





Let’s take one URL from this list and extract the book name, book format, and price from it (Listing 2-9).url = 'https://web.archive.org/web/20191018112156/https://www.apress.com/us/book/9781484249406'



my_headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100Safari/537.36'

}



rr = requests.get(url, headers = my_headers)

ht_response = rr.text



temp_dict = {}

results_list = []

main_dict = {}



soup = BeautifulSoup(ht_response,'html.parser')

primary_buy = soup.find("span", {"class":"cover-type"})

temp_dict["book_type"] = primary_buy.get_text()

temp_dict["book_price"] = primary_buy.parent.find("span", {"class": "price"}).get_text().strip()

temp_dict["book_name"] = soup.find('h1').get_text()

temp_dict["url"] = url



results_list.append(temp_dict)

main_dict["extracted_products"] = results_list



print(main_dict)

# Output

{'extracted_products': [{'book_type': 'eBook', 'book_price': '$39.99', 'book_name': 'Pro .NET Benchmarking', 'url': 'https://web.archive.org/web/20191018112156/https://www.apress.com/us/book/9781484249406'}]}



Listing 2-9Extracting structured information from a URL





Profiling Beautiful Soup parsers


We have refrained from talking about performance in the previous section since we mainly wanted you to first get an idea of the capabilities of the Beautiful Soup library.

If you look at Listing 2-9, you will immediately see that there is very little we can do about how long it takes to fetch the HTML page using the requests library since that is totally in the hands of how much bandwidth we have and the server’s response time.

So the only other thing we can profile is the Beautiful Soup library itself. It’s a powerful way to access almost any object in HTML, and it definitely has its place in the web scraping toolbox.

However, it’s the slow HTML parsing speed that makes it unviable for large-scale web crawling loads.

You can get some performance boost by switching to lxml parser, but it still isn’t much compared to parsing the DOM using XPath as discussed in the next section.

Let’s use Python’s built-in profiler (cProfile) to identify the most time-consuming function calls using the default html.parser (Listing 2-10).import cProfile



cProfile.run('''

temp_dict = {}

results_list = []

main_dict = {}

def main():



soup = BeautifulSoup(ht_response,'html.parser')

primary_buy = soup.find("span", {"class":"cover-type"})

temp_dict["book_type"] = primary_buy.get_text()

temp_dict["book_price"] = primary_buy.parent.find("span", {"class": "price"}).get_text().strip()

temp_dict["book_name"] = soup.find('h1').get_text()

temp_dict["url"] = url

results_list.append(temp_dict)



main_dict["extracted_products"] = results_list

return(results_list)

main()''', 'restats')



#https://docs.python.org/3.6/library/profile.html

import pstats

p = pstats.Stats('restats')

p.sort_stats('cumtime').print_stats(15)



#Output



Sun Apr 19 09:39:00 2020 restats



79174 function calls (79158 primitive calls) in 0.086 seconds



Ordered by: cumulative time

List reduced from 102 to 15 due to restriction <15>



ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.086 0.086 {built-in method builtins.exec}

1 0.000 0.000 0.085 0.085 <string>:2(<module>)

1 0.000 0.000 0.085 0.085 <string>:5(main)

1 0.000 0.000 0.078 0.078 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:87(__init__)

1 0.000 0.000 0.078 0.078 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:285(_feed)

1 0.000 0.000 0.078 0.078 C:\ProgramData\Anaconda3\lib\site-packages\bs4\builder\_htmlparser.py:210(feed)

1 0.000 0.000 0.078 0.078 C:\ProgramData\Anaconda3\lib\html\parser.py:104(feed)

1 0.007 0.007 0.078 0.078 C:\ProgramData\Anaconda3\lib\html\parser.py:134(goahead)

715 0.008 0.000 0.045 0.000 C:\ProgramData\Anaconda3\lib\html\parser.py:301(parse_starttag)

715 0.003 0.000 0.027 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\builder\_htmlparser.py:79(handle_starttag)

715 0.002 0.000 0.023 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:447(handle_starttag)

619 0.003 0.000 0.015 0.000 C:\ProgramData\Anaconda3\lib\html\parser.py:386(parse_endtag)

1464 0.005 0.000 0.014 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:337(endData)

714 0.001 0.000 0.012 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\builder\_htmlparser.py:107(handle_endtag)

716 0.003 0.000 0.011 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py:813(__init__)



Listing 2-10Profiling Beautiful Soup parsers





This should print out the output consisting of the top 15 most time-consuming calls. Now, there are calls going to bs4\__init__.py that we won’t be able to optimize without a major refactoring of the library; the next top time-consuming calls are all made by html\parser.py.

Let us profile the main function again with the only modification that we have switched out the parser to lxml. I am only showing the output in Listing 2-11.# Output:

Sun Apr 19 09:39:57 2020 restats



63900 function calls (63880 primitive calls) in 0.064 seconds



Ordered by: cumulative time

List reduced from 168 to 15 due to restriction <15>



ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.064 0.064 {built-in method builtins.exec}

1 0.000 0.000 0.063 0.063 <string>:2(<module>)

1 0.000 0.000 0.063 0.063 <string>:5(main)

1 0.000 0.000 0.058 0.058 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:87(__init__)

1 0.000 0.000 0.058 0.058 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:285(_feed)

1 0.000 0.000 0.058 0.058 C:\ProgramData\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:246(feed)

2/1 0.006 0.003 0.047 0.047 src/lxml/parser.pxi:1242(feed)

715 0.001 0.000 0.026 0.000 src/lxml/saxparser.pxi:374(_handleSaxTargetStartNoNs)

715 0.000 0.000 0.024 0.000 src/lxml/saxparser.pxi:401(_callTargetSaxStart)

715 0.000 0.000 0.024 0.000 src/lxml/parsertarget.pxi:78(_handleSaxStart)

715 0.004 0.000 0.023 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\builder\_lxml.py:145(start)

715 0.002 0.000 0.017 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:447(handle_starttag)

715 0.001 0.000 0.011 0.000 src/lxml/saxparser.pxi:452(_handleSaxEndNoNs)

2181 0.004 0.000 0.011 0.000 C:\ProgramData\Anaconda3\lib\site-packages\bs4\__init__.py:337(endData)

715 0.000 0.000 0.010 0.000 src/lxml/parsertarget.pxi:84(_handleSaxEnd)



<pstats.Stats at 0x2a852202780>



Listing 2-11Profiling Beautiful Soup parsers (cont.)





You can clearly see a reduction in not only the number of function calls but also the cumulative time, and most of those time advantages are directly coming from using the lxml-based parser builder\_lxml.py as the back end for Beautiful Soup.





XPath


XPath stems its origins in the XSLT standard and stands for XML path language. Its syntax allows you to identify paths and nodes of an XML (and HTML) document. You will almost never have to write your own XPath from scratch so we will not spend any time talking about the XPath syntax, but you are encouraged to go through the XPath 3.1 standard (www.w3.org/TR/xpath-31/) for complete details.

The most common way we find XPath is by taking the help of developer tools in Google Chrome. For example, if I want the XPath to be the price of a book on the Apress site, I will right-click anywhere on the page and click inspect. Once there, click the element you want the XPath for; in our case, we want the price of a particular book (see Figure 2-5). Now, you can click copy and either select the abbreviated XPath or the complete XPath of a particular object; you can use either of that for web scraping.Abbreviated XPath: //*[@id=“id2”]/div/div/div/ul/li[3]/div[2]/span[2]/span



Complete XPath: /html/body/div[5]/div/div/div/div/div[3]/div/div/div/ul/li[3]/div[2]/span[2]/span





Figure 2-5XPath for the Apress ecommerce store





We will use the XPath syntax to extract the same information as Listing 2-12.from lxml.html import fromstring, tostring

temp_dict = {}

results_list = []

main_dict = {}

def main():



tree = fromstring(ht_response)



temp_dict["book_type"] = tree.xpath('//*[@id="content"]/div[2]/div[2]/div[1]/div/dl/dt[1]/span[1]/text()')[0]

temp_dict["book_price"] = tree.xpath('//*[@id="content"]/div[2]/div[2]/div[1]/div/dl/dt[1]/span[2]/span/text()')[0].strip()

temp_dict["book_name"] = tree.xpath('//*[@id="content"]/div[2]/div[1]/div[1]/div[1]/div[2]/h1/text()')[0]

temp_dict["url"] = url

results_list.append(temp_dict)



main_dict["extracted_products"] = results_list

return(main_dict)

main()

#Output

{'extracted_products': [{'book_name': 'Pro .NET Benchmarking',

'book_price': '$39.99',

'book_type': 'eBook',

'url': 'https://web.archive.org/web/20191018112156/https://www.apress.com/us/book/9781484249406'}]}



Listing 2-12Using the lxml library





Profiling XPath-based lxml


Profiling the main() function from Listing 2-12 gives us an astonishing result; we are getting fivefold time improvement and a drastic 160-fold reduction in the number of function calls.

Even if we end up parsing 100,000 documents of similar type, it will only take us 26.67 minutes (0.44 hrs) vs. 143.33 minutes (2.39 hrs) for Beautiful Soup.

I just wanted to put this out there so that you know that even though we are using Beautiful Soup here for examples, you should strongly consider switching to XPath-based parsing once your workload gets into parsing hundreds of thousands of web pages (see Listing 2-13).Sun Apr 19 10:08:05 2020 restats



436 function calls in 0.016 seconds



Ordered by: cumulative time

List reduced from 103 to 15 due to restriction <15>



ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.016 0.016 {built-in method builtins.exec}

1 0.000 0.000 0.015 0.015 <string>:2(<module>)

1 0.000 0.000 0.015 0.015 <string>:5(main)

1 0.000 0.000 0.012 0.012 C:\ProgramData\Anaconda3\lib\site-packages\lxml\html\__init__.py:861(fromstring)

1 0.000 0.000 0.012 0.012 C:\ProgramData\Anaconda3\lib\site-packages\lxml\html\__init__.py:759(document_fromstring)

1 0.000 0.000 0.012 0.012 src/lxml/etree.pyx:3198(fromstring)

1 0.007 0.007 0.007 0.007 src/lxml/etree.pyx:354(getroot)

1 0.000 0.000 0.005 0.005 src/lxml/parser.pxi:1869(_parseMemoryDocument)

1 0.000 0.000 0.005 0.005 src/lxml/parser.pxi:1731(_parseDoc)

1 0.005 0.005 0.005 0.005 src/lxml/parser.pxi:1009(_parseUnicodeDoc)

3 0.000 0.000 0.003 0.001 src/lxml/etree.pyx:1568(xpath)

3 0.003 0.001 0.003 0.001 src/lxml/xpath.pxi:281(__call__)

3 0.000 0.000 0.000 0.000 src/lxml/xpath.pxi:252(__init__)

3 0.000 0.000 0.000 0.000 src/lxml/xpath.pxi:131(__init__)

30 0.000 0.000 0.000 0.000 src/lxml/parser.pxi:612(_forwardParserError)



Listing 2-13Profiling the lxml library





Crawling an entire site


We will discuss important parameters before we can start crawling entire websites. Let us start out by writing a naive crawler, point out its shortcomings, and try to fix it by specific solutions.

Essentially, we have one function called link_crawler() which takes in a seed_url, and it uses that to request the first page. Once the links are parsed, we can start loading them into the initial set of URLs to be crawled.

As we start getting down the list, we will see that there are pages we have already requested and parsed, and to keep track of those, we have another set called seen_url_list.

We are trying to restrict our crawl size, so that we are restricting domain addresses to only those which are from the seed list; another way we have restricted the crawl rate is by specifying a max_n number which refers to the number of pages we have fetched (see Listing 2-14). We are also taking care of relative links and adding a base URL.import requests

from bs4 import BeautifulSoup



def link_crawler(seed_url, max_n = 5000):

my_headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100Safari/537.36'

}

initial_url_set = set()

initial_url_set.add(seed_url)

seen_url_set = set()



while len(initial_url_set)!=0 and len(seen_url_set) < max_n:

temp_url = initial_url_set.pop()

if temp_url in seen_url_set:

continue

else:

seen_url_set.add(temp_url)

r = requests.get(url = temp_url, headers = my_headers)

st_code = r.status_code



html_response = r.text

soup = BeautifulSoup(html_response,'html.parser')

links = soup.find_all('a', href=True)

for link in links:



if ('http' in link['href']):

if seed_url.split(".")[1] in link['href']:

initial_url_set.add(link['href'])



elif [char for char in link['href']][0] == '/':

final_url = seed_url+link['href']

initial_url_set.add(final_url)



return(initial_url_set, seen_url_set)

seed_url = 'http://www.jaympatel.com'

link_crawler(seed_url)

#output:

(set(),

{'http://jaympatel.com/',

'http://jaympatel.com/2018/11/get-started-with-git-and-github-in-under-10-minutes/',

'http://jaympatel.com/2019/02/introduction-to-natural-language-processing-rule-based-methods-name-entity-recognition-ner-and-text-classification/',

'http://jaympatel.com/2019/02/introduction-to-web-scraping-in-python-using-beautiful-soup/',

'http://jaympatel.com/2019/02/natural-language-processing-nlp-term-frequency-inverse-document-frequency-tf-idf-based-vectorization-in-python/',

'http://jaympatel.com/2019/02/natural-language-processing-nlp-text-vectorization-and-bag-of-words-approach/',

'http://jaympatel.com/2019/02/natural-language-processing-nlp-word-embeddings-words2vec-glove-based-text-vectorization-in-python/',

'http://jaympatel.com/2019/02/top-data-science-interview-questions-and-answers/',

'http://jaympatel.com/2019/02/using-twitter-rest-apis-in-python-to-search-and-download-tweets-in-bulk/',

'http://jaympatel.com/2019/02/why-is-web-scraping-essential-and-who-uses-web-scraping/',

'http://jaympatel.com/2020/01/introduction-to-machine-learning-metrics/',

'http://jaympatel.com/about/',

'http://jaympatel.com/books',

'http://jaympatel.com/books/',

'http://jaympatel.com/categories/',

'http://jaympatel.com/categories/#data-mining',

'http://jaympatel.com/categories/#data-science',

'http://jaympatel.com/categories/#interviews',

'http://jaympatel.com/categories/#machine-learning',

'http://jaympatel.com/categories/#natural-language-processing',

'http://jaympatel.com/categories/#requests',

'http://jaympatel.com/categories/#sentiments',

'http://jaympatel.com/categories/#software-development',

'http://jaympatel.com/categories/#text-vectorization',

'http://jaympatel.com/categories/#twitter',

'http://jaympatel.com/categories/#web-scraping',

'http://jaympatel.com/consulting-services',

'http://jaympatel.com/consulting-services/',

'http://jaympatel.com/cv',

'http://jaympatel.com/cv/',

'http://jaympatel.com/pages/CV.pdf',

'http://jaympatel.com/tags/',

'http://jaympatel.com/tags/#coefficient-of-determination-r2',

'http://jaympatel.com/tags/#git',

'http://jaympatel.com/tags/#glove',

'http://jaympatel.com/tags/#information-criterion',

'http://jaympatel.com/tags/#language-detection',

'http://jaympatel.com/tags/#machine-learning',

'http://jaympatel.com/tags/#name-entity-recognition',

'http://jaympatel.com/tags/#p-value',

'http://jaympatel.com/tags/#regex',

'http://jaympatel.com/tags/#regression',

'http://jaympatel.com/tags/#t-test',

'http://jaympatel.com/tags/#term-frequency-inverse-document-frequency-tf-idf',

'http://jaympatel.com/tags/#text-mining',

'http://jaympatel.com/tags/#tweepy',

'http://jaympatel.com/tags/#version-control',

'http://jaympatel.com/tags/#web-scraping',

'http://jaympatel.com/tags/#word-embeddings',

'http://jaympatel.com/tags/#words2vec',

'http://www.jaympatel.com/assets/DoD_SERDP_case_study.pdf'})



Listing 2-14Link crawler





The function in Listing 2-14 works fine for testing and educational purposes, but it has some serious shortcomings which make it entirely unsuitable for using it regularly. Let us go through some of the issues and see how we can make it robust enough for practical uses.





URL normalization


In general, when we are setting up a crawler, we are only looking to scrape information from a specific type of pages. For example, we typically exclude scraping from links which point to CSS sheets or JavaScript. You can get a much more granular idea on the filetype at a particular link by checking the content-type in the request header, but this requires you to actually ping the link which is not practical in many cases.

Another common scenario is normalizing multiple links which all are in fact pointing to one page. These days, single-page HTML sites are becoming very common, where a user can jump through different sections of the page using anchor links. For example, all the following links are pointing to different sections of the same page:<a href="#pricing">Pricing</a><br />



<a href="#license-cost">License Cost</a></li>





Another way the same link may get different URLs is through Urchin Tracking Module (UTM) parameters which are commonly used for tracking campaigns in digital marketing and are pretty common on the Web. As an example, let us consider the following two URLs for Specrom Analytics with UTM parameters, with the only difference being the utm_source parameter:

www.specrom.com/?utm_source=newsletter&utm_medium=banner&utm_campaign=fall_sale&utm_term=web%20scraping%20crawling

www.specrom.com/?utm_source=google&utm_medium=banner&utm_campaign=fall_sale&utm_term=web%20scraping%20crawling

Both links are pointing to www.specrom.com (you can verify it if you want); so if your crawler took in the URLs, then you will end up with three copies of the same page which will waste your bandwidth and computing not only to fetch them but also down the road when you try to deduplicate your database.

There is also a question of trailing slashes; traditionally, web addresses with trailing slashes indicated folders, whereas the ones without it indicated files. This definitely doesn’t hold true anymore, but we are still stuck with pages with and without slashes both pointing to the same content. Google has issued a guidance for webmasters about this issue, and their preferred way is a 301 redirect from a duplicate page to the canonical one. To keep things simple, we will simply ignore trailing slashes in our code.

Therefore, you will need to incorporate URL normalization in your link crawler; in our case, we can simply exclude everything after #-[?*!@=]. You can easily accomplish this by using regular expressions or by using Python’s string methods; but in our case, we will use the Python package tld which has a handy attribute called parsed URL to get rid of fragments and queries from the URL (Listing 2-15).from tld import get_tld

sample_url = 'http://www.specrom.com/?utm_source=google&utm_medium=banner&utm_campaign=fall_sale&utm_term=web%20scraping%20crawling'



def get_normalized_url(url):

res = get_tld(url, as_object=True)

path_list = [char for char in res.parsed_url.path]

if len(path_list) == 0:

final_url = res.parsed_url.scheme+'://'+res.parsed_url.netloc

elif path_list[-1] == '/':

final_string = ''.join(path_list[:-1])

final_url = res.parsed_url.scheme+'://'+res.parsed_url.netloc+final_string

else:



final_url = url

return final_url



get_normalized_url(sample_url)

#output

'http://www.specrom.com'



Listing 2-15URL normalization





Robots.txt and crawl delay


We can use our URL link finder function to crawl the entire website; but first we will have to make some modifications to ensure that we are not overstepping the scope of legitimate crawling.

Most sitemasters put a file called robots.txt in the path http://www.example.com/robots.txtm which explicitly lists out directories and pages on their site which are OK to crawl and what parts are off limits to crawlers. These are just a suggestion, and you can scrape a website that explicitly prohibits crawling using the robots.txt file, but it’s unethical and against terms of use that can open you up for a legal challenge in some jurisdictions. Some robots.txt files also t