News aggregators are some of the most useful sites on the web.
By continuously checking different news outlets across the web, they ensure that you always have the latest news at your disposal. No matter which source was the first to break the news.
But how do these news aggregators manage to always collect all the news instantly, from so many different sources at once?
How News Aggregators and Web Scraping Work
That’s what you’ll find out below. In this short article, we’ll have a closer look at what techniques news aggregators use – from RSS and Atom feeds to web crawling and scraping – to give you all the news you can possibly read.
Ready to find out? Then read on…
What are News Aggregators?
Let’s start by defining what news aggregators actually are.
Say you prefer to read The New York Times online for the main headlines of the day, but you prefer to read The Economist online when it comes to news articles specifically related to the economy and financial topics.
In the past, this meant you had to visit both websites separately. But with a news aggregator, you can have articles from both sources combined into one platform instead.
Such a platform aggregates news items – often shortened to just a headline and a snippet as an introduction – to present it to you in one handy overview.
If you see a headline you like, you click on the news item, and you’re directed to the original news article on the news outlet’s website.
And that’s the most basic function of a news aggregator. Now, these news aggregators often allow you to fully customize what news you receive, filtering by sources, languages, and topics.
They then curate the news for you to give you the best news experience based on your unique needs and preferences.
So how do these aggregators get their hands on all this news data?
How Do News Aggregators Work?
Many news content aggregators rely on RSS feeds, Atom feeds, and APIs provided by the news outlets themselves. This is the easiest way for the aggregator to collect news data.
Such a feed generally contains only a small snippet of the news article, consisting of just the article headline and a few sentences as a short introduction or summary.
The aggregator simply pulls information from this feed and presents it on its own website.
However, not all news outlets provide a feed, and some news aggregators want more information than what’s provided in the feed.
Instead, they can crawl entire news sites to identify new content. Yet others take their news from another aggregator, Google News, by extracting data from its feed. This used to be easy, using Google’s own Google News API, but that was deprecated a few years back.
Luckily, you can still find Google News API scrapers out there to help you do it.
And that leads us to one of the most common ways news aggregators gather their news: through web scraping.
By scraping loads of different news outlets or other aggregators like Google News, aggregators can present all the content they want from many different websites all at once.
What is Web Scraping?
Web scraping is the automated process of extracting data from web pages by using a robot (called a web crawler).
The robot visits a given URL or set of URLs before crawling these web pages, fetching the information in on the pages, and scraping (extracting) it. The scraped data is then repurposed, for example, by copying it into a spreadsheet or database.
Now out of the many different applications of web scraping techniques, news aggregators might just be the most useful ones.
We’ll discuss how news aggregators use web scraping in a bit. But web scraping can do a lot more than that.
Other popular uses of web scraping include:
- Competitor monitoring
- Price monitoring
- SEO monitoring
- Customer monitoring and opinion mining
- Lead generation
But now, let’s get back to news aggregators and how they further use these web scraping techniques to give you your personal news feed.
From Scraped Data to Your Personal Feed
Once the web scraper has gathered all the data the news aggregator wants (like headlines, author details, lead paragraph, and header image), they still need to process all this raw data.
This starts with information clustering and categorization of articles. This is often done using a common numerical statistic, such as term frequency-inverse document frequency (shortened to TF*IDF), which analyses the information in the articles and allows the program to cluster and categorize the data.
Once categorized, the news aggregator will go through all these articles to find only the best and most relevant articles that match your needs.
Lastly, you don’t want your news feed to be a bland stream of raw data and words. So the final step in the process of news aggregation is to visualize the data in an appealing, easy-to-use format.
And that’s, in a nutshell, how news aggregators use web scraping techniques to give you the latest scoop from around the world.