This tutorial explains how to scrape Google News for articles related to the topic of your choice using Python.
We are interested to see the following information for each news article.
- Title : Article Headline
- Source : Original News Source or Blogger Name
- Time : Publication Date/Time
- Author : Article Author
- Link : Article Link
Step 1 : Install the following python libraries if they are not already installed.
- requests
- BeautifulSoup
- pandas
You can install any python library using the command pip install library_name
.
Step 2 : Set Search Query
The next step is to define a search query which is the topic or term for which you want to search for related articles. In the code below, you can specify it in the 'query' variable.
The code below extracts relevant information such as titles, sources, times, authors and links from Google news related to a specific topic and stores them in a CSV file named 'news.csv'.
import requests from bs4 import BeautifulSoup import pandas as pd # Search Query query = 'US Economy' # Encode special characters in a text string def encode_special_characters(text): encoded_text = '' special_characters = {'&': '%26', '=': '%3D', '+': '%2B', ' ': '%20'} # Add more special characters as needed for char in text.lower(): encoded_text += special_characters.get(char, char) return encoded_text query2 = encode_special_characters(query) url = f"https://news.google.com/search?q={query2}&hl=en-US&gl=US&ceid=US%3Aen" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') articles = soup.find_all('article') links = [article.find('a')['href'] for article in articles] links = [link.replace("./articles/", "https://news.google.com/articles/") for link in links] news_text = [article.get_text(separator='\n') for article in articles] news_text_split = [text.split('\n') for text in news_text] news_df = pd.DataFrame({ 'Title': [text[2] for text in news_text_split], 'Source': [text[0] for text in news_text_split], 'Time': [text[3] if len(text) > 3 else 'Missing' for text in news_text_split], 'Author': [text[4].split('By ')[-1] if len(text) > 4 else 'Missing' for text in news_text_split], 'Link': links }) # Write to CSV news_df.to_csv('news.csv', index=False)
- The function encode_special_characters(text) is used to replace special characters like '&' with their encoded text. It is to make the URL follow web standards.
- The code sends a request to the google news URL using requests.get() and parses the HTML content using BeautifulSoup.
- It finds all the articles in the HTML and extracts 'Title', 'Source', 'Time', 'Author' and 'Link' information. If some articles don't have a publishing date or author details, we will set them missing.
If you want to see the latest news from Google News, you can replace the 'url' variable with the code below -
url = "https://news.google.com/home?hl=en-US&gl=US&ceid=US%3Aen"
Refer to the parameters of the URL which you can customize according to your country and location.
- hl=en-US: Language setting for the page where "hl" stands for "host language" and "en-US" refers to US English as the language.
- gl=US: Geographical location for the content.
- ceid=US:en: Country edition specifying the edition for US in English.
Share Share Tweet