💖 Welcome to my blog! It’s a pleasure to meet you here. On this platform, you can not only gain interesting technical insights but also experience an enjoyable and relaxed atmosphere. Whether you’re a programming newbie or a seasoned developer, you’ll find valuable knowledge here to learn and grow.
🔍 Blog content includes:
- Core Java and Microservices: Covers Java fundamentals, JVM, concurrency programming, Redis, Kafka, Spring, etc., helping you master enterprise-level development technologies.
- Big Data Technology: Includes Hadoop (HDFS), Hive, Spark, Flink, Kafka, Redis, ECharts, Zookeeper, and related techniques.
- Development Tools: Shares usage tips for common development tools like IDEA, Git, Mac, Alfred, Typora, etc., to enhance your productivity.
- Databases and Optimization: Summarizes MySQL and other commonly used database technologies, solving practical database issues in work.
- Python and Big Data: Focuses on in-depth learning of the Python programming language, data analysis tools (such as Pandas, NumPy), and big data processing technologies, helping you master data analysis, data mining, machine learning, etc.
- Data Structures and Algorithms: Summarizes core knowledge of data structures and algorithms, enhancing your programming thinking, and helping you tackle challenges in interviews with major companies.
🌟 My goal: Continuous learning and summarization, sharing technical insights and solutions, and exploring the infinite possibilities of technology together with you! Here, I hope to grow with you, inspire each other, and become better versions of ourselves.
📣 Welcome to subscribe to this column, join me in learning, sharing, and growing in this ocean of knowledge! 💻🚀
📍Copyright Notice: All content on this blog is original, following the CC 4.0 BY-SA license. Please cite the source when reproducing.
Table of Contents
1. Basic Concepts of Web Scraping
2. Main Working Process of Web Scrapers
2. Install necessary libraries
III. Write Your First Simple Web Crawler
2. Step-by-Step Code Explanation
4. Improve Web Scraping Functionality
2. Controlling Scraping Frequency
Five、Handling Complex Web Pages
1. Ethical Laws and Deontology
3. Avoiding Over-Frequent Requests
Web scraping is a technique where data from web pages is automatically fetched through programs. For beginners, writing a simple crawler using Python is an excellent introductory project. Python offers powerful tools and libraries, such as requests and BeautifulSoup, which can help quickly achieve the extraction of web page data.
In this article, we will start from the basic concepts of web scraping and gradually implement a simple crawler capable of extracting web content. We will also explore how to improve crawlers to handle complex scenarios. The following aspects will be covered:
1. Basic Concepts of Web Scraping
1. Definition of a Web Crawler
A web crawler (also known as a spider) is an automated script or program that simulates user behavior when accessing web pages and extracts specific content from them.
2. Main Workflow of a Web Crawler
Typical steps in a web crawler task usually include:
-
Send Request: Access the target webpage through the HTTP protocol and obtain its HTML content.
-
Parse Data: Parse the obtained HTML to extract the specific data we need.
-
Store Data: Save the extracted data into files or databases for subsequent processing.
3. Popular Python Tools
-
<strong>requests</strong>: Send HTTP requests and retrieve web content.
-
<strong>BeautifulSoup</strong>: Parse HTML or XML data and extract specific content.
-
<strong>re</strong>(Regular Expressions): Match and extract complex text patterns.
-
<strong>pandas</strong>: Clean and analyze data.
Two. Environment Setup
1. Install Python
Ensure that your computer has Python installed (recommended version 3.7 or higher). If not installed yet, download and install it from the Official Python Website.
2. Install Required Libraries
Open the command line or terminal and run the following command to install the necessary Python libraries:
1 |
pip install requests beautifulsoup4 |
-
<strong>requests</strong>: For sending HTTP requests.
-
<strong>beautifulsoup4</strong>: For parsing HTML data.
Three. Write Your First Simple Spider
We will implement a simple spider that will crawl the title and main content of a webpage.
1. Complete Code Example
The following code implements a basic crawler:
1 2 |
[crayon–67efc5ca07760430226282 inline=“true” class=“language-python”]import requests from bs4 import BeautifulSoup |
[/crayon]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def simple_crawler(url): try: # 1. Send the request response = requests.get(url) response.raise_for_status() # Check if the request was successful # 2. Parse the webpage content soup = BeautifulSoup(response.text, ‘html.parser’) # 3. Extract title and paragraph content title = soup.find(‘title’).text # Get the webpage title paragraphs = soup.find_all(‘p’) # Get all paragraph contents print(f“Webpage title: {title}\n”) print(“Webpage content:”) for p in paragraphs: print(p.text) except requests.exceptions.RequestException as e: print(f“Request failed: {e}”) # Example URL url = “https://example.com” # Replace with the webpage you want to crawl simple_crawler(url) |
2. Code step-by-step explanation
1)Send HTTP Request
1 |
response = requests.get(url) |
-
Use the requests.get() method to send a GET request to the target URL.
-
The returned response object contains all the content of the webpage, including HTML source code.
2)Check Request Status
1 |
response.raise_for_status() |
-
Use raise_for_status() to check if the request was successful. If the returned HTTP status code indicates an error (such as 404 or 500), it will raise an exception.
3)Parsing HTML Data
1 |
soup = BeautifulSoup(response.text, ‘html.parser’) |
-
The BeautifulSoup class is used to parse HTML content and convert it into a Python object for easier subsequent operations.
-
The second parameter ‘html.parser’ specifies the use of Python’s built-in HTML parser.
4)Extracting Web Content
1 |
soup = BeautifulSoup(response.text, ‘html.parser’) |
-
The find(‘title’) method returns the content of the <title> tag.
-
The find_all(‘p’) method returns all paragraph tags <p> and stores them in a list format.
5)Printing the Results
1 |
soup = BeautifulSoup(response.text, ‘html.parser’) |
-
Iterate over the extracted paragraph content and print the text of each paragraph.
Section 4: Improving Spider Functionality
1. Adding Request Headers
Some websites detect spider programs and block access. You can simulate browser access by adding request headers.
1 2 3 4 |
headers = { “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36” } response = requests.get(url, headers=headers) |
2. Controlling Crawl Rate
To avoid high load on the target website, add delays after each request.
1 2 3 4 |
headers = { “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36” } response = requests.get(url, headers=headers) |
3. Saving Data
Save crawled data as files or databases.
Saving to File:
1 2 3 4 |
headers = { “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36” } response = requests.get(url, headers=headers) |
Save to CSV file:
1 2 3 4 5 6 |
import csv with open(“output.csv”, “w”, newline=“”, encoding=“utf-8”) as csvfile: writer = csv.writer(csvfile) writer.writerow([“Paragraph content”]) for p in paragraphs: writer.writerow([p.text]) |
V. Handling Complex Web Pages
1. Dynamically Loaded Web Pages
For webpages rendered with JavaScript, requests cannot fetch the complete content. Use selenium or playwright instead.
Example (using selenium):
1 2 3 4 5 6 |
import csv with open(“output.csv”, “w”, newline=“”, encoding=“utf-8”) as csvfile: writer = csv.writer(csvfile) writer.writerow([“Paragraph content”]) for p in paragraphs: writer.writerow([p.text]) |
2. Scrapping Images or Files
1 2 3 |
[crayon–67efc5ca07779710181482 inline=“true” class=“language-python”]import os # Download image img_url = “https://example.com/image.jpg” |
[/crayon]
1. Respecting Legal and Ethical Guidelines
-
Avoid illegal activities: Ensure scraping activities comply with the website’s terms of service.
-
Adhere to robots.txt restrictions: Check
robots.txt to view crawling limitations imposed by the target website.
2. Handling Exceptions
Avoid illegal activities: Ensure scraping activities comply with the website’s terms of service.
Adhere to robots.txt restrictions: Check robots.txt to view crawling limitations imposed by the target website.
Add error handling logic for failed network requests or missing data:
1 2 3 4 5 |
[crayon–67efc5ca0777c793301400 inline=“true” class=“language-python”]try: response = requests.get(url) response.raise_for_status() except requests.exceptions.RequestException as e: print(f“Request failed: {e}”) |
[/crayon]
3. Avoiding Excessive Request Frequency
Implement delays or use proxy IPs:
1 2 3 4 |
[crayon–67efc5ca0777e248530037 inline=“true” class=“language-python”]proxies = { “http”: “http://123.45.67.89:8080”, “https”: “http://123.45.67.89:8080” } |
[/crayon]
“`
response = requests.get(url, proxies=proxies)
Leave a Reply
You must be logged in to post a comment.