IT Log

Record various IT issues and difficulties.

Web Scraping Essential → Selenium: Detailed Explanation (Part 2)


  • (3)Mandatory Waiting:
  • Advanced: Manually Implement Page Wait:


  • 👇

    👉

    Categories:

    1. Explicit Waits
      time.sleep()
      Disadvantages: Not intelligent, too short time setting leads to elements not loaded; too long time setting wastes time
    2. Implicit Waits
    3. Explicit Waits
      Target element

      Using explicit waits with a lambda expression:

      #! python
      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10 66:66
      # @Author  : John Gu
      # @File    : wait_demo.py
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.wait import WebDriverWait
      sms_btn = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
      By.XPATH,
      '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]'
      ))
      sms_btn.click()
      

      If the logic is complex, you can use a custom function:

      (Some logins have image CAPTCHAs, but the src attribute of the image CAPTCHA isn’t immediately available; it appears after some time. The following method can be used):

      #! python
      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10 6:66
      # @Author  : GuHanZhe
      # @File    : wait_demo.py
      import time
      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.wait import WebDriverWait
      driver = webdriver.Chrome()
      driver.get('https://passport.bilibili.com/login')
      def func(dv):
      print("If there is no return value, this function will be executed once every 0.5 seconds; if there is a return value, it will be assigned to the sms_btn variable")
      tag = dv.find_element(
      By.XPATH,
      '//*[@id="app"]/div[2]/div[2]/div[3]/div[1]/div[3]'
      )
      img_src = tag.get_attribute("xxx")
      if img_src:
      return tag
      return
      sms_btn = WebDriverWait(driver, 30, 0.5).until(func)
      sms_btn.click()
      time.sleep(2.5)
      driver.close()
      

      Practice One: Obtain Specific Element Attributes of Baidu’s Homepage

      #! python
      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10 6:66
      # @Author  : GuHanZhe
      # @File    : baidu_demo.py
      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.support.wait import WebDriverWait
      driver = webdriver.Chrome()
      url = 'https://www.baidu.com/'
      driver.get(url)
      WebDriverWait(driver, 20, 0.5).until(
      EC.presence_of_element_located(
      (By.LINK_TEXT, 'hao123')
      )
      )
      '''
      The second parameter represents the maximum wait time of 20 seconds.
      The third parameter indicates a check every 0.5 seconds for the specified tag's existence.
      EC.presence_of_element_located(
      (By.LINK_TEXT, 'hao123')
      )
      EC stands for the condition to wait for; here, it is presence_of_element_located, meaning the node should appear.
      Its argument is a tuple specifying the node's locator, targeting the link text content of 'hao123'.
      Every 0.5 seconds, it checks if the specified tag exists via link text; if found, execution continues;
      if not, it waits until the 20-second limit before throwing an error.
      '''
      content = driver.find_element(By.LINK_TEXT, 'hao123').get_attribute('href')
      print(content)
      

      Practice 2: Implement QQ Space Login

      #! python
      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10 6:66
      # @Author  : GuHanZhe
      # @File    : qzone_login_demo.py
      import time
      from selenium import webdriver
      from selenium.common.exceptions import NoSuchElementException
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.support.wait import WebDriverWait
      driver = webdriver.Chrome()
      driver.get('https://qzone.qq.com/')
      # Display wait until the login iframe is located before continuing execution!
      locator = (By.XPATH, '//div[@class="login_wrap"]/iframe')
      WebDriverWait(driver=driver, timeout=5, poll_frequency=0.3, ignored_exceptions=(NoSuchElementException,)).until(
      EC.presence_of_element_located(locator), message='Not Found')
      # Switch to login iframe
      fr = driver.find_element(By.XPATH, '//div[@class="login_wrap"]/iframe')
      driver.switch_to.frame(fr)
      driver.find_element(By.XPATH, '//*[@id="switcher_plogin"]').click()
      time.sleep(1)
      driver.find_element(By.XPATH, '//*[@id="u"]').send_keys('QQ Number')
      time.sleep(1)
      driver.find_element(By.XPATH, '//*[@id="p"]').send_keys('Password')
      time.sleep(1)
      driver.find_element(By.ID, 'login_button').click()
      time.sleep(2)
      driver.quit()
      

      expected_conditions conditions:

      expected_conditions is a sub-module of selenium, which contains a series of conditions that can be used to verify status. By combining these conditions with the methods of this class, you can wait flexibly based on the conditions.

      Waiting Conditions Meaning
      title_is and title_contains These two condition classes verify the title, checking whether the title equals or contains specific content.
      presence_of_element_located and presence_of_all_elements_located These two conditions check if elements are present. The parameters they take are locator tuples (e.g., (By.ID, ‘kw’)). The first passes as soon as one element matching the condition is loaded; the second only passes when all such elements are loaded.
      visibility_of_element_located and invisibility_of_element_located and visibility_of These three conditions check if an element is visible. The first two take locator tuples as parameters, while the third takes a WebElement. The first and third check node visibility; the second checks node invisibility.
      text_to_be_present_in_element and text_to_be_present_in_element_value The first checks if the text of a node contains specific content; the second checks if the value of a node contains specific content. The former verifies the element’s text, while the latter verifies the element’s value.
      frame_to_be_available_and_switch_to_it Load and switch: check if a frame can be switched to. Parameters can be locator tuples or direct

      For more detailed parameters and usage instructions related to waiting conditions, please refer to the official documentation:

      (2) Implicit Wait (implicitly_wait(xx))

      • It sets a maximum waiting time. If the page loads within this specified time, the next step will be executed; otherwise, it will keep waiting until the timeout is reached before proceeding to the next step. The drawback is that the program continues to wait for the entire page to load, even if the required element has already loaded. Typically, this happens when you see the browser tab’s loading indicator stop spinning. Implicit waits apply to the entire driver lifecycle, so they only need to be set once. Both implicit and explicit waits can be used together, but note that the maximum wait time will be determined by the larger of the two values. By default, the implicit wait timeout is 0 seconds.

      Practice 1: Implementing the Acquisition of Specific Elements’ Attributes on Baidu’s Homepage

      #! python
      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10 6:66
      # @Author  : GuHanZhe
      # @File    : bd_login_demo.py
      from selenium import webdriver
      from selenium.webdriver.common.by import By
      driver = webdriver.Chrome()
      url = 'https://www.baidu.com/'
      # Set a maximum waiting time of 10 seconds for all element
      

      Forced Wait:

      Forced wait means blocking and waiting for a certain period of time here, regardless of circumstances, using time.sleep().

      Advanced: Manually Implement Page Waiting:

      Principle:

      • Using the concepts of forced wait and explicit wait, manually implement by continuously checking or setting a limit on checks for whether a tag element is loaded and exists.

      Practical Example 1: Implementing automatic downward scrolling through the Taobao web page to obtain specific element attributes

      # -*- coding: utf-8 -*-
      # @Time    : 2024/10/10
      # @Author  : Gu Han Zhe
      # @File    : bd_login_demo.py
      import time
      from selenium import webdriver
      from selenium.common.exceptions import NoSuchElementException
      from selenium.webdriver.common.by import By
      
      def wait_for_element(driver, xpath, max_attempts=30, interval=0.5):
          """
          Manually implement explicit waiting: wait for the target element to load completely or exist.
          :param driver: WebDriver instance
          :param xpath: Target element's XPATH
          :param max_attempts: Maximum number of attempts
          :param interval: Time interval between each attempt (seconds)
          :return: Returns the found element or None
          """
          for attempt in range(max_attempts):
              try:
                  element = driver.find_element(By.XPATH, xpath)
                  if element.is_displayed():
                      print(f"[INFO] Element found: {xpath} (Attempt number: {attempt + 1})")
                      return element
              except NoSuchElementException:
                  pass
              time.sleep(interval)
          print(f"[ERROR] Element not found: {xpath} (Maximum attempts: {max_attempts})")
          return None
      
      def main():
          driver = webdriver.Chrome()
          driver.get('https://www.taobao.com')
          target_xpath = '/html/body/div[12]/div/div/h3'
          element = wait_for_element(driver, target_xpath, max_attempts=30, interval=1)
          if element:
              print(f"Target element text: {element.text}")
          else:
              print("Target element not found; operation terminated.")
          driver.quit()
      
      if __name__ == "__main__":
          main()
      



      👇🏻 Click below to follow my personal
      public account👇🏻


      🎯 In-depth communication | 📌 Note "From CSDN"
      🌟 Solve problems, expand networks, grow together! (No casual inquiries)



      🚀 More than just communication—it's your technical accelerator!



      10 responses to “Web Scraping Essential → Selenium: Detailed Explanation (Part 2)”

      1. Overall, this is an excellent resource that bridges the gap between theory and practice in using Selenium for web automation.

      2. The clear structure and real-world examples make this article a go-to guide for mastering Selenium waits in web scraping projects.

      3. Learning about mandatory waits was new for me. The Taobao example helped clarify when and why to use such strategies.

      4. The article effectively demonstrates how to handle dynamic content loading using different waiting techniques. Highly recommended!

      5. I appreciated the practical examples like retrieving attributes from Baidu’s homepage. It makes learning Selenium more hands-on.

      6. The detailed explanation of expected_conditions and how they can be customized is something every Selenium user should know.

      7. This article is a great resource for anyone struggling with login flows in web scraping. The QQ Space example is well-documented.

      8. The comparison between implicit and explicit waits was very insightful. It helps in deciding which strategy to use based on specific needs.

      9. I found the section on explicit waits particularly helpful. The code snippets make it easy to understand how to implement them in practice.

      10. This article provides a comprehensive guide to using Selenium for web scraping, especially focusing on different waiting strategies. The examples are clear and well-explained.

      Leave a Reply