Post

How to handle Data, and Images(20) Web Crawling

Using Web Crawling to get Data

Lesson Notes in .ipynb file

How to handle Data, and Images(20) - Web Crawling

Topics

Web Crawler

  • Web Crawling is an automated method to retrieve data
  • Can get needed information from any service
  • Usually used with Python

Getting HTML code from specific website (1)

1
2
3
4
5
6
7
8
9
import requests

# make a request on specific URL
request = requests.get('https://hyeonukim.github.io/devblog/')

# after connecting, get website's source code
html = request.text.strip()

print(html)

Getting HTML code from specific website (2)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests
from bs4 import BeautifulSoup

# make a request on specific URL
request = requests.get('https://hyeonukim.github.io/devblog/')

# after connecting, get website's source code
html = request.text.strip()

# turn HTML source code into python's BeautifulSoup variable
soup = BeautifulSoup(html, 'html.parser')

# get all the elements consisting of <a> tag
links = soup.select('a')

# for all link 
for link in links:
  # if link has 'href' attribute
  if link.has_attr('href'):
    # href attribute has text 'posts' included
    if link.get('href').find('posts') != -1:
      print(link.text)

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Leetcode 167. Two Sum II - Input Array Is SortedExplanation for Leetcode 167 - Two Sum II - Input Array Is Sorted, and its solution in Python.   Jan 18, 2025    Leetcode, Two Pointers, Medium 
Leetcode 15. 3SumExplanation for Leetcode 15 - 3Sum, and its solution in Python.   Jan 18, 2025    Leetcode, Two Pointers, Medium 
Leetcode 36. Valid SudokuExplanation for Leetcode 36 - Valid Sudoku, and its solution in Python.   Jan 17, 2025    Leetcode, Arrays & Hashing, Medium 
Leetcode 238. Product of Array Except SelfExplanation for Leetcode 238 - Product of Array Except Self, and its solution in Python.   Jan 17, 2025    Leetcode, Arrays & Hashing, Medium 
Leetcode 128. Longest Consecutive SequenceExplanation for Leetcode 128 - Longest Consecutive Sequence, and its solution in Python.   Jan 17, 2025    Leetcode, Arrays & Hashing, Medium 
How to handle Data, and Images(19) Matplotlib UsageDrawing different types of Graphs with Matploblib   Jan 16, 2025    Handling Data and Images, Matplotlib 
How to handle Data, and Images(18) Matplotlib IntroIntroduction to Matplotlib and how to plot a line graph and save the figure   Jan 16, 2025    Handling Data and Images, Matplotlib 
Leetcode 49. Group AnagramsExplanation for Leetcode 49 - Group Anagrams, and its solution in Python.   Jan 16, 2025    Leetcode, Arrays & Hashing, Medium 
Leetcode 347. Top K Frequent ElementsExplanation for Leetcode 347 - Top K Frequent Elements, and its solution in Python.   Jan 16, 2025    Leetcode, Arrays & Hashing, Medium 
How to handle Data, and Images(18) Pandas UsageMore in depth of some operations that Pandas offers and their usage   Jan 15, 2025    Handling Data and Images, Pandas 
Leetcode 15. 3Sum
Leetcode 167. Two Sum II - Input Array Is Sorted
Leetcode 238. Product of Array Except Self
Leetcode 128. Longest Consecutive Sequence
Leetcode 36. Valid Sudoku

Summary

  • Press f12 on website to get the HTML code
  • we can select specific class with BeautifulSoup.select
This post is licensed under CC BY 4.0 by the author.