Skip to content

Latest commit

ย 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

README.md

Web Shopping Mall Crawler

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ๋Š” 2019๋…„ ๊ณผ์ œ๋กœ ๋งŒ๋“ค๊ฒŒ ๋œ ํฌ๋กค๋ง์œผ๋กœ
์•„๋ž˜ URL์ธ G-market๊ณผ GS-shop์˜ ์•„์ดํ…œ์„ ํฌ๋กค๋งํ•˜๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค. Gmarket: https://m.gmarket.co.kr/n/superdeal?categoryCode=400000 Gsshop: https://www.gsshop.com/shop/sect/sectL.gs?sectid=1378

Description

Shopping mall Crawling Python Code

Environment

  • Database : MongoDB
  • Library
    • Lxml : ์ •์  ํŽ˜์ด์ง€์—์„œ ์‚ฌ์šฉ
    • Selenium : ๋™์  ํŽ˜์ด์ง€ ์ ‘๊ทผ์‹œ์— ์‚ฌ์šฉ
  • ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ
    • Async.io : ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ
    • motor : MongoDB ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ๋“œ๋ผ์ด๋ฒ„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • Crawling ๋Œ€์ƒ : G Market, GS Shop
  • Chrome Driver : https://sites.google.com/a/chromium.org/chromedriver/home

Prerequisite

Module Version Description
python 3.6 Basic
lxml 4.5.1 Crawling
selenium 3.141.0 Crawling
motor 2.4.0 Async Mongo driver
pymongo 3.12.0 Database
chrome driver Your Chrome Dynamic Page

์ˆ˜์ง‘ ํ”„๋กœ์„ธ์Šค (Extract Process)

ํฌ๋กค๋ง์— ์•ž์„œ ํ•˜๋“œ์ฝ”๋”ฉ์— ํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค. ์ˆ˜์ง‘์„ ํ•˜๊ณ ์ž ํ•˜๋Š” ์‡ผํ•‘๋ชฐ์˜ URL ๋ฐ ์ƒํ’ˆ ์ •๋ณด๋ฅผ ์–ด๋–ค ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ๊ฐ€์ ธ์˜ฌ์ง€ ๊ทธ๋ฆฌ๊ณ  ์ƒํ’ˆ์˜ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ํƒœ๊ทธ๋ฅผ ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•  ๊ฒƒ ์ธ๊ฐ€ ๋“ฑ ๊ณ ๋ คํ•  ์ ์ด ๋งŽ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‡ผํ•‘๋ชฐ ์‚ฌ์ดํŠธ๊ฐ€ ๋ณ€๊ฒฝ์ด ์žˆ์„ ๊ฒฝ์šฐ, ์ด ์ž‘์—…์„ ์ตœ์†Œํ•œ์œผ๋กœ ํ• ์ง€, ํŽ˜์ด์ง€๊ฐ€ ๋™์ ์ผ ๊ฒฝ์šฐ ๊ทธ๋ฆฌ๊ณ  ์ƒํ’ˆ ์ •๋ณด๊ฐ€ ์ƒํ’ˆ ์ •๋ณด ์—๋Ÿฌ๋กœ ์˜ˆ์™ธ์ฒ˜๋ฆฌ ๋“ฑ ๊ธฐ์ˆ ์ ์œผ๋กœ๋„ ์ƒ๊ฐํ•  ์ ์ด ๋งŽ๋‹ค.

์šฐ์„  ์ตœ๋Œ€ํ•œ ๊ฐ„๋‹จํ•œ ์ˆ˜์ง‘ ํ”„๋กœ์„ธ์Šค๋กœ์„œ ์‡ผํ•‘๋ชฐ URL ๋ฐ ์ƒํ’ˆ ํƒœ๊ทธ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๊ณ , ํ•ด๋‹น ์ •๋ณด๋ฅผ ํ† ๋Œ€๋กœ ํฌ๋กค๋ง์„ ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋Š” ๋‘ ๊ฐ€์ง€ ์‡ผํ•‘๋ชฐ์˜ URL๊ณผ ์ˆ˜์ง‘ํ•˜๊ณ ์ž ํ•˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ URL Parameter๋กœ์„œ ๊ฐ ๋น„๋™๊ธฐ๋กœ ์ ‘๊ทผํ•˜์—ฌ ์ƒํ’ˆ ์ •๋ณด๋ฅผ Xpath๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธ์–ด ํ•˜๋‚˜์˜ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅํ•œ ํ›„, ๋น„๋™๊ธฐ Mongo Instert๋ฅผ ์ด์šฉํ•˜์—ฌ ์ ์žฌํ•œ๋‹ค.

์ •์  ํŽ˜์ด์ง€์˜ ๊ฒฝ์šฐ lxml๋ฅผ ์ด์šฉ, ๋™์  ํŽ˜์ด์ง€์˜ ๊ฒฝ์šฐ selenium chrome headless driver ์ด์šฉ.

Usage

  1. pip install -r requirements.txt
  2. install chrome
  3. download chrome driver
  4. execute

์˜ˆ์ „์— ๋งŒ๋“ ๊ฑฐ๋ผ์„œ ์‡ผํ•‘๋ชฐ ์ƒํ’ˆ html ๊ด€๋ จํ•ด์„œ ๋ณ€๊ฒฝ ํ•ด์ฃผ์…”์•ผ ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. main.py์—์„œ ์ฃผ์„์ฒ˜๋ฆฌํ•ด์„œ ์ˆ˜์ง‘, ์ ์žฌ๋ฅผ ๋ณ„๋„๋กœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Selenium์˜ ์‚ฌ์šฉ์€ ๋™์  ํŽ˜์ด์ง€์ผ ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ ํŽ˜์ด์ง€๊ฐ€ ๋™์ ์ด ์•„๋‹ˆ๋ฉด lxml๋งŒ ์‚ฌ์šฉํ•ด์„œ ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ ๋‚ด์— ์—…๋กœ๋“œ ๋˜์–ด์žˆ๋Š” chome driver์˜ ๊ฒฝ์šฐ window์šฉ ์ž…๋‹ˆ๋‹ค.