A Python-based web scraper and crawler designed to extract structured data from various websites β including those with anti-scraping techniques like custom headers and CSRF protection.
- Books to Scrape
- Scraping Course β CSRF Protected Login
- Scrape This Site β Advanced Headers Challenge
- β Custom headers to bypass basic anti-bot detection
- β Automatic pagination support
- β CSRF token retrieval and session-based login handling
- β Output data to JSON
Scraper/
βββ Sync/ # Synchronous scraping modules
β βββ Categories.py # Gets all the books categories
β βββ NamePrice.py # Extracts book names and prices
β βββ Total.py # Extracts book names and prices as per their categories
β
βββ async.py # Asynchronous scraping module
βββ header.py # Manages headers/user-agents
βββ Login.py # Handles login/authentication
βββ Scrape.json # Scraping output for "async.py"- Python 3.11.9
- requests β HTTP requests
- BeautifulSoup β HTML parsing
- asyncio, aiohttp
- json
- Clone the Repository
git clone https://github.com/Argu333/Scraper.git
cd Scraper- Install the used libraries (if not installed)
pip install requests
pip install beautifulsoup4
pip install aiohttp