The project involved developing a sophisticated web scraping solution to extract comprehensive data for each National Drug Code (NDC) from multiple sources including Sam.gov, DailyMed, and the FDA Orange Book. The project aimed to deliver an adaptable, robust solution to handle dynamic content, varied data formats, and challenging anti-scraping mechanisms, ensuring the latest and most accurate data was consistently available for querying and analysis.
Key Features:
- Advanced 'element-hunting' mechanism for dynamic layout handling
- PDF and Word document data extraction
- IP proxy rotation and anti-bot measures
- Scheduled bi-monthly automated scraping
- REST API integration for secure data access
Project Challenges:
- Handling Dynamic Content and Diverse Layouts: Each source displayed unique layouts and data structures. To address this, we developed an AI-powered 'element-hunting' system that intelligently detected relevant data points based on text patterns, proximity, and hierarchy.
- Extracting Data from PDFs and Word Files: Some data was embedded in downloadable files. We implemented a solution to detect and retrieve these files, using libraries like PyMuPDF, pdfplumber, and python-docx to extract relevant text, which required custom parsing to isolate the necessary information.
- Managing IP Blocking and Proxy Rotation: To avoid IP bans, we established a proxy rotation system to balance requests across multiple IPs, with dynamic delays and retry mechanisms to mimic human behavior.
- Automating Scheduled Scraping: To ensure regular updates, we set up cron jobs that triggered the scripts bi-monthly, reducing the need for manual intervention.
- Storing and Serving Data: We stored all extracted data in a PostgreSQL database and created a Django REST API to serve data securely to a React frontend.
- Frontend Responsiveness and API Integration: Building a responsive frontend that integrated seamlessly with the backend APIs posed challenges. Ensuring smooth data loading, handling API response delays, and creating a consistent, user-friendly experience across devices required optimizing both the frontend code and API response handling.
Technologies Used:
- Python: Core language for backend data processing and automation
-
Selenium: Used for automated browser interaction to handle dynamic content
-
Beautiful Soup: HTML and XML parsing library for data extraction
-
Django: Backend framework to build secure APIs and manage data
-
React.js: Frontend framework for an interactive and user-friendly interface.