Web Scraping and Analytics Solution for Automated Tenant Screening
About Our Client
The Client is a US tenant screening company that delivers applicant reliability reports to rental service providers.
Web Scraping Needed to Automate Tenant Background Checks
The Client manually screened publicly available government databases to gather and update data on tenants’ criminal history. Such an approach was time-consuming and did not guarantee timely detection of changes in tenant background. The Client had an idea to automate the process with the help of a web scraping system but needed expert consulting on technical aspects as well as the legalities of implementing such a solution. The company turned to ScienceSoft, trusting our 35 years of experience in providing data management and analytics services and our deep knowledge of data protection regulations.
Designing Web Scraping Solution Architecture With an Analytics Module
ScienceSoft held several interviews with the Client’s team in charge of tenant data screening to elicit their requirements for the web scraping solution. Our team studied the screened databases, the retrieved data, and the ways the Client’s team uses it for reporting to rental service providers (e.g., preparing summaries of tenant criminal history, creating risk assessment reports).
As a result, ScienceSoft concluded that the Client needed not just a web scraping tool but an analytics solution that would cleanse and interpret the scraped data. Such an approach would allow the Client to automate both data gathering and reporting, which was the Client’s main goal for the project.
The Client asked ScienceSoft to create a high-level architecture of the analytics solution to help the stakeholders understand the overall functionality and benefits of the system and make a final decision on implementation. In two days, our team delivered the architecture design that would support the following processes and data flows:
Data crawling and parsing
Each government website is screened by dedicated crawlers and parsers that are designed to navigate the database of that specific website. The crawler maps out the database structure, and the parser extracts the relevant information (e.g., tenant name and demographics, committed offenses). If the web page contains links leading to other useful pages, these pages will be opened and parsed like the parental page.
Data cleansing and upload to raw data storage
The scraped data is cleansed and standardized (e.g., unifying date formats, removing duplicate entries, transforming offense types into numerical codes) and lands in the raw data storage. The storage also features a metadata catalog that keeps additional information about the scraped data, such as the source and the date of scraping.
Data analytics and reporting
The cleaned data is further transformed to fit the chosen analytics model. For instance, it can be beneficial to group offenses into crime types or merge related fields (e.g., criminal sentence start and end dates can be merged into a criminal sentence duration column). The highly structured data is stored in a data warehouse (DWH) that is optimized for automated reporting and ad hoc analytics queries.
Error reporting
The solution captures error events at every step of the data processing flow and creates error reports to enable timely identification of solution issues and their remediation.
Data orchestration
All the described processes are scheduled and automated with the help of a data orchestration tool. ScienceSoft recommended using Apache Airflow due to its diverse functionality and cost-efficiency.
Consulting on Legal and Ethical Aspects of Web Scraping
Our experts provided the following recommendations for ensuring the legality and ethical use of the solution:
- Review the terms of service of the screened websites to make sure the conditions do not explicitly prohibit scraping.
- Review the robots.txt file of the government database websites to see which parts can be accessed by web crawlers.
- Implement rate limiting to avoid overloading the servers of the scrapped websites and prevent service disruption.
- Keep logs of data scraping activities to facilitate future compliance audits.
- Determine which federal and state personal data protection regulations the solution will need to comply with. The final list will depend on the residency of the screened individuals (e.g., FCRA at the federal level, CCPA/CPRA for California, the SHIELD Act for New York, CDPA for Virginia, CCPA for Colorado).
- Ensure solution compliance with the defined regulations (e.g., providing data collection notices that rental companies will need to distribute to the screened applicants; creating and publishing a comprehensive data collection and usage policy).
Web Scraping and Analytics Solution Architecture Ready in 2 Days
In two days, the Client received an architecture for a web scraping solution with an analytics and reporting module. The solution will allow the Client to automatically scrape data on tenant criminal history from government databases and report it to rental service providers. We also advised the Client on the legal and ethical concerns for the solution. Thanks to ScienceSoft’s assistance, the Client was able to assess the feasibility of web scraping for its business and make an informed decision on implementation.
Technologies and Tools
Apache Airflow.