Lead Scoring Prediction App: Enhancing Sales and Marketing Efficiency through Data Science
In today's competitive landscape, understanding and prioritizing leads is crucial for maximizing business impact. This Lead Scoring Prediction App is a sample project that leverages machine learning to classify leads into categories such as “Hot,” “Warm,” “Medium,” “Dormant,” or “Cold.” The app helps organizations streamline their sales and marketing efforts by identifying leads with the highest conversion potential. While the data used for this app is from an EdTech company—reflected in the choice of features—the methodology is adaptable and has been successfully implemented in diverse industries, including a project I designed specifically for hospitality sales.
The app allows users to input details through an intuitive form, designed to assess lead viability based on various parameters. The model considers aspects such as the lead origin (e.g., online form or lead import), last activity (e.g., email bounced, SMS sent), country knowledge (known or unknown), specialization (e.g., travel and tourism), and current occupation status. Tags like "Lost," "Ongoing," or "Unable to Reach" help provide additional granularity. It also includes an asymmetrical activity index to measure recent engagement with the company’s offerings. Each of these factors is processed by a machine learning model to generate a predictive lead score, categorizing it by potential.
Deployment and Accessibility
The app is not only functional but also fully deployed using Render, ensuring easy and scalable access. It runs on a Flask-based backend with a sleek, responsive frontend, making it suitable for both desktop and mobile users. Integration with the machine learning model allows for real-time lead evaluation, and the app interface is designed to be user-friendly, with clear prompts and dropdowns for ease of data input. The color scheme and layout have been carefully chosen for a professional look that balances usability and aesthetic appeal.
Behind the Scenes: A Technical Overview
The backend of this app incorporates a logistic regression model trained on labeled data to assign lead categories. The scoring methodology includes features engineering specific to the EdTech sector, demonstrating how custom models can be tailored for different industries. For example, the hospitality version I developed includes features such as booking history, event engagement, and response time metrics, aligning the model with the nuances of hospitality sales.
This project is built with Python, leveraging libraries like scikit-learn for model building and Flask for deployment, and incorporates advanced techniques in feature selection, preprocessing, and cross-validation to optimize prediction accuracy. The app also demonstrates my ability to handle the end-to-end machine learning workflow, from data preprocessing to model training, testing, and deployment.
Sample App with Real-World Relevance
While this app serves as a showcase, it demonstrates the versatility and value of machine learning in identifying high-potential leads. The features used in this EdTech example provide an understanding of common indicators in customer behavior, while the deployed model's accuracy highlights the potential impact of AI-driven lead scoring across industries.
Whether in EdTech, hospitality, or other sectors, lead scoring tools like this empower sales and marketing teams to focus on leads with the highest probability of conversion, ultimately driving better business results and resource allocation. This app not only exemplifies technical proficiency in AI and machine learning but also reflects a data-driven approach that adapts seamlessly to industry-specific needs.
→
→
Feature Definations:
Out of 33, only 8 features were selected since they played a significant role in prediction modelling. Following is the glossary of these features:
Lead Origin: The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. The dropdown in app has 2 values: Lead Add Form and Lead Import
Last activity: Last activity performed by the prospect. The dropdown includes 4 different values. Each of these 4 values were taken from the original list of 19 values based on the importance of factors.
Country: This dropdown asks the user to declare whether or not the prospect's country is known.
Specialization: Thsi dropdown makes the user to choose the specialization fo the prospect. Instead of retaining all the options from original dataset, since this was a sample, only one choise has been retained - That is, whether or not the prospect from Travel and Tourism background.
Occupation: This dropdown relates to whether the occupation of the prospect is known or not. All the individual occupation types were simplified to make it easier for the user to choose a value.
Tags: Tags assigned to customers indicating the current status of the lead. Original dataset had 26 distinct categories which only 3 most important values were retained for simplification of form.
Lead Quality: Two most important lead qualities were retained from five values in original dataset
Activity Index: Its internal auto grading system output for each lead.
Tools used:
Logistic Regression using Statsmodels in Python
Model Evaluation:
Train Accuracy: 92.01%
Test Accuracy: 89.11%
Model Sensitivity (Correctly predicted "Yes" to total predicted "Yes"): 89.9%
Model Specificity (Correctly predicted "No" to total predicted "No"): 88.6%
Precision: 83.7%
Sample Recommendations to business based on this project:
PRIMARY RECOMMENDATIONS
Company should have maximum focus on Hot Leads followed by Warm, Medium, Cold and Dormant Leads in given order of importance and as pre availability of resources.
Hence, during the time when they dont have resources, it is advisable to allocate maximum resources on calling people characterised as Hot Leads.
In case the hot leads are exausted, then they can start to call Warm leads
Ones they are exausted, calling should be done on Medium leads. The difference between conversion rates of Warm leads and Medium leads is negligible.
Post exausting all these, callers can start focussing on Cold leads.
It is advisable to avoid calling leads that have been defined as Dormant. These leads have negligible conversion rate of around 5%. These leads also make up close to 60% of the total leads obtained. Ignoring them may will result in heavy savings on resource allocation.
ADDITIONAL OPERATIONAL SUGGESTIONS AND RECOMMENDATIONS
Tags have surfaced as the most influential feature inversely proportional to the probability of lead conversion. Within Tags, post transformation, we conclude that:
‘Tags_Lost’, ‘Tags_Unable to Reach’, ‘Tags_Ongoing’ are top categories in Tags for a lead NOT being converted.
Based on above insights, Tags with following categories in original dataset could largely be least prioritized: ‘Diploma holder (Not Eligible)’, 'Not doing further education’, 'Already a student’, 'University not recognized', 'Recognition issue (DEC approval)', 'Lost to Others’, 'Busy', 'opp hangup', 'Ringing', 'switched off', 'invalid number', 'number not provided’, ‘Wrong number given’, 'Shall take in the next coming month', 'Want to take admission but has financial problems’, 'in touch with EINS', 'In confusion whether part time or DLP', 'Still Thinking’, 'Graduation in progress', 'Interested in full time MBA', 'Interested in other courses’
Since considerable number of tags are associated with ‘Unable to Reach’ category, recurrence of such instances can be reduced by making contact number as mandatory field and possibly an OTP verification can be done to ensure that genuine number is shared. For OTP verification service, further research will be needed to asses comparison between savings in resource cost and OTP verification service cost.
Since Total time spent on the website is directly proportional to probability of lead conversion, it is advisable for sales people to start assessing and documenting this factor in real time basis so that they can prioritize on lead focus. An integrated sales force automation solution can be a possible recommendation based on firm’s financial position and projected increase in revenue and profitability.
Current Occupation comes up as yet another category which has surfaced as one of the influential features. When occupation status is undisclosed/ Unknown/ Null Value, the probability of lead NOT being converted is relatively higher. This could possibly be because of non serious students just trying to casually check the products. Recurrence of such leads could be reduced by:
Making it mandatory for students to share their official mail ID and authenticating it with an OTP. This solution has to be well thought through since this could possibly be a problem for employees of some of the organizations where Email usage policies are strict.
Increasing the size of Institutional Sales Team that handles B2B business
Leads with Lead Quality as Worse can be completely avoided. Out analysis also showed that leads with quality index of ‘Worse’ is miniscule at just 7% success ratio.
Sales team is recommended to prioritize leads with relatively higher Asymmetrique Activity Score
shobhit.kulshreshtha@gmail.com
Open to work on custom projects in Machine Learning and Analytics