Unsupervised Learning Approach for Identifying Retail Store Employees Using Footfall Data


Analysis of customer visits (or footfall) in the store traced via geolocation enabled devices, helps digital firms understand customers and their buying behavior better. Insights gained through geo footfall analysis help clients and advertisers make an informed decision, choose profitable regions, recognize relevant advertising opportunities, and analyze their competitors to increase the success rate. But all this information can be disingenuous if people who walk past the store without entering, and staff of the store are not excluded. Therefore, two groups of people contributing to the footfall at the store can be considered outliers - people passing by the store, and employees of the store. The behavior of these outliers is expected to be different from the actual customers.

Since the data collected by geofencing the stores and pings from the SDK of the geo-enabled devices do not contribute much in tagging these outliers exclusively, these outliers are not very evident and cannot be removed by extreme value analysis. To tackle this problem we have formulated a multivariate approach to identify and remove these outliers from our source data. As we have no labeled data that marks a footfall as an employee or customer, we are using an unsupervised outlier detection model using the DBSCAN algorithm to provide a coherent and complete dataset with the labeled outliers. In this process, different techniques were taken into consideration to handle the effectiveness of features. Features like time spent by a visitor in and around the stores compared to other locations, monthly visit frequency, daily visit frequency, etc. were dominant in tagging the outliers.
Discovering the structure of data was another key step to optimize parameters of the DBSCAN algorithm for our use case namely, epsilon and minimal points.

Finally, the evaluation was done against the results obtained with that of the k-means algorithm, which showed that DBSCAN has a higher detection rate and a low rate of false positives in discovering outliers for the given problem statement.


Soumya Jain is currently working as a data scientist II in MiQ. She has done her engineering in computer science from BIT Durg, and MTech from IIIT Bangalore in data Science specialization. She has been employed for the past 1.5 years. She has a keen interest in the field of data science and finding valuable information from a dataset and making great stories out of it is what drives me to learn more.