Chapter 2: Alpha Capture in Digital Commerce [Series]

5 min readDec 16, 2022

Introduction

Intelligence Node crawls millions of web pages daily to provide its customers with real-time, high-velocity, and accurate data. But data acquisition and normalization at such a large scale and an affordable cost are not possible manually. They are rigorous processes and come with challenges of their own. To address these challenges, Intelligence Node’s analytics and data science team has developed strategies through advanced analytics and continuous R&D.

In this part of the ‘Alpha Capture in Digital Commerce series,’ we will explore the “data pipeline” challenges in the context of retail and discuss practical data science applications to solve these challenges.

How We Add Value to Data Extraction…

Textual Data Extraction — Leveraging Text to Normalize Products

a. Smart Categorization

Intelligence Node maintains 1400 categories and uses Bidirectional Encoder Representations from Transformers (BERT) to segment the data into these categories. We deploy a custom implementation of BERT to extract sentence vectors and use them for training a fully connected feedforward Neural Network to accurately classify and map products to our (continuously evolving… see below) category tree.

b. Knowledge Graph Enhancement

Intelligence Node uses a Zero Shot Clustering approach (application) to club relevant keywords without learning about the domain. This allows the algorithm to capture previously unknown keywords and relate them to the correct group of similar keywords (keywords being the features). The algorithm then feeds the similarity matrix into the clustering algorithm, which groups relevant keywords in the same bucket.

Example: In the below instance, the Zero Shot Clustering model has identified relevant keywords like recycled, upcycled, environmental, and organic jeans and grouped these into the same bucket ‘recycled jeans’.

Visual Data Extraction — Leveraging Images to Segment Products

a. Product Image Segmentation

For visual extraction of information from an image, Intelligence Node uses MRCNN (CNN based) architecture to identify object bounding for each product visible in an image. The Resnet-based masked RCNN helps draw precise segmentation masks. Further, the RCNN labels each pixel by its product category to create product-wise segmentation masks. This enables us to automatically extract all products in an image and focus on specific features of those products (without background noise).

b. Attributes Tagger

We use ‘attributes tagger’ to add labels to an image and deploy ‘image segmentation’ to localize a part of an image with a particular product type. Resnet (CNN) based classifier is used to label images with their corresponding attributes — an example of a multi-label, multi-class classification.

The classifier leverages object localization techniques to identify a region and is currently used to fill attribute gaps where product text can be ambiguous / have missing information.

c. Image OCR (Optical Character Recognition)

Our in-house model is trained to use image pre-processing to identify text regions and transform them separately into the best readable format (deskew, deproject, de-warp, denoising, etc.). The algorithm then uses the extracted text to identify meta information about the product, such as brand, manufacturer, contents of products, weight/volume, country of origin, etc.

To accomplish the above, we use a combination of libraries on top of Python, for example, OpenCV/scikit-image for preprocessing images, Tesseract for optical character recognition (reading text), and ZBar for reading QR codes and barcodes from images. Finally, we leverage PyTorch to upscale the low-resolution image with deep learning.

Final Words

In the digital age of retail, shopper expectations are evolving faster than the latest technology. Advanced analytics is not just ‘good-to-have’ but a ‘must-have’ to remain relevant and competitive. In the first article of this series, we saw how Intelligence Node uses analytics and advanced technology to streamline, automate, and optimize data acquisition. But that’s just the first step. Data normalization and transformation are critical in building quality data that customers can rely on. As the retail industry becomes more real-time and fierce, the velocity, variety, and volume of data will need to keep upgrading at the same rate. Through these data pipeline innovations developed by the team, Intelligence Node aims to constantly provide the most accurate and comprehensive data to its clients while also sharing its analytical abilities with data analytics enthusiasts everywhere.

Published by

Yasen Dimitrov

Enterprise data and analytics leader

I am excited to bring out the 2nd chapter in the ‘Alpha Capture in Digital Commerce’ series. The first chapter was all about data acquisition in retail and adaptive data collection techniques devised and adopted by Intelligence Node. In this chapter, we will explore “data pipeline” challenges in the context of retail and discuss practical data science applications to solve these challenges. Hope you enjoy this piece and find it useful. Let me know your thoughts in the comments section below and whether you’d like me to address any specific data science use cases in my upcoming articles