Marcin Ros 2018 - Original Co-Author, Co-Founder and System Architect of SensID
In the context of GDPR enforcement and rising volumes of unstructured enterprise data, organizations faced an increasing demand to identify, classify, and manage sensitive personal data across diverse systems. This paper presents SensID, a modular AI/ML-powered data governance tool that was designed to discover and catalog sensitive data within both structured and unstructured sources. The system applied machine learning (ML) and natural language processing (NLP) to enable precision detection of sensitive personal identifiers, supported data anonymization workflows, and integrated into enterprise environments through APIs and dashboarding. The initial architecture and concept were developed by the author and piloted with PKP (Polish State Railways), later forming the foundation for the system's evolution under 4Semantic.
With the introduction of the General Data Protection Regulation (GDPR), enterprises were required to manage personal and sensitive data with increased transparency, traceability, and control. While structured data could be handled by traditional MDM (Master Data Management) solutions, a significant volume of valuable and regulated content remained hidden in unstructured text, documents, emails, logs, and real-time streams. SensID addressed this gap through a flexible, scalable, AI-based solution.
SensID was a data intelligence platform developed to support:
Detection and classification of personal and sensitive data (names, IDs, addresses, medical and financial information)
Structured (SQL, NoSQL) and unstructured (file systems, emails, APIs, logs) source scanning
GDPR compliance: data subject rights, consent tracking, anonymization
Dashboard-driven visualization and RESTful APIs
The system was developed using modern open-source frameworks with a modular approach to ensure flexibility and adaptability in complex IT environments.
This diagram represents the architecture of SensID. Multiple enterprise data sources were ingested, scanned, and processed using ML/NLP-based pipelines before results were indexed and visualized or accessed through external APIs.
High-Level Architecture of SensID
The system architecture was composed of the following key modules:
Data Ingestion Layer: Connectors to various enterprise sources including:
File systems: PDF, DOCX, ODF, HTML, ZIP, etc.
Databases: PostgreSQL, MySQL, Oracle, MongoDB
Email servers and content repositories
APIs and data streams (Kafka, JMS, REST, FTP)
Processing Core: Implemented sampling, scanning, parsing, and indexing. Capable of:
Preprocessing binary/unstructured content using text extraction (e.g., Tika)
Sampling or full scanning modes
Processing structured and unstructured data concurrently
ML/NLP-Based Classification Engine:
Named Entity Recognition (NER) using custom-trained models (initially in Polish)
Pattern-based recognition (regex, heuristics)
Rule engine and taxonomy management (lexicons for GDPR categories)
Sensitive Data Index:
Elasticsearch-based metadata index storing detected entity types, locations, and source identifiers
Supported multi-source correlation and contextual analysis
Dashboard & Analytics Interface:
Kibana-based dashboards
Visual insights into data distribution, sensitivity, source maps, frequency statistics
API Layer:
RESTful endpoints for scan control, results access, data export
Real-time integration with compliance tools and CRM/ERP systems
SensID was designed to be deployed as a virtual appliance (OVA), Docker container, or directly on-premise, supporting:
Horizontal scaling across nodes for high availability (HA)
Automated scheduling and orchestration of scans
Integration with enterprise ESBs and messaging platforms (e.g., IBM MQ, Kafka)
Security and privacy-by-design principles were built into the tool:
Optional suppression of raw values (metadata-only mode)
Role-based access control (RBAC)
HTTPS/TLS secure access
To support GDPR's "right to be forgotten" and data minimization, SensID included a built-in workflow for:
Detecting columns and records containing identifying and quasi-identifying data
Exporting classification output to ARX anonymization tool
Applying syntactic and semantic anonymization models:
k-anonymity
l-diversity
t-closeness
(epsilon, delta)-differential privacy
Workflows could be operated manually (via GUI Web based dashboards) or automatically configured via parameterized profiles.
SensID was validated through multiple pilots, including:
PKP (Polish State Railways) – Discovery of sensitive content across mixed legacy environments
Banking sector – Consent registry and customer profile enrichment
Telco & Utilities – Automated classification of customer correspondence
Additional business applications included:
Fraud detection and investigative analytics
Customer 360 enrichment
Continuous data governance enforcement
The initial concept, architecture, and first MVP of SensID were developed between 2016–2018 by the author in collaboration with a co-founder. The system was bootstrapped, piloted, and presented to early clients before attracting investment interest. Upon entering negotiations with investors (Rubicon Partners), the author chose to exit the venture, with the product continuing its path under the 4Semantic brand.
The author’s contributions included:
Architecture and technology stack design
Implementation of ML/NLP pipelines and classifiers
Integration design and dashboard architecture
Pilot delivery and early client acquisition
SensID demonstrated how applied AI and pragmatic enterprise architecture could bridge the compliance gap created by modern privacy laws. Its success lay in the combination of intelligent data mining, scalable infrastructure, and real-world enterprise integration. As data privacy continues to evolve, systems like SensID offered a roadmap for future-ready data governance.
Keywords: GDPR, sensitive data detection, NLP, machine learning, data governance, data anonymization, Elasticsearch, ARX, enterprise architecture