Technological design in the project “Intelligent speech processing system in medical transcription”
Main objective and research problem
The main goal of the project is to develop a functional system supporting patients and doctors in maintaining electronic medical records during a medical visit using advanced NLP/NLU algorithms in Polish. This system will significantly reduce the time spent on manual data entry, which currently leads to excessive work, stress and burnout among medical staff.
The key research problem is to develop a set of machine learning and NLP/NLU algorithms for Polish, which will enable transcription of medical interviews, extraction of important information from text and automatic completion of medical forms. Solving this problem will significantly improve the efficiency of doctors’ work, shorten patient queues and increase the quality of medical documentation.
Main functionalities of the system

The system provides comprehensive support for the medical documentation process through a number of key functionalities. Sound recording and playback during a medical visit are carried out by a dedicated edge device with a two-channel recording system, enabling separation of the doctor’s and patient’s voices. The system enables the creation of medical questionnaires and their flexible management, allowing for adjustment to medical specializations and individual needs of medical facilities.
An innovative solution is conducting a controlled visit using dynamic scenarios based on AI, which prompt the doctor with subsequent questions based on the analysis of the previous conversation. Real-time transcription of the conversation, performed by advanced speech recognition models, enables immediate access to the consultation text. Automatic completion of forms based on semantic analysis of the transcription eliminates the need for manual data entry by medical personnel.
The system also ensures data synchronization with the cloud, facilitating access to documentation from various locations and devices. Documents such as prescriptions, sick leave certificates or referrals are generated automatically based on the collected data. The mobile application equipped with a virtual assistant guides the patient through the entire visit process, allows for sending additional documents and provides access to the history of visits and recommendations.
Technological architecture

The system uses a modern microservice architecture, ensuring the flexibility, efficiency and scalability of the platform. Microservices are implemented mainly using FastAPI (Python), which allows for the optimal selection of technologies for specific tasks. Communication between services is carried out by the Kafka message broker, which acts as a data bus and enables reliable transmission of messages between components.
Containerization using Docker provides a uniform runtime environment and flexible management of system resources. The main APIs are implemented in accordance with the REST standard and documented using OpenAPI, which facilitates integration with external systems. Real-time audio transmission uses the WebSocket protocol, ensuring low latency and responsiveness of the interface.
The entire infrastructure is implemented using continuous integration and deployment (CI/CD) techniques via Azure Pipelines, which allows for automation of the compilation, testing and deployment processes of new software versions. Data security is ensured by JWT-based authentication mechanisms supported by central identity management.
Speech recognition and natural language processing
The heart of the system is advanced speech recognition and natural language processing technologies, adapted to the specifics of the Polish language. Speech transcription is performed by the OpenAI Whisper model, which has been adapted to recognize Polish medical terminology. Speech recognition is also supported by Mozilla DeepSpeech, which provides the ability to process audio data locally without the need to send it to the cloud.
Natural language processing is based on models from the HuggingFace Transformers library, in particular on Polish variants of BERT (HerBERT) adapted to the analysis of medical texts. Medical entity recognition, semantic classification and relation extraction are performed using the SpaCy library with Polish language models. The system uses LLM (Large Language Model) models, e.g. Gemma3, which, after a fine-tuning process on specialist medical texts, are used to generate documentation and analyze the context of the conversation.
Automatic mapping of information from transcription to form fields is performed by custom algorithms built on the basis of PyTorch, using reinforcement learning techniques. Text generation is supported by Retrieval Augmented Generation techniques using LangChain to integrate LLM models with medical knowledge bases.
Edge devices and sound processing
The system uses dedicated edge devices for sound recording and pre-processing. These devices are equipped with advanced NVIDIA systems for edge processing with AI acceleration, which allows for initial sound analysis without the need to send raw data to the cloud. The device conducts dual-channel sound recording with a sampling rate of 44100 Hz, which allows for capturing the full spectrum of human speech.
Audio signal processing is performed using the PyAudio and librosa libraries, enabling noise filtration, volume normalization and separation of sound sources. Speech diarization (recognition of individual speakers) is implemented using PyAnnote, which allows for precise assignment of individual utterances to the doctor and patient. Real-time audio streaming uses WebRTC, ensuring high quality and low transmission latency.
Databases and information management
The system implements a multi-layered data storage architecture, adapted to different types of medical information. Structured patient data and medical records are stored in a relational PostgreSQL database, ensuring integrity and transactionality. Audio recordings and attachments are stored in the MinIO object system, offering scalability and fault tolerance.
Semantic information is stored and retrieved using the Qdrant vector database, which enables efficient retrieval of similar documents and text fragments based on their meaning. Graph representations of medical terminology and relationships between concepts are stored in Neo4j, which allows for complex queries that take into account semantic connections.
User interfaces and accessibility
The system offers diverse user interfaces adapted to the needs of different target groups. The main web interface for doctors is implemented using React with the Material-UI component library, providing a modern and intuitive look. The mobile interface for patients is built in React Native, offering a consistent user experience on iOS and Android devices.
The virtual assistant for patients is based on a conversational interface using the Gemma3 model, enabling natural interactions in Polish. Medical data and reports are visualized using D3.js and Chart.js, offering a clear presentation of complex information. The form system is implemented using React Hook Form, ensuring data validation and interface responsiveness.
The system fully supports accessibility for people with special needs, in accordance with WCAG 2.1 guidelines. For the elderly and visually impaired, text magnification, contrast enhancement and text reading using open-source speech engines have been implemented. People with hearing disabilities benefit from automatic captions and transcription, and if necessary, automatic translation into Polish sign language. The system also supports people with musculoskeletal dysfunctions through voice interfaces and support for alternative input devices.
Integration with external systems
The system ensures integration with key external systems used in the Polish healthcare system. The implementation of the HL7 FHIR standard enables data exchange with the P1 e-medical documentation system, ensuring compliance with the Polish National Implementation of HL7 CDA. The system also supports the generation and transmission of e-prescriptions and e-referrals in accordance with applicable standards.
Integration with external laboratory and diagnostic systems takes place via standard medical data exchange protocols, which enables automatic inclusion of test results in patient documentation. The system also offers APIs for external applications, enabling the expansion of the ecosystem and integration with telemedicine platforms.
Security and compliance with regulations
Medical data security is a priority for the system and is implemented at many levels. User authentication and authorization are based on the OAuth 2.0 and OpenID Connect standards, with the implementation of role-based access control (RBAC). Data is sent using the HTTPS protocol with TLS certificates, ensuring communication encryption.
Sensitive data is stored in encrypted form, using strong cryptographic algorithms. The system implements comprehensive medical data anonymization mechanisms, using the Presidio library to automatically recognize and mask personal information. All data operations are logged and monitored, enabling auditing and detection of unauthorized activities.
The system ensures full compliance with GDPR requirements and regulations on medical data processing, implementing patient consent mechanisms, data retention and the implementation of data subject rights. Penetration tests are regularly carried out using OWASP ZAP tools, verifying the system’s resistance to potential attacks.
Innovative technological aspects
The system is distinguished by a number of innovative technological solutions. Dual-channel audio recording and analysis with separation of doctor and patient voices enables precise attribution of statements and improves the quality of transcription. Dynamic medical visit scenarios, controlled by AI algorithms based on the Gemma3 model, streamline the interview process and ensure complete documentation.
Automatic extraction of medical information from transcripts uses the latest achievements in NLP for Polish, including the HerBERT and plT5 models. Personalization of questionnaires for individual patient characteristics and doctor specialization increases the efficiency of the documentation process. The patient application with a virtual assistant, built on the basis of the Gemma3 model, provides intuitive navigation through the visit process and access to recommendations.
Implementation of the microservice architecture using the latest open source tools ensures scalability and flexibility of the system, enabling its development and adaptation to changing needs. Application of fine-tuning techniques of the Gemma3 model on Polish medical texts allows for achieving high quality of generated documents and understanding of the context.
Comprehensive support for people with special needs, implemented using open standards and tools, makes the system accessible to a wide range of users, including the elderly, disabled and those requiring translation. Integration with national e-medical documentation systems ensures practical usability of the solution in the Polish healthcare system.