Commit @2c075e3eeeefc0ef016c4a04199242dc15256349 - yjyoon/whisper_streaming

Dominik Macháček 2024-08-19

Merge branch 'main' into vad-streaming-clean

conflicts merged

@2c075e3eeeefc0ef016c4a04199242dc15256349

74b80e3

2c075e3

README.md

--- README.md

+++ README.md


 
 **Turning Whisper into Real-Time Transcription System**
 
-Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
+Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
 
-Abstract:    Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference. 
+Abstract:    Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference. 
 
 
-Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
-
-Demo video: https://player.vimeo.com/video/840442741
+[Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
 
 [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
 
-Please, cite us. [Bibtex citation](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/bib/2023.ijcnlp-demo.3.bib):
+Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
 
 ```
-@InProceedings{machacek-dabre-bojar:2023:ijcnlp,
-  author    = {Macháček, Dominik  and  Dabre, Raj  and  Bojar, Ondřej},
-  title     = {Turning Whisper into Real-Time Transcription System},
-  booktitle      = {System Demonstrations},
-  month          = {November},
-  year           = {2023},
-  address        = {Bali, Indonesia},
-  publisher      = {Asian Federation of Natural Language Processing},
-  pages     = {17--24},
+@inproceedings{machacek-etal-2023-turning,
+    title = "Turning Whisper into Real-Time Transcription System",
+    author = "Mach{\'a}{\v{c}}ek, Dominik  and
+      Dabre, Raj  and
+      Bojar, Ond{\v{r}}ej",
+    editor = "Saha, Sriparna  and
+      Sujaini, Herry",
+    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
+    month = nov,
+    year = "2023",
+    address = "Bali, Indonesia",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.ijcnlp-demo.3",
+    pages = "17--24",
 }
 ```
 
 ## Installation
 
-1) ``pip install librosa`` -- audio processing library
+1) ``pip install librosa soundfile`` -- audio processing library
 
 Note: for the VAD I need to `pip install torch torchaudio`.
 
 2) Whisper backend.
 
-Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
+ Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
 
 Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
+
+Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly. 
+Install with: `pip install openai`
+
+For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with *sk-xxx* replaced with your api key. 
 
 The backend is loaded only when chosen. The unused one does not have to be installed.
 

 
 ```
 usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
-                         [--backend {faster-whisper,whisper_timestamped}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
+                         [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
                          audio_path
 
 positional arguments:

                         Source language code, e.g. en,de,cs, or 'auto' for language detection.
   --task {transcribe,translate}
                         Transcribe or translate.
-  --backend {faster-whisper,whisper_timestamped}
+  --backend {faster-whisper,whisper_timestamped,openai-api}
                         Load only this backend for Whisper processing.
   --vad                 Use VAD = voice activity detection, with the default parameters.
   --buffer_trimming {sentence,segment}

 
 This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
 
-```
+```python
 from whisper_online import *
 
 src_lan = "en"  # source language

 
 ### Server -- real-time from mic
 
-`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
+`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
 
 Client example:
 

 re-process confirmed sentence prefixes and skip them, making sure they don't
 overlap, and we limit the processing buffer window. 
 
-Contributions are welcome.
-
 ### Performance evaluation
 
 [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
 
+### Contributions
+
+Contributions are welcome. We acknowledge especially:
+
+- [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
+- [Nice explanation video](https://www.youtube.com/watch?v=_spinzpEeFM) -- published on 31st March 2024, not that newer updates are not included.
+- [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
+- [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
+- [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
+- The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
+
 
 ## Contact
 

74b80e3

2c075e3

line_packet.py

--- line_packet.py

+++ line_packet.py


 
 """Functions for sending and receiving individual lines of text over a socket.
 
-Used by marian-server-server.py to communicate with the Marian worker.
-
 A line is transmitted using one or more fixed-size packets of UTF-8 bytes
 containing:
 

 
   - Zero or more \0 bytes as required to pad the packet to PACKET_SIZE
 
+Originally from the UEDIN team of the ELITR project. 
 """
 
 PACKET_SIZE = 65536

74b80e3

2c075e3

whisper_online.py

--- whisper_online.py

+++ whisper_online.py


 import librosa
 from functools import lru_cache
 import time
-import datetime
+import logging
 
+import io
+import soundfile as sf
+import math
+
+logger = logging.getLogger(__name__)
 
 @lru_cache
 def load_audio(fname):
-    a, _ = librosa.load(fname, sr=16000)
+    a, _ = librosa.load(fname, sr=16000, dtype=np.float32)
     return a
 
 def load_audio_chunk(fname, beg, end):

 
     def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
         import whisper
+        import whisper_timestamped
         from whisper_timestamped import transcribe_timestamped
         self.transcribe_timestamped = transcribe_timestamped
         if model_dir is not None:
-            print("ignoring model_dir, not implemented",file=self.logfile)
+            logger.debug("ignoring model_dir, not implemented")
         return whisper.load_model(modelsize, download_root=cache_dir)
 
     def transcribe(self, audio, init_prompt=""):

 
     def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
         from faster_whisper import WhisperModel
+#        logging.getLogger("faster_whisper").setLevel(logger.level)
         if model_dir is not None:
-            print(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.",file=self.logfile)
+            logger.debug(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.")
             model_size_or_path = model_dir
         elif modelsize is not None:
             model_size_or_path = modelsize

         self.transcribe_kargs["task"] = "translate"
 
 
+class OpenaiApiASR(ASRBase):
+    """Uses OpenAI's Whisper API for audio transcription."""
+
+    def __init__(self, lan=None, temperature=0, logfile=sys.stderr):
+        self.logfile = logfile
+
+        self.modelname = "whisper-1"  
+        self.original_language = None if lan == "auto" else lan # ISO-639-1 language code
+        self.response_format = "verbose_json" 
+        self.temperature = temperature
+
+        self.load_model()
+
+        self.use_vad_opt = False
+
+        # reset the task in set_translate_task
+        self.task = "transcribe"
+
+    def load_model(self, *args, **kwargs):
+        from openai import OpenAI
+        self.client = OpenAI()
+
+        self.transcribed_seconds = 0  # for logging how many seconds were processed by API, to know the cost
+        
+
+    def ts_words(self, segments):
+        no_speech_segments = []
+        if self.use_vad_opt:
+            for segment in segments.segments:
+                # TODO: threshold can be set from outside
+                if segment["no_speech_prob"] > 0.8:
+                    no_speech_segments.append((segment.get("start"), segment.get("end")))
+
+        o = []
+        for word in segments.words:
+            start = word.get("start")
+            end = word.get("end")
+            if any(s[0] <= start <= s[1] for s in no_speech_segments):
+                # print("Skipping word", word.get("word"), "because it's in a no-speech segment")
+                continue
+            o.append((start, end, word.get("word")))
+        return o
+
+
+    def segments_end_ts(self, res):
+        return [s["end"] for s in res.words]
+
+    def transcribe(self, audio_data, prompt=None, *args, **kwargs):
+        # Write the audio data to a buffer
+        buffer = io.BytesIO()
+        buffer.name = "temp.wav"
+        sf.write(buffer, audio_data, samplerate=16000, format='WAV', subtype='PCM_16')
+        buffer.seek(0)  # Reset buffer's position to the beginning
+
+        self.transcribed_seconds += math.ceil(len(audio_data)/16000)  # it rounds up to the whole seconds
+
+        params = {
+            "model": self.modelname,
+            "file": buffer,
+            "response_format": self.response_format,
+            "temperature": self.temperature,
+            "timestamp_granularities": ["word", "segment"]
+        }
+        if self.task != "translate" and self.original_language:
+            params["language"] = self.original_language
+        if prompt:
+            params["prompt"] = prompt
+
+        if self.task == "translate":
+            proc = self.client.audio.translations
+        else:
+            proc = self.client.audio.transcriptions
+
+        # Process transcription/translation
+        transcript = proc.create(**params)
+        logger.debug(f"OpenAI API processed accumulated {self.transcribed_seconds} seconds")
+
+        return transcript
+
+    def use_vad(self):
+        self.use_vad_opt = True
+
+    def set_translate_task(self):
+        self.task = "translate"
+
+
+
 
 class HypothesisBuffer:
 

                         c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
                         tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
                         if c == tail:
-                            print("removing last",i,"words:",file=self.logfile)
+                            words = []
                             for j in range(i):
-                                print("\t",self.new.pop(0),file=self.logfile)
+                                words.append(repr(self.new.pop(0)))
+                            words_msg = " ".join(words)
+                            logger.debug(f"removing last {i} words: {words_msg}")
                             break
 
     def flush(self):

             self.transcript_buffer.last_commited_time = self.buffer_time_offset
 
         self.commited = []
-        self.last_chunked_at = 0
-
 
     def insert_audio_chunk(self, audio):
         self.audio_buffer = np.append(self.audio_buffer, audio)

         "context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
         """
         k = max(0,len(self.commited)-1)
-        while k > 0 and self.commited[k-1][1] > self.last_chunked_at:
+        while k > 0 and self.commited[k-1][1] > self.buffer_time_offset:
             k -= 1
 
         p = self.commited[:k]

         """
 
         prompt, non_prompt = self.prompt()
-        print("PROMPT:", prompt, file=self.logfile)
-        print("CONTEXT:", non_prompt, file=self.logfile)
-        print(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",file=self.logfile)
+        logger.debug(f"PROMPT: {prompt}")
+        logger.debug(f"CONTEXT: {non_prompt}")
+        logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
         res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
 
         # transform to [(beg,end,"word1"), ...]

         self.transcript_buffer.insert(tsw, self.buffer_time_offset)
         o = self.transcript_buffer.flush()
         self.commited.extend(o)
-        print(">>>>COMPLETE NOW:",self.to_flush(o),file=self.logfile,flush=True)
-        print("INCOMPLETE:",self.to_flush(self.transcript_buffer.complete()),file=self.logfile,flush=True)
+        completed = self.to_flush(o)
+        logger.debug(f">>>>COMPLETE NOW: {completed}")
+        the_rest = self.to_flush(self.transcript_buffer.complete())
+        logger.debug(f"INCOMPLETE: {the_rest}")
 
         # there is a newly confirmed text
 

             #while k>0 and self.commited[k][1] > l:
             #    k -= 1
             #t = self.commited[k][1] 
-            print(f"chunking segment",file=self.logfile)
+            logger.debug("chunking segment")
             #self.chunk_at(t)
 
-        print(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}",file=self.logfile)
+        logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
         return self.to_flush(o)
 
     def chunk_completed_sentence(self):
         if self.commited == []: return
-        print(self.commited,file=self.logfile)
+        logger.debug(self.commited)
         sents = self.words_to_sentences(self.commited)
         for s in sents:
-            print("\t\tSENT:",s,file=self.logfile)
+            logger.debug(f"\t\tSENT: {s}")
         if len(sents) < 2:
             return
         while len(sents) > 2:

         # we will continue with audio processing at this timestamp
         chunk_at = sents[-2][1]
 
-        print(f"--- sentence chunked at {chunk_at:2.2f}",file=self.logfile)
+        logger.debug(f"--- sentence chunked at {chunk_at:2.2f}")
         self.chunk_at(chunk_at)
 
     def chunk_completed_segment(self, res):

                 ends.pop(-1)
                 e = ends[-2]+self.buffer_time_offset
             if e <= t:
-                print(f"--- segment chunked at {e:2.2f}",file=self.logfile)
+                logger.debug(f"--- segment chunked at {e:2.2f}")
                 self.chunk_at(e)
             else:
-                print(f"--- last segment not within commited area",file=self.logfile)
+                logger.debug(f"--- last segment not within commited area")
         else:
-            print(f"--- not enough segments to chunk",file=self.logfile)
+            logger.debug(f"--- not enough segments to chunk")
 
 
 

         cut_seconds = time - self.buffer_time_offset
         self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
         self.buffer_time_offset = time
-        self.last_chunked_at = time
 
     def words_to_sentences(self, words):
         """Uses self.tokenizer for sentence segmentation of words.

         """
         o = self.transcript_buffer.complete()
         f = self.to_flush(o)
-        print("last, noncommited:",f,file=self.logfile)
+        logger.debug(f"last, noncommited: {f}")
         self.buffer_time_offset += len(self.audio_buffer)/16000
         return f
 

 
     # the following languages are in Whisper, but not in wtpsplit:
     if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
-        print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
+        logger.debug(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.")
         lan = None
 
     from wtpsplit import WtP

     parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
     parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
     parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
-    parser.add_argument('--lan', '--language', type=str, default='en', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
+    parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
     parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
-    parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
+    parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
     parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
     parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
     parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
+    parser.add_argument("-l", "--log-level", dest="log_level", choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], help="Set the log level", default='DEBUG')
 
-## main:
+def asr_factory(args, logfile=sys.stderr):
+    """
+    Creates and configures an ASR and ASR Online instance based on the specified backend and arguments.
+    """
+    backend = args.backend
+    if backend == "openai-api":
+        logger.debug("Using OpenAI API.")
+        asr = OpenaiApiASR(lan=args.lan)
+    else:
+        if backend == "faster-whisper":
+            asr_cls = FasterWhisperASR
+        else:
+            asr_cls = WhisperTimestampedASR
+
+        # Only for FasterWhisperASR and WhisperTimestampedASR
+        size = args.model
+        t = time.time()
+        logger.info(f"Loading Whisper {size} model for {args.lan}...")
+        asr = asr_cls(modelsize=size, lan=args.lan, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
+        e = time.time()
+        logger.info(f"done. It took {round(e-t,2)} seconds.")
+
+    # Apply common configurations
+    if getattr(args, 'vad', False):  # Checks if VAD argument is present and True
+        logger.info("Setting VAD filter")
+        asr.use_vad()
+
+    language = args.lan
+    if args.task == "translate":
+        asr.set_translate_task()
+        tgt_language = "en"  # Whisper translates into English
+    else:
+        tgt_language = language  # Whisper transcribes in this language
+
+    # Create the tokenizer
+    if args.buffer_trimming == "sentence":
+        tokenizer = create_tokenizer(tgt_language)
+    else:
+        tokenizer = None
+
+    # Create the OnlineASRProcessor
+    online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
+
+    return asr, online
+
+def set_logging(args,logger,other="_server"):
+    logging.basicConfig(#format='%(name)s 
+            format='%(levelname)s\t%(message)s')
+    logger.setLevel(args.log_level)
+    logging.getLogger("whisper_online"+other).setLevel(args.log_level)
+#    logging.getLogger("whisper_online_server").setLevel(args.log_level)
+
+
 
 if __name__ == "__main__":
 

     logfile = sys.stderr
 
     if args.offline and args.comp_unaware:
-        print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=logfile)
+        logger.error("No or one option from --offline and --comp_unaware are available, not both. Exiting.")
         sys.exit(1)
+
+#    if args.log_level:
+#        logging.basicConfig(format='whisper-%(levelname)s:%(name)s: %(message)s',
+#                            level=getattr(logging, args.log_level))
+
+    set_logging(args,logger)
 
     audio_path = args.audio_path
 
     SAMPLING_RATE = 16000
     duration = len(load_audio(audio_path))/SAMPLING_RATE
-    print("Audio duration is: %2.2f seconds" % duration, file=logfile)
+    logger.info("Audio duration is: %2.2f seconds" % duration)
 
-    size = args.model
-    language = args.lan
-
-    t = time.time()
-    print(f"Loading Whisper {size} model for {language}...",file=logfile,end=" ",flush=True)
-
-    if args.backend == "faster-whisper":
-        asr_cls = FasterWhisperASR
-    else:
-        asr_cls = WhisperTimestampedASR
-
-    asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
-
-    if args.task == "translate":
-        asr.set_translate_task()
-        tgt_language = "en"  # Whisper translates into English
-    else:
-        tgt_language = language  # Whisper transcribes in this language
-
-
-    e = time.time()
-    print(f"done. It took {round(e-t,2)} seconds.",file=logfile)
-
-    if args.vad:
-        print("setting VAD filter",file=logfile)
-        asr.use_vad()
-
-    
+    asr, online = asr_factory(args, logfile=logfile)
     min_chunk = args.min_chunk_size
-    if args.buffer_trimming == "sentence":
-        tokenizer = create_tokenizer(tgt_language)
-    else:
-        tokenizer = None
-    online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
-
 
     # load the audio into the LRU cache before we start the timer
     a = load_audio_chunk(audio_path,0,1)
 
-    # warm up the ASR, because the very first transcribe takes much more time than the other
+    # warm up the ASR because the very first transcribe takes much more time than the other
     asr.transcribe(a)
 
     beg = args.start_at

             print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=logfile,flush=True)
             print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
         else:
-            print(o,file=logfile,flush=True)
+            # No text, so no output
+            pass
 
     if args.offline: ## offline mode processing (for testing/debugging)
         a = load_audio(audio_path)
         online.insert_audio_chunk(a)
         try:
             o = online.process_iter()
-        except AssertionError:
-            print("assertion error",file=logfile)
-            pass
+        except AssertionError as e:
+            logger.error(f"assertion error: {repr(e)}")
         else:
             output_transcript(o)
         now = None

             online.insert_audio_chunk(a)
             try:
                 o = online.process_iter()
-            except AssertionError:
-                print("assertion error",file=logfile)
+            except AssertionError as e:
+                logger.error(f"assertion error: {repr(e)}")
                 pass
             else:
                 output_transcript(o, now=end)
 
-            print(f"## last processed {end:.2f}s",file=logfile,flush=True)
+            logger.debug(f"## last processed {end:.2f}s")
 
             if end >= duration:
                 break

 
             try:
                 o = online.process_iter()
-            except AssertionError:
-                print("assertion error",file=logfile)
+            except AssertionError as e:
+                logger.error(f"assertion error: {e}")
                 pass
             else:
                 output_transcript(o)
             now = time.time() - start
-            print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=logfile,flush=True)
+            logger.debug(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}")
 
             if end >= duration:
                 break

74b80e3

2c075e3

whisper_online_server.py

--- whisper_online_server.py

+++ whisper_online_server.py


 import sys
 import argparse
 import os
+import logging
+import numpy as np
+
+logger = logging.getLogger(__name__)
 parser = argparse.ArgumentParser()
 
 # server options

 parser.add_argument("--port", type=int, default=43007)
 parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller.')
 parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
+parser.add_argument("--warmup-file", type=str, dest="warmup_file", 
+        help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .")
 
 # options from whisper_online
 add_shared_args(parser)
 args = parser.parse_args()
 
+set_logging(args,logger,other="")
 
 # setting whisper object by args 
 

 
 size = args.model
 language = args.lan
+asr, online = asr_factory(args)
+min_chunk = args.min_chunk_size
 
-t = time.time()
-print(f"Loading Whisper {size} model for {language}...",file=sys.stderr,end=" ",flush=True)
-
-if args.backend == "faster-whisper":
-    from faster_whisper import WhisperModel
-    asr_cls = FasterWhisperASR
-elif args.backend == "whisper_timestamped":
-    import whisper
-    from whisper_online import WhisperTimestampedASR
-    asr_cls = WhisperTimestampedASR
+# warm up the ASR because the very first transcribe takes more time than the others. 
+# Test results in https://github.com/ufal/whisper_streaming/pull/81
+msg = "Whisper is not warmed up. The first chunk processing may take longer."
+if args.warmup_file:
+    if os.path.isfile(args.warmup_file):
+        a = load_audio_chunk(args.warmup_file,0,1)
+        asr.transcribe(a)
+        logger.info("Whisper is warmed up.")
+    else:
+        logger.critical("The warm up file is not available. "+msg)
+        sys.exit(1)
 else:
-    raise ValueError(f"Unknown {args.backend=}")
-
-asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
-
-if args.task == "translate":
-    asr.set_translate_task()
-    tgt_language = "en"
-else:
-    tgt_language = language
-
-print(f"done. It took {round(time.time()-t,2)} seconds.",file=sys.stderr)
-
-if args.vad:
-    print("setting VAD filter",file=sys.stderr)
-    asr.use_vad()
-
-
-if args.buffer_trimming == "sentence":
-    tokenizer = create_tokenizer(tgt_language)
-else:
-    tokenizer = None
-if not args.vac:
-    from whisper_online import OnlineASRProcessor
-    online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
-else:
-    from whisper_online_vac import VACOnlineASRProcessor
-    online = VACOnlineASRProcessor(args.min_chunk_size, asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
-
-
-demo_audio_path = "cs-maji-2.16k.wav"
-if os.path.exists(demo_audio_path):
-    # load the audio into the LRU cache before we start the timer
-    a = load_audio_chunk(demo_audio_path,0,1)
-
-    # TODO: it should be tested whether it's meaningful
-    # warm up the ASR, because the very first transcribe takes much more time than the other
-    asr.transcribe(a)
-else:
-    print("Whisper is not warmed up",file=sys.stderr)
-
-
+    logger.warning(msg)
 
 
 ######### Server objects
 
 import line_packet
 import socket
-
-import logging
-
 
 class Connection:
     '''it wraps conn object'''

                 break
             print("received audio:",len(raw_bytes), "bytes", raw_bytes[:10])
             sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
-            audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
+            audio, _ = librosa.load(sf,sr=SAMPLING_RATE,dtype=np.float32)
             out.append(audio)
         if not out:
             return None

             print("%1.0f %1.0f %s" % (beg,end,o[2]),flush=True,file=sys.stderr)
             return "%1.0f %1.0f %s" % (beg,end,o[2])
         else:
-            print(o,file=sys.stderr,flush=True)
+            logger.debug("No text in this segment")
             return None
 
     def send_result(self, o):

         while True:
             a = self.receive_audio_chunk()
             if a is None:
-                print("break here",file=sys.stderr)
                 break
             self.online_asr_proc.insert_audio_chunk(a)
             o = online.process_iter()
             try:
                 self.send_result(o)
             except BrokenPipeError:
-                print("broken pipe -- connection closed?",file=sys.stderr)
+                logger.info("broken pipe -- connection closed?")
                 break
 
 #        o = online.finish()  # this should be working

 
 
 
-
-# Start logging.
-level = logging.INFO
-logging.basicConfig(level=level, format='whisper-server-%(levelname)s: %(message)s')
-
 # server loop
 
 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
     s.bind((args.host, args.port))
     s.listen(1)
-    logging.info('INFO: Listening on'+str((args.host, args.port)))
+    logger.info('Listening on'+str((args.host, args.port)))
     while True:
         conn, addr = s.accept()
-        logging.info('INFO: Connected to client on {}'.format(addr))
+        logger.info('Connected to client on {}'.format(addr))
         connection = Connection(conn)
         proc = ServerProcessor(connection, online, args.min_chunk_size)
         proc.process()
         conn.close()
-        logging.info('INFO: Connection to client closed')
-logging.info('INFO: Connection closed, terminating.')
+        logger.info('Connection to client closed')
+logger.info('Connection closed, terminating.')

Add a comment

Open 0
Closed 0

List

...	...	@@ -3,43 +3,51 @@
3	3
4	4	Turning Whisper into Real-Time Transcription System
5	5
6		-Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
	6	+Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
7	7
8		-Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
	8	+Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9	9
10	10
11		-Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
12		-
13		-Demo video: https://player.vimeo.com/video/840442741
	11	+[Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
14	12
15	13	[Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
16	14
17		-Please, cite us. [Bibtex citation](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/bib/2023.ijcnlp-demo.3.bib):
	15	+Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
18	16
19	17	```
20		-@InProceedings{machacek-dabre-bojar:2023:ijcnlp,
21		- author = {Macháček, Dominik and Dabre, Raj and Bojar, Ondřej},
22		- title = {Turning Whisper into Real-Time Transcription System},
23		- booktitle = {System Demonstrations},
24		- month = {November},
25		- year = {2023},
26		- address = {Bali, Indonesia},
27		- publisher = {Asian Federation of Natural Language Processing},
28		- pages = {17--24},
	18	+@inproceedings{machacek-etal-2023-turning,
	19	+ title = "Turning Whisper into Real-Time Transcription System",
	20	+ author = "Mach{\'a}{\v{c}}ek, Dominik and
	21	+ Dabre, Raj and
	22	+ Bojar, Ond{\v{r}}ej",
	23	+ editor = "Saha, Sriparna and
	24	+ Sujaini, Herry",
	25	+ booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
	26	+ month = nov,
	27	+ year = "2023",
	28	+ address = "Bali, Indonesia",
	29	+ publisher = "Association for Computational Linguistics",
	30	+ url = "https://aclanthology.org/2023.ijcnlp-demo.3",
	31	+ pages = "17--24",
29	32	}
30	33	```
31	34
32	35	## Installation
33	36
34		-1) ``pip install librosa`` -- audio processing library
	37	+1) ``pip install librosa soundfile`` -- audio processing library
35	38
36	39	Note: for the VAD I need to `pip install torch torchaudio`.
37	40
38	41	2) Whisper backend.
39	42
40		-Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
	43	+ Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
41	44
42	45	Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
	46	+
	47	+Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly.
	48	+Install with: `pip install openai`
	49	+
	50	+For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with sk-xxx replaced with your api key.
43	51
44	52	The backend is loaded only when chosen. The unused one does not have to be installed.
45	53
...	...	@@ -71,7 +79,7 @@
71	79
72	80	```
73	81	usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
74		- [--backend {faster-whisper,whisper_timestamped}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
	82	+ [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
75	83	audio_path
76	84
77	85	positional arguments:
...	...	@@ -91,7 +99,7 @@
91	99	Source language code, e.g. en,de,cs, or 'auto' for language detection.
92	100	--task {transcribe,translate}
93	101	Transcribe or translate.
94		- --backend {faster-whisper,whisper_timestamped}
	102	+ --backend {faster-whisper,whisper_timestamped,openai-api}
95	103	Load only this backend for Whisper processing.
96	104	--vad Use VAD = voice activity detection, with the default parameters.
97	105	--buffer_trimming {sentence,segment}
...	...	@@ -149,7 +157,7 @@
149	157
150	158	This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
151	159
152		-```
	160	+```python
153	161	from whisper_online import *
154	162
155	163	src_lan = "en" # source language
...	...	@@ -177,7 +185,7 @@
177	185
178	186	### Server -- real-time from mic
179	187
180		-`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
	188	+`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
181	189
182	190	Client example:
183	191
...	...	@@ -218,12 +226,21 @@
218	226	re-process confirmed sentence prefixes and skip them, making sure they don't
219	227	overlap, and we limit the processing buffer window.
220	228
221		-Contributions are welcome.
222		-
223	229	### Performance evaluation
224	230
225	231	[See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
226	232
	233	+### Contributions
	234	+
	235	+Contributions are welcome. We acknowledge especially:
	236	+
	237	+- [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
	238	+- [Nice explanation video](https://www.youtube.com/watch?v=_spinzpEeFM) -- published on 31st March 2024, not that newer updates are not included.
	239	+- [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
	240	+- [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
	241	+- [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
	242	+- The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
	243	+
227	244
228	245	## Contact
229	246

...	...	@@ -2,8 +2,6 @@
2	2
3	3	"""Functions for sending and receiving individual lines of text over a socket.
4	4
5		-Used by marian-server-server.py to communicate with the Marian worker.
6		-
7	5	A line is transmitted using one or more fixed-size packets of UTF-8 bytes
8	6	containing:
9	7
...	...	@@ -11,6 +9,7 @@
11	9
12	10	- Zero or more \0 bytes as required to pad the packet to PACKET_SIZE
13	11
	12	+Originally from the UEDIN team of the ELITR project.
14	13	"""
15	14
16	15	PACKET_SIZE = 65536

...	...	@@ -4,12 +4,17 @@
4	4	import librosa
5	5	from functools import lru_cache
6	6	import time
7		-import datetime
	7	+import logging
8	8
	9	+import io
	10	+import soundfile as sf
	11	+import math
	12	+
	13	+logger = logging.getLogger(__name__)
9	14
10	15	@lru_cache
11	16	def load_audio(fname):
12		- a, _ = librosa.load(fname, sr=16000)
	17	+ a, _ = librosa.load(fname, sr=16000, dtype=np.float32)
13	18	return a
14	19
15	20	def load_audio_chunk(fname, beg, end):
...	...	@@ -57,10 +62,11 @@
57	62
58	63	def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
59	64	import whisper
	65	+ import whisper_timestamped
60	66	from whisper_timestamped import transcribe_timestamped
61	67	self.transcribe_timestamped = transcribe_timestamped
62	68	if model_dir is not None:
63		- print("ignoring model_dir, not implemented",file=self.logfile)
	69	+ logger.debug("ignoring model_dir, not implemented")
64	70	return whisper.load_model(modelsize, download_root=cache_dir)
65	71
66	72	def transcribe(self, audio, init_prompt=""):
...	...	@@ -99,8 +105,9 @@
99	105
100	106	def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
101	107	from faster_whisper import WhisperModel
	108	+# logging.getLogger("faster_whisper").setLevel(logger.level)
102	109	if model_dir is not None:
103		- print(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.",file=self.logfile)
	110	+ logger.debug(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.")
104	111	model_size_or_path = model_dir
105	112	elif modelsize is not None:
106	113	model_size_or_path = modelsize
...	...	@@ -150,6 +157,93 @@
150	157	self.transcribe_kargs["task"] = "translate"
151	158
152	159
	160	+class OpenaiApiASR(ASRBase):
	161	+ """Uses OpenAI's Whisper API for audio transcription."""
	162	+
	163	+ def __init__(self, lan=None, temperature=0, logfile=sys.stderr):
	164	+ self.logfile = logfile
	165	+
	166	+ self.modelname = "whisper-1"
	167	+ self.original_language = None if lan == "auto" else lan # ISO-639-1 language code
	168	+ self.response_format = "verbose_json"
	169	+ self.temperature = temperature
	170	+
	171	+ self.load_model()
	172	+
	173	+ self.use_vad_opt = False
	174	+
	175	+ # reset the task in set_translate_task
	176	+ self.task = "transcribe"
	177	+
	178	+ def load_model(self, args, *kwargs):
	179	+ from openai import OpenAI
	180	+ self.client = OpenAI()
	181	+
	182	+ self.transcribed_seconds = 0 # for logging how many seconds were processed by API, to know the cost
	183	+
	184	+
	185	+ def ts_words(self, segments):
	186	+ no_speech_segments = []
	187	+ if self.use_vad_opt:
	188	+ for segment in segments.segments:
	189	+ # TODO: threshold can be set from outside
	190	+ if segment["no_speech_prob"] > 0.8:
	191	+ no_speech_segments.append((segment.get("start"), segment.get("end")))
	192	+
	193	+ o = []
	194	+ for word in segments.words:
	195	+ start = word.get("start")
	196	+ end = word.get("end")
	197	+ if any(s[0] <= start <= s[1] for s in no_speech_segments):
	198	+ # print("Skipping word", word.get("word"), "because it's in a no-speech segment")
	199	+ continue
	200	+ o.append((start, end, word.get("word")))
	201	+ return o
	202	+
	203	+
	204	+ def segments_end_ts(self, res):
	205	+ return [s["end"] for s in res.words]
	206	+
	207	+ def transcribe(self, audio_data, prompt=None, args, *kwargs):
	208	+ # Write the audio data to a buffer
	209	+ buffer = io.BytesIO()
	210	+ buffer.name = "temp.wav"
	211	+ sf.write(buffer, audio_data, samplerate=16000, format='WAV', subtype='PCM_16')
	212	+ buffer.seek(0) # Reset buffer's position to the beginning
	213	+
	214	+ self.transcribed_seconds += math.ceil(len(audio_data)/16000) # it rounds up to the whole seconds
	215	+
	216	+ params = {
	217	+ "model": self.modelname,
	218	+ "file": buffer,
	219	+ "response_format": self.response_format,
	220	+ "temperature": self.temperature,
	221	+ "timestamp_granularities": ["word", "segment"]
	222	+ }
	223	+ if self.task != "translate" and self.original_language:
	224	+ params["language"] = self.original_language
	225	+ if prompt:
	226	+ params["prompt"] = prompt
	227	+
	228	+ if self.task == "translate":
	229	+ proc = self.client.audio.translations
	230	+ else:
	231	+ proc = self.client.audio.transcriptions
	232	+
	233	+ # Process transcription/translation
	234	+ transcript = proc.create(**params)
	235	+ logger.debug(f"OpenAI API processed accumulated {self.transcribed_seconds} seconds")
	236	+
	237	+ return transcript
	238	+
	239	+ def use_vad(self):
	240	+ self.use_vad_opt = True
	241	+
	242	+ def set_translate_task(self):
	243	+ self.task = "translate"
	244	+
	245	+
	246	+
153	247
154	248	class HypothesisBuffer:
155	249
...	...	@@ -181,9 +275,11 @@
181	275	c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
182	276	tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
183	277	if c == tail:
184		- print("removing last",i,"words:",file=self.logfile)
	278	+ words = []
185	279	for j in range(i):
186		- print("\t",self.new.pop(0),file=self.logfile)
	280	+ words.append(repr(self.new.pop(0)))
	281	+ words_msg = " ".join(words)
	282	+ logger.debug(f"removing last {i} words: {words_msg}")
187	283	break
188	284
189	285	def flush(self):
...	...	@@ -246,8 +342,6 @@
246	342	self.transcript_buffer.last_commited_time = self.buffer_time_offset
247	343
248	344	self.commited = []
249		- self.last_chunked_at = 0
250		-
251	345
252	346	def insert_audio_chunk(self, audio):
253	347	self.audio_buffer = np.append(self.audio_buffer, audio)
...	...	@@ -257,7 +351,7 @@
257	351	"context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
258	352	"""
259	353	k = max(0,len(self.commited)-1)
260		- while k > 0 and self.commited[k-1][1] > self.last_chunked_at:
	354	+ while k > 0 and self.commited[k-1][1] > self.buffer_time_offset:
261	355	k -= 1
262	356
263	357	p = self.commited[:k]
...	...	@@ -278,9 +372,9 @@
278	372	"""
279	373
280	374	prompt, non_prompt = self.prompt()
281		- print("PROMPT:", prompt, file=self.logfile)
282		- print("CONTEXT:", non_prompt, file=self.logfile)
283		- print(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",file=self.logfile)
	375	+ logger.debug(f"PROMPT: {prompt}")
	376	+ logger.debug(f"CONTEXT: {non_prompt}")
	377	+ logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
284	378	res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
285	379
286	380	# transform to [(beg,end,"word1"), ...]
...	...	@@ -289,8 +383,10 @@
289	383	self.transcript_buffer.insert(tsw, self.buffer_time_offset)
290	384	o = self.transcript_buffer.flush()
291	385	self.commited.extend(o)
292		- print(">>>>COMPLETE NOW:",self.to_flush(o),file=self.logfile,flush=True)
293		- print("INCOMPLETE:",self.to_flush(self.transcript_buffer.complete()),file=self.logfile,flush=True)
	386	+ completed = self.to_flush(o)
	387	+ logger.debug(f">>>>COMPLETE NOW: {completed}")
	388	+ the_rest = self.to_flush(self.transcript_buffer.complete())
	389	+ logger.debug(f"INCOMPLETE: {the_rest}")
294	390
295	391	# there is a newly confirmed text
296	392
...	...	@@ -314,18 +410,18 @@
314	410	#while k>0 and self.commited[k][1] > l:
315	411	# k -= 1
316	412	#t = self.commited[k][1]
317		- print(f"chunking segment",file=self.logfile)
	413	+ logger.debug("chunking segment")
318	414	#self.chunk_at(t)
319	415
320		- print(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}",file=self.logfile)
	416	+ logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
321	417	return self.to_flush(o)
322	418
323	419	def chunk_completed_sentence(self):
324	420	if self.commited == []: return
325		- print(self.commited,file=self.logfile)
	421	+ logger.debug(self.commited)
326	422	sents = self.words_to_sentences(self.commited)
327	423	for s in sents:
328		- print("\t\tSENT:",s,file=self.logfile)
	424	+ logger.debug(f"\t\tSENT: {s}")
329	425	if len(sents) < 2:
330	426	return
331	427	while len(sents) > 2:
...	...	@@ -333,7 +429,7 @@
333	429	# we will continue with audio processing at this timestamp
334	430	chunk_at = sents[-2][1]
335	431
336		- print(f"--- sentence chunked at {chunk_at:2.2f}",file=self.logfile)
	432	+ logger.debug(f"--- sentence chunked at {chunk_at:2.2f}")
337	433	self.chunk_at(chunk_at)
338	434
339	435	def chunk_completed_segment(self, res):
...	...	@@ -350,12 +446,12 @@
350	446	ends.pop(-1)
351	447	e = ends[-2]+self.buffer_time_offset
352	448	if e <= t:
353		- print(f"--- segment chunked at {e:2.2f}",file=self.logfile)
	449	+ logger.debug(f"--- segment chunked at {e:2.2f}")
354	450	self.chunk_at(e)
355	451	else:
356		- print(f"--- last segment not within commited area",file=self.logfile)
	452	+ logger.debug(f"--- last segment not within commited area")
357	453	else:
358		- print(f"--- not enough segments to chunk",file=self.logfile)
	454	+ logger.debug(f"--- not enough segments to chunk")
359	455
360	456
361	457
...	...	@@ -368,7 +464,6 @@
368	464	cut_seconds = time - self.buffer_time_offset
369	465	self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
370	466	self.buffer_time_offset = time
371		- self.last_chunked_at = time
372	467
373	468	def words_to_sentences(self, words):
374	469	"""Uses self.tokenizer for sentence segmentation of words.
...	...	@@ -402,7 +497,7 @@
402	497	"""
403	498	o = self.transcript_buffer.complete()
404	499	f = self.to_flush(o)
405		- print("last, noncommited:",f,file=self.logfile)
	500	+ logger.debug(f"last, noncommited: {f}")
406	501	self.buffer_time_offset += len(self.audio_buffer)/16000
407	502	return f
408	503
...	...	@@ -443,7 +538,7 @@
443	538
444	539	# the following languages are in Whisper, but not in wtpsplit:
445	540	if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
446		- print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
	541	+ logger.debug(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.")
447	542	lan = None
448	543
449	544	from wtpsplit import WtP
...	...	@@ -463,14 +558,67 @@
463	558	parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
464	559	parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
465	560	parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
466		- parser.add_argument('--lan', '--language', type=str, default='en', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
	561	+ parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
467	562	parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
468		- parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
	563	+ parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
469	564	parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
470	565	parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
471	566	parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
	567	+ parser.add_argument("-l", "--log-level", dest="log_level", choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], help="Set the log level", default='DEBUG')
472	568
473		-## main:
	569	+def asr_factory(args, logfile=sys.stderr):
	570	+ """
	571	+ Creates and configures an ASR and ASR Online instance based on the specified backend and arguments.
	572	+ """
	573	+ backend = args.backend
	574	+ if backend == "openai-api":
	575	+ logger.debug("Using OpenAI API.")
	576	+ asr = OpenaiApiASR(lan=args.lan)
	577	+ else:
	578	+ if backend == "faster-whisper":
	579	+ asr_cls = FasterWhisperASR
	580	+ else:
	581	+ asr_cls = WhisperTimestampedASR
	582	+
	583	+ # Only for FasterWhisperASR and WhisperTimestampedASR
	584	+ size = args.model
	585	+ t = time.time()
	586	+ logger.info(f"Loading Whisper {size} model for {args.lan}...")
	587	+ asr = asr_cls(modelsize=size, lan=args.lan, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
	588	+ e = time.time()
	589	+ logger.info(f"done. It took {round(e-t,2)} seconds.")
	590	+
	591	+ # Apply common configurations
	592	+ if getattr(args, 'vad', False): # Checks if VAD argument is present and True
	593	+ logger.info("Setting VAD filter")
	594	+ asr.use_vad()
	595	+
	596	+ language = args.lan
	597	+ if args.task == "translate":
	598	+ asr.set_translate_task()
	599	+ tgt_language = "en" # Whisper translates into English
	600	+ else:
	601	+ tgt_language = language # Whisper transcribes in this language
	602	+
	603	+ # Create the tokenizer
	604	+ if args.buffer_trimming == "sentence":
	605	+ tokenizer = create_tokenizer(tgt_language)
	606	+ else:
	607	+ tokenizer = None
	608	+
	609	+ # Create the OnlineASRProcessor
	610	+ online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
	611	+
	612	+ return asr, online
	613	+
	614	+def set_logging(args,logger,other="_server"):
	615	+ logging.basicConfig(#format='%(name)s
	616	+ format='%(levelname)s\t%(message)s')
	617	+ logger.setLevel(args.log_level)
	618	+ logging.getLogger("whisper_online"+other).setLevel(args.log_level)
	619	+# logging.getLogger("whisper_online_server").setLevel(args.log_level)
	620	+
	621	+
474	622
475	623	if __name__ == "__main__":
476	624
...	...	@@ -488,55 +636,28 @@
488	636	logfile = sys.stderr
489	637
490	638	if args.offline and args.comp_unaware:
491		- print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=logfile)
	639	+ logger.error("No or one option from --offline and --comp_unaware are available, not both. Exiting.")
492	640	sys.exit(1)
	641	+
	642	+# if args.log_level:
	643	+# logging.basicConfig(format='whisper-%(levelname)s:%(name)s: %(message)s',
	644	+# level=getattr(logging, args.log_level))
	645	+
	646	+ set_logging(args,logger)
493	647
494	648	audio_path = args.audio_path
495	649
496	650	SAMPLING_RATE = 16000
497	651	duration = len(load_audio(audio_path))/SAMPLING_RATE
498		- print("Audio duration is: %2.2f seconds" % duration, file=logfile)
	652	+ logger.info("Audio duration is: %2.2f seconds" % duration)
499	653
500		- size = args.model
501		- language = args.lan
502		-
503		- t = time.time()
504		- print(f"Loading Whisper {size} model for {language}...",file=logfile,end=" ",flush=True)
505		-
506		- if args.backend == "faster-whisper":
507		- asr_cls = FasterWhisperASR
508		- else:
509		- asr_cls = WhisperTimestampedASR
510		-
511		- asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
512		-
513		- if args.task == "translate":
514		- asr.set_translate_task()
515		- tgt_language = "en" # Whisper translates into English
516		- else:
517		- tgt_language = language # Whisper transcribes in this language
518		-
519		-
520		- e = time.time()
521		- print(f"done. It took {round(e-t,2)} seconds.",file=logfile)
522		-
523		- if args.vad:
524		- print("setting VAD filter",file=logfile)
525		- asr.use_vad()
526		-
527		-
	654	+ asr, online = asr_factory(args, logfile=logfile)
528	655	min_chunk = args.min_chunk_size
529		- if args.buffer_trimming == "sentence":
530		- tokenizer = create_tokenizer(tgt_language)
531		- else:
532		- tokenizer = None
533		- online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
534		-
535	656
536	657	# load the audio into the LRU cache before we start the timer
537	658	a = load_audio_chunk(audio_path,0,1)
538	659
539		- # warm up the ASR, because the very first transcribe takes much more time than the other
	660	+ # warm up the ASR because the very first transcribe takes much more time than the other
540	661	asr.transcribe(a)
541	662
542	663	beg = args.start_at
...	...	@@ -555,16 +676,16 @@
555	676	print("%1.4f %1.0f %1.0f %s" % (now1000, o[0]1000,o[1]*1000,o[2]),file=logfile,flush=True)
556	677	print("%1.4f %1.0f %1.0f %s" % (now1000, o[0]1000,o[1]*1000,o[2]),flush=True)
557	678	else:
558		- print(o,file=logfile,flush=True)
	679	+ # No text, so no output
	680	+ pass
559	681
560	682	if args.offline: ## offline mode processing (for testing/debugging)
561	683	a = load_audio(audio_path)
562	684	online.insert_audio_chunk(a)
563	685	try:
564	686	o = online.process_iter()
565		- except AssertionError:
566		- print("assertion error",file=logfile)
567		- pass
	687	+ except AssertionError as e:
	688	+ logger.error(f"assertion error: {repr(e)}")
568	689	else:
569	690	output_transcript(o)
570	691	now = None
...	...	@@ -575,13 +696,13 @@
575	696	online.insert_audio_chunk(a)
576	697	try:
577	698	o = online.process_iter()
578		- except AssertionError:
579		- print("assertion error",file=logfile)
	699	+ except AssertionError as e:
	700	+ logger.error(f"assertion error: {repr(e)}")
580	701	pass
581	702	else:
582	703	output_transcript(o, now=end)
583	704
584		- print(f"## last processed {end:.2f}s",file=logfile,flush=True)
	705	+ logger.debug(f"## last processed {end:.2f}s")
585	706
586	707	if end >= duration:
587	708	break
...	...	@@ -607,13 +728,13 @@
607	728
608	729	try:
609	730	o = online.process_iter()
610		- except AssertionError:
611		- print("assertion error",file=logfile)
	731	+ except AssertionError as e:
	732	+ logger.error(f"assertion error: {e}")
612	733	pass
613	734	else:
614	735	output_transcript(o)
615	736	now = time.time() - start
616		- print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=logfile,flush=True)
	737	+ logger.debug(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}")
617	738
618	739	if end >= duration:
619	740	break

...	...	@@ -4,6 +4,10 @@
4	4	import sys
5	5	import argparse
6	6	import os
	7	+import logging
	8	+import numpy as np
	9	+
	10	+logger = logging.getLogger(__name__)
7	11	parser = argparse.ArgumentParser()
8	12
9	13	# server options
...	...	@@ -11,11 +15,14 @@
11	15	parser.add_argument("--port", type=int, default=43007)
12	16	parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller.')
13	17	parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
	18	+parser.add_argument("--warmup-file", type=str, dest="warmup_file",
	19	+ help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .")
14	20
15	21	# options from whisper_online
16	22	add_shared_args(parser)
17	23	args = parser.parse_args()
18	24
	25	+set_logging(args,logger,other="")
19	26
20	27	# setting whisper object by args
21	28
...	...	@@ -23,68 +30,28 @@
23	30
24	31	size = args.model
25	32	language = args.lan
	33	+asr, online = asr_factory(args)
	34	+min_chunk = args.min_chunk_size
26	35
27		-t = time.time()
28		-print(f"Loading Whisper {size} model for {language}...",file=sys.stderr,end=" ",flush=True)
29		-
30		-if args.backend == "faster-whisper":
31		- from faster_whisper import WhisperModel
32		- asr_cls = FasterWhisperASR
33		-elif args.backend == "whisper_timestamped":
34		- import whisper
35		- from whisper_online import WhisperTimestampedASR
36		- asr_cls = WhisperTimestampedASR
	36	+# warm up the ASR because the very first transcribe takes more time than the others.
	37	+# Test results in https://github.com/ufal/whisper_streaming/pull/81
	38	+msg = "Whisper is not warmed up. The first chunk processing may take longer."
	39	+if args.warmup_file:
	40	+ if os.path.isfile(args.warmup_file):
	41	+ a = load_audio_chunk(args.warmup_file,0,1)
	42	+ asr.transcribe(a)
	43	+ logger.info("Whisper is warmed up.")
	44	+ else:
	45	+ logger.critical("The warm up file is not available. "+msg)
	46	+ sys.exit(1)
37	47	else:
38		- raise ValueError(f"Unknown {args.backend=}")
39		-
40		-asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
41		-
42		-if args.task == "translate":
43		- asr.set_translate_task()
44		- tgt_language = "en"
45		-else:
46		- tgt_language = language
47		-
48		-print(f"done. It took {round(time.time()-t,2)} seconds.",file=sys.stderr)
49		-
50		-if args.vad:
51		- print("setting VAD filter",file=sys.stderr)
52		- asr.use_vad()
53		-
54		-
55		-if args.buffer_trimming == "sentence":
56		- tokenizer = create_tokenizer(tgt_language)
57		-else:
58		- tokenizer = None
59		-if not args.vac:
60		- from whisper_online import OnlineASRProcessor
61		- online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
62		-else:
63		- from whisper_online_vac import VACOnlineASRProcessor
64		- online = VACOnlineASRProcessor(args.min_chunk_size, asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
65		-
66		-
67		-demo_audio_path = "cs-maji-2.16k.wav"
68		-if os.path.exists(demo_audio_path):
69		- # load the audio into the LRU cache before we start the timer
70		- a = load_audio_chunk(demo_audio_path,0,1)
71		-
72		- # TODO: it should be tested whether it's meaningful
73		- # warm up the ASR, because the very first transcribe takes much more time than the other
74		- asr.transcribe(a)
75		-else:
76		- print("Whisper is not warmed up",file=sys.stderr)
77		-
78		-
	48	+ logger.warning(msg)
79	49
80	50
81	51	######### Server objects
82	52
83	53	import line_packet
84	54	import socket
85		-
86		-import logging
87		-
88	55
89	56	class Connection:
90	57	'''it wraps conn object'''
...	...	@@ -143,7 +110,7 @@
143	110	break
144	111	print("received audio:",len(raw_bytes), "bytes", raw_bytes[:10])
145	112	sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
146		- audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
	113	+ audio, _ = librosa.load(sf,sr=SAMPLING_RATE,dtype=np.float32)
147	114	out.append(audio)
148	115	if not out:
149	116	return None
...	...	@@ -174,7 +141,7 @@
174	141	print("%1.0f %1.0f %s" % (beg,end,o[2]),flush=True,file=sys.stderr)
175	142	return "%1.0f %1.0f %s" % (beg,end,o[2])
176	143	else:
177		- print(o,file=sys.stderr,flush=True)
	144	+ logger.debug("No text in this segment")
178	145	return None
179	146
180	147	def send_result(self, o):
...	...	@@ -188,14 +155,13 @@
188	155	while True:
189	156	a = self.receive_audio_chunk()
190	157	if a is None:
191		- print("break here",file=sys.stderr)
192	158	break
193	159	self.online_asr_proc.insert_audio_chunk(a)
194	160	o = online.process_iter()
195	161	try:
196	162	self.send_result(o)
197	163	except BrokenPipeError:
198		- print("broken pipe -- connection closed?",file=sys.stderr)
	164	+ logger.info("broken pipe -- connection closed?")
199	165	break
200	166
201	167	# o = online.finish() # this should be working
...	...	@@ -203,23 +169,18 @@
203	169
204	170
205	171
206		-
207		-# Start logging.
208		-level = logging.INFO
209		-logging.basicConfig(level=level, format='whisper-server-%(levelname)s: %(message)s')
210		-
211	172	# server loop
212	173
213	174	with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
214	175	s.bind((args.host, args.port))
215	176	s.listen(1)
216		- logging.info('INFO: Listening on'+str((args.host, args.port)))
	177	+ logger.info('Listening on'+str((args.host, args.port)))
217	178	while True:
218	179	conn, addr = s.accept()
219		- logging.info('INFO: Connected to client on {}'.format(addr))
	180	+ logger.info('Connected to client on {}'.format(addr))
220	181	connection = Connection(conn)
221	182	proc = ServerProcessor(connection, online, args.min_chunk_size)
222	183	proc.process()
223	184	conn.close()
224		- logging.info('INFO: Connection to client closed')
225		-logging.info('INFO: Connection closed, terminating.')
	185	+ logger.info('Connection to client closed')
	186	+logger.info('Connection closed, terminating.')

Delete comment