Commit @27a4d2e2bf12ca32239120c4ea8e067e820d9fec - yjyoon/whisper_streaming

Dominik Macháček 2023-05-19

updates:

- fix errors
- module documented

@27a4d2e2bf12ca32239120c4ea8e067e820d9fec

0babb1f

27a4d2e

README.md

--- README.md

+++ README.md


 
 The backend is loaded only when chosen. The unused one does not have to be installed.
 
-## Usage
+## Usage: example entry point
 
 ```
 usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]

 
 Example:
 
+It simulates realtime processing from a pre-recorded mono 16k wav file.
+
 ```
 python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
 ```
 
-## Output format
+### Output format
 
 ```
 2691.4399 300 1380 Chairman, thank you.

 
 [See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
 
+## Usage as a module
+
+TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter. 
+
+The code whisper_online.py is nicely commented, read it as the full documentation.
+
+
+This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
+
+```
+from whisper_online import *
+
+src_lan = "en"  # source language
+tgt_lan = "en"  # target language  -- same as source for ASR, "en" if translate task is used
+
+
+asr = FasterWhisperASR(lan, "large-v2")  # loads and wraps Whisper model
+# set options:
+# asr.set_translate_task()  # it will translate from lan into English
+# asr.use_vad()  # set using VAD 
+
+
+online = OnlineASRProcessor(tgt_lan, asr)  # create processing object
+
+
+while audio_has_not_ended:   # processing loop:
+	a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
+	online.insert_audio_chunk(a)
+	o = online.process_iter()
+	print(o) # do something with current partial output
+# at the end of this audio processing
+o = online.finish()
+print(o)  # do something with the last output
+
+
+online.init()  # refresh if you're going to re-use the object for the next audio
+```
+
 
 
 ## Background
 
-Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer. 
+Default Whisper is intended for audio chunks of at most 30 seconds that contain
+one full sentence. Longer audio files must be split to shorter chunks and
+merged with "init prompt". In low latency simultaneous streaming mode, the
+simple and naive chunking fixed-sized windows does not work well, it can split
+a word in the middle. It is also necessary to know when the transcribt is
+stable, should be confirmed ("commited") and followed up, and when the future
+content makes the transcript clearer. 
 
-For that, there is LocalAgreement-n policy: if n consecutive updates, each with a newly available audio stream chunk, agree on a prefix transcript, it is confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
+For that, there is LocalAgreement-n policy: if n consecutive updates, each with
+a newly available audio stream chunk, agree on a prefix transcript, it is
+confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
 
-In this project, we re-use the idea of Peter Polák from this demo: https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py However, it doesn't do any sentence segmentation, but Whisper produces punctuation and `whisper_transcribed` makes word-level timestamps. In short: we consecutively process new audio chunks, emit the transcripts that are confirmed by 2 iterations, and scroll the audio processing buffer on a timestamp of a confirmed complete sentence. The processing audio buffer is not too long and the processing is fast.
+In this project, we re-use the idea of Peter Polák from this demo:
+https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
+However, it doesn't do any sentence segmentation, but Whisper produces
+punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
+word-level timestamps. In short: we
+consecutively process new audio chunks, emit the transcripts that are confirmed
+by 2 iterations, and scroll the audio processing buffer on a timestamp of a
+confirmed complete sentence. The processing audio buffer is not too long and
+the processing is fast.
 
-In more detail: we use the init prompt, we handle the inaccurate timestamps, we re-process confirmed sentence prefixes and skip them, making sure they don't overlap, and we limit the processing buffer window. 
+In more detail: we use the init prompt, we handle the inaccurate timestamps, we
+re-process confirmed sentence prefixes and skip them, making sure they don't
+overlap, and we limit the processing buffer window. 
 
-This project is work in progress. Contributions are welcome.
+Contributions are welcome.
 
 ### Tests
 
 Rigorous quality and latency tests are pending.
-
-Small initial debugging shows that on a fluent monologue speech without pauses, both the quality and latency of English and German ASR is impressive. 
-
-Czech ASR tests show that multi-speaker interview with pauses and disfluencies is challenging. However, parameters should be tuned.
 
 ## Contact
 

0babb1f

27a4d2e

whisper_online.py

--- whisper_online.py

+++ whisper_online.py


             a,b,t = self.new[0]
             if abs(a - self.last_commited_time) < 1:
                 if self.commited_in_buffer:
-                    # it's going to search for 1, 2 or 3 consecutive words that are identical in commited and new. If they are, they're dropped.
+                    # it's going to search for 1, 2, ..., 5 consecutive words (n-grams) that are identical in commited and new. If they are, they're dropped.
                     cn = len(self.commited_in_buffer)
                     nn = len(self.new)
-                    for i in range(1,min(min(cn,nn),5)+1):
+                    for i in range(1,min(min(cn,nn),5)+1):  # 5 is the maximum 
                         c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
                         tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
                         if c == tail:

 
     SAMPLING_RATE = 16000
 
-    def __init__(self, language, asr, chunk):
-        """language: lang. code
+    def __init__(self, language, asr):
+        """language: lang. code that MosesTokenizer uses for sentence segmentation
         asr: WhisperASR object
         chunk: number of seconds for intended size of audio interval that is inserted and looped
         """
         self.language = language
         self.asr = asr
-        self.tokenizer = MosesTokenizer("en")
+        self.tokenizer = MosesTokenizer(self.language)
 
         self.init()
-
-        self.chunk = chunk
-
 
     def init(self):
         """run this when starting or restarting processing"""

     parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
     parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
     parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
+    parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
     parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
     args = parser.parse_args()
+
+    if args.offline and args.comp_unaware:
+        print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=sys.stderr)
+        sys.exit(1)
 
     audio_path = args.audio_path
 

 
     if args.task == "translate":
         asr.set_translate_task()
+        tgt_language = "en"  # Whisper translates into English
+    else:
+        tgt_language = language  # Whisper transcribes in this language
 
 
     e = time.time()

         asr.use_vad()
 
     min_chunk = args.min_chunk_size
-    online = OnlineASRProcessor(language,asr,min_chunk)
+    online = OnlineASRProcessor(tgt_language,asr)
 
 
     # load the audio into the LRU cache before we start the timer

     beg = args.start_at
     start = time.time()-beg
 
-    def output_transcript(o):
+    def output_transcript(o, now=None):
         # output format in stdout is like:
         # 4186.3606 0 1720 Takhle to je
         # - the first three words are:
         #    - emission time from beginning of processing, in milliseconds
         #    - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
         # - the next words: segment transcript
-        now = time.time()-start
+        if now is None:
+            now = time.time()-start
         if o[0] is not None:
             print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
             print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)

             pass
         else:
             output_transcript(o)
+        now = None
+    elif args.comp_unaware:  # computational unaware mode 
+        end = beg + min_chunk
+        while True:
+            a = load_audio_chunk(audio_path,beg,end)
+            online.insert_audio_chunk(a)
+            try:
+                o = online.process_iter()
+            except AssertionError:
+                print("assertion error",file=sys.stderr)
+                pass
+            else:
+                output_transcript(o, now=end)
+
+            print(f"## last processed {end:.2f}s",file=sys.stderr,flush=True)
+
+            beg = end
+            end += min_chunk
+            if end >= duration:
+                break
+        now = duration
+
     else: # online = simultaneous mode
         end = 0
         while True:

             else:
                 output_transcript(o)
             now = time.time() - start
-            print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr)
-
-            print(file=sys.stderr,flush=True)
+            print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr,flush=True)
 
             if end >= duration:
                 break
+        now = None
 
     o = online.finish()
-    output_transcript(o)
+    output_transcript(o, now=now)

Add a comment

Open 0
Closed 0

List

...	...	@@ -158,10 +158,10 @@
158	158	a,b,t = self.new[0]
159	159	if abs(a - self.last_commited_time) < 1:
160	160	if self.commited_in_buffer:
161		- # it's going to search for 1, 2 or 3 consecutive words that are identical in commited and new. If they are, they're dropped.
	161	+ # it's going to search for 1, 2, ..., 5 consecutive words (n-grams) that are identical in commited and new. If they are, they're dropped.
162	162	cn = len(self.commited_in_buffer)
163	163	nn = len(self.new)
164		- for i in range(1,min(min(cn,nn),5)+1):
	164	+ for i in range(1,min(min(cn,nn),5)+1): # 5 is the maximum
165	165	c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
166	166	tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
167	167	if c == tail:
...	...	@@ -204,19 +204,16 @@
204	204
205	205	SAMPLING_RATE = 16000
206	206
207		- def __init__(self, language, asr, chunk):
208		- """language: lang. code
	207	+ def __init__(self, language, asr):
	208	+ """language: lang. code that MosesTokenizer uses for sentence segmentation
209	209	asr: WhisperASR object
210	210	chunk: number of seconds for intended size of audio interval that is inserted and looped
211	211	"""
212	212	self.language = language
213	213	self.asr = asr
214		- self.tokenizer = MosesTokenizer("en")
	214	+ self.tokenizer = MosesTokenizer(self.language)
215	215
216	216	self.init()
217		-
218		- self.chunk = chunk
219		-
220	217
221	218	def init(self):
222	219	"""run this when starting or restarting processing"""
...	...	@@ -436,8 +433,13 @@
436	433	parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
437	434	parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
438	435	parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
	436	+ parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
439	437	parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
440	438	args = parser.parse_args()
	439	+
	440	+ if args.offline and args.comp_unaware:
	441	+ print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=sys.stderr)
	442	+ sys.exit(1)
441	443
442	444	audio_path = args.audio_path
443	445
...	...	@@ -465,6 +467,9 @@
465	467
466	468	if args.task == "translate":
467	469	asr.set_translate_task()
	470	+ tgt_language = "en" # Whisper translates into English
	471	+ else:
	472	+ tgt_language = language # Whisper transcribes in this language
468	473
469	474
470	475	e = time.time()
...	...	@@ -475,7 +480,7 @@
475	480	asr.use_vad()
476	481
477	482	min_chunk = args.min_chunk_size
478		- online = OnlineASRProcessor(language,asr,min_chunk)
	483	+ online = OnlineASRProcessor(tgt_language,asr)
479	484
480	485
481	486	# load the audio into the LRU cache before we start the timer
...	...	@@ -487,14 +492,15 @@
487	492	beg = args.start_at
488	493	start = time.time()-beg
489	494
490		- def output_transcript(o):
	495	+ def output_transcript(o, now=None):
491	496	# output format in stdout is like:
492	497	# 4186.3606 0 1720 Takhle to je
493	498	# - the first three words are:
494	499	# - emission time from beginning of processing, in milliseconds
495	500	# - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
496	501	# - the next words: segment transcript
497		- now = time.time()-start
	502	+ if now is None:
	503	+ now = time.time()-start
498	504	if o[0] is not None:
499	505	print("%1.4f %1.0f %1.0f %s" % (now1000, o[0]1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
500	506	print("%1.4f %1.0f %1.0f %s" % (now1000, o[0]1000,o[1]*1000,o[2]),flush=True)
...	...	@@ -511,6 +517,28 @@
511	517	pass
512	518	else:
513	519	output_transcript(o)
	520	+ now = None
	521	+ elif args.comp_unaware: # computational unaware mode
	522	+ end = beg + min_chunk
	523	+ while True:
	524	+ a = load_audio_chunk(audio_path,beg,end)
	525	+ online.insert_audio_chunk(a)
	526	+ try:
	527	+ o = online.process_iter()
	528	+ except AssertionError:
	529	+ print("assertion error",file=sys.stderr)
	530	+ pass
	531	+ else:
	532	+ output_transcript(o, now=end)
	533	+
	534	+ print(f"## last processed {end:.2f}s",file=sys.stderr,flush=True)
	535	+
	536	+ beg = end
	537	+ end += min_chunk
	538	+ if end >= duration:
	539	+ break
	540	+ now = duration
	541	+
514	542	else: # online = simultaneous mode
515	543	end = 0
516	544	while True:
...	...	@@ -530,12 +558,11 @@
530	558	else:
531	559	output_transcript(o)
532	560	now = time.time() - start
533		- print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr)
534		-
535		- print(file=sys.stderr,flush=True)
	561	+ print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr,flush=True)
536	562
537	563	if end >= duration:
538	564	break
	565	+ now = None
539	566
540	567	o = online.finish()
541		- output_transcript(o)
	568	+ output_transcript(o, now=now)

...	...	@@ -16,7 +16,7 @@
16	16
17	17	The backend is loaded only when chosen. The unused one does not have to be installed.
18	18
19		-## Usage
	19	+## Usage: example entry point
20	20
21	21	```
22	22	usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
...	...	@@ -49,11 +49,13 @@
49	49
50	50	Example:
51	51
	52	+It simulates realtime processing from a pre-recorded mono 16k wav file.
	53	+
52	54	```
53	55	python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
54	56	```
55	57
56		-## Output format
	58	+### Output format
57	59
58	60	```
59	61	2691.4399 300 1380 Chairman, thank you.
...	...	@@ -70,27 +72,79 @@
70	72
71	73	[See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
72	74
	75	+## Usage as a module
	76	+
	77	+TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.
	78	+
	79	+The code whisper_online.py is nicely commented, read it as the full documentation.
	80	+
	81	+
	82	+This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
	83	+
	84	+```
	85	+from whisper_online import *
	86	+
	87	+src_lan = "en" # source language
	88	+tgt_lan = "en" # target language -- same as source for ASR, "en" if translate task is used
	89	+
	90	+
	91	+asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
	92	+# set options:
	93	+# asr.set_translate_task() # it will translate from lan into English
	94	+# asr.use_vad() # set using VAD
	95	+
	96	+
	97	+online = OnlineASRProcessor(tgt_lan, asr) # create processing object
	98	+
	99	+
	100	+while audio_has_not_ended: # processing loop:
	101	+ a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
	102	+ online.insert_audio_chunk(a)
	103	+ o = online.process_iter()
	104	+ print(o) # do something with current partial output
	105	+# at the end of this audio processing
	106	+o = online.finish()
	107	+print(o) # do something with the last output
	108	+
	109	+
	110	+online.init() # refresh if you're going to re-use the object for the next audio
	111	+```
	112	+
73	113
74	114
75	115	## Background
76	116
77		-Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer.
	117	+Default Whisper is intended for audio chunks of at most 30 seconds that contain
	118	+one full sentence. Longer audio files must be split to shorter chunks and
	119	+merged with "init prompt". In low latency simultaneous streaming mode, the
	120	+simple and naive chunking fixed-sized windows does not work well, it can split
	121	+a word in the middle. It is also necessary to know when the transcribt is
	122	+stable, should be confirmed ("commited") and followed up, and when the future
	123	+content makes the transcript clearer.
78	124
79		-For that, there is LocalAgreement-n policy: if n consecutive updates, each with a newly available audio stream chunk, agree on a prefix transcript, it is confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
	125	+For that, there is LocalAgreement-n policy: if n consecutive updates, each with
	126	+a newly available audio stream chunk, agree on a prefix transcript, it is
	127	+confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
80	128
81		-In this project, we re-use the idea of Peter Polák from this demo: https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py However, it doesn't do any sentence segmentation, but Whisper produces punctuation and `whisper_transcribed` makes word-level timestamps. In short: we consecutively process new audio chunks, emit the transcripts that are confirmed by 2 iterations, and scroll the audio processing buffer on a timestamp of a confirmed complete sentence. The processing audio buffer is not too long and the processing is fast.
	129	+In this project, we re-use the idea of Peter Polák from this demo:
	130	+https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
	131	+However, it doesn't do any sentence segmentation, but Whisper produces
	132	+punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
	133	+word-level timestamps. In short: we
	134	+consecutively process new audio chunks, emit the transcripts that are confirmed
	135	+by 2 iterations, and scroll the audio processing buffer on a timestamp of a
	136	+confirmed complete sentence. The processing audio buffer is not too long and
	137	+the processing is fast.
82	138
83		-In more detail: we use the init prompt, we handle the inaccurate timestamps, we re-process confirmed sentence prefixes and skip them, making sure they don't overlap, and we limit the processing buffer window.
	139	+In more detail: we use the init prompt, we handle the inaccurate timestamps, we
	140	+re-process confirmed sentence prefixes and skip them, making sure they don't
	141	+overlap, and we limit the processing buffer window.
84	142
85		-This project is work in progress. Contributions are welcome.
	143	+Contributions are welcome.
86	144
87	145	### Tests
88	146
89	147	Rigorous quality and latency tests are pending.
90		-
91		-Small initial debugging shows that on a fluent monologue speech without pauses, both the quality and latency of English and German ASR is impressive.
92		-
93		-Czech ASR tests show that multi-speaker interview with pauses and disfluencies is challenging. However, parameters should be tuned.
94	148
95	149	## Contact
96	150

Delete comment