Commit @241d2f0b0a7fcb55e6a06cec33dbbe18f2761826 - yjyoon/whisper_streaming

Dominik Macháček 2023-06-05

options readme update, and catch exception in server

@241d2f0b0a7fcb55e6a06cec33dbbe18f2761826

e5f3616

241d2f0

README.md

--- README.md

+++ README.md


 
 The most recommended backend is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
 
-Alternative, less restrictive, but slowe backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
+Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
 
 The backend is loaded only when chosen. The unused one does not have to be installed.
 

 
 ```
 usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
-                         [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--vad]
+                         [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--comp_unaware] [--vad]
                          audio_path
 
 positional arguments:

   --backend {faster-whisper,whisper_timestamped}
                         Load only this backend for Whisper processing.
   --offline             Offline mode.
-  --vad                 Use VAD = voice activity detection, with the default parameters. 
+  --comp_unaware        Computationally unaware simulation.
+  --vad                 Use VAD = voice activity detection, with the default parameters.
 ```
 
 Example:

 ```
 python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
 ```
+
+Simulation modes:
+
+- default mode, no special option: real-time simulation from file, computationally aware. The chunk size is `MIN_CHUNK_SIZE` or larger, if more audio arrived during last update computation.
+
+- `--comp_unaware` option: computationally unaware simulation. It means that the timer that counts the emission times "stops" when the model is computing. The chunk size is always `MIN_CHUNK_SIZE`. The latency is caused only by the model being unable to confirm the output, e.g. because of language ambiguity etc., and not because of slow hardware or suboptimal implementation. We implement this feature for finding the lower bound for latency.
+
+- `--start_at START_AT`: Start processing audio at this time. The first update receives the whole audio by `START_AT`. It is useful for debugging, e.g. when we observe a bug in a specific time in audio file, and want to reproduce it quickly, without long waiting.
+
+- `--ofline` option: It processes the whole audio file at once, in offline mode. We implement it to find out the lowest possible WER on given audio file.
+
+
 
 ### Output format
 

 
 ### Server
 
-`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection.
+`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
 
 Client example:
 

 arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
 ```
 
-- arecord sends realtime audio from a sound device, in raw audio format -- 16000 sampling rate, mono channel, S16\_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)
+- arecord sends realtime audio from a sound device (e.g. mic), in raw audio format -- 16000 sampling rate, mono channel, S16\_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)
 
 - nc is netcat with server's host and port
 

e5f3616

241d2f0

whisper_online_server.py

--- whisper_online_server.py

+++ whisper_online_server.py


 parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
 parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
 parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
-parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
 parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
-parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
 parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
 args = parser.parse_args()
 

                 break
             self.online_asr_proc.insert_audio_chunk(a)
             o = online.process_iter()
-            self.send_result(o)
+            try:
+                self.send_result(o)
+            except BrokenPipeError:
+                print("broken pipe -- connection closed?",file=sys.stderr)
+                break
+
 #        o = online.finish()  # this should be working
 #        self.send_result(o)
 

Add a comment

Open 0
Closed 0

List

...	...	@@ -12,7 +12,7 @@
12	12
13	13	The most recommended backend is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
14	14
15		-Alternative, less restrictive, but slowe backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
	15	+Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
16	16
17	17	The backend is loaded only when chosen. The unused one does not have to be installed.
18	18
...	...	@@ -22,7 +22,7 @@
22	22
23	23	```
24	24	usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
25		- [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--vad]
	25	+ [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--comp_unaware] [--vad]
26	26	audio_path
27	27
28	28	positional arguments:
...	...	@@ -46,7 +46,8 @@
46	46	--backend {faster-whisper,whisper_timestamped}
47	47	Load only this backend for Whisper processing.
48	48	--offline Offline mode.
49		- --vad Use VAD = voice activity detection, with the default parameters.
	49	+ --comp_unaware Computationally unaware simulation.
	50	+ --vad Use VAD = voice activity detection, with the default parameters.
50	51	```
51	52
52	53	Example:
...	...	@@ -56,6 +57,18 @@
56	57	```
57	58	python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
58	59	```
	60	+
	61	+Simulation modes:
	62	+
	63	+- default mode, no special option: real-time simulation from file, computationally aware. The chunk size is `MIN_CHUNK_SIZE` or larger, if more audio arrived during last update computation.
	64	+
	65	+- `--comp_unaware` option: computationally unaware simulation. It means that the timer that counts the emission times "stops" when the model is computing. The chunk size is always `MIN_CHUNK_SIZE`. The latency is caused only by the model being unable to confirm the output, e.g. because of language ambiguity etc., and not because of slow hardware or suboptimal implementation. We implement this feature for finding the lower bound for latency.
	66	+
	67	+- `--start_at START_AT`: Start processing audio at this time. The first update receives the whole audio by `START_AT`. It is useful for debugging, e.g. when we observe a bug in a specific time in audio file, and want to reproduce it quickly, without long waiting.
	68	+
	69	+- `--ofline` option: It processes the whole audio file at once, in offline mode. We implement it to find out the lowest possible WER on given audio file.
	70	+
	71	+
59	72
60	73	### Output format
61	74
...	...	@@ -114,7 +127,7 @@
114	127
115	128	### Server
116	129
117		-`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection.
	130	+`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
118	131
119	132	Client example:
120	133
...	...	@@ -122,7 +135,7 @@
122	135	arecord -f S16_LE -c1 -r 16000 -t raw -D default \| nc localhost 43001
123	136	```
124	137
125		-- arecord sends realtime audio from a sound device, in raw audio format -- 16000 sampling rate, mono channel, S16\_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)
	138	+- arecord sends realtime audio from a sound device (e.g. mic), in raw audio format -- 16000 sampling rate, mono channel, S16\_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)
126	139
127	140	- nc is netcat with server's host and port
128	141

...	...	@@ -20,9 +20,7 @@
20	20	parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
21	21	parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
22	22	parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
23		-parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
24	23	parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
25		-parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
26	24	parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
27	25	args = parser.parse_args()
28	26
...	...	@@ -183,7 +181,12 @@
183	181	break
184	182	self.online_asr_proc.insert_audio_chunk(a)
185	183	o = online.process_iter()
186		- self.send_result(o)
	184	+ try:
	185	+ self.send_result(o)
	186	+ except BrokenPipeError:
	187	+ print("broken pipe -- connection closed?",file=sys.stderr)
	188	+ break
	189	+
187	190	# o = online.finish() # this should be working
188	191	# self.send_result(o)
189	192

Delete comment