

Merge branch 'main' into online-from-factory
@f03948cad61dbe8398512368df63c6526b455a6d
--- README.md
+++ README.md
... | ... | @@ -3,14 +3,12 @@ |
3 | 3 |
|
4 | 4 |
**Turning Whisper into Real-Time Transcription System** |
5 | 5 |
|
6 |
-Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023 |
|
6 |
+Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023 |
|
7 | 7 |
|
8 |
-Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference. |
|
8 |
+Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference. |
|
9 | 9 |
|
10 | 10 |
|
11 |
-Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf |
|
12 |
- |
|
13 |
-Demo video: https://player.vimeo.com/video/840442741 |
|
11 |
+[Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741) |
|
14 | 12 |
|
15 | 13 |
[Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023 |
16 | 14 |
|
... | ... | @@ -157,7 +155,7 @@ |
157 | 155 |
|
158 | 156 |
This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application. |
159 | 157 |
|
160 |
-``` |
|
158 |
+```python |
|
161 | 159 |
from whisper_online import * |
162 | 160 |
|
163 | 161 |
src_lan = "en" # source language |
... | ... | @@ -185,7 +183,7 @@ |
185 | 183 |
|
186 | 184 |
### Server -- real-time from mic |
187 | 185 |
|
188 |
-`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option). |
|
186 |
+`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option). |
|
189 | 187 |
|
190 | 188 |
Client example: |
191 | 189 |
|
... | ... | @@ -226,12 +224,20 @@ |
226 | 224 |
re-process confirmed sentence prefixes and skip them, making sure they don't |
227 | 225 |
overlap, and we limit the processing buffer window. |
228 | 226 |
|
229 |
-Contributions are welcome. |
|
230 |
- |
|
231 | 227 |
### Performance evaluation |
232 | 228 |
|
233 | 229 |
[See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf) |
234 | 230 |
|
231 |
+### Contributions |
|
232 |
+ |
|
233 |
+Contributions are welcome. We acknowledge especially: |
|
234 |
+ |
|
235 |
+- [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes. |
|
236 |
+- [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN) |
|
237 |
+- [Ondřej Plátek](https://opla.cz/) for the paper pre-review. |
|
238 |
+- [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea. |
|
239 |
+- The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py. |
|
240 |
+ |
|
235 | 241 |
|
236 | 242 |
## Contact |
237 | 243 |
|
--- whisper_online.py
+++ whisper_online.py
... | ... | @@ -626,7 +626,7 @@ |
626 | 626 |
# load the audio into the LRU cache before we start the timer |
627 | 627 |
a = load_audio_chunk(audio_path,0,1) |
628 | 628 |
|
629 |
- # warm up the ASR, because the very first transcribe takes much more time than the other |
|
629 |
+ # warm up the ASR because the very first transcribe takes much more time than the other |
|
630 | 630 |
asr.transcribe(a) |
631 | 631 |
|
632 | 632 |
beg = args.start_at |
--- whisper_online_server.py
+++ whisper_online_server.py
... | ... | @@ -10,6 +10,8 @@ |
10 | 10 |
# server options |
11 | 11 |
parser.add_argument("--host", type=str, default='localhost') |
12 | 12 |
parser.add_argument("--port", type=int, default=43007) |
13 |
+parser.add_argument("--warmup-file", type=str, dest="warmup_file", |
|
14 |
+ help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .") |
|
13 | 15 |
|
14 | 16 |
|
15 | 17 |
# options from whisper_online |
... | ... | @@ -26,18 +28,25 @@ |
26 | 28 |
asr, online = asr_factory(args) |
27 | 29 |
min_chunk = args.min_chunk_size |
28 | 30 |
|
29 |
-demo_audio_path = "cs-maji-2.16k.wav" |
|
30 |
-if os.path.exists(demo_audio_path): |
|
31 |
- # load the audio into the LRU cache before we start the timer |
|
32 |
- a = load_audio_chunk(demo_audio_path,0,1) |
|
33 | 31 |
|
34 |
- # TODO: it should be tested whether it's meaningful |
|
35 |
- # warm up the ASR, because the very first transcribe takes much more time than the other |
|
36 |
- asr.transcribe(a) |
|
32 |
+if args.buffer_trimming == "sentence": |
|
33 |
+ tokenizer = create_tokenizer(tgt_language) |
|
37 | 34 |
else: |
38 |
- print("Whisper is not warmed up",file=sys.stderr) |
|
35 |
+ tokenizer = None |
|
36 |
+online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec)) |
|
39 | 37 |
|
40 |
- |
|
38 |
+# warm up the ASR because the very first transcribe takes more time than the others. |
|
39 |
+# Test results in https://github.com/ufal/whisper_streaming/pull/81 |
|
40 |
+msg = "Whisper is not warmed up. The first chunk processing may take longer." |
|
41 |
+if args.warmup_file: |
|
42 |
+ if os.path.isfile(args.warmup_file): |
|
43 |
+ a = load_audio_chunk(args.warmup_file,0,1) |
|
44 |
+ asr.transcribe(a) |
|
45 |
+ print("INFO: Whisper is warmed up.",file=sys.stderr) |
|
46 |
+ else: |
|
47 |
+ print("WARNING: The warm up file is not available. "+msg,file=sys.stderr) |
|
48 |
+else: |
|
49 |
+ print("WARNING: " + msg, file=sys.stderr) |
|
41 | 50 |
|
42 | 51 |
|
43 | 52 |
######### Server objects |
Add a comment
Delete comment
Once you delete this comment, you won't be able to recover it. Are you sure you want to delete this comment?