Commit @f0434308c7743f9c37a499862077dce2065813a4 - yjyoon/whisper_server

Fedir Zadniprovskyi 2024-05-27

docs: update README.md

@f0434308c7743f9c37a499862077dce2065813a4

ce5dbe5

f043430

README.md

--- README.md

+++ README.md


-## Faster Whisper Server
-`faster-whisper-server` is a web server that supports real-time transcription using WebSockets.
-- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) is used as the backend. Both GPU and CPU inference are supported.
-- LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for real-time transcription.
-- Can be deployed using Docker (Compose configuration can be found in [compose.yaml](./compose.yaml)).
-- All configuration is done through environment variables. See [config.py](./faster_whisper_server/config.py).
-- NOTE: only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
-- NOTE: this isn't really meant to be used as a standalone tool but rather to add transcription features to other applications.
+# Faster Whisper Server
+`faster-whisper-server` is an OpenAI API compatible transcription server which uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as it's backend.
+Features:
+- GPU and CPU support.
+- Easily deployable using Docker.
+- Configurable through environment variables (see [config.py](./faster_whisper_server/config.py)).
+- OpenAI API compatible.
+
 Please create an issue if you find a bug, have a question, or a feature suggestion.
-# Quick Start
+
+## OpenAI API Compatibility ++
+See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
+- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
+    - Unlike OpenAI's API, `faster-whisper-server` also supports streaming transcriptions(and translations). This is usefull for when you want to process large audio files would rather receive the transcription in chunks as they are processed rather than waiting for the whole file to be transcribe. It works in the similar way to chat messages are being when chatting with LLMs.
+- Audio file translation via `POST /v1/audio/translations` endpoint.
+- (WIP) Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
+    - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
+    - Only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
+
+## Quick Start
 Using Docker
 ```bash
 docker run --gpus=all --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:cuda

 Using Docker Compose
 ```bash
 curl -sO https://raw.githubusercontent.com/fedirz/faster-whisper-server/master/compose.yaml
-docker compose up --detach up faster-whisper-server-cuda
+docker compose up --detach faster-whisper-server-cuda
 # or
-docker compose up --detach up faster-whisper-server-cpu
+docker compose up --detach faster-whisper-server-cpu
 ```
 ## Usage
-Streaming audio data from a microphone. [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
+### OpenAI API CLI
 ```bash
-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
-# or
-arecord -f S16_LE -c1 -r 16000 -t raw -D default 2>/dev/null | websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
+export OPENAI_API_KEY="cant-be-empty"
+export OPENAI_BASE_URL=http://localhost:8000/v1/
+```
+```bash
+openai api audio.transcriptions.create -m distil-medium.en -f audio.wav --response-format text
+
+openai api audio.translations.create -m distil-medium.en -f audio.wav --response-format verbose_json
+```
+### OpenAI API Python SDK
+```python
+from openai import OpenAI
+
+client = OpenAI(api_key="cant-be-empty", base_url="http://localhost:8000/v1/")
+
+audio_file = open("audio.wav", "rb")
+transcript = client.audio.transcriptions.create(
+    model="distil-medium.en", file=audio_file
+)
+print(transcript.text)
+```
+
+### CURL
+```bash
+# If `model` isn't specified, the default model is used
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.mp3"
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true"
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true" -F "model=distil-large-v3"
+# It's recommended that you always specify the language as that will reduce the transcription time
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true" -F "model=distil-large-v3" -F "language=en"
+
+curl http://localhost:8000/v1/audio/translations -F "file=@audio.wav"
+```
+
+### Live Transcription
+[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
+Live transcribing audio data from a microphone.
+```bash
+ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://localhost:8000/v1/audio/transcriptions
 ```
 Streaming audio data from a file.
 ```bash
-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > output.raw
+ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > audio.raw
 # send all data at once
-cat output.raw | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
+cat audio.raw | websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
 # Output: {"text":"One,"}{"text":"One,  two,  three,  four,  five."}{"text":"One,  two,  three,  four,  five."}%
 # streaming 16000 samples per second. each sample is 2 bytes
-cat output.raw | pv -qL 32000 | websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
+cat audio.raw | pv -qL 32000 | websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
 # Output: {"text":"One,"}{"text":"One,  two,"}{"text":"One,  two,  three,"}{"text":"One,  two,  three,  four,  five."}{"text":"One,  two,  three,  four,  five.  one."}%
 ```
-Transcribing a file
-```bash
-# convert the file if it has a different format
-ffmpeg -i output.wav -ac 1 -ar 16000 -f s16le output.raw
-curl -X POST -F "file=@output.raw" http://0.0.0.0:8000/v1/audio/transcriptions
-# Output: "{\"text\":\"One,  two,  three,  four,  five.\"}"%
-```
-## Roadmap
-- [ ] Support file transcription (non-streaming) of multiple formats.
-- [ ] CLI client.
-- [ ] Separate the web server related code from the "core", and publish "core" as a package.
-- [ ] Additional documentation and code comments.
-- [ ] Write benchmarks for measuring streaming transcription performance. Possible metrics:
-    - Latency (time when transcription is sent - time between when audio has been received)
-    - Accuracy (already being measured when testing but the process can be improved)
-    - Total seconds of audio transcribed / audio duration (since each audio chunk is being processed at least twice)
-- [ ] Get the API response closer to the format used by OpenAI.
-- [ ] Integrations...

f043430

audio.wav (Binary) (added)

+++ audio.wav

Binary file is not shown

ce5dbe5

f043430

faster_whisper_server/main.py

--- faster_whisper_server/main.py

+++ faster_whisper_server/main.py


     ws: WebSocket,
     model: Annotated[Model, Query()] = config.whisper.model,
     language: Annotated[Language | None, Query()] = config.default_language,
-    prompt: Annotated[str | None, Query()] = None,
     response_format: Annotated[
         ResponseFormat, Query()
     ] = config.default_response_format,

     await ws.accept()
     transcribe_opts = {
         "language": language,
-        "initial_prompt": prompt,
         "temperature": temperature,
         "vad_filter": True,
         "condition_on_previous_text": False,

Add a comment

Open 0
Closed 0

List

...	...	@@ -237,7 +237,6 @@
237	237	ws: WebSocket,
238	238	model: Annotated[Model, Query()] = config.whisper.model,
239	239	language: Annotated[Language \| None, Query()] = config.default_language,
240		- prompt: Annotated[str \| None, Query()] = None,
241	240	response_format: Annotated[
242	241	ResponseFormat, Query()
243	242	] = config.default_response_format,
...	...	@@ -246,7 +245,6 @@
246	245	await ws.accept()
247	246	transcribe_opts = {
248	247	"language": language,
249		- "initial_prompt": prompt,
250	248	"temperature": temperature,
251	249	"vad_filter": True,
252	250	"condition_on_previous_text": False,

...	...	@@ -1,13 +1,23 @@
1		-## Faster Whisper Server
2		-`faster-whisper-server` is a web server that supports real-time transcription using WebSockets.
3		-- [faster-whisper](https://github.com/SYSTRAN/faster-whisper) is used as the backend. Both GPU and CPU inference are supported.
4		-- LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for real-time transcription.
5		-- Can be deployed using Docker (Compose configuration can be found in [compose.yaml](./compose.yaml)).
6		-- All configuration is done through environment variables. See [config.py](./faster_whisper_server/config.py).
7		-- NOTE: only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
8		-- NOTE: this isn't really meant to be used as a standalone tool but rather to add transcription features to other applications.
	1	+# Faster Whisper Server
	2	+`faster-whisper-server` is an OpenAI API compatible transcription server which uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as it's backend.
	3	+Features:
	4	+- GPU and CPU support.
	5	+- Easily deployable using Docker.
	6	+- Configurable through environment variables (see [config.py](./faster_whisper_server/config.py)).
	7	+- OpenAI API compatible.
	8	+
9	9	Please create an issue if you find a bug, have a question, or a feature suggestion.
10		-# Quick Start
	10	+
	11	+## OpenAI API Compatibility ++
	12	+See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
	13	+- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
	14	+ - Unlike OpenAI's API, `faster-whisper-server` also supports streaming transcriptions(and translations). This is usefull for when you want to process large audio files would rather receive the transcription in chunks as they are processed rather than waiting for the whole file to be transcribe. It works in the similar way to chat messages are being when chatting with LLMs.
	15	+- Audio file translation via `POST /v1/audio/translations` endpoint.
	16	+- (WIP) Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
	17	+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
	18	+ - Only transcription of single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
	19	+
	20	+## Quick Start
11	21	Using Docker
12	22	```bash
13	23	docker run --gpus=all --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:cuda
...	...	@@ -17,42 +27,60 @@
17	27	Using Docker Compose
18	28	```bash
19	29	curl -sO https://raw.githubusercontent.com/fedirz/faster-whisper-server/master/compose.yaml
20		-docker compose up --detach up faster-whisper-server-cuda
	30	+docker compose up --detach faster-whisper-server-cuda
21	31	# or
22		-docker compose up --detach up faster-whisper-server-cpu
	32	+docker compose up --detach faster-whisper-server-cpu
23	33	```
24	34	## Usage
25		-Streaming audio data from a microphone. [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
	35	+### OpenAI API CLI
26	36	```bash
27		-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - \| websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
28		-# or
29		-arecord -f S16_LE -c1 -r 16000 -t raw -D default 2>/dev/null \| websocat --binary ws://0.0.0.0:8000/v1/audio/transcriptions
	37	+export OPENAI_API_KEY="cant-be-empty"
	38	+export OPENAI_BASE_URL=http://localhost:8000/v1/
	39	+```
	40	+```bash
	41	+openai api audio.transcriptions.create -m distil-medium.en -f audio.wav --response-format text
	42	+
	43	+openai api audio.translations.create -m distil-medium.en -f audio.wav --response-format verbose_json
	44	+```
	45	+### OpenAI API Python SDK
	46	+```python
	47	+from openai import OpenAI
	48	+
	49	+client = OpenAI(api_key="cant-be-empty", base_url="http://localhost:8000/v1/")
	50	+
	51	+audio_file = open("audio.wav", "rb")
	52	+transcript = client.audio.transcriptions.create(
	53	+ model="distil-medium.en", file=audio_file
	54	+)
	55	+print(transcript.text)
	56	+```
	57	+
	58	+### CURL
	59	+```bash
	60	+# If `model` isn't specified, the default model is used
	61	+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
	62	+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.mp3"
	63	+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true"
	64	+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true" -F "model=distil-large-v3"
	65	+# It's recommended that you always specify the language as that will reduce the transcription time
	66	+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "streaming=true" -F "model=distil-large-v3" -F "language=en"
	67	+
	68	+curl http://localhost:8000/v1/audio/translations -F "file=@audio.wav"
	69	+```
	70	+
	71	+### Live Transcription
	72	+[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
	73	+Live transcribing audio data from a microphone.
	74	+```bash
	75	+ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - \| websocat --binary ws://localhost:8000/v1/audio/transcriptions
30	76	```
31	77	Streaming audio data from a file.
32	78	```bash
33		-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > output.raw
	79	+ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - > audio.raw
34	80	# send all data at once
35		-cat output.raw \| websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
	81	+cat audio.raw \| websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
36	82	# Output: {"text":"One,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five."}%
37	83	# streaming 16000 samples per second. each sample is 2 bytes
38		-cat output.raw \| pv -qL 32000 \| websocat --no-close --binary ws://0.0.0.0:8000/v1/audio/transcriptions
	84	+cat audio.raw \| pv -qL 32000 \| websocat --no-close --binary ws://localhost:8000/v1/audio/transcriptions
39	85	# Output: {"text":"One,"}{"text":"One, two,"}{"text":"One, two, three,"}{"text":"One, two, three, four, five."}{"text":"One, two, three, four, five. one."}%
40	86	```
41		-Transcribing a file
42		-```bash
43		-# convert the file if it has a different format
44		-ffmpeg -i output.wav -ac 1 -ar 16000 -f s16le output.raw
45		-curl -X POST -F "file=@output.raw" http://0.0.0.0:8000/v1/audio/transcriptions
46		-# Output: "{\"text\":\"One, two, three, four, five.\"}"%
47		-```
48		-## Roadmap
49		-- [ ] Support file transcription (non-streaming) of multiple formats.
50		-- [ ] CLI client.
51		-- [ ] Separate the web server related code from the "core", and publish "core" as a package.
52		-- [ ] Additional documentation and code comments.
53		-- [ ] Write benchmarks for measuring streaming transcription performance. Possible metrics:
54		- - Latency (time when transcription is sent - time between when audio has been received)
55		- - Accuracy (already being measured when testing but the process can be improved)
56		- - Total seconds of audio transcribed / audio duration (since each audio chunk is being processed at least twice)
57		-- [ ] Get the API response closer to the format used by OpenAI.
58		-- [ ] Integrations...

Delete comment