Commit @1bb8819b5f1d5529d504c96e620aeb8346276cd2 - yjyoon/whisper_server

Fedir Zadniprovskyi 01-13

docs: update

@1bb8819b5f1d5529d504c96e620aeb8346276cd2

dc9e130

1bb8819

README.md

--- README.md

+++ README.md


 
 # Speaches
 
-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used.
+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models.
 
-Features:
+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches)
+
+See the documentation for installation instructions and usage: [https://speaches-ai.github.io/speaches/](https://speaches-ai.github.io/speaches/)
+
+## Features:
 
 - GPU and CPU support.
-- Easily deployable using Docker.
-- **Configurable through environment variables (see [config.py](./src/speaches/config.py))**.
-- OpenAI API compatible.
-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/)
+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/)
+- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`.
+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
+
+  - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
+
 - Live transcription support (audio is sent via websocket as it's generated).
 - Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models.
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
+  - Generate a spoken audio summary of a body of text (text in, audio out)
+  - Perform sentiment analysis on a recording (audio in, text out)
+  - Async speech to speech interactions with a model (audio in, audio out)
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
 
 Please create an issue if you find a bug, have a question, or a feature suggestion.
 
-## OpenAI API Compatibility ++
+## Demo
 
-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
+### Streaming Transcription
 
-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
-  - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs.
-- Audio file translation via `POST /v1/audio/translations` endpoint.
-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
-  - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
-  - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
+TODO
 
-## Quick Start
+### Speech Generation
 
-[Hugging Face Space](https://huggingface.co/spaces/speaches-ai/speaches)
+TODO
 
-![image](https://github.com/fedirz/faster-whisper-server/assets/76551385/6d215c52-ded5-41d2-89a5-03a6fd113aa0)
+### Live Transcription (using WebSockets)
 
-### Using Docker Compose (Recommended)
-
-NOTE: I'm using newer Docker Compsose features. If you are using an older version of Docker Compose, you may need need to update.
-
-```bash
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.yaml
-
-# for GPU support
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cuda.yaml
-docker compose --file compose.cuda.yaml up --detach
-# for CPU only (use this if you don't have a GPU, as the image is much smaller)
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cpu.yaml
-docker compose --file compose.cpu.yaml up --detach
-```
-
-### Using Docker
-
-```bash
-# for GPU support
-docker run --gpus=all --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --detach ghcr.io/speaches-ai/speaches:latest-cuda
-# for CPU only (use this if you don't have a GPU, as the image is much smaller)
-docker run --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --env WHISPER__MODEL=Systran/faster-whisper-small --detach ghcr.io/speaches-ai/speaches:latest-cpu
-```
-
-### Using Kubernetes
-
-Follow [this tutorial](https://substratus.ai/blog/deploying-faster-whisper-on-k8s)
-
-## Usage
-
-If you are looking for a step-by-step walkthrough, check out [this](https://www.youtube.com/watch?app=desktop&v=vSN-oAl6LVs) YouTube video.
-
-### OpenAI API CLI
-
-```bash
-export OPENAI_API_KEY="cant-be-empty"
-export OPENAI_BASE_URL=http://localhost:8000/v1/
-```
-
-```bash
-openai api audio.transcriptions.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format text
-
-openai api audio.translations.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format verbose_json
-```
-
-### OpenAI API Python SDK
-
-```python
-from pathlib import Path
-
-from openai import OpenAI
-
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="cant-be-empty")
-
-with Path("audio.wav").open("rb") as f:
-    transcript = client.audio.transcriptions.create(model="Systran/faster-distil-whisper-large-v3", file=f)
-    print(transcript.text)
-```
-
-### cURL
-
-```bash
-# If `model` isn't specified, the default model is used
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.mp3"
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "stream=true"
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "model=Systran/faster-distil-whisper-large-v3"
-# It's recommended that you always specify the language as that will reduce the transcription time
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "language=en"
-
-curl http://localhost:8000/v1/audio/translations -F "file=@audio.wav"
-```
-
-### Live Transcription (using WebSocket)
-
-From [live-audio](./examples/live-audio) example
-
-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f
-
-[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
-Live transcription of audio data from a microphone.
-
-```bash
-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://localhost:8000/v1/audio/transcriptions
-```
+<video width="100%" controls>
+  <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
+</video>

dc9e130

1bb8819

docs/index.md

--- docs/index.md

+++ docs/index.md


 
 !!! note
 
-    These docs are a work in progress. If you have any questions, suggestions, or find a bug, please create an issue.
-
-TODO: add HuggingFace Space URL
+    These docs are a work in progress.
 
 # Speaches
 
-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used.
+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models.
+
+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches)
 
 ## Features:
 
 - GPU and CPU support.
-- [Deployable via Docker Compose / Docker](./installation.md)
-- [Highly configurable](./configuration.md)
+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/)
+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/)
 - OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`.
-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
+
+  - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
+
 - Live transcription support (audio is sent via websocket as it's generated).
 - Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
-- [Text-to-speech (TTS) via `piper`]
-- (Coming soon) Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models.
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
   - Generate a spoken audio summary of a body of text (text in, audio out)
   - Perform sentiment analysis on a recording (audio in, text out)
   - Async speech to speech interactions with a model (audio in, audio out)
-- (Coming soon) Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
 
 Please create an issue if you find a bug, have a question, or a feature suggestion.
 
-## OpenAI API Compatibility ++
+## Demo
 
-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
+### Streaming Transcription
 
-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
-  - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs.
-- Audio file translation via `POST /v1/audio/translations` endpoint.
-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
-  - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
-  - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
+TODO
 
-TODO: add a note about gradio ui
-TODO: add a note about hf space
+### Speech Generation
+
+TODO
+
+### Live Transcription (using WebSockets)
+
+<video width="100%" controls>
+  <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
+</video>

dc9e130

1bb8819

docs/usage/live-transcription.md

--- docs/usage/live-transcription.md

+++ docs/usage/live-transcription.md


 
     More content will be added here soon.
 
-TODO: fix link
-From [live-audio](./examples/live-audio) example
+From [live-audio](https://github.com/speaches-ai/speaches/tree/master/examples/live-audio) example
 
-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f
+<video width="100%" controls>
+  <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
+</video>
 
 [websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
 Live transcription of audio data from a microphone.

dc9e130

1bb8819

docs/usage/speech-to-text.md

--- docs/usage/speech-to-text.md

+++ docs/usage/speech-to-text.md


 
 !!! note
 
-    Before proceeding, make sure you are familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
+    Before proceeding, you should be familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
 
 ## Curl
 

dc9e130

1bb8819

docs/usage/text-to-speech.md

--- docs/usage/text-to-speech.md

+++ docs/usage/text-to-speech.md


 
 !!! note
 
-    Before proceeding, make sure you are familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
+    Before proceeding, you should be familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
 
 ## Prerequisite
-
-!!! note
-
-    `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/)
 
 Download the Kokoro model and voices.
 

 docker cp voices.json speaches:/home/ubuntu/.cache/huggingface/hub/models--hexgrad--Kokoro-82M/snapshots/c97b7bbc3e60f447383c79b2f94fee861ff156ac/voices.json
 ```
 
+!!! note
+
+    `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/)
+
 Download the piper voices from [HuggingFace model repository](https://huggingface.co/rhasspy/piper-voices)
 
 ```bash

dc9e130

1bb8819

src/speaches/gradio_app.py

--- src/speaches/gradio_app.py

+++ src/speaches/gradio_app.py


                 model_dropdown_choices.remove("rhasspy/piper-voices")
                 gr.Textbox("Speech generation using `rhasspy/piper-voices` model is only supported on x86_64 machines.")
 
-            text = gr.Textbox(
-                label="Input Text",
-                value=DEFAULT_TEXT,
-            )
+            text = gr.Textbox(label="Input Text", value=DEFAULT_TEXT, lines=3)
             stt_model_dropdown = gr.Dropdown(
                 choices=model_dropdown_choices,
                 label="Model",

Add a comment

Open 0
Closed 0

List

...	...	@@ -187,10 +187,7 @@
187	187	model_dropdown_choices.remove("rhasspy/piper-voices")
188	188	gr.Textbox("Speech generation using `rhasspy/piper-voices` model is only supported on x86_64 machines.")
189	189
190		- text = gr.Textbox(
191		- label="Input Text",
192		- value=DEFAULT_TEXT,
193		- )
	190	+ text = gr.Textbox(label="Input Text", value=DEFAULT_TEXT, lines=3)
194	191	stt_model_dropdown = gr.Dropdown(
195	192	choices=model_dropdown_choices,
196	193	label="Model",

...	...	@@ -3,119 +3,45 @@
3	3
4	4	# Speaches
5	5
6		-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used.
	6	+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models.
7	7
8		-Features:
	8	+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches)
	9	+
	10	+See the documentation for installation instructions and usage: [https://speaches-ai.github.io/speaches/](https://speaches-ai.github.io/speaches/)
	11	+
	12	+## Features:
9	13
10	14	- GPU and CPU support.
11		-- Easily deployable using Docker.
12		-- Configurable through environment variables (see [config.py](./src/speaches/config.py)).
13		-- OpenAI API compatible.
14		-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
	15	+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/)
	16	+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/)
	17	+- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`.
	18	+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
	19	+
	20	+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
	21	+
15	22	- Live transcription support (audio is sent via websocket as it's generated).
16	23	- Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
	24	+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models.
	25	+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
	26	+ - Generate a spoken audio summary of a body of text (text in, audio out)
	27	+ - Perform sentiment analysis on a recording (audio in, text out)
	28	+ - Async speech to speech interactions with a model (audio in, audio out)
	29	+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
17	30
18	31	Please create an issue if you find a bug, have a question, or a feature suggestion.
19	32
20		-## OpenAI API Compatibility ++
	33	+## Demo
21	34
22		-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
	35	+### Streaming Transcription
23	36
24		-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
25		- - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs.
26		-- Audio file translation via `POST /v1/audio/translations` endpoint.
27		-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
28		- - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
29		- - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
	37	+TODO
30	38
31		-## Quick Start
	39	+### Speech Generation
32	40
33		-[Hugging Face Space](https://huggingface.co/spaces/speaches-ai/speaches)
	41	+TODO
34	42
35		-![image](https://github.com/fedirz/faster-whisper-server/assets/76551385/6d215c52-ded5-41d2-89a5-03a6fd113aa0)
	43	+### Live Transcription (using WebSockets)
36	44
37		-### Using Docker Compose (Recommended)
38		-
39		-NOTE: I'm using newer Docker Compsose features. If you are using an older version of Docker Compose, you may need need to update.
40		-
41		-```bash
42		-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.yaml
43		-
44		-# for GPU support
45		-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cuda.yaml
46		-docker compose --file compose.cuda.yaml up --detach
47		-# for CPU only (use this if you don't have a GPU, as the image is much smaller)
48		-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cpu.yaml
49		-docker compose --file compose.cpu.yaml up --detach
50		-```
51		-
52		-### Using Docker
53		-
54		-```bash
55		-# for GPU support
56		-docker run --gpus=all --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --detach ghcr.io/speaches-ai/speaches:latest-cuda
57		-# for CPU only (use this if you don't have a GPU, as the image is much smaller)
58		-docker run --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --env WHISPER__MODEL=Systran/faster-whisper-small --detach ghcr.io/speaches-ai/speaches:latest-cpu
59		-```
60		-
61		-### Using Kubernetes
62		-
63		-Follow [this tutorial](https://substratus.ai/blog/deploying-faster-whisper-on-k8s)
64		-
65		-## Usage
66		-
67		-If you are looking for a step-by-step walkthrough, check out [this](https://www.youtube.com/watch?app=desktop&v=vSN-oAl6LVs) YouTube video.
68		-
69		-### OpenAI API CLI
70		-
71		-```bash
72		-export OPENAI_API_KEY="cant-be-empty"
73		-export OPENAI_BASE_URL=http://localhost:8000/v1/
74		-```
75		-
76		-```bash
77		-openai api audio.transcriptions.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format text
78		-
79		-openai api audio.translations.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format verbose_json
80		-```
81		-
82		-### OpenAI API Python SDK
83		-
84		-```python
85		-from pathlib import Path
86		-
87		-from openai import OpenAI
88		-
89		-client = OpenAI(base_url="http://localhost:8000/v1", api_key="cant-be-empty")
90		-
91		-with Path("audio.wav").open("rb") as f:
92		- transcript = client.audio.transcriptions.create(model="Systran/faster-distil-whisper-large-v3", file=f)
93		- print(transcript.text)
94		-```
95		-
96		-### cURL
97		-
98		-```bash
99		-# If `model` isn't specified, the default model is used
100		-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
101		-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.mp3"
102		-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "stream=true"
103		-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "model=Systran/faster-distil-whisper-large-v3"
104		-# It's recommended that you always specify the language as that will reduce the transcription time
105		-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "language=en"
106		-
107		-curl http://localhost:8000/v1/audio/translations -F "file=@audio.wav"
108		-```
109		-
110		-### Live Transcription (using WebSocket)
111		-
112		-From [live-audio](./examples/live-audio) example
113		-
114		-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f
115		-
116		-[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
117		-Live transcription of audio data from a microphone.
118		-
119		-```bash
120		-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - \| websocat --binary ws://localhost:8000/v1/audio/transcriptions
121		-```
	45	+<video width="100%" controls>
	46	+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
	47	+</video>

...	...	@@ -4,42 +4,47 @@
4	4
5	5	!!! note
6	6
7		- These docs are a work in progress. If you have any questions, suggestions, or find a bug, please create an issue.
8		-
9		-TODO: add HuggingFace Space URL
	7	+ These docs are a work in progress.
10	8
11	9	# Speaches
12	10
13		-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used.
	11	+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models.
	12	+
	13	+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches)
14	14
15	15	## Features:
16	16
17	17	- GPU and CPU support.
18		-- [Deployable via Docker Compose / Docker](./installation.md)
19		-- [Highly configurable](./configuration.md)
	18	+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/)
	19	+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/)
20	20	- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`.
21		-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
	21	+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
	22	+
	23	+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
	24	+
22	25	- Live transcription support (audio is sent via websocket as it's generated).
23	26	- Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
24		-- [Text-to-speech (TTS) via `piper`]
25		-- (Coming soon) Audio generation (chat completions endpoint) \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
	27	+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models.
	28	+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
26	29	- Generate a spoken audio summary of a body of text (text in, audio out)
27	30	- Perform sentiment analysis on a recording (audio in, text out)
28	31	- Async speech to speech interactions with a model (audio in, audio out)
29		-- (Coming soon) Realtime API \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
	32	+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API \| [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
30	33
31	34	Please create an issue if you find a bug, have a question, or a feature suggestion.
32	35
33		-## OpenAI API Compatibility ++
	36	+## Demo
34	37
35		-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information.
	38	+### Streaming Transcription
36	39
37		-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint.
38		- - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs.
39		-- Audio file translation via `POST /v1/audio/translations` endpoint.
40		-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint.
41		- - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) \| [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription.
42		- - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.
	40	+TODO
43	41
44		-TODO: add a note about gradio ui
45		-TODO: add a note about hf space
	42	+### Speech Generation
	43	+
	44	+TODO
	45	+
	46	+### Live Transcription (using WebSockets)
	47	+
	48	+<video width="100%" controls>
	49	+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
	50	+</video>

...	...	@@ -6,10 +6,11 @@
6	6
7	7	More content will be added here soon.
8	8
9		-TODO: fix link
10		-From [live-audio](./examples/live-audio) example
	9	+From [live-audio](https://github.com/speaches-ai/speaches/tree/master/examples/live-audio) example
11	10
12		-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f
	11	+<video width="100%" controls>
	12	+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4">
	13	+</video>
13	14
14	15	[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
15	16	Live transcription of audio data from a microphone.

...	...	@@ -7,7 +7,7 @@
7	7
8	8	!!! note
9	9
10		- Before proceeding, make sure you are familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
	10	+ Before proceeding, you should be familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription)
11	11
12	12	## Curl
13	13

...	...	@@ -4,13 +4,9 @@
4	4
5	5	!!! note
6	6
7		- Before proceeding, make sure you are familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
	7	+ Before proceeding, you should be familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech)
8	8
9	9	## Prerequisite
10		-
11		-!!! note
12		-
13		- `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/)
14	10
15	11	Download the Kokoro model and voices.
16	12
...	...	@@ -26,6 +22,10 @@
26	22	docker cp voices.json speaches:/home/ubuntu/.cache/huggingface/hub/models--hexgrad--Kokoro-82M/snapshots/c97b7bbc3e60f447383c79b2f94fee861ff156ac/voices.json
27	23	```
28	24
	25	+!!! note
	26	+
	27	+ `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/)
	28	+
29	29	Download the piper voices from [HuggingFace model repository](https://huggingface.co/rhasspy/piper-voices)
30	30
31	31	```bash

Delete comment