

docs: update
@1bb8819b5f1d5529d504c96e620aeb8346276cd2
--- README.md
+++ README.md
... | ... | @@ -3,119 +3,45 @@ |
3 | 3 |
|
4 | 4 |
# Speaches |
5 | 5 |
|
6 |
-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used. |
|
6 |
+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models. |
|
7 | 7 |
|
8 |
-Features: |
|
8 |
+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches) |
|
9 |
+ |
|
10 |
+See the documentation for installation instructions and usage: [https://speaches-ai.github.io/speaches/](https://speaches-ai.github.io/speaches/) |
|
11 |
+ |
|
12 |
+## Features: |
|
9 | 13 |
|
10 | 14 |
- GPU and CPU support. |
11 |
-- Easily deployable using Docker. |
|
12 |
-- **Configurable through environment variables (see [config.py](./src/speaches/config.py))**. |
|
13 |
-- OpenAI API compatible. |
|
14 |
-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it). |
|
15 |
+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/) |
|
16 |
+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/) |
|
17 |
+- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`. |
|
18 |
+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it). |
|
19 |
+ |
|
20 |
+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription. |
|
21 |
+ |
|
15 | 22 |
- Live transcription support (audio is sent via websocket as it's generated). |
16 | 23 |
- Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity. |
24 |
+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models. |
|
25 |
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
26 |
+ - Generate a spoken audio summary of a body of text (text in, audio out) |
|
27 |
+ - Perform sentiment analysis on a recording (audio in, text out) |
|
28 |
+ - Async speech to speech interactions with a model (audio in, audio out) |
|
29 |
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
17 | 30 |
|
18 | 31 |
Please create an issue if you find a bug, have a question, or a feature suggestion. |
19 | 32 |
|
20 |
-## OpenAI API Compatibility ++ |
|
33 |
+## Demo |
|
21 | 34 |
|
22 |
-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information. |
|
35 |
+### Streaming Transcription |
|
23 | 36 |
|
24 |
-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint. |
|
25 |
- - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs. |
|
26 |
-- Audio file translation via `POST /v1/audio/translations` endpoint. |
|
27 |
-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint. |
|
28 |
- - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription. |
|
29 |
- - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported. |
|
37 |
+TODO |
|
30 | 38 |
|
31 |
-## Quick Start |
|
39 |
+### Speech Generation |
|
32 | 40 |
|
33 |
-[Hugging Face Space](https://huggingface.co/spaces/speaches-ai/speaches) |
|
41 |
+TODO |
|
34 | 42 |
|
35 |
- |
|
43 |
+### Live Transcription (using WebSockets) |
|
36 | 44 |
|
37 |
-### Using Docker Compose (Recommended) |
|
38 |
- |
|
39 |
-NOTE: I'm using newer Docker Compsose features. If you are using an older version of Docker Compose, you may need need to update. |
|
40 |
- |
|
41 |
-```bash |
|
42 |
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.yaml |
|
43 |
- |
|
44 |
-# for GPU support |
|
45 |
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cuda.yaml |
|
46 |
-docker compose --file compose.cuda.yaml up --detach |
|
47 |
-# for CPU only (use this if you don't have a GPU, as the image is much smaller) |
|
48 |
-curl --silent --remote-name https://raw.githubusercontent.com/speaches-ai/speaches/master/compose.cpu.yaml |
|
49 |
-docker compose --file compose.cpu.yaml up --detach |
|
50 |
-``` |
|
51 |
- |
|
52 |
-### Using Docker |
|
53 |
- |
|
54 |
-```bash |
|
55 |
-# for GPU support |
|
56 |
-docker run --gpus=all --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --detach ghcr.io/speaches-ai/speaches:latest-cuda |
|
57 |
-# for CPU only (use this if you don't have a GPU, as the image is much smaller) |
|
58 |
-docker run --publish 8000:8000 --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub --env WHISPER__MODEL=Systran/faster-whisper-small --detach ghcr.io/speaches-ai/speaches:latest-cpu |
|
59 |
-``` |
|
60 |
- |
|
61 |
-### Using Kubernetes |
|
62 |
- |
|
63 |
-Follow [this tutorial](https://substratus.ai/blog/deploying-faster-whisper-on-k8s) |
|
64 |
- |
|
65 |
-## Usage |
|
66 |
- |
|
67 |
-If you are looking for a step-by-step walkthrough, check out [this](https://www.youtube.com/watch?app=desktop&v=vSN-oAl6LVs) YouTube video. |
|
68 |
- |
|
69 |
-### OpenAI API CLI |
|
70 |
- |
|
71 |
-```bash |
|
72 |
-export OPENAI_API_KEY="cant-be-empty" |
|
73 |
-export OPENAI_BASE_URL=http://localhost:8000/v1/ |
|
74 |
-``` |
|
75 |
- |
|
76 |
-```bash |
|
77 |
-openai api audio.transcriptions.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format text |
|
78 |
- |
|
79 |
-openai api audio.translations.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format verbose_json |
|
80 |
-``` |
|
81 |
- |
|
82 |
-### OpenAI API Python SDK |
|
83 |
- |
|
84 |
-```python |
|
85 |
-from pathlib import Path |
|
86 |
- |
|
87 |
-from openai import OpenAI |
|
88 |
- |
|
89 |
-client = OpenAI(base_url="http://localhost:8000/v1", api_key="cant-be-empty") |
|
90 |
- |
|
91 |
-with Path("audio.wav").open("rb") as f: |
|
92 |
- transcript = client.audio.transcriptions.create(model="Systran/faster-distil-whisper-large-v3", file=f) |
|
93 |
- print(transcript.text) |
|
94 |
-``` |
|
95 |
- |
|
96 |
-### cURL |
|
97 |
- |
|
98 |
-```bash |
|
99 |
-# If `model` isn't specified, the default model is used |
|
100 |
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" |
|
101 |
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.mp3" |
|
102 |
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "stream=true" |
|
103 |
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "model=Systran/faster-distil-whisper-large-v3" |
|
104 |
-# It's recommended that you always specify the language as that will reduce the transcription time |
|
105 |
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "language=en" |
|
106 |
- |
|
107 |
-curl http://localhost:8000/v1/audio/translations -F "file=@audio.wav" |
|
108 |
-``` |
|
109 |
- |
|
110 |
-### Live Transcription (using WebSocket) |
|
111 |
- |
|
112 |
-From [live-audio](./examples/live-audio) example |
|
113 |
- |
|
114 |
-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f |
|
115 |
- |
|
116 |
-[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required. |
|
117 |
-Live transcription of audio data from a microphone. |
|
118 |
- |
|
119 |
-```bash |
|
120 |
-ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://localhost:8000/v1/audio/transcriptions |
|
121 |
-``` |
|
45 |
+<video width="100%" controls> |
|
46 |
+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4"> |
|
47 |
+</video> |
--- docs/index.md
+++ docs/index.md
... | ... | @@ -4,42 +4,47 @@ |
4 | 4 |
|
5 | 5 |
!!! note |
6 | 6 |
|
7 |
- These docs are a work in progress. If you have any questions, suggestions, or find a bug, please create an issue. |
|
8 |
- |
|
9 |
-TODO: add HuggingFace Space URL |
|
7 |
+ These docs are a work in progress. |
|
10 | 8 |
|
11 | 9 |
# Speaches |
12 | 10 |
|
13 |
-`speaches` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used. |
|
11 |
+`speaches` is an OpenAI API-compatible server supporting streaming transcription, translation, and speech generation. Speach-to-Text is powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for Text-to-Speech [piper](https://github.com/rhasspy/piper) and [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) are used. This project aims to be Ollama, but for TTS/STT models. |
|
12 |
+ |
|
13 |
+Try it out on the [HuggingFace Space](https://huggingface.co/spaces/speaches-ai/speaches) |
|
14 | 14 |
|
15 | 15 |
## Features: |
16 | 16 |
|
17 | 17 |
- GPU and CPU support. |
18 |
-- [Deployable via Docker Compose / Docker](./installation.md) |
|
19 |
-- [Highly configurable](./configuration.md) |
|
18 |
+- [Deployable via Docker Compose / Docker](https://speaches-ai.github.io/speaches/installation/) |
|
19 |
+- [Highly configurable](https://speaches-ai.github.io/speaches/configuration/) |
|
20 | 20 |
- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `speaches`. |
21 |
-- Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it). |
|
21 |
+- Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it). |
|
22 |
+ |
|
23 |
+ - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription. |
|
24 |
+ |
|
22 | 25 |
- Live transcription support (audio is sent via websocket as it's generated). |
23 | 26 |
- Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity. |
24 |
-- [Text-to-speech (TTS) via `piper`] |
|
25 |
-- (Coming soon) Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
27 |
+- Text-to-Speech via `kokoro`(Ranked #1 in the [TTS Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)) and `piper` models. |
|
28 |
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/231): Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
26 | 29 |
- Generate a spoken audio summary of a body of text (text in, audio out) |
27 | 30 |
- Perform sentiment analysis on a recording (audio in, text out) |
28 | 31 |
- Async speech to speech interactions with a model (audio in, audio out) |
29 |
-- (Coming soon) Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
32 |
+- [Coming soon](https://github.com/speaches-ai/speaches/issues/115): Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime) |
|
30 | 33 |
|
31 | 34 |
Please create an issue if you find a bug, have a question, or a feature suggestion. |
32 | 35 |
|
33 |
-## OpenAI API Compatibility ++ |
|
36 |
+## Demo |
|
34 | 37 |
|
35 |
-See [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio) for more information. |
|
38 |
+### Streaming Transcription |
|
36 | 39 |
|
37 |
-- Audio file transcription via `POST /v1/audio/transcriptions` endpoint. |
|
38 |
- - Unlike OpenAI's API, `speaches` also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs. |
|
39 |
-- Audio file translation via `POST /v1/audio/translations` endpoint. |
|
40 |
-- Live audio transcription via `WS /v1/audio/transcriptions` endpoint. |
|
41 |
- - LocalAgreement2 ([paper](https://aclanthology.org/2023.ijcnlp-demo.3.pdf) | [original implementation](https://github.com/ufal/whisper_streaming)) algorithm is used for live transcription. |
|
42 |
- - Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported. |
|
40 |
+TODO |
|
43 | 41 |
|
44 |
-TODO: add a note about gradio ui |
|
45 |
-TODO: add a note about hf space |
|
42 |
+### Speech Generation |
|
43 |
+ |
|
44 |
+TODO |
|
45 |
+ |
|
46 |
+### Live Transcription (using WebSockets) |
|
47 |
+ |
|
48 |
+<video width="100%" controls> |
|
49 |
+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4"> |
|
50 |
+</video> |
--- docs/usage/live-transcription.md
+++ docs/usage/live-transcription.md
... | ... | @@ -6,10 +6,11 @@ |
6 | 6 |
|
7 | 7 |
More content will be added here soon. |
8 | 8 |
|
9 |
-TODO: fix link |
|
10 |
-From [live-audio](./examples/live-audio) example |
|
9 |
+From [live-audio](https://github.com/speaches-ai/speaches/tree/master/examples/live-audio) example |
|
11 | 10 |
|
12 |
-https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f |
|
11 |
+<video width="100%" controls> |
|
12 |
+ <source src="https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f" type="video/mp4"> |
|
13 |
+</video> |
|
13 | 14 |
|
14 | 15 |
[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required. |
15 | 16 |
Live transcription of audio data from a microphone. |
--- docs/usage/speech-to-text.md
+++ docs/usage/speech-to-text.md
... | ... | @@ -7,7 +7,7 @@ |
7 | 7 |
|
8 | 8 |
!!! note |
9 | 9 |
|
10 |
- Before proceeding, make sure you are familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription) |
|
10 |
+ Before proceeding, you should be familiar with the [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription) |
|
11 | 11 |
|
12 | 12 |
## Curl |
13 | 13 |
|
--- docs/usage/text-to-speech.md
+++ docs/usage/text-to-speech.md
... | ... | @@ -4,13 +4,9 @@ |
4 | 4 |
|
5 | 5 |
!!! note |
6 | 6 |
|
7 |
- Before proceeding, make sure you are familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
|
7 |
+ Before proceeding, you should be familiar with the [OpenAI Text-to-Speech](https://platform.openai.com/docs/guides/text-to-speech) and the relevant [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createSpeech) |
|
8 | 8 |
|
9 | 9 |
## Prerequisite |
10 |
- |
|
11 |
-!!! note |
|
12 |
- |
|
13 |
- `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/) |
|
14 | 10 |
|
15 | 11 |
Download the Kokoro model and voices. |
16 | 12 |
|
... | ... | @@ -26,6 +22,10 @@ |
26 | 22 |
docker cp voices.json speaches:/home/ubuntu/.cache/huggingface/hub/models--hexgrad--Kokoro-82M/snapshots/c97b7bbc3e60f447383c79b2f94fee861ff156ac/voices.json |
27 | 23 |
``` |
28 | 24 |
|
25 |
+!!! note |
|
26 |
+ |
|
27 |
+ `rhasspy/piper-voices` audio samples can be found [here](https://rhasspy.github.io/piper-samples/) |
|
28 |
+ |
|
29 | 29 |
Download the piper voices from [HuggingFace model repository](https://huggingface.co/rhasspy/piper-voices) |
30 | 30 |
|
31 | 31 |
```bash |
--- src/speaches/gradio_app.py
+++ src/speaches/gradio_app.py
... | ... | @@ -187,10 +187,7 @@ |
187 | 187 |
model_dropdown_choices.remove("rhasspy/piper-voices") |
188 | 188 |
gr.Textbox("Speech generation using `rhasspy/piper-voices` model is only supported on x86_64 machines.") |
189 | 189 |
|
190 |
- text = gr.Textbox( |
|
191 |
- label="Input Text", |
|
192 |
- value=DEFAULT_TEXT, |
|
193 |
- ) |
|
190 |
+ text = gr.Textbox(label="Input Text", value=DEFAULT_TEXT, lines=3) |
|
194 | 191 |
stt_model_dropdown = gr.Dropdown( |
195 | 192 |
choices=model_dropdown_choices, |
196 | 193 |
label="Model", |
Add a comment
Delete comment
Once you delete this comment, you won't be able to recover it. Are you sure you want to delete this comment?