Fedir Zadniprovskyi 01-12
docs: usage pages (and more)
@8f4851657b67a4ba51d7e0ea25af381dbc219e43
.pre-commit-config.yaml
--- .pre-commit-config.yaml
+++ .pre-commit-config.yaml
@@ -44,4 +44,4 @@
     rev: v1.5.0
     hooks:
       - id: detect-secrets
-        exclude: 'README.md|tests/conftest.py|docs/usage.md'
+        exclude: 'README.md|tests/conftest.py|docs/usage/*'
README.md
--- README.md
+++ README.md
@@ -78,15 +78,15 @@
 ### OpenAI API Python SDK
 
 ```python
+from pathlib import Path
+
 from openai import OpenAI
 
-client = OpenAI(api_key="cant-be-empty", base_url="http://localhost:8000/v1/")
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="cant-be-empty")
 
-audio_file = open("audio.wav", "rb")
-transcript = client.audio.transcriptions.create(
-    model="Systran/faster-distil-whisper-large-v3", file=audio_file
-)
-print(transcript.text)
+with Path("audio.wav").open("rb") as f:
+    transcript = client.audio.transcriptions.create(model="Systran/faster-distil-whisper-large-v3", file=f)
+    print(transcript.text)
 ```
 
 ### cURL
docs/introduction.md
--- docs/introduction.md
+++ docs/introduction.md
@@ -2,18 +2,26 @@
 
     Under development. I don't yet recommend using these docs as reference for now.
 
+TODO: add HuggingFace Space URL
+
 # Faster Whisper Server
 
-`faster-whisper-server` is an OpenAI API-compatible transcription server which uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as its backend.
-Features:
+`faster-whisper-server` is an OpenAI API-compatible server supporting transcription, translation, and speech generation. For transcription/translation it uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and for text-to-speech [piper](https://github.com/rhasspy/piper) is used.
+
+## Features:
 
 - GPU and CPU support.
-- Easily deployable using Docker.
-- **Configurable through environment variables (see [config.py](./src/faster_whisper_server/config.py))**.
-- OpenAI API compatible.
+- [Deployable via Docker Compose / Docker](./installation.md)
+- [Highly configurable](./configuration.md)
+- OpenAI API compatible. All tools and SDKs that work with OpenAI's API should work with `faster-whisper-server`.
 - Streaming support (transcription is sent via [SSE](https://en.wikipedia.org/wiki/Server-sent_events) as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
 - Live transcription support (audio is sent via websocket as it's generated).
 - Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.
+- (Coming soon) Audio generation (chat completions endpoint) | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
+  - Generate a spoken audio summary of a body of text (text in, audio out)
+  - Perform sentiment analysis on a recording (audio in, text out)
+  - Async speech to speech interactions with a model (audio in, audio out)
+- (Coming soon) Realtime API | [OpenAI Documentation](https://platform.openai.com/docs/guides/realtime)
 
 Please create an issue if you find a bug, have a question, or a feature suggestion.
 
 
docs/usage.md (deleted)
--- docs/usage.md
@@ -1,86 +0,0 @@
-TODO: break this down into: transcription/translation, streaming transcription/translation, live transcription, audio generation, model listing
-TODO: add video demos for all
-TODO: add a note about OPENAI_API_KEY
-
-## Curl
-
-```bash
-curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
-```
-
-## Python
-
-=== "httpx"
-
-    ```python
-    import httpx
-
-    with open('audio.wav', 'rb') as f:
-        files = {'file': ('audio.wav', f)}
-        response = httpx.post('http://localhost:8000/v1/audio/transcriptions', files=files)
-
-    print(response.text)
-    ```
-
-## OpenAI SDKs
-
-=== "Python"
-
-    ```python
-    import httpx
-
-    with open('audio.wav', 'rb') as f:
-        files = {'file': ('audio.wav', f)}
-        response = httpx.post('http://localhost:8000/v1/audio/transcriptions', files=files)
-
-    print(response.text)
-    ```
-
-=== "CLI"
-
-    ```bash
-    export OPENAI_BASE_URL=http://localhost:8000/v1/
-    export OPENAI_API_KEY="cant-be-empty"
-    openai api audio.transcriptions.create -m Systran/faster-whisper-small -f audio.wav --response-format text
-    ```
-
-=== "Other"
-
-    See [OpenAI libraries](https://platform.openai.com/docs/libraries) and [OpenAI speech-to-text usage](https://platform.openai.com/docs/guides/speech-to-text).
-
-## Open WebUI
-
-### Using the UI
-
-1. Go to the [Admin Settings](http://localhost:8080/admin/settings) page
-2. Click on the "Audio" tab
-3. Update settings
-   - Speech-to-Text Engine: OpenAI
-   - API Base URL: http://faster-whisper-server:8000/v1
-   - API Key: does-not-matter-what-you-put-but-should-not-be-empty
-   - Model: Systran/faster-distil-whisper-large-v3
-4. Click "Save"
-
-### Using environment variables (Docker Compose)
-
-!!! warning
-
-    This doesn't seem to work when you've previously used the UI to set the STT engine.
-
-```yaml
-# NOTE: Some parts of the file are omitted for brevity.
-services:
-  open-webui:
-    image: ghcr.io/open-webui/open-webui:main
-    ...
-    environment:
-      ...
-      # Environment variables are documented here https://docs.openwebui.com/getting-started/env-configuration#speech-to-text
-      AUDIO_STT_ENGINE: "openai"
-      AUDIO_STT_OPENAI_API_BASE_URL: "http://faster-whisper-server:8000/v1"
-      AUDIO_STT_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
-      AUDIO_STT_MODEL: "Systran/faster-distil-whisper-large-v3"
-  faster-whisper-server:
-    image: fedirz/faster-whisper-server:latest-cuda
-    ...
-```
 
docs/usage/live-transcription.md (added)
+++ docs/usage/live-transcription.md
@@ -0,0 +1,17 @@
+## Live Transcription (using WebSocket)
+
+!!! note
+
+    More content will be added here soon.
+
+TODO: fix link
+From [live-audio](./examples/live-audio) example
+
+https://github.com/fedirz/faster-whisper-server/assets/76551385/e334c124-af61-41d4-839c-874be150598f
+
+[websocat](https://github.com/vi/websocat?tab=readme-ov-file#installation) installation is required.
+Live transcription of audio data from a microphone.
+
+```bash
+ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws://localhost:8000/v1/audio/transcriptions
+```
 
docs/usage/open-webui-integration.md (added)
+++ docs/usage/open-webui-integration.md
@@ -0,0 +1,36 @@
+## Open WebUI
+
+### Using the UI
+
+1. Go to the [Admin Settings](http://localhost:8080/admin/settings) page
+2. Click on the "Audio" tab
+3. Update settings
+   - Speech-to-Text Engine: OpenAI
+   - API Base URL: http://faster-whisper-server:8000/v1
+   - API Key: does-not-matter-what-you-put-but-should-not-be-empty
+   - Model: Systran/faster-distil-whisper-large-v3
+4. Click "Save"
+
+### Using environment variables (Docker Compose)
+
+!!! warning
+
+    This doesn't seem to work when you've previously used the UI to set the STT engine.
+
+```yaml
+# NOTE: Some parts of the file are omitted for brevity.
+services:
+  open-webui:
+    image: ghcr.io/open-webui/open-webui:main
+    ...
+    environment:
+      ...
+      # Environment variables are documented here https://docs.openwebui.com/getting-started/env-configuration#speech-to-text
+      AUDIO_STT_ENGINE: "openai"
+      AUDIO_STT_OPENAI_API_BASE_URL: "http://faster-whisper-server:8000/v1"
+      AUDIO_STT_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
+      AUDIO_STT_MODEL: "Systran/faster-distil-whisper-large-v3"
+  faster-whisper-server:
+    image: fedirz/faster-whisper-server:latest-cuda
+    ...
+```
 
docs/usage/speech-to-text.md (added)
+++ docs/usage/speech-to-text.md
@@ -0,0 +1,54 @@
+https://platform.openai.com/docs/api-reference/audio/createTranscription
+https://platform.openai.com/docs/guides/speech-to-text
+
+TODO: add a note about automatic downloads
+TODO: add a note about api-key
+TODO: mention streaming
+TODO: add a demo
+TODO: talk about audio format
+
+## Curl
+
+```bash
+curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav"
+```
+
+## Python
+
+=== "httpx"
+
+    ```python
+    import httpx
+
+    with open('audio.wav', 'rb') as f:
+        files = {'file': ('audio.wav', f)}
+        response = httpx.post('http://localhost:8000/v1/audio/transcriptions', files=files)
+
+    print(response.text)
+    ```
+
+## OpenAI SDKs
+
+=== "Python"
+
+    ```python
+    import httpx
+
+    with open('audio.wav', 'rb') as f:
+        files = {'file': ('audio.wav', f)}
+        response = httpx.post('http://localhost:8000/v1/audio/transcriptions', files=files)
+
+    print(response.text)
+    ```
+
+=== "CLI"
+
+    ```bash
+    export OPENAI_BASE_URL=http://localhost:8000/v1/
+    export OPENAI_API_KEY="cant-be-empty"
+    openai api audio.transcriptions.create -m Systran/faster-whisper-small -f audio.wav --response-format text
+    ```
+
+=== "Other"
+
+    See [OpenAI libraries](https://platform.openai.com/docs/libraries).
 
docs/usage/text-to-speech.md (added)
+++ docs/usage/text-to-speech.md
@@ -0,0 +1,98 @@
+!!! warning
+
+    This feature not supported on ARM devices only x86_64. I was unable to build [piper-phonemize](https://github.com/rhasspy/piper-phonemize)(my [fork](https://github.com/fedirz/piper-phonemize))
+
+https://platform.openai.com/docs/api-reference/audio/createSpeech
+https://platform.openai.com/docs/guides/text-to-speech
+http://localhost:8001/faster-whisper-server/api/
+TODO: add a note about automatic downloads
+TODO: add a note about api-key
+TODO: add a demo
+
+## Prerequisite
+
+Download the piper voices from [HuggingFace model repository](https://huggingface.co/rhasspy/piper-voices)
+
+```bash
+# Download all voices (~15 minutes / 7.7 Gbs)
+docker exec -it faster-whisper-server huggingface-cli download rhasspy/piper-voices
+# Download all English voices (~4.5 minutes)
+docker exec -it faster-whisper-server huggingface-cli download rhasspy/piper-voices --include 'en/**/*' 'voices.json'
+# Download all qualities of a specific voice (~4 seconds)
+docker exec -it faster-whisper-server huggingface-cli download rhasspy/piper-voices --include 'en/en_US/amy/**/*' 'voices.json'
+# Download specific quality of a specific voice (~2 seconds)
+docker exec -it faster-whisper-server huggingface-cli download rhasspy/piper-voices --include 'en/en_US/amy/medium/*' 'voices.json'
+```
+
+!!! note
+
+    You can find audio samples of all the available voices [here](https://rhasspy.github.io/piper-samples/)
+
+## Curl
+
+```bash
+# Generate speech from text using the default values (response_format="mp3", speed=1.0, voice="en_US-amy-medium", etc.)
+curl http://localhost:8000/v1/audio/speech --header "Content-Type: application/json" --data '{"input": "Hello World!"}' --output audio.mp3
+# Specifying the output format
+curl http://localhost:8000/v1/audio/speech --header "Content-Type: application/json" --data '{"input": "Hello World!", "response_format": "wav"}' --output audio.wav
+# Specifying the audio speed
+curl http://localhost:8000/v1/audio/speech --header "Content-Type: application/json" --data '{"input": "Hello World!", "speed": 2.0}' --output audio.mp3
+
+# List available (downloaded) voices
+curl http://localhost:8000/v1/audio/speech/voices
+# List just the voice names
+curl http://localhost:8000/v1/audio/speech/voices | jq --raw-output '.[] | .voice'
+# List just the voices in your language
+curl --silent http://localhost:8000/v1/audio/speech/voices | jq --raw-output '.[] | select(.voice | startswith("en")) | .voice'
+
+curl http://localhost:8000/v1/audio/speech --header "Content-Type: application/json" --data '{"input": "Hello World!", "voice": "en_US-ryan-high"}' --output audio.mp3
+```
+
+## Python
+
+=== "httpx"
+
+    ```python
+    from pathlib import Path
+
+    import httpx
+
+    client = httpx.Client(base_url="http://localhost:8000/")
+    res = client.post(
+        "v1/audio/speech",
+        json={
+            "model": "piper",
+            "voice": "en_US-amy-medium",
+            "input": "Hello, world!",
+            "response_format": "mp3",
+            "speed": 1,
+        },
+    ).raise_for_status()
+    with Path("output.mp3").open("wb") as f:
+        f.write(res.read())
+    ```
+
+## OpenAI SDKs
+
+=== "Python"
+
+    ```python
+    from pathlib import Path
+
+    from openai import OpenAI
+
+    openai = OpenAI(base_url="http://localhost:8000/v1", api_key="cant-be-empty")
+    res = openai.audio.speech.create(
+        model="piper",
+        voice="en_US-amy-medium",  # pyright: ignore[reportArgumentType]
+        input="Hello, world!",
+        response_format="mp3",
+        speed=1,
+    )
+    with Path("output.mp3").open("wb") as f:
+        f.write(res.response.read())
+    ```
+
+=== "Other"
+
+    See [OpenAI libraries](https://platform.openai.com/docs/libraries)
mkdocs.yml
--- mkdocs.yml
+++ mkdocs.yml
@@ -1,6 +1,10 @@
 # yaml-language-server: $schema=https://squidfunk.github.io/mkdocs-material/schema.json
+# https://www.mkdocs.org/user-guide/configuration/#configuration
 site_name: Faster Whisper Server Documentation
-repo_url: https://github.com/fedirz/faster-whisper-server
+site_url: https://fedirz.github.io/faster-whisper-server/
+repo_url: https://github.com/fedirz/faster-whisper-server/
+edit_uri: edit/master/docs/
+docs_dir: docs
 theme:
   language: en
   name: material
@@ -9,13 +13,15 @@
     primary: deep orange
     accent: indigo
   features:
-    - content.tabs.link
-    - content.code.copy
+    # https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/
     - navigation.instant
     - navigation.instant.progress
     - navigation.instant.prefetch
+    # https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/
     - search.highlight
     - search.share
+    - content.tabs.link
+    - content.code.copy
 plugins:
   # https://github.com/bharel/mkdocs-render-swagger-plugin
   - render_swagger
@@ -23,9 +29,13 @@
       default_handler: python
 nav:
   - Introduction: introduction.md
+  - Capabilities / Usage:
+      - Speech-to-Text: usage/speech-to-text.md
+      - Text-to-Speech: usage/text-to-speech.md
+      - Live Transcription (using WebSockets): usage/live-transcription.md
+      - Open WebUI Intergration: usage/open-webui-integration.md
   - Installation: installation.md
   - Configuration: configuration.md
-  - Usage: usage.md
   - API: api.md
 markdown_extensions:
   - admonition
@@ -34,3 +44,4 @@
       alternate_style: true
   # https://github.com/mkdocs/mkdocs/issues/545
   - mdx_truly_sane_lists
+# TODO: https://github.com/oprypin/markdown-callouts
Add a comment
List