Auto-Labeling & Annotations
Every file scored by FirstHandAPI’s AI ensemble automatically generates structured annotation metadata alongside the quality score. No separate labeling pipeline needed — annotations are included in the file response at GET /v1/jobs/:id/files.
How It Works
When a submission is scored by the AI ensemble (Claude Vision + Whisper + ffprobe), the system generates annotation metadata in the same pass. Annotations are content-type-aware and include:
- Images: Object detection with confidence, OCR with per-word confidence, scene classification, face count, quality metrics, safety scores
- Audio: Transcription with word-level timestamps, speaker demographics, audio metadata (sample rate, codec, SNR)
- Video: Scene segmentation with precise timestamps, object tracking with confidence, video metadata (FPS, resolution, codecs)
Every annotation includes an annotation_model field (e.g., "claude-sonnet-4-20250514") for reproducibility tracking.
When Annotations Are Not Generated
- Policy violations (1-star): AI-generated content, content policy violations, corrupted files
- Stock photo auto-rejects: Files caught by reverse image search before Claude scoring
- Resolution pre-check failures: Images/video below
min_widthormin_height(rejected before scoring) - Scoring system errors: If scoring fails entirely
In these cases, annotations will be null.
Whisper Hallucination Handling
For audio and video files, Whisper transcription is protected against hallucination — silent or ambient audio that Whisper misinterprets as speech. When hallucination is detected (via per-segment no_speech_prob, avg_logprob, and compression_ratio thresholds), the transcript is suppressed and the file is scored as ambient audio. See Trust & Safety for details.
Image EXIF Normalization
All images are automatically EXIF-rotated before scoring and delivery. The corrected image replaces the original in S3, so download_url always returns a correctly-oriented image. No client-side rotation is needed.
Image Annotations
EXIF normalization: All images are automatically rotated to correct orientation (using EXIF data) before scoring. The normalized image replaces the original in S3, so download URLs always return correctly-oriented images. No client-side rotation needed.
| Field | Type | Description |
|---|---|---|
type | "image" | Content type discriminant |
objects | object[] | Detected objects with label, confidence (0.0-1.0), position, coverage |
scene.setting | string | Scene description (e.g., "indoor office") |
scene.indoor | boolean | Whether the scene is indoors |
scene.confidence | number | Scene classification confidence (0.0-1.0) |
text_extraction | object | null | OCR with full_text and per-word {text, confidence} array |
color_palette | string[] | 3-6 dominant hex color codes |
composition | string | Composition description |
face_count | integer | Number of human faces detected |
orientation | string | "landscape", "portrait", or "square" |
quality_metrics.blur_score | number | 0.0 (sharp) to 1.0 (very blurry) |
quality_metrics.exposure | string | "underexposed", "normal", or "overexposed" |
quality_metrics.noise_level | string | "low", "moderate", or "high" |
safety.nsfw_score | number | 0.0 (safe) to 1.0 (explicit) |
safety.violence_score | number | 0.0 (none) to 1.0 (graphic) |
safety.pii_detected | boolean | True if visible personal info (ID cards, documents with names) |
annotation_model | string | Model version (e.g., "claude-sonnet-4-20250514") |
{
"type": "image",
"objects": [
{ "label": "mailbox", "confidence": 0.95, "position": "center", "approximate_coverage": "25% of frame" },
{ "label": "house", "confidence": 0.88, "position": "upper-right", "approximate_coverage": "35% of frame" }
],
"scene": { "setting": "outdoor residential neighborhood", "indoor": false, "confidence": 0.92 },
"text_extraction": {
"full_text": "1234 Oak Street",
"words": [
{ "text": "1234", "confidence": 0.97 },
{ "text": "Oak", "confidence": 0.95 },
{ "text": "Street", "confidence": 0.93 }
]
},
"color_palette": ["#8B4513", "#228B22", "#87CEEB"],
"composition": "centered mailbox with residential background, natural daylight",
"face_count": 0,
"orientation": "landscape",
"quality_metrics": { "blur_score": 0.08, "exposure": "normal", "noise_level": "low" },
"safety": { "nsfw_score": 0.0, "violence_score": 0.0, "pii_detected": false },
"annotation_model": "claude-sonnet-4-20250514"
}Audio Annotations
Audio annotations include technical metadata from ffprobe (sample rate, codec, channels) alongside AI-generated content analysis.
| Field | Type | Description |
|---|---|---|
type | "audio" | Content type discriminant |
duration_seconds | number | null | Actual duration from ffprobe or Whisper |
sample_rate | number | null | Sample rate in Hz (e.g., 44100) |
bitrate_kbps | number | null | Bitrate in kbps (e.g., 128) |
codec | string | null | Audio codec (e.g., "aac", "mp3") |
channels | number | null | 1 = mono, 2 = stereo |
speaker_count | integer | Estimated number of distinct speakers |
language | string | ISO 639-1 code (e.g., "en") — from Whisper when available |
topics | string[] | Topic classification labels |
keywords | string[] | Extracted keywords |
noise_level | string | "silent", "low", "moderate", or "high" |
snr_db | number | null | Estimated signal-to-noise ratio in dB |
emotion_tone | object | null | Emotion/tone classification (null for non-speech) |
room_acoustics | object | Recording environment estimation |
background_sounds | object[] | Identified background sounds with confidence |
speaker_demographics | object | null | Estimated speaker characteristics (see below) |
transcript_segments | object[] | null | Segment-level transcript with timestamps and confidence |
word_timestamps | object[] | null | Word-level timestamps from Whisper (see below) |
annotation_model | string | Model version |
Emotion & Tone
| Field | Type | Description |
|---|---|---|
emotion_tone.primary | string | "neutral", "happy", "sad", "angry", "fearful", "surprised", "disgusted", "calm", "excited" |
emotion_tone.confidence | number | 0.0-1.0 |
emotion_tone.secondary | string | Optional secondary tone |
Set to null for non-speech audio (ambient, music-only).
Room Acoustics
| Field | Type | Description |
|---|---|---|
room_acoustics.estimated_room_size | string | "small", "medium", "large", "outdoor" |
room_acoustics.reverb_level | string | "dry", "slight", "moderate", "heavy" |
room_acoustics.estimated_rt60_seconds | number | null | Estimated reverb decay time. < 0.3 = treated/small, 0.3-0.6 = normal, > 1.0 = very reverberant |
Background Sounds
Each entry in background_sounds identifies a distinct non-speech sound:
| Field | Type | Description |
|---|---|---|
label | string | e.g., "HVAC", "traffic", "fan", "birds", "keyboard", "music" |
confidence | number | 0.0-1.0 |
prominence | string | "faint", "noticeable", "dominant" |
Transcript Segment Confidence
Each transcript segment now includes a confidence score (0.0-1.0) derived from Whisper’s per-segment avg_logprob. Use this to filter low-confidence transcriptions at scale:
{
"text": "The battery lasts about 8 hours.",
"start_seconds": 0.0,
"end_seconds": 2.34,
"confidence": 0.82
}Speaker Demographics (Estimated)
| Field | Values |
|---|---|
estimated_gender | "male", "female", "unknown" |
estimated_age_range | "child", "young_adult", "adult", "senior", "unknown" |
accent_region | e.g., "us_general", "british", "indian", "unknown" |
These are AI estimates from voice characteristics, not verified demographics.
Word-Level Timestamps
From Whisper’s verbose_json output — precise per-word timing:
"word_timestamps": [
{ "word": "The", "start": 0.0, "end": 0.12 },
{ "word": "battery", "start": 0.12, "end": 0.56 },
{ "word": "lasts", "start": 0.56, "end": 0.89 }
]Full example:
{
"type": "audio",
"duration_seconds": 21.45,
"sample_rate": 44100,
"bitrate_kbps": 128,
"codec": "aac",
"channels": 1,
"speaker_count": 1,
"language": "en",
"topics": ["product review", "technology"],
"keywords": ["battery life", "screen quality"],
"noise_level": "low",
"snr_db": 32,
"emotion_tone": {
"primary": "neutral",
"confidence": 0.85,
"secondary": "calm"
},
"room_acoustics": {
"estimated_room_size": "medium",
"reverb_level": "slight",
"estimated_rt60_seconds": 0.4
},
"background_sounds": [
{ "label": "HVAC", "confidence": 0.75, "prominence": "faint" }
],
"speaker_demographics": {
"estimated_gender": "male",
"estimated_age_range": "adult",
"accent_region": "us_general"
},
"transcript_segments": [
{ "text": "The battery lasts about 8 hours.", "start_seconds": 0.0, "end_seconds": 2.34, "confidence": 0.82 },
{ "text": "Screen quality is excellent.", "start_seconds": 2.34, "end_seconds": 4.87, "confidence": 0.79 }
],
"word_timestamps": [
{ "word": "The", "start": 0.0, "end": 0.12 },
{ "word": "battery", "start": 0.12, "end": 0.56 },
{ "word": "lasts", "start": 0.56, "end": 0.89 },
{ "word": "about", "start": 0.89, "end": 1.15 },
{ "word": "8", "start": 1.15, "end": 1.32 },
{ "word": "hours", "start": 1.32, "end": 1.78 }
],
"annotation_model": "claude-sonnet-4-20250514"
}Video Annotations
Video annotations combine visual analysis from Claude Vision, audio analysis from Whisper, and technical metadata from ffprobe.
| Field | Type | Description |
|---|---|---|
type | "video" | Content type discriminant |
duration_seconds | number | null | Duration from ffprobe |
width | number | null | Resolution width in pixels |
height | number | null | Resolution height in pixels |
fps | number | null | Frames per second |
video_codec | string | null | Video codec (e.g., "h264", "hevc") |
audio_codec | string | null | Audio codec (e.g., "aac") |
scenes | object[] | Scene segmentation with precise float timestamps and confidence |
actions | string[] | Action recognition labels |
object_tracking | object[] | Objects tracked across scenes with confidence |
keyframe_descriptions | string[] | Description per extracted keyframe |
face_count | integer | Total faces detected across keyframes |
speaker_count | integer | null | Estimated speakers (null if no audio) |
noise_level | string | null | Audio noise level (null if no audio) |
quality_metrics.blur_score | number | 0.0 (sharp) to 1.0 (blurry) — averaged across keyframes |
quality_metrics.exposure | string | "underexposed", "normal", or "overexposed" |
quality_metrics.stability | string | "stable", "moderate", or "shaky" |
safety.nsfw_score | number | 0.0-1.0 for most sensitive frame |
safety.violence_score | number | 0.0-1.0 for most sensitive frame |
transcript_segments | object[] | null | Audio transcript with precise timestamps |
word_timestamps | object[] | null | Word-level timestamps from Whisper |
annotation_model | string | Model version |
{
"type": "video",
"duration_seconds": 15.2,
"width": 1920,
"height": 1080,
"fps": 30,
"video_codec": "h264",
"audio_codec": "aac",
"scenes": [
{ "description": "Entrance area with door", "start_seconds": 0.0, "end_seconds": 5.2, "confidence": 0.88 },
{ "description": "Living room with furniture", "start_seconds": 5.2, "end_seconds": 10.5, "confidence": 0.92 },
{ "description": "Kitchen area", "start_seconds": 10.5, "end_seconds": 15.2, "confidence": 0.85 }
],
"actions": ["walking", "panning camera", "moving through rooms"],
"object_tracking": [
{ "label": "furniture", "confidence": 0.90, "appears_in_scenes": [0, 1, 2] },
{ "label": "television", "confidence": 0.88, "appears_in_scenes": [1] }
],
"keyframe_descriptions": ["Doorway with wooden frame", "Bright living room with windows", "Kitchen counter with appliances"],
"face_count": 0,
"speaker_count": 1,
"noise_level": "low",
"quality_metrics": { "blur_score": 0.1, "exposure": "normal", "stability": "stable" },
"safety": { "nsfw_score": 0.0, "violence_score": 0.0 },
"transcript_segments": [
{ "text": "Here's a walkthrough of the apartment.", "start_seconds": 1.0, "end_seconds": 3.2 }
],
"word_timestamps": [
{ "word": "Here's", "start": 1.0, "end": 1.25 },
{ "word": "a", "start": 1.25, "end": 1.35 },
{ "word": "walkthrough", "start": 1.35, "end": 1.98 }
],
"annotation_model": "claude-sonnet-4-20250514"
}File-Level Metadata
In addition to annotations, each file response includes:
| Field | Type | Description |
|---|---|---|
duration_seconds | number | null | Actual duration (populated for audio/video via ffprobe) |
width | number | null | Image/video width in pixels |
height | number | null | Image/video height in pixels |
content_hash | string | null | SHA-256 hash of file bytes (for dedup + integrity verification) |
Accessing Annotations
Via REST API
curl -H "Authorization: Bearer fh_live_..." \
https://api.firsthandapi.com/v1/jobs/job_01JQ.../filesVia TypeScript SDK
const files = await client.getJobFiles('job_01JQ...');
for (const file of files.data) {
if (file.annotations?.type === 'image') {
console.log('Objects:', file.annotations.objects.map(o => `${o.label} (${o.confidence})`));
console.log('OCR:', file.annotations.text_extraction?.full_text);
console.log('Faces:', file.annotations.face_count);
console.log('NSFW score:', file.annotations.safety.nsfw_score);
}
if (file.annotations?.type === 'audio') {
console.log('Duration:', file.annotations.duration_seconds, 's');
console.log('Sample rate:', file.annotations.sample_rate, 'Hz');
console.log('Speaker:', file.annotations.speaker_demographics);
console.log('Words:', file.annotations.word_timestamps?.length);
}
}Video Keyframes
For video files, 3 representative keyframe images (at 10%, 50%, 90% of duration) are extracted and stored. The file response includes a keyframes array with pre-signed download URLs:
"keyframes": [
{ "index": 0, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
{ "index": 1, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
{ "index": 2, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." }
]Use keyframes for video thumbnails, video-text training pairs, and preview without downloading the full video.
Provenance Metadata
Every file response includes provenance metadata alongside annotations:
| Field | Type | Description |
|---|---|---|
content_hash | string | SHA-256 hash for dedup + integrity verification |
captured_at | string | null | EXIF capture timestamp (when taken, not uploaded) |
device_info | object | null | {device_model, device_os, app_version} from worker’s device |
worker_region | string | null | Derived from GPS (e.g., "US-NY", "US-CA-LA", "GB") |
These enable geographic diversity analysis, device-specific quality filtering, and dataset provenance tracking.
Webhook: submission.scored
The submission.scored webhook event fires for every scored submission (approved AND rejected), including the full annotation payload. This enables:
- Rejection analytics (why are files being rejected?)
- Active learning (flag low-confidence annotations for human review)
- Real-time scoring dashboards
See Webhook Handling for setup.
Limitations
- Annotations are best-effort — accuracy varies with content quality and complexity
- Confidence scores are Claude’s self-assessed certainty, not calibrated probabilities
- Bounding boxes are described spatially (e.g., “center”, “upper-left”) rather than as pixel coordinates (COCO format). For precise bounding boxes, use a dedicated labeling service
- Speaker demographics are estimated from voice characteristics — not verified identity data
- Word-level confidence is not available (Whisper API does not expose per-word probability). Word timestamps are precise but lack confidence scores
- Audio fingerprinting (Chromaprint) is not available — use
content_hashfor dedup instead - Transcript segments depend on Whisper availability. When the OpenAI API key is not configured, segments will be
null