GuidesAuto-Labeling & Annotations

Auto-Labeling & Annotations

Every file scored by FirstHandAPI’s AI ensemble automatically generates structured annotation metadata alongside the quality score. No separate labeling pipeline needed — annotations are included in the file response at GET /v1/jobs/:id/files.

How It Works

When a submission is scored by the AI ensemble (Claude Vision + Whisper + ffprobe), the system generates annotation metadata in the same pass. Annotations are content-type-aware and include:

  • Images: Object detection with confidence, OCR with per-word confidence, scene classification, face count, quality metrics, safety scores
  • Audio: Transcription with word-level timestamps, speaker demographics, audio metadata (sample rate, codec, SNR)
  • Video: Scene segmentation with precise timestamps, object tracking with confidence, video metadata (FPS, resolution, codecs)

Every annotation includes an annotation_model field (e.g., "claude-sonnet-4-20250514") for reproducibility tracking.

When Annotations Are Not Generated

  • Policy violations (1-star): AI-generated content, content policy violations, corrupted files
  • Stock photo auto-rejects: Files caught by reverse image search before Claude scoring
  • Resolution pre-check failures: Images/video below min_width or min_height (rejected before scoring)
  • Scoring system errors: If scoring fails entirely

In these cases, annotations will be null.

Whisper Hallucination Handling

For audio and video files, Whisper transcription is protected against hallucination — silent or ambient audio that Whisper misinterprets as speech. When hallucination is detected (via per-segment no_speech_prob, avg_logprob, and compression_ratio thresholds), the transcript is suppressed and the file is scored as ambient audio. See Trust & Safety for details.

Image EXIF Normalization

All images are automatically EXIF-rotated before scoring and delivery. The corrected image replaces the original in S3, so download_url always returns a correctly-oriented image. No client-side rotation is needed.

Image Annotations

EXIF normalization: All images are automatically rotated to correct orientation (using EXIF data) before scoring. The normalized image replaces the original in S3, so download URLs always return correctly-oriented images. No client-side rotation needed.

FieldTypeDescription
type"image"Content type discriminant
objectsobject[]Detected objects with label, confidence (0.0-1.0), position, coverage
scene.settingstringScene description (e.g., "indoor office")
scene.indoorbooleanWhether the scene is indoors
scene.confidencenumberScene classification confidence (0.0-1.0)
text_extractionobject | nullOCR with full_text and per-word {text, confidence} array
color_palettestring[]3-6 dominant hex color codes
compositionstringComposition description
face_countintegerNumber of human faces detected
orientationstring"landscape", "portrait", or "square"
quality_metrics.blur_scorenumber0.0 (sharp) to 1.0 (very blurry)
quality_metrics.exposurestring"underexposed", "normal", or "overexposed"
quality_metrics.noise_levelstring"low", "moderate", or "high"
safety.nsfw_scorenumber0.0 (safe) to 1.0 (explicit)
safety.violence_scorenumber0.0 (none) to 1.0 (graphic)
safety.pii_detectedbooleanTrue if visible personal info (ID cards, documents with names)
annotation_modelstringModel version (e.g., "claude-sonnet-4-20250514")
{
  "type": "image",
  "objects": [
    { "label": "mailbox", "confidence": 0.95, "position": "center", "approximate_coverage": "25% of frame" },
    { "label": "house", "confidence": 0.88, "position": "upper-right", "approximate_coverage": "35% of frame" }
  ],
  "scene": { "setting": "outdoor residential neighborhood", "indoor": false, "confidence": 0.92 },
  "text_extraction": {
    "full_text": "1234 Oak Street",
    "words": [
      { "text": "1234", "confidence": 0.97 },
      { "text": "Oak", "confidence": 0.95 },
      { "text": "Street", "confidence": 0.93 }
    ]
  },
  "color_palette": ["#8B4513", "#228B22", "#87CEEB"],
  "composition": "centered mailbox with residential background, natural daylight",
  "face_count": 0,
  "orientation": "landscape",
  "quality_metrics": { "blur_score": 0.08, "exposure": "normal", "noise_level": "low" },
  "safety": { "nsfw_score": 0.0, "violence_score": 0.0, "pii_detected": false },
  "annotation_model": "claude-sonnet-4-20250514"
}

Audio Annotations

Audio annotations include technical metadata from ffprobe (sample rate, codec, channels) alongside AI-generated content analysis.

FieldTypeDescription
type"audio"Content type discriminant
duration_secondsnumber | nullActual duration from ffprobe or Whisper
sample_ratenumber | nullSample rate in Hz (e.g., 44100)
bitrate_kbpsnumber | nullBitrate in kbps (e.g., 128)
codecstring | nullAudio codec (e.g., "aac", "mp3")
channelsnumber | null1 = mono, 2 = stereo
speaker_countintegerEstimated number of distinct speakers
languagestringISO 639-1 code (e.g., "en") — from Whisper when available
topicsstring[]Topic classification labels
keywordsstring[]Extracted keywords
noise_levelstring"silent", "low", "moderate", or "high"
snr_dbnumber | nullEstimated signal-to-noise ratio in dB
emotion_toneobject | nullEmotion/tone classification (null for non-speech)
room_acousticsobjectRecording environment estimation
background_soundsobject[]Identified background sounds with confidence
speaker_demographicsobject | nullEstimated speaker characteristics (see below)
transcript_segmentsobject[] | nullSegment-level transcript with timestamps and confidence
word_timestampsobject[] | nullWord-level timestamps from Whisper (see below)
annotation_modelstringModel version

Emotion & Tone

FieldTypeDescription
emotion_tone.primarystring"neutral", "happy", "sad", "angry", "fearful", "surprised", "disgusted", "calm", "excited"
emotion_tone.confidencenumber0.0-1.0
emotion_tone.secondarystringOptional secondary tone

Set to null for non-speech audio (ambient, music-only).

Room Acoustics

FieldTypeDescription
room_acoustics.estimated_room_sizestring"small", "medium", "large", "outdoor"
room_acoustics.reverb_levelstring"dry", "slight", "moderate", "heavy"
room_acoustics.estimated_rt60_secondsnumber | nullEstimated reverb decay time. < 0.3 = treated/small, 0.3-0.6 = normal, > 1.0 = very reverberant

Background Sounds

Each entry in background_sounds identifies a distinct non-speech sound:

FieldTypeDescription
labelstringe.g., "HVAC", "traffic", "fan", "birds", "keyboard", "music"
confidencenumber0.0-1.0
prominencestring"faint", "noticeable", "dominant"

Transcript Segment Confidence

Each transcript segment now includes a confidence score (0.0-1.0) derived from Whisper’s per-segment avg_logprob. Use this to filter low-confidence transcriptions at scale:

{
  "text": "The battery lasts about 8 hours.",
  "start_seconds": 0.0,
  "end_seconds": 2.34,
  "confidence": 0.82
}

Speaker Demographics (Estimated)

FieldValues
estimated_gender"male", "female", "unknown"
estimated_age_range"child", "young_adult", "adult", "senior", "unknown"
accent_regione.g., "us_general", "british", "indian", "unknown"

These are AI estimates from voice characteristics, not verified demographics.

Word-Level Timestamps

From Whisper’s verbose_json output — precise per-word timing:

"word_timestamps": [
  { "word": "The", "start": 0.0, "end": 0.12 },
  { "word": "battery", "start": 0.12, "end": 0.56 },
  { "word": "lasts", "start": 0.56, "end": 0.89 }
]

Full example:

{
  "type": "audio",
  "duration_seconds": 21.45,
  "sample_rate": 44100,
  "bitrate_kbps": 128,
  "codec": "aac",
  "channels": 1,
  "speaker_count": 1,
  "language": "en",
  "topics": ["product review", "technology"],
  "keywords": ["battery life", "screen quality"],
  "noise_level": "low",
  "snr_db": 32,
  "emotion_tone": {
    "primary": "neutral",
    "confidence": 0.85,
    "secondary": "calm"
  },
  "room_acoustics": {
    "estimated_room_size": "medium",
    "reverb_level": "slight",
    "estimated_rt60_seconds": 0.4
  },
  "background_sounds": [
    { "label": "HVAC", "confidence": 0.75, "prominence": "faint" }
  ],
  "speaker_demographics": {
    "estimated_gender": "male",
    "estimated_age_range": "adult",
    "accent_region": "us_general"
  },
  "transcript_segments": [
    { "text": "The battery lasts about 8 hours.", "start_seconds": 0.0, "end_seconds": 2.34, "confidence": 0.82 },
    { "text": "Screen quality is excellent.", "start_seconds": 2.34, "end_seconds": 4.87, "confidence": 0.79 }
  ],
  "word_timestamps": [
    { "word": "The", "start": 0.0, "end": 0.12 },
    { "word": "battery", "start": 0.12, "end": 0.56 },
    { "word": "lasts", "start": 0.56, "end": 0.89 },
    { "word": "about", "start": 0.89, "end": 1.15 },
    { "word": "8", "start": 1.15, "end": 1.32 },
    { "word": "hours", "start": 1.32, "end": 1.78 }
  ],
  "annotation_model": "claude-sonnet-4-20250514"
}

Video Annotations

Video annotations combine visual analysis from Claude Vision, audio analysis from Whisper, and technical metadata from ffprobe.

FieldTypeDescription
type"video"Content type discriminant
duration_secondsnumber | nullDuration from ffprobe
widthnumber | nullResolution width in pixels
heightnumber | nullResolution height in pixels
fpsnumber | nullFrames per second
video_codecstring | nullVideo codec (e.g., "h264", "hevc")
audio_codecstring | nullAudio codec (e.g., "aac")
scenesobject[]Scene segmentation with precise float timestamps and confidence
actionsstring[]Action recognition labels
object_trackingobject[]Objects tracked across scenes with confidence
keyframe_descriptionsstring[]Description per extracted keyframe
face_countintegerTotal faces detected across keyframes
speaker_countinteger | nullEstimated speakers (null if no audio)
noise_levelstring | nullAudio noise level (null if no audio)
quality_metrics.blur_scorenumber0.0 (sharp) to 1.0 (blurry) — averaged across keyframes
quality_metrics.exposurestring"underexposed", "normal", or "overexposed"
quality_metrics.stabilitystring"stable", "moderate", or "shaky"
safety.nsfw_scorenumber0.0-1.0 for most sensitive frame
safety.violence_scorenumber0.0-1.0 for most sensitive frame
transcript_segmentsobject[] | nullAudio transcript with precise timestamps
word_timestampsobject[] | nullWord-level timestamps from Whisper
annotation_modelstringModel version
{
  "type": "video",
  "duration_seconds": 15.2,
  "width": 1920,
  "height": 1080,
  "fps": 30,
  "video_codec": "h264",
  "audio_codec": "aac",
  "scenes": [
    { "description": "Entrance area with door", "start_seconds": 0.0, "end_seconds": 5.2, "confidence": 0.88 },
    { "description": "Living room with furniture", "start_seconds": 5.2, "end_seconds": 10.5, "confidence": 0.92 },
    { "description": "Kitchen area", "start_seconds": 10.5, "end_seconds": 15.2, "confidence": 0.85 }
  ],
  "actions": ["walking", "panning camera", "moving through rooms"],
  "object_tracking": [
    { "label": "furniture", "confidence": 0.90, "appears_in_scenes": [0, 1, 2] },
    { "label": "television", "confidence": 0.88, "appears_in_scenes": [1] }
  ],
  "keyframe_descriptions": ["Doorway with wooden frame", "Bright living room with windows", "Kitchen counter with appliances"],
  "face_count": 0,
  "speaker_count": 1,
  "noise_level": "low",
  "quality_metrics": { "blur_score": 0.1, "exposure": "normal", "stability": "stable" },
  "safety": { "nsfw_score": 0.0, "violence_score": 0.0 },
  "transcript_segments": [
    { "text": "Here's a walkthrough of the apartment.", "start_seconds": 1.0, "end_seconds": 3.2 }
  ],
  "word_timestamps": [
    { "word": "Here's", "start": 1.0, "end": 1.25 },
    { "word": "a", "start": 1.25, "end": 1.35 },
    { "word": "walkthrough", "start": 1.35, "end": 1.98 }
  ],
  "annotation_model": "claude-sonnet-4-20250514"
}

File-Level Metadata

In addition to annotations, each file response includes:

FieldTypeDescription
duration_secondsnumber | nullActual duration (populated for audio/video via ffprobe)
widthnumber | nullImage/video width in pixels
heightnumber | nullImage/video height in pixels
content_hashstring | nullSHA-256 hash of file bytes (for dedup + integrity verification)

Accessing Annotations

Via REST API

curl -H "Authorization: Bearer fh_live_..." \
  https://api.firsthandapi.com/v1/jobs/job_01JQ.../files

Via TypeScript SDK

const files = await client.getJobFiles('job_01JQ...');
 
for (const file of files.data) {
  if (file.annotations?.type === 'image') {
    console.log('Objects:', file.annotations.objects.map(o => `${o.label} (${o.confidence})`));
    console.log('OCR:', file.annotations.text_extraction?.full_text);
    console.log('Faces:', file.annotations.face_count);
    console.log('NSFW score:', file.annotations.safety.nsfw_score);
  }
  if (file.annotations?.type === 'audio') {
    console.log('Duration:', file.annotations.duration_seconds, 's');
    console.log('Sample rate:', file.annotations.sample_rate, 'Hz');
    console.log('Speaker:', file.annotations.speaker_demographics);
    console.log('Words:', file.annotations.word_timestamps?.length);
  }
}

Video Keyframes

For video files, 3 representative keyframe images (at 10%, 50%, 90% of duration) are extracted and stored. The file response includes a keyframes array with pre-signed download URLs:

"keyframes": [
  { "index": 0, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
  { "index": 1, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
  { "index": 2, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." }
]

Use keyframes for video thumbnails, video-text training pairs, and preview without downloading the full video.

Provenance Metadata

Every file response includes provenance metadata alongside annotations:

FieldTypeDescription
content_hashstringSHA-256 hash for dedup + integrity verification
captured_atstring | nullEXIF capture timestamp (when taken, not uploaded)
device_infoobject | null{device_model, device_os, app_version} from worker’s device
worker_regionstring | nullDerived from GPS (e.g., "US-NY", "US-CA-LA", "GB")

These enable geographic diversity analysis, device-specific quality filtering, and dataset provenance tracking.

Webhook: submission.scored

The submission.scored webhook event fires for every scored submission (approved AND rejected), including the full annotation payload. This enables:

  • Rejection analytics (why are files being rejected?)
  • Active learning (flag low-confidence annotations for human review)
  • Real-time scoring dashboards

See Webhook Handling for setup.

Limitations

  • Annotations are best-effort — accuracy varies with content quality and complexity
  • Confidence scores are Claude’s self-assessed certainty, not calibrated probabilities
  • Bounding boxes are described spatially (e.g., “center”, “upper-left”) rather than as pixel coordinates (COCO format). For precise bounding boxes, use a dedicated labeling service
  • Speaker demographics are estimated from voice characteristics — not verified identity data
  • Word-level confidence is not available (Whisper API does not expose per-word probability). Word timestamps are precise but lack confidence scores
  • Audio fingerprinting (Chromaprint) is not available — use content_hash for dedup instead
  • Transcript segments depend on Whisper availability. When the OpenAI API key is not configured, segments will be null