Auto-Labeling & Annotations

Every file scored by FirstHandAPI’s AI ensemble automatically generates structured annotation metadata alongside the quality score. No separate labeling pipeline needed — annotations are included in the file response at GET /v1/jobs/:id/files.

How It Works

When a submission is scored by the AI ensemble (Claude Vision + Whisper + ffprobe), the system generates annotation metadata in the same pass. Annotations are content-type-aware and include:

Images: Object detection with confidence, OCR with per-word confidence, scene classification, face count, quality metrics, safety scores
Audio: Transcription with word-level timestamps, speaker demographics, audio metadata (sample rate, codec, SNR)
Video: Scene segmentation with precise timestamps, object tracking with confidence, video metadata (FPS, resolution, codecs)

Every annotation includes an annotation_model field (e.g., "claude-sonnet-4-20250514") for reproducibility tracking.

When Annotations Are Not Generated

Policy violations (1-star): AI-generated content, content policy violations, corrupted files
Stock photo auto-rejects: Files caught by reverse image search before Claude scoring
Resolution pre-check failures: Images/video below min_width or min_height (rejected before scoring)
Scoring system errors: If scoring fails entirely

In these cases, annotations will be null.

Whisper Hallucination Handling

For audio and video files, Whisper transcription is protected against hallucination — silent or ambient audio that Whisper misinterprets as speech. When hallucination is detected (via per-segment no_speech_prob, avg_logprob, and compression_ratio thresholds), the transcript is suppressed and the file is scored as ambient audio. See Trust & Safety for details.

Image EXIF Normalization

All images are automatically EXIF-rotated before scoring and delivery. The corrected image replaces the original in S3, so download_url always returns a correctly-oriented image. No client-side rotation is needed.

Image Annotations

EXIF normalization: All images are automatically rotated to correct orientation (using EXIF data) before scoring. The normalized image replaces the original in S3, so download URLs always return correctly-oriented images. No client-side rotation needed.

Field	Type	Description
`type`	`"image"`	Content type discriminant
`objects`	`object[]`	Detected objects with label, confidence (0.0-1.0), position, coverage
`scene.setting`	`string`	Scene description (e.g., `"indoor office"`)
`scene.indoor`	`boolean`	Whether the scene is indoors
`scene.confidence`	`number`	Scene classification confidence (0.0-1.0)
`text_extraction`	`object \| null`	OCR with `full_text` and per-word `{text, confidence}` array
`color_palette`	`string[]`	3-6 dominant hex color codes
`composition`	`string`	Composition description
`face_count`	`integer`	Number of human faces detected
`orientation`	`string`	`"landscape"`, `"portrait"`, or `"square"`
`quality_metrics.blur_score`	`number`	0.0 (sharp) to 1.0 (very blurry)
`quality_metrics.exposure`	`string`	`"underexposed"`, `"normal"`, or `"overexposed"`
`quality_metrics.noise_level`	`string`	`"low"`, `"moderate"`, or `"high"`
`safety.nsfw_score`	`number`	0.0 (safe) to 1.0 (explicit)
`safety.violence_score`	`number`	0.0 (none) to 1.0 (graphic)
`safety.pii_detected`	`boolean`	True if visible personal info (ID cards, documents with names)
`annotation_model`	`string`	Model version (e.g., `"claude-sonnet-4-20250514"`)

{
  "type": "image",
  "objects": [
    { "label": "mailbox", "confidence": 0.95, "position": "center", "approximate_coverage": "25% of frame" },
    { "label": "house", "confidence": 0.88, "position": "upper-right", "approximate_coverage": "35% of frame" }
  ],
  "scene": { "setting": "outdoor residential neighborhood", "indoor": false, "confidence": 0.92 },
  "text_extraction": {
    "full_text": "1234 Oak Street",
    "words": [
      { "text": "1234", "confidence": 0.97 },
      { "text": "Oak", "confidence": 0.95 },
      { "text": "Street", "confidence": 0.93 }
    ]
  },
  "color_palette": ["#8B4513", "#228B22", "#87CEEB"],
  "composition": "centered mailbox with residential background, natural daylight",
  "face_count": 0,
  "orientation": "landscape",
  "quality_metrics": { "blur_score": 0.08, "exposure": "normal", "noise_level": "low" },
  "safety": { "nsfw_score": 0.0, "violence_score": 0.0, "pii_detected": false },
  "annotation_model": "claude-sonnet-4-20250514"
}

Audio Annotations

Audio annotations include technical metadata from ffprobe (sample rate, codec, channels) alongside AI-generated content analysis.

Field	Type	Description
`type`	`"audio"`	Content type discriminant
`duration_seconds`	`number \| null`	Actual duration from ffprobe or Whisper
`sample_rate`	`number \| null`	Sample rate in Hz (e.g., `44100`)
`bitrate_kbps`	`number \| null`	Bitrate in kbps (e.g., `128`)
`codec`	`string \| null`	Audio codec (e.g., `"aac"`, `"mp3"`)
`channels`	`number \| null`	1 = mono, 2 = stereo
`speaker_count`	`integer`	Estimated number of distinct speakers
`language`	`string`	ISO 639-1 code (e.g., `"en"`) — from Whisper when available
`topics`	`string[]`	Topic classification labels
`keywords`	`string[]`	Extracted keywords
`noise_level`	`string`	`"silent"`, `"low"`, `"moderate"`, or `"high"`
`snr_db`	`number \| null`	Estimated signal-to-noise ratio in dB
`emotion_tone`	`object \| null`	Emotion/tone classification (null for non-speech)
`room_acoustics`	`object`	Recording environment estimation
`background_sounds`	`object[]`	Identified background sounds with confidence
`speaker_demographics`	`object \| null`	Estimated speaker characteristics (see below)
`transcript_segments`	`object[] \| null`	Segment-level transcript with timestamps and confidence
`word_timestamps`	`object[] \| null`	Word-level timestamps from Whisper (see below)
`annotation_model`	`string`	Model version

Emotion & Tone

Field	Type	Description
`emotion_tone.primary`	`string`	`"neutral"`, `"happy"`, `"sad"`, `"angry"`, `"fearful"`, `"surprised"`, `"disgusted"`, `"calm"`, `"excited"`
`emotion_tone.confidence`	`number`	0.0-1.0
`emotion_tone.secondary`	`string`	Optional secondary tone

Set to null for non-speech audio (ambient, music-only).

Room Acoustics

Field	Type	Description
`room_acoustics.estimated_room_size`	`string`	`"small"`, `"medium"`, `"large"`, `"outdoor"`
`room_acoustics.reverb_level`	`string`	`"dry"`, `"slight"`, `"moderate"`, `"heavy"`
`room_acoustics.estimated_rt60_seconds`	`number \| null`	Estimated reverb decay time. < 0.3 = treated/small, 0.3-0.6 = normal, > 1.0 = very reverberant

Background Sounds

Each entry in background_sounds identifies a distinct non-speech sound:

Field	Type	Description
`label`	`string`	e.g., `"HVAC"`, `"traffic"`, `"fan"`, `"birds"`, `"keyboard"`, `"music"`
`confidence`	`number`	0.0-1.0
`prominence`	`string`	`"faint"`, `"noticeable"`, `"dominant"`

Transcript Segment Confidence

Each transcript segment now includes a confidence score (0.0-1.0) derived from Whisper’s per-segment avg_logprob. Use this to filter low-confidence transcriptions at scale:

{
  "text": "The battery lasts about 8 hours.",
  "start_seconds": 0.0,
  "end_seconds": 2.34,
  "confidence": 0.82
}

Speaker Demographics (Estimated)

Field	Values
`estimated_gender`	`"male"`, `"female"`, `"unknown"`
`estimated_age_range`	`"child"`, `"young_adult"`, `"adult"`, `"senior"`, `"unknown"`
`accent_region`	e.g., `"us_general"`, `"british"`, `"indian"`, `"unknown"`

These are AI estimates from voice characteristics, not verified demographics.

Word-Level Timestamps

From Whisper’s verbose_json output — precise per-word timing:

"word_timestamps": [
  { "word": "The", "start": 0.0, "end": 0.12 },
  { "word": "battery", "start": 0.12, "end": 0.56 },
  { "word": "lasts", "start": 0.56, "end": 0.89 }
]

Full example:

{
  "type": "audio",
  "duration_seconds": 21.45,
  "sample_rate": 44100,
  "bitrate_kbps": 128,
  "codec": "aac",
  "channels": 1,
  "speaker_count": 1,
  "language": "en",
  "topics": ["product review", "technology"],
  "keywords": ["battery life", "screen quality"],
  "noise_level": "low",
  "snr_db": 32,
  "emotion_tone": {
    "primary": "neutral",
    "confidence": 0.85,
    "secondary": "calm"
  },
  "room_acoustics": {
    "estimated_room_size": "medium",
    "reverb_level": "slight",
    "estimated_rt60_seconds": 0.4
  },
  "background_sounds": [
    { "label": "HVAC", "confidence": 0.75, "prominence": "faint" }
  ],
  "speaker_demographics": {
    "estimated_gender": "male",
    "estimated_age_range": "adult",
    "accent_region": "us_general"
  },
  "transcript_segments": [
    { "text": "The battery lasts about 8 hours.", "start_seconds": 0.0, "end_seconds": 2.34, "confidence": 0.82 },
    { "text": "Screen quality is excellent.", "start_seconds": 2.34, "end_seconds": 4.87, "confidence": 0.79 }
  ],
  "word_timestamps": [
    { "word": "The", "start": 0.0, "end": 0.12 },
    { "word": "battery", "start": 0.12, "end": 0.56 },
    { "word": "lasts", "start": 0.56, "end": 0.89 },
    { "word": "about", "start": 0.89, "end": 1.15 },
    { "word": "8", "start": 1.15, "end": 1.32 },
    { "word": "hours", "start": 1.32, "end": 1.78 }
  ],
  "annotation_model": "claude-sonnet-4-20250514"
}

Video Annotations

Video annotations combine visual analysis from Claude Vision, audio analysis from Whisper, and technical metadata from ffprobe.

Field	Type	Description
`type`	`"video"`	Content type discriminant
`duration_seconds`	`number \| null`	Duration from ffprobe
`width`	`number \| null`	Resolution width in pixels
`height`	`number \| null`	Resolution height in pixels
`fps`	`number \| null`	Frames per second
`video_codec`	`string \| null`	Video codec (e.g., `"h264"`, `"hevc"`)
`audio_codec`	`string \| null`	Audio codec (e.g., `"aac"`)
`scenes`	`object[]`	Scene segmentation with precise float timestamps and confidence
`actions`	`string[]`	Action recognition labels
`object_tracking`	`object[]`	Objects tracked across scenes with confidence
`keyframe_descriptions`	`string[]`	Description per extracted keyframe
`face_count`	`integer`	Total faces detected across keyframes
`speaker_count`	`integer \| null`	Estimated speakers (null if no audio)
`noise_level`	`string \| null`	Audio noise level (null if no audio)
`quality_metrics.blur_score`	`number`	0.0 (sharp) to 1.0 (blurry) — averaged across keyframes
`quality_metrics.exposure`	`string`	`"underexposed"`, `"normal"`, or `"overexposed"`
`quality_metrics.stability`	`string`	`"stable"`, `"moderate"`, or `"shaky"`
`safety.nsfw_score`	`number`	0.0-1.0 for most sensitive frame
`safety.violence_score`	`number`	0.0-1.0 for most sensitive frame
`transcript_segments`	`object[] \| null`	Audio transcript with precise timestamps
`word_timestamps`	`object[] \| null`	Word-level timestamps from Whisper
`annotation_model`	`string`	Model version

{
  "type": "video",
  "duration_seconds": 15.2,
  "width": 1920,
  "height": 1080,
  "fps": 30,
  "video_codec": "h264",
  "audio_codec": "aac",
  "scenes": [
    { "description": "Entrance area with door", "start_seconds": 0.0, "end_seconds": 5.2, "confidence": 0.88 },
    { "description": "Living room with furniture", "start_seconds": 5.2, "end_seconds": 10.5, "confidence": 0.92 },
    { "description": "Kitchen area", "start_seconds": 10.5, "end_seconds": 15.2, "confidence": 0.85 }
  ],
  "actions": ["walking", "panning camera", "moving through rooms"],
  "object_tracking": [
    { "label": "furniture", "confidence": 0.90, "appears_in_scenes": [0, 1, 2] },
    { "label": "television", "confidence": 0.88, "appears_in_scenes": [1] }
  ],
  "keyframe_descriptions": ["Doorway with wooden frame", "Bright living room with windows", "Kitchen counter with appliances"],
  "face_count": 0,
  "speaker_count": 1,
  "noise_level": "low",
  "quality_metrics": { "blur_score": 0.1, "exposure": "normal", "stability": "stable" },
  "safety": { "nsfw_score": 0.0, "violence_score": 0.0 },
  "transcript_segments": [
    { "text": "Here's a walkthrough of the apartment.", "start_seconds": 1.0, "end_seconds": 3.2 }
  ],
  "word_timestamps": [
    { "word": "Here's", "start": 1.0, "end": 1.25 },
    { "word": "a", "start": 1.25, "end": 1.35 },
    { "word": "walkthrough", "start": 1.35, "end": 1.98 }
  ],
  "annotation_model": "claude-sonnet-4-20250514"
}

File-Level Metadata

In addition to annotations, each file response includes:

Field	Type	Description
`duration_seconds`	`number \| null`	Actual duration (populated for audio/video via ffprobe)
`width`	`number \| null`	Image/video width in pixels
`height`	`number \| null`	Image/video height in pixels
`content_hash`	`string \| null`	SHA-256 hash of file bytes (for dedup + integrity verification)

Accessing Annotations

Via REST API

curl -H "Authorization: Bearer fh_live_..." \
  https://api.firsthandapi.com/v1/jobs/job_01JQ.../files

Via TypeScript SDK

const files = await client.getJobFiles('job_01JQ...');
 
for (const file of files.data) {
  if (file.annotations?.type === 'image') {
    console.log('Objects:', file.annotations.objects.map(o => `${o.label} (${o.confidence})`));
    console.log('OCR:', file.annotations.text_extraction?.full_text);
    console.log('Faces:', file.annotations.face_count);
    console.log('NSFW score:', file.annotations.safety.nsfw_score);
  }
  if (file.annotations?.type === 'audio') {
    console.log('Duration:', file.annotations.duration_seconds, 's');
    console.log('Sample rate:', file.annotations.sample_rate, 'Hz');
    console.log('Speaker:', file.annotations.speaker_demographics);
    console.log('Words:', file.annotations.word_timestamps?.length);
  }
}

Video Keyframes

For video files, 3 representative keyframe images (at 10%, 50%, 90% of duration) are extracted and stored. The file response includes a keyframes array with pre-signed download URLs:

"keyframes": [
  { "index": 0, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
  { "index": 1, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." },
  { "index": 2, "download_url": "https://...", "download_url_expires_at": "2026-04-10T..." }
]

Use keyframes for video thumbnails, video-text training pairs, and preview without downloading the full video.

Provenance Metadata

Every file response includes provenance metadata alongside annotations:

Field	Type	Description
`content_hash`	`string`	SHA-256 hash for dedup + integrity verification
`captured_at`	`string \| null`	EXIF capture timestamp (when taken, not uploaded)
`device_info`	`object \| null`	`{device_model, device_os, app_version}` from worker’s device
`worker_region`	`string \| null`	Derived from GPS (e.g., `"US-NY"`, `"US-CA-LA"`, `"GB"`)

These enable geographic diversity analysis, device-specific quality filtering, and dataset provenance tracking.

Webhook: submission.scored

The submission.scored webhook event fires for every scored submission (approved AND rejected), including the full annotation payload. This enables:

Rejection analytics (why are files being rejected?)
Active learning (flag low-confidence annotations for human review)
Real-time scoring dashboards

See Webhook Handling for setup.

Limitations

Annotations are best-effort — accuracy varies with content quality and complexity
Confidence scores are Claude’s self-assessed certainty, not calibrated probabilities
Bounding boxes are described spatially (e.g., “center”, “upper-left”) rather than as pixel coordinates (COCO format). For precise bounding boxes, use a dedicated labeling service
Speaker demographics are estimated from voice characteristics — not verified identity data
Word-level confidence is not available (Whisper API does not expose per-word probability). Word timestamps are precise but lack confidence scores
Audio fingerprinting (Chromaprint) is not available — use content_hash for dedup instead
Transcript segments depend on Whisper availability. When the OpenAI API key is not configured, segments will be null

AI Scoring Rubric Writing Job Descriptions