Methodology
How Nile Intel collects, structures, and scores events across the Horn of Africa
Overview
Nile Intel is an automated open-source intelligence (OSINT) platform that monitors 16+ news sources covering Sudan, South Sudan, and the broader Horn of Africa. It ingests articles via RSS feeds, clusters related reporting, and uses large language models to extract structured event data — including event type, severity, actors, regions, and verification status.
The goal is to provide timely, structured situational awareness for organizations operating in or monitoring the region — at a fraction of the cost and latency of traditional intelligence services.
Transparency note: Nile Intel is an automated system. All event extractions are produced by AI models applied to publicly available news reporting. They are not editorial judgments and should be cross-referenced with primary sources for operational decisions.
Pipeline
Every article passes through a six-stage pipeline from raw RSS feed to structured, queryable event data.
Source Ingestion
RSS feeds from 16+ news sources are polled every 15 minutes. Articles are filtered for relevance to Sudan, South Sudan, and adjacent regions using keyword matching on titles and descriptions.
Article Clustering
Related articles about the same event are grouped using cosine similarity on title and description text. This reduces 50+ daily articles into 10-15 distinct story clusters, preventing duplicate coverage from inflating event counts.
Extractive Summary
Each cluster gets an initial summary by selecting the longest, most detailed article description. This serves as a fast fallback when AI summarization is unavailable.
AI Event Extraction
A large language model (Llama 3.3 70B via Groq) reads all articles in each cluster and extracts structured fields: event type, subtype, severity (1-5), scope, country, regions, actors, verification status, and confidence score. The model also provides a rationale explaining its severity and verification decisions.
Validation & Quality Control
Each extraction is validated against a strict schema. Events that fail validation (missing required fields, invalid values) or have very low confidence (<0.3) are quarantined for review rather than published. This prevents hallucinated or poorly-supported events from entering the database.
Actor Normalization
Actor names are mapped to canonical forms using a dictionary of 80+ aliases. For example: "Govt of South Sudan", "GoSS", and "South Sudan government" all resolve to "Government of South Sudan". This enables consistent querying and trend analysis.
Severity Scale
Every event is assigned a severity score from 1-5 based on the scope, impact, and urgency of the reported situation.
| Level | Label | Definition | Examples |
|---|---|---|---|
| 1 | Routine | Scheduled events, routine statements, standard reporting | Government press briefings, scheduled UN meetings, routine humanitarian updates |
| 2 | Notable | Localized incidents, policy changes, organizational shifts | Minor clashes with no casualties, new policy announcements, staff rotations |
| 3 | Significant | Regional displacement, major political shifts, economic disruptions | Multi-day protests, significant troop movements, trade route disruptions |
| 4 | Major | Large-scale violence, state-level crisis, major international intervention | Multi-faction clashes with casualties, large-scale displacement (10K+), state of emergency |
| 5 | Critical | War escalation, mass atrocity, national emergency | Full-scale military offensive, reported mass atrocities, capital under siege |
Source Tiering
Sources are classified into three reliability tiers. This classification is deterministic (not AI-assigned) and influences the verification status of extracted events.
| Tier | Criteria | Sources |
|---|---|---|
| Tier 1 | International wire services, major broadcasters, UN agencies with editorial standards and fact-checking processes | BBC Africa, Reuters, Al Jazeera, The Guardian Africa, France24, UN News, VOA |
| Tier 2 | Regional outlets with established track records, local knowledge, but potentially less editorial oversight | Radio Tamazuj, Eye Radio, Sudan Tribune, Dabanga Radio, Africanews |
| Tier 3 | Aggregators, diaspora media, or outlets with limited editorial processes | Google News aggregates, Nyamilepedia |
Verification Status
Each event receives one of three verification labels:
- Confirmed: Reported by 2+ Tier 1 sources, or corroborated by official statements/UN reports. High confidence in factual accuracy.
- Reported: Reported by at least one credible source but not independently verified. The default for most events.
- Unverified: Single-source reporting from Tier 3 sources, or where the AI extraction had low confidence. Should be treated as preliminary.
Event Classification
Events are classified into six primary types:
- Security: Armed clashes, ceasefire violations, military operations, intercommunal violence
- Political: Elections, peace talks, government formation, political protests, diplomatic developments
- Humanitarian: Displacement, food insecurity, disease outbreaks, aid delivery, protection concerns
- Economic: Trade disruptions, currency crises, oil production changes, sanctions
- Infrastructure: Road/bridge destruction, power outages, telecommunications, construction projects
- Legal: Court rulings, ICC proceedings, human rights investigations, legislative changes
Quality Control
Quarantine System
Extractions that fail validation are quarantined rather than discarded. This serves two purposes:
- Safety: Low-quality extractions never reach the public database
- Learning: Quarantined records are reviewed to improve the extraction prompt and identify systematic failure modes
Deduplication
Each article cluster is hashed based on its constituent article titles. If a cluster has already been extracted (or quarantined), it is skipped. This prevents the same event from being counted multiple times across feed refresh cycles.
Event counting: Events are deduplicated across sources via clustering. All counts in the Event Archive represent unique events, not individual articles. When multiple outlets report the same event, it appears as a single event record with multi-source attribution.
Confidence Assessment
Each extraction is assessed as High, Medium, or Low confidence based on source agreement, extraction consistency, and verification status. Low-confidence extractions (below 0.3 on the internal scale) are automatically quarantined for review rather than published. Confidence levels are tracked over time to monitor extraction reliability.
Provenance
Every event record in the database includes full provenance metadata:
- Model version: Which AI model produced the extraction (e.g.,
llama-3.3-70b-versatile) - Prompt version: Which extraction prompt was used (versioned for auditability)
- Source articles: URLs of all articles in the cluster that informed the extraction
- Extraction timestamp: When the extraction was performed
- Rationale: The model's explanation for its severity and verification decisions
Limitations
Users should be aware of the following limitations:
- Source dependency: Nile Intel can only report on events covered by its monitored sources. Events in remote areas with no media access may not appear.
- AI extraction errors: While validated, AI extractions can misclassify event types, miscalculate severity, or miss nuances that a human analyst would catch.
- Latency: Events appear in the database within 15 minutes of RSS publication. This is faster than weekly reports but slower than social media monitoring.
- Language coverage: Currently limited to English-language sources. Arabic and local language reporting is not directly ingested.
- No primary reporting: Nile Intel does not have correspondents or conduct original investigations. All data derives from published news articles.
Contact: For questions about methodology, data access, or partnership inquiries, reach out via the Weekly Brief subscription or the event archive alert signup.