AI Log Triage for IoT Devices — Without Sending Data to the Cloud

The project started as an internal R&D tool.

The client’s engineering team was running tests on their FreeRTOS IoT devices and spending a disproportionate amount of time manually analysing test logs — scrolling through event sequences, cross-referencing error codes against documentation, trying to reconstruct what happened during a failed test run. We built them an AI-assisted triage tool to do that work automatically: feed in the logs, get back a readable summary of what went wrong and why.

For the prototype, we used a public LLM API. Fast to build, good results, the team adopted it immediately. It became part of their standard testing workflow.

That success led to the next question: could this run on production device telemetry? The answer was yes — but it hit a wall immediately. Production telemetry contained customer device data, and the client’s information security policy explicitly prohibited sending that data to any external service. OpenAI, Anthropic, Microsoft’s Azure OpenAI Service — all off the table. Even cloud-hosted models running in a nominally private Azure tenant did not satisfy the policy.

The capability was proven. The task was to rebuild it so that no data ever left their infrastructure.

Why raw logs don’t map cleanly to LLM input

Before going into the specifics: a common misconception is that AI log analysis means pasting a log file into ChatGPT or similar chat-based AI and asking “what went wrong.” That does not work on production systems. The output is vague, the model hallucinates plausible-sounding root causes, and it misses the actual failure buried in thousands of unrelated lines.

Real log analysis requires a structured pipeline that runs before the LLM ever sees anything. The raw logs need to be parsed, normalized, filtered, correlated across sources and time windows, and reduced to a focused incident packet — only then can a language model do useful reasoning over the result. The pre-processing stages are where most of the engineering effort goes: format detection, noise filtering, event correlation, incident windowing, and evidence ranking. Without this, you are asking a model to find a needle in a haystack while blindfolded.

This project was a good illustration of why the pipeline matters more than the model.

Why FreeRTOS logs make the starting point simpler

The first question was whether a locally-hosted model could realistically handle this task. The answer depends heavily on the type of device.

These devices ran FreeRTOS, not Linux. That distinction matters for the amount of pre-processing required. A Linux system produces thousands of log lines per minute — kernel scheduler events, driver messages, systemd chatter — the vast majority of which is noise. Feeding that to any model, local or cloud, is impractical without a substantial pre-processing pipeline that filters, normalizes, and correlates events before the model sees them.

FreeRTOS devices log only application-level events: task state changes, sensor readings, MQTT outcomes, error codes, connectivity events. An incident window is typically 20–80 lines of structured, purposeful output. The signal-to-noise ratio is high enough that the pre-processing stages are simpler — though they still exist. You still need format detection, timestamp normalization, and incident windowing, but the volume and noise are manageable.

This made it the right starting point. The same pipeline architecture applies to Linux log analysis — it just requires more aggressive pre-processing stages to cut through the noise before the model gets involved.

For a 7B-class model on commodity hardware, FreeRTOS logs are the sweet spot: compact input, high signal density, and tractable without GPU acceleration.

The non-obvious hard part: finding where the problem actually was

The practical engineering challenge turned out to be something we did not anticipate upfront.

Logs were not delivered in real time. The client’s telemetry pipeline batched device events and forwarded them on a fixed schedule — in production, that typically meant once per day, or on-demand when an engineer manually triggered a sync. By the time the log arrived for analysis, the device had long since recovered or moved on. The log buffer covering the incident was somewhere in the middle of the data we received, surrounded by normal-operation entries before and recovery events after.

A naive prompt — “here are the logs, what went wrong?” — produced unreliable results because the model would try to summarize the entire window, including the recovery phase, often diluting or missing the actual failure.

The solution required a two-stage approach. The first stage asked the model to locate the incident within the log — identify the timestamp range where the failure sequence began and ended, based on event patterns rather than any prior knowledge of what to look for. Only once that window was identified did we pass it to the analysis stage. Getting this first-stage prompt right took significant iteration: the model needed to distinguish a brief failure-and-recovery sequence from normal operational variance, without being told what the failure would look like.

This kind of prompt engineering is where most of the real work in applied AI lives. The model selection and infrastructure are secondary.

Making the answers specific, not generic

A model with no context about the client’s specific devices will give generic answers: “possible memory issue,” “connection timeout,” “check your configuration.” Technically correct, practically useless.

The gap between a generic answer and a useful one is knowledge — the client’s own firmware changelogs, device-specific error code documentation, past incident post-mortems. We built a RAG layer that indexed this documentation locally and retrieved the most relevant sections for each query before the model ran.

The difference in output quality was substantial. Instead of “possible MQTT connection failure,” the model would identify the specific firmware version where this error pattern was introduced, reference the relevant changelog entry, and suggest the exact service restart or update path that fixed it in a previous incident. That is the difference between a demo and a tool engineers actually use.

What ran where

The model serving ran on the client’s existing on-premise server — no new hardware purchased. We chose Mistral 7B in Q4_K_M quantization, served via Ollama. On CPU-only hardware of this class, triage summaries come back in roughly 90–120 seconds. For asynchronous post-incident analysis this is acceptable; the alternative was engineers spending 20–40 minutes on manual investigation.

The RAG layer ran alongside it: embeddings via nomic-embed-text, indexed into a local Chroma instance. The full stack — model server, embedding model, vector store, orchestration — ran on a single server alongside the OpenSearch cluster, with no external network calls at any point in the pipeline.

What the client got

Engineers went from spending 20–40 minutes on initial triage to validating a pre-generated summary in under 5 minutes. Not every summary was correct – roughly 15% required manual review and correction — typically cases involving firmware versions not yet indexed in the RAG layer. But even those cases had the right log window and relevant documentation already surfaced, making the investigation faster regardless.

More importantly: the client’s security posture did not change. No telemetry left their network. The system was auditable, self-contained, and ran on hardware they already owned.

What about Linux devices?

This project focused on FreeRTOS, but the architecture extends to Linux-based embedded devices. The core difference is the pre-processing pipeline: Linux logs are noisier, more voluminous, and come from multiple sources (kernel ring buffer, systemd journal, application logs, crash dumps). The pipeline needs more stages to filter, parse, and correlate before the model can reason effectively — but the fundamental approach is the same.

For Linux workloads, the pre-processing pipeline becomes the primary engineering investment. The LLM reasoning stage remains similar, and open-source models in the 7B–14B range can handle the task when the input has been properly reduced to a focused incident packet. Whether you use an OSS model on-premise or a cloud API depends on your data residency requirements and the latency you can tolerate — the pipeline design works with either.

If you have a device fleet generating structured telemetry and engineers spending too much time on triage, this architecture is worth considering – whether your devices run FreeRTOS, embedded Linux, or both. Data residency requirements don’t have to be a blocker. Get in touch — we can usually tell within one conversation whether the architecture fits your setup.