Big Picture 10 min read

The Chemistry Observability Gap

Engineering teams instrument every system they touch except the one that decides whether they ship. Behavioral team telemetry is the missing observability layer.

By QuestWorks Editorial

TL;DR

Modern organizations run rich observability for code, dashboards for individuals, and analytics for headcount. The team itself, the unit that actually produces the work, is the unobserved layer. Surveys lag, Slack volume is not signal, calendar metadata is not chemistry, and quarterly performance reviews arrive after the breakdown. The behavioral science is settled enough to act on: Woolley's collective intelligence work and Pentland's sociometric research show that turn-taking equality, social sensitivity, and network breadth predict team output more reliably than individual talent does. The frontier is not better surveys. It is team-level behavioral telemetry, aggregated and opt-in, that gives leaders an honest read on how a team coordinates.

The team that instrumented everything except itself

Picture an engineering org most readers will recognize. Datadog watches p95 latency. Sentry surfaces errors before customers see them. Linear tracks every issue from idea to merge. Jira reports sprint health back to leadership in real time. The product runs on rails, and when those rails wobble, the team knows within minutes.

Now picture the same team three weeks before a quarter-end miss. Two senior engineers have stopped speaking in standup beyond a status update. The tech lead has begun routing all decisions through one trusted peer and skipping the rest. A junior who used to ship the most code is silent on the channel and shipping nothing. Nothing in the dashboards moves. Sentry is green, Datadog is green, deploys are green. The first signal that something has gone wrong arrives in the postmortem, weeks after the fact.

The team has full observability on its product and zero observability on itself. That asymmetry is the operating problem of the decade, and it has a name: the chemistry observability gap.

The three intelligence stacks teams already run

Most organizations are already paying for three layers of intelligence about their work. Each layer answers a different question, and the gaps between them are where coordination breakdowns live.

The product and system layer watches code and infrastructure. Honeycomb's 2016 observability codification, followed by Peter Bourgon's three pillars (metrics, logs, traces) in 2017, gave engineering teams the language and instruments. Datadog, Sentry, Grafana, and Linear sit here. Output is measured in error rate, deploy frequency, p95, and lead time. None of it touches how the team coordinates.

The individual layer watches people one at a time. Lattice (roughly 3,700 customers and approximately $127M in 2024 revenue), Culture Amp, 15Five, and the broader engagement stack live here. Output is quarterly performance ratings, sentiment scores, and goal completion. Gartner's 2024 research found only 21% of HR leaders believe their organizations use talent data effectively.

The org and workforce layer watches headcount and skills. Visier's people analytics, Eightfold's talent intelligence (NTT DATA alone runs 20,000-plus profiles on Eightfold), and Workday operate here. Output is attrition risk, span of control, internal mobility, and skill gaps. The HCM market sat at $31.34B in 2024, projected to roughly double by 2032, with people analytics alone at $3.08B. The dollars are real. The unit of analysis is still the individual.

The product layer sees the work. The individual layer sees the person. The workforce layer sees the population. None of them see the team.

What is missing: the team chemistry layer

There is a fourth layer the existing stack does not produce. It answers questions like: how evenly is this team distributing decision-making? Who is speaking and who is silent? How fast does the team recover when a planned release slips or a customer escalation lands at 4pm on a Thursday? Are the right pairs of engineers talking, or is the org chart papering over a network that has fractured into two camps?

These are not soft questions. They are the questions Google's Project Aristotle spent two years answering across 180 teams, with psychological safety surfacing as the top factor across five dynamics. They are the questions Anita Woolley's 2010 Science paper on collective intelligence reframed as measurable: a team's collective intelligence factor, the c-factor, was not correlated with individual IQ. It was correlated with social sensitivity (r=0.26), turn-taking equality (lower variance correlated at r=-0.41), and proportion of women in the group (r=0.23). The c-factor was replicated in person and online by Riedl and colleagues in PNAS in 2021.

The chemistry layer is not a metaphor. It is a measurable system. The reason most organizations do not measure it is not philosophical. It is instrument failure.

Why the current toolset cannot see chemistry

The four instruments most leaders reach for each fail in a different direction.

Engagement surveys are lagging and recency-biased. The Edmondson 1999 psychological safety scale is foundational and remains widely used, and the broader migration from engagement surveys to team intelligence is the cleanest map of where this instrument fits now. The CIPD 2024 evidence review documents acknowledged weaknesses: social desirability bias, response bias, recency bias, and the recall problem of asking people to summarize a quarter's worth of behavior in five Likert items. Gallup's 2024 read shows global engagement falling from 23% to 21%, the first decline since 2020, with US engagement at 31%, a ten-year low, and a $438B annual cost. The instrument has had decades. The trend is in the wrong direction.

Slack volume is not signal. Message count, reaction count, and channel activity are easy to measure and useless as proxies for team health. A team going quiet can mean focused execution or psychological withdrawal, and the volume number cannot tell them apart. Slack activity is not a chemistry signal covers this failure mode in depth. Slack itself explicitly does not market its analytics as an evaluation surface, and the platform's terms reflect that.

Calendar metadata is not chemistry either. Microsoft Viva Insights captures meeting load, focus time, and after-hours work patterns. The data is aggregated and de-identified by design, which is the right governance posture, but the data class itself is structural, not behavioral. Knowing a team has too many meetings does not reveal who is dominating those meetings, who never speaks, or how decisions actually land.

Performance reviews are quarterly and retrospective. By the time a manager writes the review, the breakdown is months old and the data is filtered through one observer's recall.

The proof that this matters is structural. Atlassian's State of Teams 2024 found that 64% of teams lack shared goals, while teams with shared goals are 4.6x more productive. Atlassian estimates 25 billion work hours per year lost to coordination breakdowns. That 64% gap is not subtle. It is the most basic coordination question a team can answer, and the surveys and dashboards did not surface it until Atlassian asked teams directly.

What behavioral team telemetry actually looks like

The foundational science is older than most people realize. Sandy Pentland's MIT Human Dynamics Lab spent the 2000s instrumenting teams with sociometric badges that passively captured speech energy, turn-taking, posture, and proximity. Pentland's 2012 HBR summary reported the headline finding: those passive signals predicted team productivity without any access to what people were actually saying. The behavioral pattern alone carried the signal. Woolley's c-factor work pointed at the same destination from the lab side. The 2021 PNAS replication closed most remaining "soft science" objections by showing the effect holds in distributed, online groups.

Translated to a hybrid or distributed team in 2026, behavioral telemetry looks like this:

  • Turn-taking equality. Across decision-making moments, does conversational airtime distribute roughly evenly, or does it concentrate on one or two voices? Woolley's variance measure (r=-0.41 with c-factor) is the canonical reference.
  • Social sensitivity. Do team members read and respond to each other's affective cues, or do they steamroll? Woolley measured this with the Reading the Mind in the Eyes test; the behavioral correlate is whether the team adjusts its tempo when someone pushes back.
  • Network breadth. Are the right cross-pair conversations happening, or is the team running on two or three internal hub nodes? A team whose information flow runs through one person is fragile by definition.
  • Recovery time. When the team hits a disagreement or a planned task slips, how long does it take to converge on a new path? Recovery latency is one of the cleanest leading indicators of team health.

None of these require reading message content or rating individual performance. All of them are observable in aggregate and improve when leaders pay attention to them. Measuring team dynamics well means picking the behavioral signals that map to those four properties and ignoring the volume metrics that do not.

Borrowing the observability frame from software

The reason "observability" is the right word here, not "analytics" or "measurement," comes from where the term originated. Observability is borrowed from control-systems engineering: a system is observable if its internal state can be inferred from its external outputs. Honeycomb's 2016 codification adapted the term for distributed software, where logs alone could not explain emergent failures.

The same translation applies to teams. The analogy is direct enough to be useful:

  • Surveys are user complaints. They are self-reported, intermittent, lagging, and only as useful as the user's ability to articulate the underlying problem.
  • Dashboards are the check-engine light. They tell you something is wrong after the fact. They tell you nothing about why.
  • Behavioral telemetry is traces. It captures the actual flow of activity through the system in time, in aggregate, with enough resolution to reconstruct what happened.

The discipline of observability also brought a useful methodological humility to engineering: you cannot debug what you cannot see, and you cannot improve what you cannot measure. The same humility is overdue for team systems. Team intelligence and people analytics answer different questions, and the difference matters most when a leader is trying to understand a specific team's behavior this week.

The surveillance objection, taken seriously

Every time behavioral data enters the workplace, the surveillance question arrives with it. It should. Pentland's badges, Humanyze's later product, and Microsoft's 2020 Productivity Score all drew the critique, and in each case the critique was earned by specific design choices that exposed individual-level data without sufficient consent or governance.

The mitigations are now well understood and not optional. Aggregate-only outputs at the team level. No individual scoring. Opt-in participation with clear ability to leave. Strengths-based framing, never deficit framing. A clean separation between coaching (private, individual) and reporting (aggregate, leader-visible). Published data model and retention policies.

Team chemistry observability is not a softer label for surveillance. The design constraints listed above eliminate the failure modes that defined the surveillance critique. The category fails when those constraints are skipped. It succeeds when they are foundational.

The chemistry layer of the intelligence stack

The right way to think about where this fits is as a tier in a stack that already exists. Workforce intelligence (Visier) answers questions about headcount, attrition, and span. People intelligence (Eightfold) answers questions about skills, matching, and internal mobility. The missing tier above them answers questions about how a specific team is functioning together this week. Surveys, the legacy instrument, sit at the bottom as the slowest layer. Behavioral team telemetry sits on top as the fastest.

QuestWorks operates at that tier. Teams of two to five run 25-minute AI-facilitated sessions on QuestWorks' own cinematic, voice-controlled platform. The system observes how the team coordinates, in the spirit of the Woolley and Pentland canon: turn-taking patterns, recovery after friction, network breadth across the players. Outputs flow into a weekly leader-facing team health report and an individual QuestDash that surfaces strengths. Nine HeroTypes are public; coaching with HeroGPT remains private. Slack and Microsoft Teams handle install, invites, leaderboards, and coaching, while the simulation runs on QuestWorks' own platform. Participation is voluntary and not tied to performance reviews. Pricing is $14 per user per month for the first 50 companies in the Founder's Circle, $20 per user per month afterward, with a 10-day trial.

The category framing is straightforward. Team intelligence, powered by play. The instrument is behavioral. The unit of analysis is the team. The observability gap is the problem worth solving.

Frequently Asked Questions

Team chemistry observability is the practice of inferring how a team is actually coordinating from the behavior it produces, rather than from the sentiment it self-reports. Borrowed from software observability, it treats team dynamics as an instrumentable system. The signals are aggregate and behavioral: turn-taking equality, social sensitivity, network breadth across the team, and recovery time after conflict or pressure. The output is a team-level signal that complements surveys and people analytics rather than replacing them.

Surveys capture stated belief recalled after events have already shaped opinion. The Edmondson psychological safety scale, foundational to the field, has acknowledged exposure to social desirability bias, response bias, and recency bias. Atlassian's State of Teams 2024 found 64% of teams lacked shared goals, a structural breakdown that engagement instruments routinely failed to catch. Gartner's 2024 survey reported only 21% of HR leaders believe their organizations use talent data effectively. The instrument has had decades to improve, and the gap persists.

It does not have to be, and the design choices make the difference. The MIT Human Dynamics Lab's sociometric badge research and Microsoft Viva both drew the surveillance critique when individual-level views were exposed. The mitigations are now well understood: team-level aggregation by default, no individual scoring, opt-in participation, strengths-based outputs, and clear data boundaries between coaching and reporting. Surveillance is a design failure, not an inherent property of behavioral signal.

People analytics measures individuals and aggregates them into population-level trends, anchored in tools like Visier, Eightfold, and Workday. Team intelligence treats the team itself as the unit of analysis. The metrics differ: coordination quality, role clarity under load, network breadth, and recovery time after conflict. The two layers are complementary. Workforce intelligence answers headcount and attrition questions; people intelligence answers skills and matching questions; team intelligence answers how a specific team functions together this week.

QuestWorks observes how teams of two to five coordinate during 25-minute AI-facilitated sessions on its own cinematic, voice-controlled platform. The system captures behavioral signal in the spirit of the Woolley collective intelligence research: turn-taking patterns, network breadth, decision velocity, recovery after friction. Outputs flow into a weekly team-level report visible to leaders and a player-level QuestDash that is strengths-based and individual. Coaching with HeroGPT remains private. Participation is voluntary and not tied to performance reviews.

Ready to Level Up Your Team?

10-day free trial. Install in under a minute.

Slack Microsoft Teams Try it free
Team Intelligence™, powered by play. Slack Microsoft Teams Try QuestWorks Free