3 Data Ingestion
In this chapter, we walk through how raw event logs from ultimate games are processed into structured datasets ready for analysis.
3.1 Event Processing Overview
We process each team’s event stream independently, iterating over events in chronological order. For every event, we check whether it pertains to one of our four key categories: pulls, penalties, blocks, or throws. When it does, we record a row in the corresponding table, pulling all the relevant information from the event and combining it with contextual details from the current state (such as which players are on the field, the current possession number, and score).
The game state is updated continuously during this loop. For example, a turnover would increment the number of possessions in the point, update player positioning, and a timeout would change who is on the field. This approach allows us to cleanly separate events into different structured datasets, while also capturing the temporal and tactical flow of the game.
3.1.1 Handling Derived and Ambiguous Information
Some information isn’t available directly in the event logs and needs to be inferred. For example:
- Block location is not logged explicitly. Instead, we assume the block occurred where the next thrower began their throw—if such a throw exists. This approximation generally works well, but it can introduce noise when the disc is centered as the result of a foul or repositioned after the turnover.
- Penalty locations are inferred by shifting the throw location ±10 yards in the direction of possession. This is a rough heuristic and can be inaccurate, especially on penalties the result in centering the disc.
3.1.2 Row Ordering and Game Reconstruction
Every event we process is assigned a unique order number, incremented as we iterate. This makes it possible to reconstruct the entire game as a single, unified timeline by combining the throws, blocks, pulls, and penalties tables into one chronological tabular view.
3.1.3 Defensive Line Table
We build a separate defensive table that tracks which defenders are on the field for every possession. Using the state environment, we attach possession identifiers (constructed as game_quarter - quarter_point - possession_num
) to each row. After the whole game is processed, this identifier is used to map each throw with the specific defensive unit it was played against.
Within each possession, we also track line changes, which help us capture substitutions or injury-related swaps that occur mid-possession. In rare cases, these updates can cause mismatches—for example, when one team records an injury and the other does not, or when possession numbers are slightly off due to human error. These mismatches are uncommon and typically resolve without significantly affecting analysis.
3.1.4 Estimating Time Left
Most events in the logs do not include time information, but a subset of events—like point starts and timeouts—do list the time left in the quarter. To fill in the gaps, we interpolate between these known times. For any stretch between two recorded time entries, we evenly distribute time across the throws that occurred in between, assuming each took the same duration. This gives us a reasonable approximation of the time left for each throw.
To improve accuracy, we run a sliding window check within each quarter to ensure that time values are monotonically decreasing. Occasionally, due to logging errors, a time might increase instead of decrease—this is often due to manual edits or delayed recording. We remove these out-of-sequence time entries to preserve consistency.
Finally, for double overtime—where time is often not recorded—we set the time left to 0 for all events in that period. This avoids introducing misleading values into analyses that depend on time pressure or end-of-quarter strategy.
3.1.5 Career Stats Aggregation
Once all game events are processed, we use the player stats to aggregate career stats for each player by expanding and summarizing individual performances across all available years. We generate player-level summaries for throwing, receiving, blocking, and other key contributions.
For simplicity and consistency, we limit this aggregation to games played in 2021 and onward when play by play data began to be recorded. This cutoff ensures that the data is standardized in terms of formatting, event definitions, and completeness. Earlier seasons often have inconsistencies or missing fields that make direct comparisons unreliable.
3.2 Final Output
After processing the raw event data through the steps outlined in this chapter, the final output consists of several well-structured tables that offer deep insights into the game dynamics. Here’s a summary of what the final output includes:
Throws Table: Contains detailed records of each throw, including its thrower, receiver, throw angle, throw distance, and relevant contextual features such as the defensive line. It also flags drops and stalls as turnovers.
Blocks Table: Captures block events and approximates block locations based on thrower starting positions.
Pulls Table: Details the pull events, providing information on the puller, pull location an hangtime.
Penalties Table: Tracks penalty events and includes the event location