Crowdsourced Air Trafﬁc Data from the OpenSky Network 2019–20

. The OpenSky Network is a non-proﬁt association that crowdsources the global collection of live air trafﬁc control data broadcast by airplanes and makes it available to researchers. OpenSky’s data has been used by over a hundred academic groups in the past ﬁve years, with popular research applications ranging from improved weather forecasting to climate analysis. With the COVID-19 outbreak, the demand for live and historic aircraft ﬂight data has surged further. Researchers around the world use air trafﬁc data to comprehend the spread of the 5 pandemic and analyze the effects of the global containment measures on economies, climate and other systems. With this work, we present a comprehensive air trafﬁc dataset, derived and enriched from the full OpenSky data and made publicly available for the ﬁrst time (Olive et al. (2020), DOI: https://doi.org/10.5281/zenodo.3928564). It spans all ﬂights seen by the network’s more than 3000 members between 1 January 2019 and 1 July 2020. Overall, the archive includes 41,900,660 ﬂights, from 160,737 aircraft, which were seen to frequent 13,934 airports in 127 countries. 10


Introduction
In this paper, we present a dataset of global flight movements derived from crowdsourced air traffic control data collected by the OpenSky Network (Schäfer et al. (2014)), which are widely used in many fields, including several areas pertaining to Earth System Sciences.With the spread of COVID-19, they are furthermore widely used in the understanding of the pandemic and its effects.
OpenSky flight data has regularly been used in analyzing environmental issues such as noise emissions (Tengzelius and Abom (2019)) or black carbon article emissions (Zhang et al. (2019)) to name but a few.In the wake of the pandemic, OpenSky has received a surge of more than 70 requests for air traffic data specifically related to COVID-19.The research behind these requests can be largely separated into two different areas, epidemiological modelling and understanding the systemic impact of the pandemic.
The first category, modeling of the possible spread of COVID-19, was of crucial interest early in the stages of the pandemic and will again gain importance to estimate travel safety in the future.The utility of flight data for this purpose was illustrated for example in widely circulated studies such as Bogoch et al. (2020) but has been known to be useful in the context of pandemics for much longer (e.g., Mao et al. (2015)).
The second main category comprises the analysis of the socio-ecological impact of COVID-19 and measures implemented to fight it.It uses flights for example as an indicator of economic activity (at a given airport, region, or globally) as illustrated in Miller et al. (2020).Examples of such use of data provided by OpenSky can be found in Bank of England, Monetary Policy Committee (2020), International Monetary Fund (2020) or United Nations Department of Economic and Social Affairs (2020).
Flight data can further be used to understand the impact of the sudden drop in air traffic on many global systems.For example, Lecocq et al. (2020) employed OpenSky data recently in order to analyze the impact of COVID-19 mitigation measures on high-frequency seismic noise and we received several requests relating to research specifically on the xte impact of COVID-19.This present dataset, available at https://doi.org/10.5281/zenodo.3928563,was created in order to make it easier for researchers to access air traffic data for their own systemic analyses.

Background
Crowdsourced research projects are a form of 'citizen science' whereby members of the public can join larger scientific efforts by contributing to smaller tasks.In the past, such efforts have taken many forms including attempting to detect extra-terrestrial signals (UC Berkley (2019)), or exploring protein folding for medical purposes (Pande (2019)).Typically, the projects form distributed computing networks with results being fed to a central server.
In a parallel development, software-defined radios (SDRs) have become readily available and affordable over the past decade.
SDR devices present a significant change to traditional radios, in that wireless technologies can be implemented as separate pieces of software and run on the same hardware.This has greatly reduced the barriers to entry, so many more users can now take part in wireless projects such as crowdsourced sensor networks with little cost.This development has given rise to several global crowdsourced flight tracking efforts, from commercial to enthusiast and research use.
The concept of flight tracking itself is based on several radar technologies.Traditionally, these were expensive and inaccurate non-cooperative radars developed for military purposes.With the explosive growth of global civil aviation, however, more accurate cooperative radar technologies have been deployed to ensure safety and efficiency of the airspace.
For this dataset of flight movements, we use the data broadcast by aircraft with the modern Automatic Dependent Surveillance -Broadcast (ADS-B) protocol.This data includes position, velocity, identification and flight status information broadcast up to twice a second (see Schäfer et al. (2014)).The protocol is being made mandatory in many airspaces as of 2020, resulting in broad equipage among larger aircraft from industrialized countries and emerging economies as described by Schäfer et al. (2016).a non-profit organisation, OpenSky then grants researchers from academic and other institutions direct access to this database on request (Schäfer et al. (2014)).
Along with the global sensor coverage, the database has initially grown exponentially since its inception on 2014 (see Fig. 1) and currently comprises over 23 trillion messages, taking up around 2 Petabytes.In peak pre-pandemic times, almost 100,000 flights were tracked per day.The raw data available in the Impala database has been used in more than 100 academic publications as of July 2020.However, despite available application programming interfaces and third-party tools, the access to this data requires significant investment of time and resources to understand the availability and underlying structure of the database.With this data set and its accompanying descriptor we want to address this accessibility issue and make a relevant part of the OpenSky Network flight meta data accessible to all researchers.

Crowdsourced Collection
The raw data used to generate the dataset was recorded more than 3000 crowdsourced sensors of the OpenSky Network.The  As the data comes from a crowdsourced system of receivers, it is dealing with numerous challenges and difficulties found in such an organically-grown, non-controlled set of receivers.However, it is the only feasible option for the large-scale collection of open research data as collecting data from a synchronized and controlled deployment would be less flexible and less widely applicable, in particular for a non-profit research endeavour.Conversely, due to the high sensor density and high level of redundancy in the OpenSky Network, many well-covered regions of this data achieve the quality of controlled deployments on a nation-wide level in many countries.
The true coverage of the network, i.e. actually received positions of airplanes, is illustrated in Fig. 3, both for 1 January 2019 and during the pandemic on 1 May 2020.Historic coverage for any given day is visible on https://opensky-network.org/ network/facts.

Derivation of Flights
We define a flight for the purpose of this dataset as the time between the first received ADS-B contact of one specific aircraft and the last.A flight must be of at least 15 minutes.If a flight leaves OpenSky's coverage range for more than 10 minutes, it is principally considered finished at the point of last contact.To prevent counting flights multiple times if they return into the coverage range after more than 10 minutes (e.g., for any flight over the Atlantic Ocean), we apply a simple check: if time, distance and reported velocity match, they will be considered segments of the same flight.If not, it is assumed that the aircraft has landed at some point.
The destination airport candidates are received from these identified flight trajectories as follows.If the last position seen is above 2500 meters, no candidate is defined and the value is set to 'NULL'.Else, the descending trajectory is extrapolated towards the ground and the Cartesian distance to the closest airports is computed.If there is no airport within 10 kilometers, the value is set to 'NULL'.Else, the closest identified airport is listed as the destination airport.The procedure applies in reverse for the origin airport candidates.
We note that this approach is necessarily an extrapolation and airports may in some cases be wrongly identified if the contact is lost before the ground, in particular where several airports are close by.

Data Cleaning
To make the data accessible and meet the requirements, complex pre-processing is needed to abstract from most system aspects, reduce the data volume, and to eliminate the need to understand all system aspects in order to use the data.Moreover, the information quality needs to be assessed and indicated, allowing researchers to choose subsets that match their own require- ments.Therefore, we performed the following processing steps to prepare the unstructured OpenSky Network data and create a well-defined dataset for scientific analysis.

Decoding
Decoding ADS-B correctly is a complex task.Although libraries and tutorials such as Sun et al. (2019) exist, it remains a tedious task that requires a deep understanding of the underlying link layer technology Mode S.Moreover, the sheer volume of data collected by OpenSky (about 120 GB of raw data per hour) makes this process challenging and resource-intensive.
Therefore, we relieve researchers from this burden by providing readily decoded information such as position in WGS84 coordinates, altitude information in meters, and the unique aircraft identifier as a 24 bit hexadecimal number.

Timestamps
Timestamps are provided in different resolutions and units, depending on the receiving sensor type.For purposes of this dataset, we use the time when the messages where received at the server, with a second precision, which we deem more than sufficient for the macro use cases intended.

Deduplication
OpenSky's raw data is merely a long list of single measurements by single sensors.However, as most localization algorithms rely on signals being received by multiple receivers, we grouped multiple receptions belonging to the same transmission based on their continuous timestamp and signal payload.This process is called deduplication.Note that although most position reports are unique, a small number of falsely grouped measurements remains as noise in the data.

Quality Assurance
Crowdsourcing creates potential issues regarding the quality and integrity of location and timing information of certain aircraft and sensors.Such issues can range from faults in the transmission chain (i.e., aircraft transponder, ground station) to malicious injection of falsified aircraft data.To allow researchers to ignore these effects while still preserving them as a potential subject of research, OpenSky offers integrity checks to verify and judge the data correctness (see Schäfer et al. (2018)).We also note that the abstracted nature of this dataset makes it more robust to any issues in the first place as low-quality data will be averaged out over time by the many involved receivers.

Data Enrichment
We use the OpenSky aircraft database to add aircraft types to our flight data, and access publicly available open application programming interfaces (API) to match the commercial flight identifier, where available.8. lastseen: The UTC timestamp of the last message received by the OpenSky Network.9. day: The UTC day of the last message received by the OpenSky Network.In the following, we provide some statistics showing that our flights dataset reflects the air traffic reality as different time series showing the effect of the COVID-19 pandemic at different airports and for different airlines.In a similar fashion, Figure 5 shows COVID-19's normalized impact on different airlines across the globe.Among noticeable trends, we can identify: sharply decreasing patterns for all regular airlines in March, with stronger effects for European airlines compared to 180 American and Asian airlines; almost all low-cost airlines practically stopped all business activities (with the exception of the Japanese Peach airlines); a very slow recovery for most airlines and regions beginning in May and June, with some rebounding more strongly, for example Air New Zealand (ANZ); cargo airlines show no negative impact of the crisis, some may even find a slight upwards trend.

Figure 2 Figure 1 .
Figure2illustrates the principle of OpenSky in the abstract: The data is broadcast by ADS-B-equipped aircraft and received by crowdsourced receivers on the ground, which have typical ranges of 100-500 km in a line of sight environment.The data is then sent to the OpenSky Network, where it is processed and stored in a Cloudera Impala database.In line with its mission as network records the payloads of all 1090 MHz secondary surveillance radar downlink transmissions of aircraft along with the timestamps and signal strength indicators provided by each sensor on signal reception.Part of this data collection are the exact aircraft locations broadcast at 2 Hz by transponders using the ADS-B technology.

Figure 2 .
Figure 2. High-level illustration of the flight data crowdsourcing process, including map of active receivers on July 1, 2020.© OpenStreetMap contributors 2020.Distributed under a Creative Commons BY-SA License.

170Figure 4
Figure4shows a time series of airport activity (as measured by departures) on four different regions based on data from 1 January to 30 April 2020.The impact of the pandemic (or rather the measures to contain it) can be seen clearly in all four.For example, the data shows:a slow decrease from February in several East-Asian airports (even earlier in Hong Kong); -European airports decreasing sharply from early March onward;

Figure 4 .
Figure 4. Comparison of flight numbers at various airports as seen by the OpenSky Network during 2020.

Table 1 .
Overview of the dataset files and content metadata.

Table 2 .
Flight distribution in data set January 2020.

Table 2
shows the distribution of the top 25 aircraft types in the flight dataset over one month (January 2020).Overall, the top models are dominated by the four largest commercial aircraft manufacturers: Boeing with 8 different types accounting for 663,308 flights; Airbus with 7 models and 766,499 flights; Embraer with 4 models (120,163 flights) and Bombardier (3

Table 3 .
Top 20 airports based on recorded flight destinations in January 2020.

Table 3
shows the distribution of the top 20 airports types in the flight dataset in January 2020 (based on recorded flight destinations).Reflecting both global air traffic realities and OpenSky's coverage focus, 13 of these airports are in the United States, including the 7 busiest with regards to landings.Several of the major hubs in Europe (Frankfurt, London Heathrow, Paris Charles de Gaulle) and Asia (Kuala Lumpur, Dubai and Delhi) make up the remaining six.