OnTime

    • import from raw data
    • download of prepared partitions

    Downloading data:

    Creating a table:

    1. `Year` UInt16,
    2. `Quarter` UInt8,
    3. `Month` UInt8,
    4. `DayofMonth` UInt8,
    5. `DayOfWeek` UInt8,
    6. `FlightDate` Date,
    7. `UniqueCarrier` FixedString(7),
    8. `AirlineID` Int32,
    9. `Carrier` FixedString(2),
    10. `TailNum` String,
    11. `FlightNum` String,
    12. `OriginAirportID` Int32,
    13. `OriginAirportSeqID` Int32,
    14. `OriginCityMarketID` Int32,
    15. `Origin` FixedString(5),
    16. `OriginCityName` String,
    17. `OriginState` FixedString(2),
    18. `OriginStateFips` String,
    19. `OriginStateName` String,
    20. `OriginWac` Int32,
    21. `DestAirportID` Int32,
    22. `DestAirportSeqID` Int32,
    23. `DestCityMarketID` Int32,
    24. `Dest` FixedString(5),
    25. `DestCityName` String,
    26. `DestState` FixedString(2),
    27. `DestStateFips` String,
    28. `DestStateName` String,
    29. `DestWac` Int32,
    30. `CRSDepTime` Int32,
    31. `DepTime` Int32,
    32. `DepDelay` Int32,
    33. `DepDelayMinutes` Int32,
    34. `DepDel15` Int32,
    35. `DepartureDelayGroups` String,
    36. `DepTimeBlk` String,
    37. `TaxiOut` Int32,
    38. `WheelsOff` Int32,
    39. `WheelsOn` Int32,
    40. `TaxiIn` Int32,
    41. `CRSArrTime` Int32,
    42. `ArrTime` Int32,
    43. `ArrDelay` Int32,
    44. `ArrDelayMinutes` Int32,
    45. `ArrDel15` Int32,
    46. `ArrivalDelayGroups` Int32,
    47. `ArrTimeBlk` String,
    48. `Cancelled` UInt8,
    49. `CancellationCode` FixedString(1),
    50. `Diverted` UInt8,
    51. `CRSElapsedTime` Int32,
    52. `ActualElapsedTime` Int32,
    53. `AirTime` Int32,
    54. `Flights` Int32,
    55. `Distance` Int32,
    56. `DistanceGroup` UInt8,
    57. `CarrierDelay` Int32,
    58. `WeatherDelay` Int32,
    59. `NASDelay` Int32,
    60. `SecurityDelay` Int32,
    61. `LateAircraftDelay` Int32,
    62. `FirstDepTime` String,
    63. `TotalAddGTime` String,
    64. `LongestAddGTime` String,
    65. `DivAirportLandings` String,
    66. `DivReachedDest` String,
    67. `DivActualElapsedTime` String,
    68. `DivArrDelay` String,
    69. `DivDistance` String,
    70. `Div1Airport` String,
    71. `Div1AirportID` Int32,
    72. `Div1AirportSeqID` Int32,
    73. `Div1WheelsOn` String,
    74. `Div1TotalGTime` String,
    75. `Div1LongestGTime` String,
    76. `Div1TailNum` String,
    77. `Div2AirportID` Int32,
    78. `Div2AirportSeqID` Int32,
    79. `Div2WheelsOn` String,
    80. `Div2TotalGTime` String,
    81. `Div2LongestGTime` String,
    82. `Div2WheelsOff` String,
    83. `Div2TailNum` String,
    84. `Div3Airport` String,
    85. `Div3AirportID` Int32,
    86. `Div3AirportSeqID` Int32,
    87. `Div3WheelsOn` String,
    88. `Div3TotalGTime` String,
    89. `Div3LongestGTime` String,
    90. `Div3WheelsOff` String,
    91. `Div3TailNum` String,
    92. `Div4Airport` String,
    93. `Div4AirportID` Int32,
    94. `Div4AirportSeqID` Int32,
    95. `Div4WheelsOn` String,
    96. `Div4TotalGTime` String,
    97. `Div4LongestGTime` String,
    98. `Div4WheelsOff` String,
    99. `Div4TailNum` String,
    100. `Div5Airport` String,
    101. `Div5AirportID` Int32,
    102. `Div5AirportSeqID` Int32,
    103. `Div5WheelsOn` String,
    104. `Div5TotalGTime` String,
    105. `Div5LongestGTime` String,
    106. `Div5WheelsOff` String,
    107. `Div5TailNum` String
    108. ) ENGINE = MergeTree
    109. PARTITION BY Year
    110. ORDER BY (Carrier, FlightDate)
    111. SETTINGS index_granularity = 8192;

    Loading data with multiple threads:

    1. ls -1 *.zip | xargs -I{} -P $(nproc) bash -c "echo {}; unzip -cq {} '*.csv' | sed 's/\.00//g' | clickhouse-client --input_format_with_names_use_header=0 --query='INSERT INTO ontime FORMAT CSVWithNames'"

    (if you will have memory shortage or other issues on your server, remove the -P $(nproc) part)

    1. $ curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
    2. $ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    3. $ # check permissions of unpacked data, fix if required
    4. $ sudo service clickhouse-server restart
    5. $ clickhouse-client --query "select count(*) from datasets.ontime"

    Info

    If you will run the queries described below, you have to use the full table name, datasets.ontime.

    Q0.

    1. SELECT avg(c1)
    2. FROM
    3. (
    4. SELECT Year, Month, count(*) AS c1
    5. FROM ontime
    6. GROUP BY Year, Month
    7. );
    1. SELECT DayOfWeek, count(*) AS c
    2. FROM ontime
    3. WHERE Year>=2000 AND Year<=2008
    4. GROUP BY DayOfWeek
    5. ORDER BY c DESC;

    Q2. The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008

    Q3. The number of delays by the airport for 2000-2008

    1. SELECT Origin, count(*) AS c
    2. FROM ontime
    3. WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
    4. GROUP BY Origin
    5. ORDER BY c DESC
    6. LIMIT 10;

    Q4. The number of delays by carrier for 2007

    1. SELECT Carrier, count(*)
    2. FROM ontime
    3. WHERE DepDelay>10 AND Year=2007
    4. GROUP BY Carrier
    5. ORDER BY count(*) DESC;

    Q5. The percentage of delays by carrier for 2007

    1. SELECT Carrier, c, c2, c*100/c2 as c3
    2. FROM
    3. (
    4. SELECT
    5. Carrier,
    6. count(*) AS c
    7. FROM ontime
    8. WHERE DepDelay>10
    9. AND Year=2007
    10. GROUP BY Carrier
    11. )
    12. (
    13. SELECT
    14. Carrier,
    15. FROM ontime
    16. WHERE Year=2007
    17. GROUP BY Carrier
    18. ) USING Carrier
    19. ORDER BY c3 DESC;

    Better version of the same query:

    1. SELECT Carrier, avg(DepDelay>10)*100 AS c3
    2. FROM ontime
    3. WHERE Year=2007
    4. GROUP BY Carrier
    5. ORDER BY c3 DESC

    Q6. The previous request for a broader range of years, 2000-2008

    1. SELECT Carrier, c, c2, c*100/c2 as c3
    2. FROM
    3. (
    4. SELECT
    5. Carrier,
    6. count(*) AS c
    7. FROM ontime
    8. WHERE DepDelay>10
    9. AND Year>=2000 AND Year<=2008
    10. GROUP BY Carrier
    11. )
    12. JOIN
    13. (
    14. SELECT
    15. Carrier,
    16. count(*) AS c2
    17. FROM ontime
    18. WHERE Year>=2000 AND Year<=2008
    19. GROUP BY Carrier
    20. ) USING Carrier
    21. ORDER BY c3 DESC;

    Better version of the same query:

    1. SELECT Year, c1/c2
    2. FROM
    3. (
    4. select
    5. Year,
    6. count(*)*100 as c1
    7. from ontime
    8. WHERE DepDelay>10
    9. GROUP BY Year
    10. )
    11. JOIN
    12. (
    13. select
    14. Year,
    15. count(*) as c2
    16. from ontime
    17. GROUP BY Year
    18. ) USING (Year)
    19. ORDER BY Year;

    Better version of the same query:

    1. SELECT Year, avg(DepDelay>10)*100
    2. FROM ontime
    3. GROUP BY Year
    4. ORDER BY Year;

    Q8. The most popular destinations by the number of directly connected cities for various year ranges

    1. SELECT DestCityName, uniqExact(OriginCityName) AS u
    2. FROM ontime
    3. WHERE Year >= 2000 and Year <= 2010
    4. GROUP BY DestCityName
    5. ORDER BY u DESC LIMIT 10;

    Q9.

    1. SELECT Year, count(*) AS c1
    2. FROM ontime
    3. GROUP BY Year;

    Q10.

    1. SELECT
    2. min(Year), max(Year), Carrier, count(*) AS cnt,
    3. sum(ArrDelayMinutes>30) AS flights_delayed,
    4. round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
    5. FROM ontime
    6. WHERE
    7. DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
    8. AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
    9. AND FlightDate < '2010-01-01'
    10. GROUP by Carrier
    11. HAVING cnt>100000 and max(Year)>1990
    12. LIMIT 1000;

    Bonus:

    You can also play with the data in Playground, example.

    This performance test was created by Vadim Tkachenko. See: