OnTime

    • import from raw data
    • download of prepared partitions

    Downloading data:

    (from https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh )

    Creating a table:

    1. `Year` UInt16,
    2. `Quarter` UInt8,
    3. `Month` UInt8,
    4. `DayofMonth` UInt8,
    5. `DayOfWeek` UInt8,
    6. `FlightDate` Date,
    7. `UniqueCarrier` FixedString(7),
    8. `AirlineID` Int32,
    9. `Carrier` FixedString(2),
    10. `TailNum` String,
    11. `FlightNum` String,
    12. `OriginAirportID` Int32,
    13. `OriginAirportSeqID` Int32,
    14. `OriginCityMarketID` Int32,
    15. `Origin` FixedString(5),
    16. `OriginCityName` String,
    17. `OriginState` FixedString(2),
    18. `OriginStateFips` String,
    19. `OriginStateName` String,
    20. `OriginWac` Int32,
    21. `DestAirportID` Int32,
    22. `DestAirportSeqID` Int32,
    23. `DestCityMarketID` Int32,
    24. `Dest` FixedString(5),
    25. `DestCityName` String,
    26. `DestState` FixedString(2),
    27. `DestStateFips` String,
    28. `DestStateName` String,
    29. `DestWac` Int32,
    30. `CRSDepTime` Int32,
    31. `DepTime` Int32,
    32. `DepDelay` Int32,
    33. `DepDelayMinutes` Int32,
    34. `DepDel15` Int32,
    35. `DepartureDelayGroups` String,
    36. `DepTimeBlk` String,
    37. `TaxiOut` Int32,
    38. `WheelsOff` Int32,
    39. `WheelsOn` Int32,
    40. `TaxiIn` Int32,
    41. `CRSArrTime` Int32,
    42. `ArrTime` Int32,
    43. `ArrDelay` Int32,
    44. `ArrDelayMinutes` Int32,
    45. `ArrDel15` Int32,
    46. `ArrivalDelayGroups` Int32,
    47. `ArrTimeBlk` String,
    48. `Cancelled` UInt8,
    49. `CancellationCode` FixedString(1),
    50. `Diverted` UInt8,
    51. `CRSElapsedTime` Int32,
    52. `ActualElapsedTime` Int32,
    53. `AirTime` Int32,
    54. `Flights` Int32,
    55. `Distance` Int32,
    56. `DistanceGroup` UInt8,
    57. `CarrierDelay` Int32,
    58. `WeatherDelay` Int32,
    59. `NASDelay` Int32,
    60. `SecurityDelay` Int32,
    61. `LateAircraftDelay` Int32,
    62. `FirstDepTime` String,
    63. `TotalAddGTime` String,
    64. `LongestAddGTime` String,
    65. `DivAirportLandings` String,
    66. `DivReachedDest` String,
    67. `DivActualElapsedTime` String,
    68. `DivArrDelay` String,
    69. `DivDistance` String,
    70. `Div1Airport` String,
    71. `Div1AirportID` Int32,
    72. `Div1AirportSeqID` Int32,
    73. `Div1WheelsOn` String,
    74. `Div1TotalGTime` String,
    75. `Div1WheelsOff` String,
    76. `Div2Airport` String,
    77. `Div2AirportID` Int32,
    78. `Div2AirportSeqID` Int32,
    79. `Div2WheelsOn` String,
    80. `Div2TotalGTime` String,
    81. `Div2LongestGTime` String,
    82. `Div2WheelsOff` String,
    83. `Div2TailNum` String,
    84. `Div3Airport` String,
    85. `Div3AirportID` Int32,
    86. `Div3AirportSeqID` Int32,
    87. `Div3WheelsOn` String,
    88. `Div3TotalGTime` String,
    89. `Div3LongestGTime` String,
    90. `Div3WheelsOff` String,
    91. `Div3TailNum` String,
    92. `Div4Airport` String,
    93. `Div4AirportID` Int32,
    94. `Div4AirportSeqID` Int32,
    95. `Div4WheelsOn` String,
    96. `Div4TotalGTime` String,
    97. `Div4LongestGTime` String,
    98. `Div4WheelsOff` String,
    99. `Div4TailNum` String,
    100. `Div5Airport` String,
    101. `Div5AirportID` Int32,
    102. `Div5AirportSeqID` Int32,
    103. `Div5WheelsOn` String,
    104. `Div5TotalGTime` String,
    105. `Div5LongestGTime` String,
    106. `Div5WheelsOff` String,
    107. `Div5TailNum` String
    108. ) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

    Loading data:

    1. $ for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=example-perftest01j --query="INSERT INTO ontime FORMAT CSVWithNames"; done
    1. $ curl -O https://clickhouse-datasets.s3.yandex.net/ontime/partitions/ontime.tar
    2. $ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    3. $ # check permissions of unpacked data, fix if required
    4. $ sudo service clickhouse-server restart
    5. $ clickhouse-client --query "select count(*) from datasets.ontime"

    Info

    If you will run queries described below, you have to use full table name,
    datasets.ontime.

    1. SELECT avg(c1)
    2. FROM
    3. (
    4. SELECT Year, Month, count(*) AS c1
    5. FROM ontime
    6. GROUP BY Year, Month
    7. );

    Q1. The number of flights per day from the year 2000 to 2008

    1. SELECT DayOfWeek, count(*) AS c
    2. FROM ontime
    3. WHERE Year>=2000 AND Year<=2008
    4. GROUP BY DayOfWeek
    5. ORDER BY c DESC;

    Q2. The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008

    Q3. The number of delays by airport for 2000-2008

    1. SELECT Origin, count(*) AS c
    2. FROM ontime
    3. WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
    4. GROUP BY Origin
    5. ORDER BY c DESC
    6. LIMIT 10;

    Q4. The number of delays by carrier for 2007

    1. SELECT Carrier, count(*)
    2. FROM ontime
    3. WHERE DepDelay>10 AND Year=2007
    4. GROUP BY Carrier
    5. ORDER BY count(*) DESC;

    Q5. The percentage of delays by carrier for 2007

    1. SELECT Carrier, c, c2, c*100/c2 as c3
    2. FROM
    3. (
    4. SELECT
    5. Carrier,
    6. count(*) AS c
    7. FROM ontime
    8. WHERE DepDelay>10
    9. AND Year=2007
    10. GROUP BY Carrier
    11. )
    12. ANY INNER JOIN
    13. (
    14. Carrier,
    15. count(*) AS c2
    16. WHERE Year=2007
    17. GROUP BY Carrier
    18. ) USING Carrier
    19. ORDER BY c3 DESC;

    Better version of the same query:

    1. SELECT Carrier, avg(DepDelay>10)*100 AS c3
    2. FROM ontime
    3. WHERE Year=2007
    4. GROUP BY Carrier
    5. ORDER BY Carrier
    1. SELECT Carrier, c, c2, c*100/c2 as c3
    2. FROM
    3. (
    4. SELECT
    5. Carrier,
    6. count(*) AS c
    7. FROM ontime
    8. WHERE DepDelay>10
    9. AND Year>=2000 AND Year<=2008
    10. GROUP BY Carrier
    11. )
    12. ANY INNER JOIN
    13. (
    14. SELECT
    15. Carrier,
    16. count(*) AS c2
    17. FROM ontime
    18. WHERE Year>=2000 AND Year<=2008
    19. GROUP BY Carrier
    20. ) USING Carrier
    21. ORDER BY c3 DESC;

    Better version of the same query:

    Q7. Percentage of flights delayed for more than 10 minutes, by year

    1. SELECT Year, c1/c2
    2. FROM
    3. (
    4. select
    5. Year,
    6. count(*)*100 as c1
    7. from ontime
    8. WHERE DepDelay>10
    9. GROUP BY Year
    10. )
    11. ANY INNER JOIN
    12. (
    13. select
    14. Year,
    15. count(*) as c2
    16. from ontime
    17. GROUP BY Year
    18. ) USING (Year)
    19. ORDER BY Year;

    Better version of the same query:

    1. SELECT Year, avg(DepDelay>10)
    2. FROM ontime
    3. GROUP BY Year
    4. ORDER BY Year;

    Q8. The most popular destinations by the number of directly connected cities for various year ranges

    1. SELECT DestCityName, uniqExact(OriginCityName) AS u
    2. FROM ontime
    3. WHERE Year >= 2000 and Year <= 2010
    4. GROUP BY DestCityName
    5. ORDER BY u DESC LIMIT 10;

    Q9.

    1. SELECT Year, count(*) AS c1
    2. FROM ontime
    3. GROUP BY Year;

    Q10.

    1. SELECT
    2. min(Year), max(Year), Carrier, count(*) AS cnt,
    3. sum(ArrDelayMinutes>30) AS flights_delayed,
    4. round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
    5. FROM ontime
    6. WHERE
    7. DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
    8. AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
    9. AND FlightDate < '2010-01-01'
    10. GROUP by Carrier
    11. HAVING cnt>100000 and max(Year)>1990
    12. ORDER by rate DESC

    This performance test was created by Vadim Tkachenko. See: