Scraping my school timetable


>> Who needs a website?

Posted on | 831 words | ~4 minute read


Manually navigating to my school website to get my timetable is getting boring. Logging in every time is worse. What if you could store your cookies somewhere, reverse engineer some APIs and have a workable CLI program that contains it all?

It might just work.

How? Well, I did just show you. I’ll go over it one by one.

How to scrape?

I used a toml file to avoid hardcoded data, it’s got a really simple to understand syntax and V comes with built in modules for dealing with this stuff. Here is a sample of my own file (with specifics omitted).

[cookies]
cpsdid                       = '*** uuid ***'
cpssid_cesis_catholic_edu_au = '*** uuid ***'
ASP_NET_SessionId            = '*** uuid ***'
SamlSessionIndex             = '****_*** uuid ***'

[panel]
sizes   = [8, 10, 5, 6]
padding = 5
margin  = 3
gap     = 0

[advanced]
user_agent='Mozilla/5.0 Gecko/20100101 Firefox/99.0'
url='https://****.compass.education/Services/Calendar.svc/GetCalendarEventsByUser?'
origin='https://****.compass.education'
userid=0000

4 cookies, a convincing user agent and a origin url. All needed to get this to work smoothly, the endpoints do not like any kind of change at all. Might as well make a request exactly how they like it. What request exactly?

It’s a long one.

result := os.execute("curl --compressed --silent \'$config.url\' -X POST 
-H \'User-Agent: $config.user_agent\' 
-H \'Accept: */*\' 
-H \'Accept-Language: en-US,en;q=0.5\' 
-H \'Accept-Encoding: gzip, deflate, br\' 
-H \'Content-Type: application/json\' 
-H \'X-Requested-With: XMLHttpRequest\' 
-H \'Origin: $config.origin\' 
-H \'DNT: 1\' 
-H \'Connection: keep-alive\' 
-H \'Referer: $config.origin\' 
-H \'Cookie: cpsdid=${config.cookies["cpsdid"]}; cpssid_cesis.catholic.edu.au=${config.cookies["cpssid_cesis_catholic_edu_au"]}; ASP.NET_SessionId=${config.cookies["ASP_NET_SessionId"]}; SamlSessionIndex=${config.cookies["SamlSessionIndex"]}\' 
-H \'Sec-Fetch-Dest: empty\' 
-H \'Sec-Fetch-Mode: cors\' 
-H \'Sec-Fetch-Site: same-origin\' 
-H \'Sec-GPC: 1\' 
-H \'Pragma: no-cache\' 
-H \'Cache-Control: no-cache\' 
-H \'TE: trailers\' 
--data-raw \'{\"userId\":${config.user_id_str},\"homePage\":true,\"activityId\":null,\"locationId\":null,\"staffIds\":null,\"startDate\":\"${formatted_date}\",\"endDate\":\"${formatted_date}\",\"page\":1,\"start\":0,\"limit\":25}\' --output - 
")

Like I said, do everything the endpoints say. Otherwise you’ll get an angry status code from cloudflare. I use curl here because of it’s simplicity and how it deals with compressed data. The first time round writing this it took many hours banging my head against the wall trying to interpret json data from a compressed binary stream, not fun. Just put --compressed in the command’s args and curl takes care of the rest.

V’s default json module is not flexible at all. To avoid creating structs upon structs for the complete json tree structure I opted to use the alternative parser written in pure V. This allowed me to be a lot more dynamic, parsing values into types when I needed them (lazy evaluation).

// import x.json2

if result.exit_code != 0 {
	cli_fatals([
		"Curl exited with a non-zero error code! (${result.exit_code})"
		result.output
	])
}

entries := (json2.raw_decode(result.output) or {
	cli_fatals([
		"Failed to decode data!"
		"err: $err"
		"Returned data may be garbled?"
		"data: $result.output"
	])
}).as_map()["d"] or {
	cli_fatals([
		"Cannot find root of parsed data!"
		"Returned data may be garbled?"
		"data: $result.output"
	])
}.arr()

Boillerplate aside, this nightmarish piece of code takes the json data under d and transforms it into an array of nodes.

longTitle and longTitleWithoutTime are exactly what we are looking for. If this was C, I would be weeping seeing the amount of string manipulation in this next part. I’ll save you the trouble, just know that taking two hours off the time because of a timezone issue added 40 lines of garbage. But hey, it works fine right? Let’s just hope they don’t change or, god forbid, innovate on their APIs.

The “frame”

The frame is what really makes it look cool. Proper padding, margins and font alignment. Is it really a TUI program if the UI doesn’t look cool?

Since you’ve already seen it above, I’ll show it’s drawing process in action.

pub fn print(s string) {
	// ....
	// ....
	} $else {
+		C.usleep(10000)
		_write_buf_to_fd(1, s.str, s.len)
+		C.fflush(C.stdout)
	}
}

I’ve added a couple C functions into V’s builtin print function (above). One for sleeping 10000 nanoseconds (0.01 seconds) and another for flushing stdout. These additions show it best!

It first generates the border then iterates row by row filling in the required data, changing font colours as it goes. You see the yellow text on the right? That indicates a room or in this case a teacher change, little hints like that help a lot.

What about the -67 inside the programs arguments?

Forgot about that one, I don’t exactly have school right now so im stepping back 67 days to when I did. It can even see records in the future too. Ommiting the optional argument though just defaults to a no day offset. After finally writing out the frame it prints out the time it took for the HTTP request to complete.

The end

I originally wrote this program a while back, all in one go. It was incredibly messy and really had to be restructured. I did all of that before writing this post, source code here if you want to check it out. Unless you go to the same school as me and happen to be using linux you probably wont get much out of this, it was a cool concept eitherway and I personally use this a lot.

Thanks for reading!