What I’m super curious to detangle using traceroute is how connected is the internet, really? Seems like a pretty silly question, since obviously every page on the internet is theoretically linked to every other address, even if that connection is extremely distant. But what I want to discover is to what degree is infrastructure shared? For example, how much in common does the path from my computer to amazon.com have with the path to, say, reddit.com? The assumption would be that the outgoing message will be pretty consistent, as the path from my router to a large enough parent network would be pretty consistent. Taking it further, I’d assume that the next step would inevitably a Tier 1 service provider, maybe several, followed by a path down some more obscure IP’s, and finally to the destination. But, I suppose it’s possible that Verizon, my service provider, has such a strong hold on my outgoing traffic that it might dominate most of the infrastructure, only passing off to other providers towards the end of the journey. So let’s find out.
I started, as all good projects do, by slapping together a really rudimentary chrome extension to capture the hostnames of the sites I visited over the course of a few hours. Sure, I probably could have just pulled my browsing history, but then I wouldn’t have an excuse to learn more about how to build chrome extensions. Here’s the history I gathered from my hacky extension:
www.amazon.com www.reddit.com inbox.google.com www.google.com stackoverflow.com www.indeed.com www.glassdoor.com github.com plasmic-reflection.com panel.dreamhost.com www.instructables.com cloud.digitalocean.com en.wikipedia.org serverfault.com support.dnsimple.com jcharry.com www.digitalocean.com www.justajot.com developer.chrome.com www.elladagan.com expressjs.com blog.jcharry.com www.w3schools.com www.htmlgoodies.com www.mapbox.com plnkr.co www.hoursforteams.com localhost docs.google.com developer.mozilla.org www.justfood.org brooklynfoodcoalition.org www.added-value.org bkrot.org i.stack.imgur.com www.formget.com facebook.github.io speakerdeck.com medium.com jamesknelson.com byjoeybaker.com www.seamless.com www.chase.com chaseonline.chase.com secure01b.chase.com icons8.com printingcode.runemadsen.com itp.nyu.edu www.patrickhebron.com natureofcode.com www.red3d.com matterport.com www.goodreads.com consciouscat.net www.weruva.com purrfectcatdiet.com assets.runemadsen.com css-tricks.com fontawesome.io jonsuh.com codepen.io callmenick.com opentype.js.org www.useragentman.com www.sh-streetfood.org www.fulcrumapp.com www.facebook.com www.yougetsignal.com www.isen.com submarine-cable-map-2016.telegeography.com www.wired.com
All in all, it’s almost 70 hosts. Each of which, should, be unique.
After that, using a bash script to run traceroute, I was able to save the traces for each hostname.
Once I had all the traces, just looking at them manually starts to form an interesting picture.
Let’s look at two side by side
On the left is amazon.com and the right, goodreads.com. Immediately it’s clear that the path from my router hits some Verizon servers, hit something with the name alter.net, which after some investigation seems to be Verizon owned hardware, then in both cases, hits amazons servers. My guess is that goodreads.com is hosted on amazon. Interesting. Also, the trace to amazon.com itself starts timing out after hitting some amazon servers. My guess here is that we’re hitting higher level amazon servers, those used for directing traffic, but as soon as we get directed to more specific servers, those used for hosting their website, for example, they probably have firewalls set up to add some protection to their actual content. It could explain why every attempt to trace anything after the 13th pass timed out.
Let’s look at one more pair:
Weruva.com vs Seamless.com. Food for cats and food for humans.
Seamless travels through Verizon servers to a few Comcast servers, then starts timing out completely. Nothing to see. Weruva at least finished it’s trace, and went through Verizon, either headed cross country to Washington (or just went down to DC) to hit Level 3, then through a small company called XLhost, where Weruva.com is presumably hosted.
Taking this information, I used D3.js to make a simple visualization of the paths traveled. I first used a python script to parse the data from the txt files into a json file containing the trace for each site. Then, once loaded in D3, I could manipulate the data. In the visualization as seen here (the link above goes to a dynamic version), the circles represent ip addresses of a specific server, while the red circles indicate a terminal node. Now, the terminal nodes don’t always mean that the trace found it’s way to it’s end destination. Many of the sites timed out and were unable to resolve the full trace, so in that case a red circle would indicate an incomplete trace. Let’s look for some patterns.
It would seem that there are more unique IP’s encountered between steps 6 and 12 than anywhere else. Naively this might imply that there are many paths to travel to a designated endpoint, but more likely it’s due to the fact that many sites actually resolve completely with that number of steps. Meaning there are fewer cases where we ever even need to go past that. Another caveat is that a lot of traceroutes got lost around this area, meaning they didn’t fully resolve the address, but they encountered firewalls, or some other reason for not finding the ip, for the rest of the trace. It’s also appears to be fairly common for a trace to hit many very similar IP’s right next to each other. Looking at this trace:
it seems like it lingers on several Ip’s, i.e. 54.239.x.x or 52.93.4.x for a while. I’m not sure exactly what’s going on here, but it might be an artefact of the way traceroute works rather than demonstrating communication with all those unique IP’s. As it turns out, all those IP’s belong to Amazon, so the path is either bouncing it’s way around Amazon’s servers for some reason, or something’s happening with traceroute where it’s unable to tell us the exact path traveled. Either way, I don’t yet understand enough about the internet to fully grasp what’s happening here.