Traceroute

screen-shot-2016-09-19-at-8-23-00-pm

Introduction

What I’m super curious to detangle using traceroute is how connected is the internet, really?  Seems like a pretty silly question, since obviously every page on the internet is theoretically linked to every other address, even if that connection is extremely distant.  But what I want to discover is to what degree is infrastructure shared?  For example, how much in common does the path from my computer to amazon.com have with the path to, say, reddit.com?  The assumption would be that the outgoing message will be pretty consistent, as the path from my router to a large enough parent network would be pretty consistent.  Taking it further, I’d assume that the next step would inevitably a Tier 1 service provider, maybe several, followed by a path down some more obscure IP’s, and finally to the destination.  But, I suppose it’s possible that Verizon, my service provider, has such a strong hold on my outgoing traffic that it might dominate most of the infrastructure, only passing off to other providers towards the end of the journey.  So let’s find out.

LINK TO PROJECT

Documentation

I started, as all good projects do, by slapping together a really rudimentary chrome extension to capture the hostnames of the sites I visited over the course of a few hours.  Sure, I probably could have just pulled my browsing history, but then I wouldn’t have an excuse to learn more about how to build chrome extensions.  Here’s the history I gathered from my hacky extension:

www.amazon.com
www.reddit.com
inbox.google.com
www.google.com
stackoverflow.com
www.indeed.com
www.glassdoor.com
github.com
plasmic-reflection.com
panel.dreamhost.com
www.instructables.com
cloud.digitalocean.com
en.wikipedia.org
serverfault.com
support.dnsimple.com
jcharry.com
www.digitalocean.com
www.justajot.com
developer.chrome.com
www.elladagan.com
expressjs.com
blog.jcharry.com
www.w3schools.com
www.htmlgoodies.com
www.mapbox.com
plnkr.co
www.hoursforteams.com
localhost
docs.google.com
developer.mozilla.org
www.justfood.org
brooklynfoodcoalition.org
www.added-value.org
bkrot.org
i.stack.imgur.com
www.formget.com
facebook.github.io
speakerdeck.com
medium.com
jamesknelson.com
byjoeybaker.com
www.seamless.com
www.chase.com
chaseonline.chase.com
secure01b.chase.com
icons8.com
printingcode.runemadsen.com
itp.nyu.edu
www.patrickhebron.com
natureofcode.com
www.red3d.com
matterport.com
www.goodreads.com
consciouscat.net
www.weruva.com
purrfectcatdiet.com
assets.runemadsen.com
css-tricks.com
fontawesome.io
jonsuh.com
codepen.io
callmenick.com
opentype.js.org
www.useragentman.com
www.sh-streetfood.org
www.fulcrumapp.com
www.facebook.com
www.yougetsignal.com
www.isen.com
submarine-cable-map-2016.telegeography.com
www.wired.com

All in all, it’s almost 70 hosts. Each of which, should, be unique.
After that, using a bash script to run traceroute, I was able to save the traces for each hostname.

Once I had all the traces, just looking at them manually starts to form an interesting picture.

Let’s look at two side by side
screen-shot-2016-09-19-at-6-38-10-pm
On the left is amazon.com and the right, goodreads.com. Immediately it’s clear that the path from my router hits some Verizon servers, hit something with the name alter.net, which after some investigation seems to be Verizon owned hardware, then in both cases, hits amazons servers. My guess is that goodreads.com is hosted on amazon. Interesting. Also, the trace to amazon.com itself starts timing out after hitting some amazon servers. My guess here is that we’re hitting higher level amazon servers, those used for directing traffic, but as soon as we get directed to more specific servers, those used for hosting their website, for example, they probably have firewalls set up to add some protection to their actual content. It could explain why every attempt to trace anything after the 13th pass timed out.

Let’s look at one more pair:
screen-shot-2016-09-19-at-6-46-09-pm
Weruva.com vs Seamless.com. Food for cats and food for humans.
Seamless travels through Verizon servers to a few Comcast servers, then starts timing out completely. Nothing to see. Weruva at least finished it’s trace, and went through Verizon, either headed cross country to Washington (or just went down to DC) to hit Level 3, then through a small company called XLhost, where Weruva.com is presumably hosted.

Taking this information, I used D3.js to make a simple visualization of the paths traveled. I first used a python script to parse the data from the txt files into a json file containing the trace for each site. Then, once loaded in D3, I could manipulate the data. In the visualization as seen here (the link above goes to a dynamic version), the circles represent ip addresses of a specific server, while the red circles indicate a terminal node. Now, the terminal nodes don’t always mean that the trace found it’s way to it’s end destination. Many of the sites timed out and were unable to resolve the full trace, so in that case a red circle would indicate an incomplete trace. Let’s look for some patterns.

It would seem that there are more unique IP’s encountered between steps 6 and 12 than anywhere else. Naively this might imply that there are many paths to travel to a designated endpoint, but more likely it’s due to the fact that many sites actually resolve completely with that number of steps. Meaning there are fewer cases where we ever even need to go past that. Another caveat is that a lot of traceroutes got lost around this area, meaning they didn’t fully resolve the address, but they encountered firewalls, or some other reason for not finding the ip, for the rest of the trace. It’s also appears to be fairly common for a trace to hit many very similar IP’s right next to each other. Looking at this trace:
screen-shot-2016-09-19-at-7-23-15-pm
it seems like it lingers on several Ip’s, i.e. 54.239.x.x or 52.93.4.x for a while. I’m not sure exactly what’s going on here, but it might be an artefact of the way traceroute works rather than demonstrating communication with all those unique IP’s. As it turns out, all those IP’s belong to Amazon, so the path is either bouncing it’s way around Amazon’s servers for some reason, or something’s happening with traceroute where it’s unable to tell us the exact path traveled. Either way, I don’t yet understand enough about the internet to fully grasp what’s happening here.

Leave a Reply

Your email address will not be published. Required fields are marked *