Char-RNN: Generating HTML

Using this char-rnn tensorflow implementation to generate HTML feels almost too easy, but I was curious to see how a network could learn not only how to syntactically write HTML, but if it would also learn about any underlying commonalities between web pages. The theory being that much of the internet looks the same…

A first pass produced the following output:

Note: To see the raw html output, see the bottom of the post.

What’s sort of fascinating is that I didn’t touch up the output in anyway. I simply saved it to a file, then opened it in a browser, and it rendered without a hitch. First and foremost this tells me that browsers (I only tested it with chrome) are very good at rendering broken or bad html. Secondly, the output actually has structure. The page looks like it even has sections. There’s the main image, with a caption, a small body, and even a comments section. This tells me that probably the html I used to train was heavy with blogs or sites that have articles and comment sections.

The raw html does kind of look like html. It’s indented, tags are always named properly, although they don’t necessarily close properly. Class names are usually believable, and links sometimes actually link to real websites.


I started by crawling the web with a little python based web crawler. It loads up a page of my choosing, finds all the links on that page, then makes HTTP requests to each link, and finally saves the output to a file. Of course I tried to respect robots.txt files and only crawled pages that allowed crawling.

Once I had a significant amount of html – about 150000 lines in total, I ran it through Beautiful Soup to remove script tags and comments (I didn’t want the possibility of a line of javascript running, even though the odds of writing valid javascript are probably super slim). Finally, I concatenated all the html into one huge file, and trained a Char-rnn network with 3 layers. Training took about 2.5 full days on my little machine.


So, the output leaves something to be desired. The actual text inside html elements is a garbled mess. While the program learned the general structure of HTML, it didn’t properly learn how to close tags, although for browsers it appears that unterminated tags aren’t a dealbreaker. And of course, any links (images or otherwise) are usually made up (even though they often look like real links), and don’t lead to a real page or resource.

I’m wondering if given a larger network and significantly more data, the output would improve. I guess that’s the next step.

Raw HTML Output

got to Kgeloutions.*)</a> | <a href="http://minimalistbaker.loadnaps-175893475#72523}.linked(,to-item" class="entry-dia-button sidebar-tag');background-image:right"><div class="referenced-itemTrokel="" display_project\"=""><a href="">Entring"/>
        <p class="single-language-twitter"><a href="">Ansistic $bs</a></li>
        <li><a href="">".ruubeek">Phesing Swtogy (1 Rilebritory</a>.

            Yout look Us to my finner, Ebsololity.
            ) />

            Triving you’re untild for the agion, gluten free book around im! curined already :)</p>
        <p><img alt="" class="avatar oftule"><div class="sidebar"><img alt="mapm" class="uppent-sideloader-item ickin-screen-reader-bubble-whitp/" height="700" sizes="(max-width: 200px) 100vw, 200px" src="">
                <div class="Enter-container " data-medium-how-possers="">Verified, food “Thankscater=' font-familying cateller">Slacy Tomato consore You Geload wide another Lead (93 minuting (video release to Top protein=</a></h2><span class="entry-heam-write-title"><a href=" a-dark .sidebar simple--image .segment__east-taggone" id="placeholder"></div>
                    <link href="//"/></a><h1 class="ProfileTweet-actionCount">

                    <ul><li class="overflow-minimalistbox="img-left-strokey-aial-flow-discussion-region" itemprop="author"><header><h5>Edital Newsletters</span>
                                Thanks News Google?</div>

                            <img alt="" class="avatar avatar-96 photo" height="36" argfirger-id=6936671466 09 2004w" name="email">Largwout you</small></span></faceprg="typen-title"><li class="img--">Theve of Gifferspome, Ly:</b>ddey"><img class="cakein.topnav">At youngeh</a></p>
<div class="post-logo" onclick="'send', 'event', 'Group EndSegglewing that Socundab look respont</a>)</option>
    <option class="level-0" value="183"><label">Cspeak</button>
    <div class="jpg-twitter"><use xlink:href="#iconset-stion=starter.jp_p9App0uGG" data-hessod="26210200095210" display="pex" title="">Renima and trahrous attache $                            </div>
        <p class="summary">Tutubeeds.<br/>
    </div></div><div class="ad-promotions-content"></div>
<div class="blog">
    <div class="">nernual*Best Prumot for 40 with a mume, Gumming $more abouts</a>
<span class="button-text" name="aidget-medium">
    <ul class="f-dropdown filterfullvifext-container" id="post-iconstore-more" title="7"/></path></span></span><div class="icon--7 #554;border-top">
<link href="" title="">Monning Ember</title><meta content=",fl_progressive,g_center,h_180,q_80/wallocdum_st.png" width="100"/></a><h2 class="entry-title"><a href="" itemprop="url">October 23, 2016 at 9:10 am</a></time></p> </header>
<div class="comment-content" itemprop="text">
    <p>I husband that’s a recipe obtoget powder sample fresh gluten free tablerial be cheese well.
        <div class="ERRatingComment">
            <div class="ERRatingCommentInner" style="width:100%"></div>
<div class="comment-reply"><a aria-label="Reply to Glututor-Humazimote" class="interbrute">.sttmm">Srimed Dritter</a></p>
<p>Nots,   I use one</li></ul></div></section>
<div id="wpr_madets">
    <div class="footer_firstp_text">0 mediumi" class=""><a href="/">All/OpiodwardMine Newslamse" class="nav-clock" data-medium-1"><div class="widget-wrap"> <div class="text-close" data-sizes="auto" src="">Carl pine the Onested shot Chickin</a></h3>
            <p><small class="post-listFill="submil-buttonNending">
                    <div class="project-cupset_strop-in-thumb">
                        <img src="Preferenc " data-blog-id="668818235" data-position="Powp":6408">Gremoducheol</a>, <a href="","co32_"> </div><footer class="entry-footer">
                        Thanks )HT Kict,")</span></span></a><a class="referenced-item smoothieUvgext" id="post-1866">
                <div class="potbox 8000 +0000-10wliblimg_region-profileSize-areader-then-tarce">
                    <div class="page-text">Closes</a></li>
                <li><a href="https://wiki.omaution_1608551" terdule="section">
                        <div class="largin-request span-id"/>
                            <link href="" rel="stylesheet" type="text/css"/>
                            <link href="" target="_blank"></a></li><li class="social-twitter"><a href="" role="banerable-twitter" role="manivial-page-expander">
                            <h3>[viagram">Dandah" class="comment-reply-link" href="#comment-12676" onclick='return addComment.moveForm( "comment-312118", "421370", "respond", "8165" )'">
                                <picture alt="Vegan Pauce">
                                    <a class="popular-menu lazy-gooks bood-engthe category-with-sate" id="no">
                                        <p><strong>OYD Plugia</span> <span class="says">says</span> </p>
                                        <p class="comment-meta"><time class="comment-time" datetime="2015-15-16T16:38:02+00:00" itemprop="datePublished"><a class="comment-time-link" href="" itemprop="url">October 3, 2015 at 11:23 pm</a></time></p> </header>
                                    <div class="comment-content" itemprop="text">
                                        <p>Are – should I modules to the feesing or called the enood with plave.<br/>
                                            2,"webad');">Darging</span> <span class="says">says</span> </p>
                                    <p class="comment-meta"><time class="comment-time" datetime="2014-12-13T10:34:12+00:00" itemprop="datePublished"><a class="comment-time-link" href="" itemprop="url">October 28, 2016 at 6:26 pm</a></time></p> </header>
                                <div class="comment-content" itemprop="text">
                                    <p>I rame of my bun-helf themt public gluten free normal amazing to 1/6 mine-tit will bake you go to pepponning of it in us oats for my friend: Jurning.  I’m that’s pull the recipe after the protein crusts for chipnent to that the outsagl to soluturuw a licellas!" class="alignleft" height="200" sizes="(max-width: 200px) 100vw, 200px" src=" sx-spriticaszia Sauing%2FOc.zE ipHhere, category-sny+DegectHriver-btn:ipen-translateal-antertinations+xml" id="pane-and-reeo": 93A60746731#81B9B100E_;pata-orget=Click.releash/werkick-con-twitter-imon":"Nachurpharen@alarlow":Grecie-314 Mia" class="fcrective">Inteside anyone Weirld Custio Blog">Into Cordonlin
                                    <span class="uricitform"><a href="" title="Peeken</a></h2> </header>
                                <div class="entry-content">
                                    <p><time class="comment-meta"><time class="comment-time" datetime="2015-07-28T01:12:28+00:00" itemprop="datePublished"><a class="comment-time-link" href="" itemprop="url">January 29, 2013 at 5:51 pm</a></time></p> </header>
                                    <div class="comment-content" itemprop="text">
                                        <p>Are thing, Vorsue!!!! Thanks abilitation. appeine bindo :)</p>
                                    <div class="comment-reply"><a aria-label="Reply to Douaz-(" href="" caresEndate" class="dutrowscoon"><label for="isporacing">11 mellFat</span></li><li>1 and through enter Land the the lamnle better.*"virection*)</a>
                                        <a href="#enterread published_thementNous_Billistor+vp.pe_16 video-id="docKobscuter » FBS" rel="nofollow"><span class="thumb-menu data-footer js-tool" data-srcset=""> <time alt="Present/st.pnd-python262206359257857313532%2PBAN" jsname="" href=""previcon.png" rel="apple-touch-id-4144overviarpoad-keal-menu yt-uix-segblegineGion"><li class='\"filter' data-filter='\"student\"' data-slug='\"" title="Permalink to Meliss!</a> / *</div><div id="js_related(" id="dexveread">
                                                    <section class="folver">
                                                        <div class="flufface">
                                                    </input></inpuce><a class="logo" property="twitter-ration-navcaces2/" rel="category tag">Supper copys (Mard to this Privacon Demosspecicified</span></a>
                                            <div class="ssbp-axmang">Tlengmugal</span>
                                <div class="ssbp-engonglignerniz" data-grame="" role="term">3" class="entry-meta">
                                    width: 38; fadra-transformet"></span><span class="screen-list">
                                    hough. Python.</p>
                                <p>Chandelle">viet egg</a>
                                <ul aria-hidden="true" class="entry-image js_inlink .pinyoutpix">

                                <searing aria-hidden="true" class="entry-image-link" href="" onclick=";return false;" type="application/juston-2" id="does-to-to-tag-viagra-media">Huaps 8/5 As $DTEn Hance</h5></div>
<div class="pancel-item" role="presentation"><label foote="heiofigation Baker" property="og:type"/>
        <meta content="#55633563398" property="og:type">
            <path d="M823.5 466.8c3.3.5-.0-.0-.7-.1-.5-.6-.2-.2-.5-.1-.3-.5-.1-.5-.8-.3-.6.5-.5-.1-2.8-.1h-.56.2l.2-1.3-.2-.6.6-1.4h-15V9605.6z" id="SVGID_1635_"></path></div></div></div></section></span></comme><span class="_homt-programeposo">Ro'%= js onCodn</a>.
 .twost-item with 15-PtCushier LieppoothG">** <a href="" href="mailto" method="#11V3621" rel="path-dessert-204,00">
     <div class="page-linkbutton"></div>
 <h2 class="entry-title"><a href=""><div class="meta meta--giant"></span>
 <div class="say-meth_select_on iframe--yra-wrapper" dick="X1048867" data-lavel="+nowertitlenous" id="id-menuly">

Leave a Reply

Your email address will not be published. Required fields are marked *