A few months ago, I released a site called Feldot. It was a novel website discovery application that has unfortunately failed to grow. Because it failed, I’ve decided to discuss its inception, design choices, and why I think it failed.
Eight months ago now, I created a toy called Randomsite, blog post here. I used a port scanner to find random web servers on the internet that I stuck in a sqlite database. I learned the very basics of a web application framework called Django, just enough to get something functional to redirect sites, and I set the site up so that people were redirected to random web servers when they clicked the link.
After posting the site on Hacker News and getting reupped by dang(thanks btw), the post took off and I ended up with thousands of people and bots checking out my site. I remember monitoring the tail of the nginx access logs and watching connections fly by, a sense of mild euphoria flowing through my veins. I had created plenty of software, I’d created various tools and toys since I was 14. However, this was the first time people actually used something I created, and it made the entire experience 100x more enjoyable. It ignited something in me.
Not only did it get seen, but I got feedback on what needed to be better. Originally, the software added servers to the database that returned error messages, enough that people complained. With that feedback, I removed many of those sites and immediately saw a huge spike in errors from my site. I broke something! I frantically scrambled, fixed a few bugs in my code(it involved “random” selection from the database that assumed there weren’t gaps in the id field), and drastically improved the experience by using user feedback. The entire experience was incredible, and I wanted to do it again.
I thought for a long time about how to make novel website discovery interesting, social, unique, and better than what I already had made. I decided I wanted a reddit-like site, but instead of having posts link to URLs like reddit.com/r/funny, they would instead link to only domain names like reddit.com. An issue here is the difficulty of new-site discovery. I knew I had to combine the site with a tool that made new sites easy to find. I got to work.
A big issue with the old site as it was is that the end users were connecting directly to an IP address, their request lacking a URL. Most web servers require a URL to be passed to the web server. Nginx, the web server I used, has Name-based virtual hosting, and other web servers have something similar. This configuration allows for many websites to be hosted on a single IP address, saving on cost and better utilizing computer resources. Since my toy excluded server names, it excluded every single website in this configuration, which is most. To find most websites, I needed access to the zone files, a list of all registered domain names in the world. I had to jump through some hoops and cut through red tape, but after a few weeks, I ended up with the ICANN zone files for the most popular(in the US) zone files, including all .com names. The fun work could begin.
I randomized and filtered the zone file data so that only the URLs were placed into a sqlite database. I then made a series of python and shell scripts that pulled the next line of the database and queried for a web server on the URL. If it responded, didn’t return an error, didn’t have a bunch of numbers in the URL, and passed through several other filters, I saved its url, IP address and first 100 bytes of html data to another database.
After several thousand sites were saved, I noticed an issue on review. There were a lot of spam sites that needed to be filtered out, but exact match and delete scripts didn’t work because there were small variations between a lot of different sites. Long story short, I used a python module called difflib to do a fuzzy comparison to other known spam sites and deleted them if they were within a certain threshold, and as this comparison was computationally expensive, I parallelized the computation using the multiprocessing module so that it wouldn’t take a few weeks to complete.
Eventually I ended up with a list of 100,000 interesting sites with good enough signal-to-noise ratio to get the site started. The database with 100,000 sites was put onto the postgresql database, and the explore section of the site was read sequentially. That just left the reddit-like front page.
I won’t go too much into the creation of the reddit-like front page. The most recent 10,000 posts are calculated every now and then, calculations are done based on a function of time, up/downvotes, and moderator inputs, and the order of the sites is cached with Redis. Loading the front page queries Redis, with an offset based on what page you are on. There are plenty of posts on how reddit was created, feel free to check those out.
The site was posted, and the site took off about as well as a rock takes off into space. Looking at the site, it isn’t surprising. Making the site, I tried to focus entirely on UX, making the site simple and to use on mobile, making it extremely fast to respond. Looking back, I think more of a focus on appearance and UI would have done some good.
This site depends heavily on network effects: There needs to be enough people posting interesting content to keep the site self-sustaining. Ultimately, this site did not get there. I take solace in the fact that while the site didn’t take off, I learned enough in the process that I could finally make a long, rambling blog post of my own.