We all make mistakes, and sometimes they have real consequences. Here are three mistakes I made in programming that had consequences, and what I learned.
Publishing the wrong code
I originally published URList, a free Chrome extension, in 2017. It had been working fine for years, accumulating 75 weekly active users without me promoting it. I made it for myself, to make some tasks at work easier, but it is cool that other people are using it.
75 weekly active users is not a lot, but it is enough for you to get a one star review as soon as a broken version goes live, which is what happened.
The only reason I updated it at all was because I received a warning message from the Chrome store about how URList was using permissions that it didn’t need, namely localStorage and activeTab. Without really thinking about it, I pulled the repo and edited the extension’s manifest file so that it would no longer request those two permissions.
Silly me. The repo wasn’t up to date with the live version, and caused the extension to roll back to a previous version that lacked some features of the latest version. The live version of the code lived only on a laptop that would no longer turn on. And worse, it turned out the Chrome store’s automated message was wrong. I actually did need localStorage permissions, so removing that caused the extension to not work at all.
As is so often the case, this wasn’t just one mistake but a series of mistakes that compounded on each other to earn me my first one star review. I hadn’t been in the code for URList in years and instead of checking it to verify, I just went with what the Chrome store said. I also wasn’t as meticulous in 2017 about keeping repos up to date as I should have been.
After I realized the extension wasn’t working, I fixed it and republished it to the Chrome store. If I had taken the time to test the out of date version, and to test it without localStorage, I would have seen these problems and avoided giving people a bad experience.
The lessons I take from this are:
- Don’t overreact to automated warnings.
- Take a moment to relearn what your code is doing if you haven’t looked at it in years.
- Never blindly publish from a repo, assuming it is up to date. Test it locally first.
Letting bots wreak havoc
A little over 6 years ago was the first time I built a website full stack. I wanted to build a site that had entirely user generated content. Over time, people started creating pages on the site and it was really cool to see what they created.
After the site had been up for about a year, one day I noticed a massive number of pages being created at a rate of one per minute. They were all in Japanese. I ran the text through a translator and, without going into detail, it was further confirmation that I was dealing with web spam.
This is a very easily avoidable mistake, but at that time I was brand new to full stack development. The only spam protections I had used up to this point were built-in tools like Akismet. I was familiar with techniques to prevent spam, but I knew my site didn’t have a huge audience and so wasn’t concerned.
My first response to the spambot was the most hacky and unscalable approach possible. I went into my server file and added a line of code that blocked that user’s account specifically. Oh yeah, I also hadn’t coded up my own admin dashboard at that point, which would have allowed me to block a user account without having to make edits on the backend.
This worked for a couple of hours. The bot came back with a new account and started spamming again with a vengeance. Since it took so long to return, I wondered if I was dealing with a person on the other end making edits to their bot in real time.
Next, I checked my log files to see whether the IP address was the same for all these requests to my site, and it was. So, I again did the most hacky and unscalable thing and added a line of code to my server file that blocked that IP address.
Again, this worked for some time. Later that morning the bot was back. It took a little longer to come back this time, which makes me think the person on the other end needed to acquire a pool of IP addresses they could cycle through, and edited their bot to try different IPs in the pool.
Finally I added a captcha, specifically Google’s reCaptcha, for each time someone tries to make a new account. That did the trick, but a lot of avoidable damage was already done.
Since it had taken me over a day to notice the spam, there were now hundreds of spam pages on my site. The search engines had picked up some of them. For about a week my site was showing up in Google for the Japanese words for online casinos and erectile dysfunction.
The takeaways from this experience were:
- User generated content is a double edged sword. Your site’s quality depends not only on the quality of the content people are creating, but also on how well you ensure they are actually people.
- If you don’t follow best practices, spambots will eventually find a way in.
- Even if you don’t expect much traffic, your authentication workflow should be built as if you do.
Captchas, by the way, are not the only method to prevent spam. It was just the simplest solution in this case. Automated IP blocks, user moderation, and rate limiting, for example, are other ways to prevent this from happening.
I have to admit it was fun hacking against somebody out there in the world. It made my day interesting. I hope that person is doing okay.
Crawling impolitely
Treating mistakes as a necessary, even harmonious, part of business is cliche now, e.g. “fail faster,” and “move fast and break things.”
That mindset, while useful, is a privilege for those who are the ones building a product.
Those who are working on someone else’s product, on the other hand, may find it harder to take their own mistakes in stride because it could result in getting fired. This next mistake didn’t get me fired, but it could have.
I’ve built plenty of web crawlers, and I’ve come up against plenty of automated systems designed to prevent crawling websites. Using VPN, switching between a pool of proxy IPs like my friend above, spoofing User Agent, using a headless browser — these are all useful to avoid getting your bot blocked.
But the best way to not get blocked is the simplest: slow it down.
Or better yet, slow it down for varied lengths of time, e.g:
def humanize(x,y):
n = round(uniform(1,10))
if n <= 4:
d = uniform(x,y)
sleep(d)
elif n <= 7:
d = uniform(x,y*2)
sleep(d)
elif n < 10:
d = uniform(x*2, y*3)
sleep(d)
else:
d = uniform(x*3, y*10)
sleep(d)
Slowing your crawlers down not only makes them appear more human, which helps you go undetected, but also reduces the load on the target site’s servers. By reducing the server load you are, as they say, crawling “politely.”
Generally speaking you can make more requests per second for sites that have high traffic, because they have the servers available to handle more requests. They still may use automated IP blocks, however, so even though their servers have plenty of capacity, they may still block you if you don’t use proxies and rotate them.
All that to say, when I made this mistake, I was well aware that it is a best practice to crawl politely and that I shouldn’t crawl from my real IP address.
I did neither of those things.
I wanted to get to the deliverable as soon as possible, and I knew that my client’s website was massive (in both the number of pages and traffic to those pages) so I erred on the side of more requests per second.
Lo and behold, my client’s servers stopped responding to my requests. I figured this was just a temporary block. I would simply slow my crawler down and restart it later in the day there would be no issue.
Nope. And not only that, there were other people in my company working on the client’s site and they couldn’t access it on their laptops. Oops.
They would have to go through a VPN to view their own client’s website. The wifi was shared among the whole building, so probably nobody in the building could access this site either, which theoretically could hurt my client’s sales. But realistically I doubt it had any financial impact.
So at that point I was hoping it was a 24 hour block, but really I had no way of knowing that. I was just thinking positive. I told people it was my fault and that we’ll either come back in tomorrow and it will be like nothing happened, or I would need to get on a call and have them manually remove our IP address from their blacklist.
The next day, still nothing. So, hat in hand, I got on a call with their dev team. I explained my mistake, gave them our IP address, and they whitelisted us.
The things I take away from this experience:
- If you can, have your clients whitelist your IP address and/or user-agent (and use a unique UA that identifies you) if you plan to crawl their sites.
- Err on the side of being too polite to web servers, even if you’re in a hurry.
- Never use your real IP address when crawling. Always use a proxy, or just use a VPN.
I’m sure I’ve made other mistakes, and I’m sure I will make more. The important thing is that we learn from our own mistakes, and that we share our experiences so we can learn from each other. What are some mistakes you’ve made, and what did you learn?