Although I feel his pain, this is a very good article on how strict types and proper unit test ca save a bad day
I’m not so sure strict types would have helped any more here. There were already strict types, just on the database layer (not null boolean). Had the strict types been on the PHP level, the requests would have still blown up, just in the PHP layer.
And the “error page” still would have been that misunderstandable one.
unit tests, integration tests, acceptance tests and a QA process. I firmly believe any development agency or team should have at least one full time QA person. If not, then you need to budget in the fact bugs like this will go into the wild frequently.
I am now more of an ops person and in my late thirties, but I absolutely know what that nausea feels like, that sickly prickly sinking feeling.
I am not one to toot my own horn (I have plenty of flaws) but I consider it one of my best traits that I always look for any given problem on my end first. I currently work for a place where that is appreciated and where I can own a mistake. I now know that I never want to work anywhere where they punish people for making an honest mistake. Like you I know they are a fact of life.
This attitude is good because you get things done without communication going back and forth. If you always exclude any possibility that the cause is on your own end before pointing the blame, you get problems fixed and you always have an answer when someone else points the finger back – I’m sure you’ve been in the situation where another supplier has wrongly pointed the finger at you, and like me, you probably know that feeling of: yeah I used to be that guy.
Great post Brendt. I know I didn’t tell you anything you don’t know already. But I wanted to share regardless. Hopefully there are people out there who can prevent a few mistakes like this by reading your post and other comments people will have here.
So… no one from sales rang up marketing / engineering to say that leads have dried up? I think 100+ missing leads in one month should have been caught in some weekly roll up?
These are the new leads. These are the Glengarry leads. And to you, they’re gold. And you don’t get them. Because to give them to you is just throwing them away. They’re for closers.
Great article. Usually (in my experience) there are several things contributing. Sure, this little typo caused a months loss in leads, but technically speaking so did the lack of error reporting (and escalating to a human being) in production.
Truth is, running a business/website is a team effort. So is writing code. I might be a great developer, but when I’ve spent an entire day writing, testing, red-green refactoring that same code… The thing I did wrong in the morning of that day will still be wrong at the end of the day. I don’t know what I don’t know. I don’t see what I don’t see. That’s why I not just value the input of my team, but depend on it. Two sets of eyes see more than one.
Also, brendt, if I can nitpick. Perhaps have a coworker review a blogpost? 🙂 There’s a “client complaint” that should be “client complained”, there’s a “think to problem” that should be “think the problem”, “a unknown” should be “an unknown” and “an lasting impact” that should be “a lasting impact”. Sorry, just trying to help.
Thanks, I wrote this post early morning and some errors slipped through… I really appreciate it!
and escalating to a human being
I think I know where it comes from. Almost in any place where I worked, there were “habitual” errors in the error log. Being non-critical, they stayed in the logs for years, as there was always more pressing matters to attend to. Such errors have a tendency to add up and as a result you have quite polluted logs, which makes it much harder to spot a critical error. And although the problem described in the article could have been overlooked due to different issues, it clearly shows that such kind of technical debt must be paid off before it’s too late
Wow, the way this was initially handled definitely rang true for me. I’ve definitely had a few of these where initially I just refused to imagine it could be my mistake. Nice article on how to actually handle it (and the feelings of guilt and shame after).
Philosophical lesson aside, I’d highly recommend installing a spell checker in your IDE to catch these sorts of errors as you type. Spell checking isn’t one of those features people tend to consider “essential” in their IDEs but I’ve found them to be invaluable in my development flow.
Mine runs on-type and highlights unknown words (works with symbol names, understands different casing schemes to denote word boundaries etc) with a distinctive blue line, so I know its a spelling typo and not an actual coding error.
Yes, its annoying when it doesn’t recognise certain keywords, names or abbreviations, such as “Spatie” but as long as you’re adding/ignoring those as you go, you’ll be able to very easily spot and rectify spelling mistakes in variable names, strings, arrays etc and avoid a lot of spelling related bugs in your code altogether.
For those using vscode I’d recommend “Code Spell Checker” (cSpell) by Street Side Software.
I feel for this guy, I had a somewhat similar issue, in some parts worse, and in others not as bad.
In my junior days, I worked for a small comapny where I was one of two developers (me being junior, other was a mid level.. we were also the only IT as well). Since we were small, us two developers rarely worked on the same projects.
I was tasked with setting up a very basic conference registration page for a client (luckily the owners wife) that required payment to be collected. Now they didnt want their clients to leave the page when entering payment and wanted the payment form to be integrated into the webpage. PCI compliance? What the hell is that? SSL? Is that a mistype of SQL… You get it, junior guy who knows very little about security, compliance, etc.
So we collected CC#s, and I decided to use Paypal as the gateway. I collected the CC#s, expiry date, names, CVV, used the paypal API to to submit the payments.. Tested using Paypal sandbox, all is good. We had a hard launch date at the end of the month. I finished it early, released it a few days early.. all good. A couple days after launch day our client emails us saying, I know you guys launched, I see the registration list… how come we arent receiving any payments in paypal? I thought that is odd. I mean nothing can be wrong on my end right? I tested it with the paypal sandbox and it worked perfectly! I even tested it with live Paypal with a real CC number with $1 passed in and it accepted it.. So I said, well maybe there is some sort of holding period on paypal? Maybe you need to wait a few days? They said yeah that could be it. A few days later, contact us again and say still no payments. Okay, fine Ill look at my code.
Oops. Forgot to switch the Paypal API from sandbox to live. I notified our client about this… Luckily we did have all of the contact details regarding the registration in the database, but they had to go ahead and call up hundreds of people to manually collected payment cards.
One thing the owner was always awesome about was that he had a motto of I will never yell at you for making a mistake for the first time, but you better well learn from it and make sure it never happens again because then I will yell at you. Well how can I learn from this mistake so it doesnt happen again? Oh easy, store the CC details in a log file on the server so if I have them if I need to manually run them…
Luckily we never got hacked, and I was smart enough to not store it in web accessable location. After the conference registration was closed I did delete the log file. I quickly did learn about environment variables as well as a lot more about security….
Good read, it’s fair to say we’ve all been in similar situations. It looks like you had the sense to look back and understand the multiple areas where things should have been better or different to avoid or fix the issue.
The main one that stood out to me wasn’t the technical points around form libraries or release process etc, it was that you blamed others before even looking at the issue yourself. That’s probably just a function of your junior experience and it’s a valuable lesson that some people don’t ever learn; make sure it’s not your fault before blaming others. Just spending 2 minutes to fill in the form yourself would have shown you the blank page and should have immediately pointed to something other than wrong email address or bad analytics JS and pointed you towards the server error logs.
Think we have all been there at some point or another.
It’s a good example of where automated testing on pull requests can really help catch things like this.
C’est celui qui ne fait rien à qui il n’arrive jamais rien ! And believe me, i used to know people like that during my career.
Anyway, don’t feel bad, it happens and consider this as lesson learned ! Once but not twice 🙂
The problem here was not the bug. Bugs happen. The specifics of how the bug happened are not really that important. The problem here was lack of testing (automated or otherwise) and of monitoring.
Good writeup 👍
First lesson I learned when hired at a legit, big time dev agency as a new mid level dev was to stop using magic strings. No matter what, you see yourself typing a string into a variable or array index then you need to move it to a constant immediately.
also make small object class with properties for everything, stop returning arrays with named indexes
As your manager said, it was a team failure. You shouldn’t hit yourself too hard but I understand your position. We all have been at some point in our careers in similar situations.
If I can add a comment on that I can say that it’s the process of the company that you worked/working problematic.
Code reviews and demos to QAs are mandatory. Also monitoring and logging solutions with custom alerts. The process of delivering features is equal (or even greater) to your underlying technology stack and automated testing solutions.
What I check the most in my projects is that error reports (exceptions, warnings, etc.) from production get sent to me automatically by email. Having tests is great, checking code is also good, but catching the unexpected problems in production has been most beneficial, as that is where errors have a big impact, and you want to know as much about each error as possible.
I know this feeling exactly. When I was a junior dev, years ago, I was put on a a project far over my head, which was just a “simple” sign up form for a beta. We knew the form would get millions of submissions, so we spent a massive amount of time setting up AWS infrastructure with autoscaling, az and region fail over, a distributed mongo database with automatic fail over and read replicas, I performance tested the thing up to 15k submissions per second, it was bulletproof. I worked nights and weekends on that project, learned javascript on that project, and pushed my aws skills to the limit.
The client also wanted all the data encrypted in movement (ssl) and at rest, so you couldn’t just pull open the database and look at the submissions, it was all just encrypted nonsense, and that was our downfall.
I was working on the app entirely alone with a little bit of oversight from a senior dev, and I had spent so much time just building the damn thing, I had little time to test it, and no one else tested it end-to-end, meaning submitting the data, extracting and decrypting the data, and comparing it.
After about 1 million submissions (which we got in the first few hours) we ran a test decryption utility I had to built and sent the data securely to the client. It was missing 3 fields (of about 25) due to a typo, which they noticed immediately.
I stayed at work until 3am that day, patching the issue and proving to the client we now we’re capturing all the data.
And you know what the issue was? A simple typo in my mongoose model definition. After hundreds of hours of building the app, performance testing it, ensuring it was not going to go down under any circumstances, we missed such a simple thing.
The first lesson I learned from that was typos happen, we’re not machines. My company failed to provide proper resources for accurate testing, we should have had a test plan, and a dedicated technical QA resource that could test the app end-to-end, but we didn’t. I had to get 1000 things right and not miss, and one thing got missed.
We almost got sued, the client refused to pay the last invoice, and we ate a large portion of the cost of that app build because of it.
But that’s not all. During the project post mortem it was discovered that the project was undersold by 3x it’s cost. Whoever sold the project to the client to meet their performance specifications simply didn’t understand how much work was involved, and just saw “sign up form, no problem”. It’s because of this that we didn’t have the resources for testing, it’s because of this a junior dev (me) was assigned to the project to keep the cost low, and the senior dev was minimally involved. The harsh reality is what the client was asking for was far and above what they were willing to pay, and that should have been caught before we started anything. Had we reduced the scope, maybe we would have been able to test it better. Who knows.
Members
Online