Just Another Technology Guy: refactoring

Showing posts with label refactoring. Show all posts

Monday, May 23, 2016

Generic Software Solutions Are Unicorns

How many times have you been working on a platform and recognized the need for some underlying technology that ties together several disparate components in your system. The conversation usually goes something like this.

We've been building component A for some time and we need to build component B and C. A's been doing well but has a lot of warts and tech debt that we'd like to fix. If we only had a generic system X then we could build B and C faster and reduce redundancy between A, B, and C.

On the surface that reasoning seems sound. You're not dreaming up an ideal system and looking for problems it can solve. You have real world problems you're trying to solve for components A, B, and C. So what's wrong with setting out to build a generic system to solve your problems?

You aren't providing business value along the way

By building a new generic system instead of refactoring your existing system you're increasing you're opportunity loss costs. Every day that you spend building software that doesn't get into customers hands is a day of missed opportunity for feedback. You aren't able to get feedback on whether your solution actually fits the customer need. You aren't able to get feedback on what the unknowns in your system are that only come out in production. You aren't able to get feedback to validate the many assumptions you've had to make along the way. Building successful software requires a closed feedback loop between you and your customers. The longer you go without closing that loop the more risk you are adding that you're building the wrong thing.

Generic systems come with too many unknowns

A generic system needs to be able to solve problems it doesn't even know about. But often, these aren't problems that you have today or will likely have in the near future. Building your system in such a way that it can be delivered in small pieces that provide immediate business value allows you to make sure you're solving the correct problems today, and building a foundation that can be refactored and abstracted tomorrow to solve different problems that are unknown today.

It's easier to do the wrong thing than it is to do the right thing

When building a system for which you don't know the full requirements it's easier to do the wrong thing than it is the correct thing. That's because you're possibilities are almost infinite on what you could build, and at best you just have a educated guess as to what you will need to build in the future. This means that you're going to have to make assumptions that can't be validated. It's very possible that you make design decisions based on these assumptions that are difficult to go back and change. Building only the components you need today ensures that your system only needs to change for the problems you have, not the ones you think you're going to have.

Refactoring existing systems allows you to get the more generic solution that you need

As you build the systems and components you need, you'll be able to start identifying the parts of your system that are needed across components. These common components are the real foundation of the more generic solution you need. Building them to solve your specific problems today means that they will only do what they need to do (i.e. you won't have spent time building features that aren't used). This will create a virtuous cycle of refactoring and adding more functionality as you need it. You'll get the correct level of abstraction for your components because you'll only be abstracting them when you have a real world need for the abstraction.

Monday, June 8, 2015

Minimizing the risk of bugs

Software development is a craft that requires practice, hard work and dedication. It's a craft that involves many edge cases and unintended consequences. As awful as it sounds, we've all got bugs in our code. The goal of writing software should not be to write software with no bugs as this is unattainable. Instead the goal should be to minimize the risk of high severity bugs.

Every change to your software is an opportunity to introduce new bugs. Following these tips will help you minimize the risk of introducing bugs into your system.

Test Your Software

Duh, right? Testing your software may sound like a no-brainer but you'd be surprised by how many times people break builds and introduce buggy software just because they didn't test their code. In my previous post on Testing Your Software Properly I provided a checklist of tests that you should run before you commit your code.

Without proper tests you can not have confidence that you aren't regressing an old bug or introducing a new bug into the system. Tests are the foundation for confidence in any change you make.

Reduce the surface area of your change

The more lines of code you change with every commit the higher the risk of bugs. This is where encapsulation, SOLID principles, and refactoring become so important.

It's important to encapsulate your software into individual loosely coupled pieces. If you make a change in a class and it causes cascading changes throughout the rest of your software in code un-related to your change then you're likely introducing new bugs.

If you follow the Single Responsibility Principle in SOLID your classes will have one reason and only one reason to change. This reduces the surface area of your change because you will not be changing code in unrelated modules.

Following principles like DRY and YAGNI will lead to more robust code that is flexible and easy to change.

Keeping your code simple is one of the keys to reducing the surface area of your change.

Reduce the complexity of the code

Overly complex code leads to bugs for many reasons:

The code is fragile because the learning curve is steep.
It's easier to do the wrong thing than it is the correct thing when making a change in the code.
The code is not readable often causing you to make multiple context switches to understand a single workflow.

What are some signs that your code is too complex?

The patterns in the code are not clear, obvious, and/or discoverable.
You have multiple levels in indirection that support one workflow or use case.
You have an abstraction that fronts a single concrete implementation.
Your code does not have clear boundaries.
You have highly interdependent modules.
Your code is not highly cohesive.
Your code has rigid rules that are not enforceable in their individual units but only as a whole.

Monday, May 25, 2015

Testing your software properly

Here's a general checklist of the type of tests that you can use to ensure that you're testing your software properly before you ship it. This isn't an exhaustive list but can be used as a starting point for you to write a more exhaustive list that's right for your software.

Compile before you commit

Good testing starts with making sure your code actually compiles. Yes, people actually check in code without compiling it first. This is just a dumb mistake and is 100% avoidable.

Run a clean build before you commit

A somewhat non-obvious thing to make sure when compiling is that you're not using a cached compiled object that has changed. Often compilers will cache objects and attempt to track when the dependency changes and only recompile the dependent object when the compiler believes it has actually changed. Cached objects will then be linked against your code and make your software appear to work but those objects may have actually changed and broken functionality. Doing a clean build before you commit will ensure that an object you depend on hasn't changed in such a way as to break your code integration.

Happy Path Tests

Happy path tests should be your minimum bar when committing code. Happy path tests ensure that your code works as intended when used in the way it was designed. These tests can be thought of as functional tests. They test the functionality of the software and ensure that the software meets the business requirement.

Negative Path Tests

Negative path tests ensure that your software is resilient to change. Negative path testing includes using your code in ways for which it was not intended. Common tests include sending in null object parameters and testing upper and lower bounds of parameters. Negative path testing also includes testing that your software properly handles exceptions and throws the proper exceptions.

White Box Tests

White box tests ensure that your code works from the outside in. These are a set of tests that ensure your objects work from the consumers perspective. These tests include making sure the object can be created and initialized, that method calls work according to spec, and that the code does not misbehave from the callers perspective.

Black Box Tests

Black box tests ensure that your code works from the inside out. These tests require access to the internals of the object. Black box tests usually test the fitness of particular private methods and algorithms.

Life-cycle Tests

You should test how your objects function in various aspects of the objects life-cycle. The key to life-cycle tests is to make sure that your objects mange state properly. Life cycle tests are also useful in making sure that you don't have any memory leaks in your code due to life-cycle changes.

Life-cycle testing includes testing the creation, destruction, concurrency and serialization of your objects. Two life-cycle areas that tend to cause bugs are not properly testing when the object state is saved or restored or when the object is used in a multi-threaded environment.

Integration Tests

Do you understand how your software works in the context of the larger system of components that use and are used by your software? Integration tests allow you to make sure that your software works end to end in the system as a whole.

Monday, May 18, 2015

When Not To Refactor

Refactoring software is a crucial part of extending the life of software. Refactoring contributes to enhancing the maintainability of the software by incrementally improving the design, readability and modularity of the components. But not much has been said about when not to refactor software.

Don't refactor code unless you need to change the code for a business reason.

One of the common mistakes I often see with regards to refactoring is when people refactor code that doesn't need it under the guise of making it better. The argument usually goes something like "this needs to be more abstract", "I wrote this code a long time ago and it is crappy", "this code is too complex" or something along those lines.

You should only refactor code when you are already in the code to make a change to support the business. That may sound counter intuitive but one of the worst things we can do is change code, however crappy, unreadable or complex that doesn't have a reason to change.

Valid business reasons to change code include (but are not limited to):

Adding new functionality
Extending existing functionality.
Making measurable performance improvements.
Adding a layer of abstraction in order to support a new use case.
Modularizing a particular object so that it can be reused in another part of the system

Adding new functionality or Extending existing functionality

This is where the boyscout rule comes into play. If you are in already in the code for another reason then you should clean up the code even if you didn't make the mess.

Making measurable performance improvements

This one is probably self explanatory but it's important to note that performance improvements will usually require some level of refactoring.

Adding a layer of abstraction in order to support a new use case

This is an important one to understand. Often people will over generalize code at the beginning. This leads to overly complex designs and less readable code. If we follow the rule of not creating a layer of abstraction until we have at least two or three use cases for the code then there will come a point when you need to refactor the code in order to provide a layer of abstraction that doesn't already exist.

Until that second or third use case comes about the code should not be generalized. You don't have enough information about future uses of the code to get the abstraction correct. You may get lucky and guess at the future abstraction but you don't want to run your business on guesses and luck.

Modularizing a particular object so that it can be reused in another part of the system

Code reuse is one of the most important tenets of object oriented programming. When we identify code that is not specific to a particular object or package AND is needed in some other part of the system we should refactor this code into it's own module. Its important to ONLY do this when the code is actually needed in another part of the system.

Don't refactor code without tests

In order to refactor code safely you should have unit and integration tests for the existing functionality. I would also argue that you should write tests for the new functionality as well before you refactor. This will help you to understand the proper way to refactor the code as it helps you define how the refactored code should be used from a consumers standpoint.

If the tests don't exist for the the existing functionality you should write them first before you start refactoring. This helps ensure that you don't cause a new bug in the code or regress an old bug when refactoring.

Monday, October 13, 2014

The fallacy of the re-write

I've been in the software industry for a decade and a half and have worked on dozens of projects. Many of the systems that I have worked on were considered legacy systems. As with any system, but even more so with legacy systems, developers will get frustrated with the systems inflexibility. And inevitably this will lead to the developers decreeing that if they could only re-write the system all the problems will be solved. Unfortunately most product owners will eventually give in to these cries and will commission a re-write.

I'm here to tell you today (as both a developer and a manager) giving in to this urge IS NOT going to solve your problems. What it is going to do is grind your production to a halt and make your customers unhappy. This will have downstream effects on the team as the pressure to produce builds and builds and builds.

So why is a re-write not a viable solution?

Re-writes are usually based on a few commonly held (but false) beliefs in the software industry.

If we start the project over from scratch we won't carry the problems from the old system into the new.
If we start the project over from scratch we can use the latest and greatest technologies that are incompatible with our current technology stack.
If we start the project over from scratch we can move faster and produce results quicker.

Why are these fallacies? If we dig a little deeper we will see that a ground up re-write means you are more likely to introduce problems in the new system than you are to solve problems in the old system. What is typically glossed over is the fact that the current architecture is doing a lot of stuff correct. How do I know this? Because it's the architecture that is in production right now running your business.

Let's take them at each of these fallacies one by one.

If we start the project over from scratch we won't carry the problems from the old system into the new.

This statement can really be broken down into two parts. The first part says that there are problems in the architecture that prevent you from extending the code and because you're now aware of those problems you can re-architect the software so that those problems no longer exist. The second part says that you won't carry over existing bugs into the new system. The second part of this statement is really related to the second fallacy, so we'll cover it when we cover that fallacy.

Because it is true that re-writing a system with knowledge of the current architectural problems can help you avoid current pain points most people are quick to accept this statement without challenge. There are many different times in the life-cycle of a product when problems arise. Some arise as bugs when writing the software. These can typically be rooted out with some sort of unit testing. The next class of problems crop up when integrating each of the pieces of the system together. You can create integration tests to help reduce the amount of integration bugs but often there are integration bugs that don't show up in pre-production environments. These tend to be caused by the dynamic nature of content. Because the new system is a re-write of the old system it will be more difficult to use real inputs/outputs from the old system to test the integration of the new system. Because of this you're likely to introduce problems in the new system that don't already exist in the old system. Because the new system won't be in production till it's done, these new architectural problems are not likely to be found till your new system is in production.

If we start the project over from scratch we can use the latest and greatest technologies that are incompatible with our current technology stack.

On the surface this statement is likely true. What this statement hides is similar to what's hidden in the previous statement. New technologies mean new bugs and new problems. Again it is likely that many of these problems won't surface till the new system is in production because, as anyone who has worked in the industry for at least a few years knows, production traffic is always different from simulated traffic. You run into different race conditions and bugs simply because of the random nature of production traffic.

If we start the project over from scratch we can move faster and produce results quicker.

The final fallacy is usually the one that most companies hang their hat on even if they acknowledge that a re-write from the ground up will introduce new bugs and problems and re-introduce existing bugs and problems. The reason is because they believe that their knowledge of the existing system should help them to only solve problems that need to be solved which leads to the system being built much faster.

The fallacy in this statement is more subtle but much more severe than the others. The reason is because until your new system performs all functions of your old system, the old system is superior from a business value perspective. In fact it isn't untill the new system has 100% feature parity with the old system that it starts to provide the same business value as the legacy system, not to mention more business value. Some will try to gain business value from the new system earlier by switching over to the new system before there is 100% feature parity with the old system. But by doing this you're offering your customers less value for the same amount of money, time, and/or investment.

This visual does a good job of illustrating the feature parity problem.

What is the solution then?

Are you saying I'm stuck with my current architecture and technology stack? NO! The best way to upgrade your technology stack is to do an in-place re-write. By doing this you help mitigate the problems presented in a ground up re-write. What does an in-place re-write look like?

By segregating and replacing parts of your architecture you're reducing the surface area of change. This allows you to have a well defined contract for both the input and output of the system as well as the workflow.

In-place re-write has another huge benefit over ground up re-write. It allows you to validate your new system in production as you would any new feature of the system. This allows you to find bugs sooner as well as validate the workflow and feature parity.

Another benefit of an in-place re-write is that you can decommission parts of the legacy system as you go without ever having to do a big (and scary) "flip of the switch" from the old system to the new system.

Most importantly, your customers do not suffer when you do an in-place re-write as you are not ever taking away features from your customers. Even better, you can prioritize giving your customers new features earlier by implementing them on the new system even before you've finished porting the entire old system over.

Monday, June 16, 2014

Back To The Basics: The Hash Table CRUD Operations

In my previous post Back To The Basics: The Hash Table I described the Hash Table data structure, provided a basic template for the class in Java, as well as walked you through the hashing algorithm. In this post we’ll take a look at the CRUD operations. Actually we'll really only talk about addition, retrieval, and deletion because adding and updating in a Hash Table are the same operation.

Let's take a look at the add, get, and remove operations first in their simplest form.

private void add(String key, Object value) {
int index = this.getKeyIndex(key);
  values[index] = value;
}

private Object get(String key) {
  int index = this.getKeyIndex(key);
  return values[index];
}

private Object remove(String key) {
  int index = this.getKeyIndex(key);
Object value = values[index];
  values[index] = null;
  return value;
}

Looking at these three operations we can now see why a Hash Table in practice provides near constant time additional and retrieval operations. The only non-constant time operation that is performed is the generation of the hash code. But this operation can be optimized and it's resulting value cached (if necessary) effectively making the add and get operations approach O(1).

What you may notice by looking at these operations are that they don't handle collisions. The add operation is simply overwriting the existing value associated with the hash regardless of whether it's actually associated with the key. The get operation is returning whatever value is stored at the index generated by the getKeyIndex method. And, like the add operation, the remove operation is removing whatever is associated with the hash.

We can try to optimize our hashing algorithms as much as we want but we'll never truly be able to avoid collisions 100% of the time. The reason is that we'd have to know all the possible item's that can be hashed in order to create a truly perfect hashing algorithm. Since this is not really possible for a general purpose Hash Table, it's important that we add some code that can handle collisions when hashing the key.

The most straight forward way to handle hash collisions is to store the values at each index in a Linked List. Instead of setting the value associated with a particular key as the value in the backing array the value will be set to a Linked List which will contain all the key/value pairs that map to a particular hash. This will require us to add a little complexity to our Hash Table but will allow us to avoid putting our Hash Table into an unexpected state when a collision occurs.

The first thing we need to do to avoid collisions is to define an object that can represent a key/value pair item that gets stored in our Linked List for each value. Since we're using Java we're going to create a class that implements Map.Entry<String, Object>.

final class HashEntry<K,V> implements Map.Entry<K,V> {
private final K key;

private V value;

public HashEntry(K key, V value) {
this.key = key;
this.value = value;
}

@Override
public K getKey() { return this.key; }

@Override
public V getValue() { return this.value; }

@Override
public V setValue(V newValue) {
V oldValue = this.value;
this.value = newValue;
return oldValue;
}
}

The next thing we need to do is update our Hash Table class definition to reflect the fact that our backing array contains a Linked List of our Hash Entry key/value pairs instead of an array of objects. Our new class definition looks like:

public class HashTable {
private LinkedList<HashEntry<String,Object>>[] values;
private final int tableSize = 1543;

public HashTable() {
this.values = (LinkedList<HashEntry<String,Object>>[])new LinkedList[tableSize];
}
}

Now we can update our add, get, and remove methods to support using a Linked List as the backing array. We're going to start with the get method because we'll define a helper method that the addition operation can also take advantage of.

Our new get method is fundamentally the same as our previous get method with the only exception being that we've pushed the actual retrieval logic into it's own method named findEntry. First it tries to find the item. If it does it returns it's value, otherwise it returns null. The new get method looks like:

private Object get(String key) {
  int index = this.getKeyIndex(key);
HashEntry<String, Object> entry = this.findEntry(values[index], key);
  if(entry != null)
  return entry.getValue();
  return null;
}

The findEntry method simply iterates through the linked list attempting to find an existing item with the given key. If it finds one it returns it, otherwise it returns null:

private HashEntry<String, Object> findEntry(LinkedList<HashEntry<String, Object>> list, String key) {
if(list == null)
  return null;

Iterator<HashEntry<String, Object>> iterator = list.iterator();
  while(iterator.hasNext()) {
HashEntry<String, Object> entry = iterator.next();
if(entry.getKey().equals(key))
  return entry;
}
  return null;
}

Our new add method is also the same as our previous add method with the exception of two things. First we've added logic to create the Linked List if it doesn't already exist. The second change is that we've pushed the actual addition logic into it's own addToList method. This allows the add method to maintain a single responsibility. The new add method looks like:

private void add(String key, Object value) {
  int index = this.getKeyIndex(key);
LinkedList<HashEntry<String,Object>> list = this.values[index];
if(list == null) {
list = new LinkedList<HashEntry<String, Object>>();
  this.values[index] = list;
}
  this.addValueToList(list, key, value);
}

The addToList method is not very complicated either. It tries to get a handle to the existing object using the findEntry method we've already defined. If it does not find an entry it adds the new key/value pair to the list otherwise it updates the existing entry value. Here's what the addToList method looks like:

private void addValueToList(LinkedList<HashEntry<String, Object>> list, String key, Object value) {
HashEntry<String, Object> entry = this.findEntry(list, key);
if(entry == null) {
list.add(new HashEntry<String, Object>(key, value));
} else {
entry.setValue(value);
}
}

Finally, our new remove method is also able to benefit from our findEntry method. It first tries to find the existing value. If it finds it then it removes it from the list and returns the value of the entry. Otherwise it returns null. The new remove method looks like:

private Object remove(String key) {
  int index = this.getKeyIndex(key);
LinkedList<HashEntry<String, Object>> list = values[index];
HashEntry<String, Object> entry = this.findEntry(list, key);

  if(list != null && entry != null) {
list.remove(entry);
  return entry.getValue();
}

  return null;
}

Wrap Up

There are a few additional things to consider when creating or using Hash Tables. Concurrency can become an issue when the hash table is used in a multi-threaded environment. If you use a hash table in a multi-threaded environment you'll need to make sure that the implementation you use is thread safe.

Another thing to consider when creating or using Hash Tables is that because the Hash Table relies on a fixed size array as it's backing object store, it may eventually need to grow as more items are added. The cost of performing a resize operation is typically O(N) as each item in the existing array needs to be copied to a new, larger, array. This can be done all at once or incrementally.

Lastly, you'll also need to consider how you are going to handle null keys and/or null values. Some implementations don't allow null keys and/or null values.

Monday, June 9, 2014

Back To The Basics: The Hash Table

A few weeks ago I began a series entitled Back To The Basics. This series is intended to go over various software data structures that are helpful to understand as a software engineer. I started the series with the Binary Tree. The next data structure I'd like to take a look at in this series is the Hash Table.

Hash Table's are very important data structures to understand. At their most basic level they allow for associative array functionality with near constant time addition, deletion, and retrieval. The ability for a Hash Table to provide near constant time operations depends on how good the hashing algorithm is. The goal of the hashing algorithm is to avoid collisions where two different values create the same hash. When a collision is encountered the Hash Table strays farther away from being able to provide constant time operations.

The ability to provide near constant time operations makes the Hash Table ideal for use as a cache or an index as items can be inserted and retrieved with very little overhead.

While most modern programming languages and frameworks provide several optimized implementations of the Hash Table, let's take a look at what it would take to build a Hash Table from scratch which can handle adding, removing, and retrieving values.

Let's first start by defining our HashTable class. The most basic definition of a Hash Table will define the underlying array which will store our values and our constructor will initialize the store. The important thing to pay attention to here is the size of our initial array. We're going to initialize it to a size using a prime number because it will help us evenly distribute values within our backing store. The reason for this will become more apparent when we create our hashing function.

public class HashTable {
private Object[] values;
private final int tableSize = 1543;

public HashTable() {
this.values = new Object[tableSize];
}
}

Now that we have our initial class definition let's start by creating a method which returns the index we should use to store the value for a given key.

private int getKeyIndex(String key) {
return key.hashCode() % this.values.length;
}

The getKeyIndex method is pretty simple. It first gets a hash code for the key and then returns an index location based on modding the hash code with the length of the array that holds our values. The reason we return a mod is so that we can make sure that the index returned is within the bounds of our array. And because we've used a prime number for our array length we can get better distribution of hashes in our array so that they're not clustered together.

Right now the getKeyIndex method is relatively boring and hides the magic of the actual hashing algorithm because our keys are strings and we're able to rely on the String classes hashCode method. But what if there was no hashCode method on String? How would we go about writing our own?

A good hashing algorithm will try to provide uniform distribution of values in order to avoid collisions which will increase the cost of the operation. The cost is increased because for every collision you have to resolve it somehow. The work it takes to resolve the collision is relative to the number of collisions for that key and the algorithm used to resolve the collisions. One way to resolve the collisions is to store the values in a Linked List and iterate through the list till we find the item with our hash code. This will cause our resolve algorithm to have an O(n) cost where n is the number of collisions for that key.

Now let's build our string hash algorithm. Because strings are made up of individual characters and those characters have an integer value we could create a hashing function that just sums the character values. For example:

private int hash(String key) {
int hash = 0;
for(int index=0; index < key.length(); index++) {
hash += key.charAt(index);
}
return hash;
}

This has a problem in that it's likely to cause collisions. For example let's say we wanted to hash the string "dog". We'd start our hash with an initial value of 0. Then we'd iterate through the characters in the string summing them as we go. For dog this would produce a hash of 314 ((d)100 + (o)111 + (g)103). Now let's say we wanted to create a hash of the string "cav". This would also create a hash of 314 ((c)99 + (a)97 + (v)118). This is a collision, which is something we want to avoid.

The good thing is that there's a pretty simple way to help avoid these collisions. If we change our algorithm slightly to include the use of a constant that the character value can be multiplied against then we can avoid these collisions. Notice that I multiply the constant by the current hash value during each iteration. This helps us make sure the hash is unique because the value used for a character is different depending on it's position within the string. This helps us decrease our likelihood of a collision.

private int hashCode(String key) {
int constant = 31;
int hash = 0;
for(int index=0; index < key.length(); index++) {
hash += (constant * hash) + key.charAt(index);
}
return hash;
}

With our new algorithm dog will hash to 99644 and cav will hash to 98264. Because we've added a constant that is multiplied against the current hash value at each iteration we've increased the likelihood that the resulting hash code is unique.

Note: The above algorithm is loosely based on the hashCode method in the OpenJDK's String class. If you're curious as to why the constant value of 31 was chosen check out Effective Java by Joshua Bloch. This Java Bytes blog post goes into even more depth for the ultra-curious

Monday, April 21, 2014

Coding Standards Revisited: Tips For More Readable Code

In my previous post, Coding Standards Revisited: My Language Agnostic Coding Standards, I talked about some non-traditional language agnostic coding standards that I believe apply to all code on all (modern) frameworks, languages, and platforms. In today's post I want to talk specifically about some standards that can increase (or decrease if ignored) the readability of your code. As I've argued before maintainability is crucial in the craft of software development.

So let's talk about a few readability coding standards that I like to use. When adhered to they greatly increase the readability, and therefore maintainability, of your code.

Meaningful Variable Names

Use meaningful names. x is not meaningful. client is not meaningful. provider is not meaningful. Chose names that are specific to the domain you are in. Chose names that clearly define the value of the variable. For example rowIndex is more meaningful than x.

Shorthand Variable Names

DO NOT use shorthand for variable names. Shorthand assumes that you have a shared context from which the shorthand was generated. It becomes difficult for members of other teams or new team members to read your code if you use shorthand.

Private Variables

DO NOT use _ to denote private variables. The English language doesn't use _'s to start words. So starting variable names with _ causes the brain to do more work in recognizing the pattern. I would strongly suggest casing your private variables the same as function variables and differentiating them using keywords like this or self if your language supports them.

Constants

If there is one place I am willing to break from the language standard for my own standard it's with constants. I usually use all caps for constants whether they're private or public. I've found that even developers that haven't been traditionally exposed to this style can figure it out intuitively pretty quickly.

With that said, if your language does provide a standard for how Constants are defined you should try to first adhere to that already defined standard.

Variable Scope

I make all member variables private. Even variables that may be subject to change from outside influence. Exposing private variables as public or even protected means that the state of your class can change without the class having an opportunity to respond to the state change. This is often the cause of bugs in the system (hard to find bugs at that).

Define get or set methods for your class if your really have to expose the value of a member variable. Having a set method allows the class to control the change and therefore giving it an opportunity to keep it's internal state consistent.

Some languages provide syntactic sugar for the get/set methods and should be preferred to more explicit get/set methods.

Curly Braces

Use curly braces according to the standards of the language you're writing in. Don't just arbitraily put curly braces on their own line or on the line of the block their defining. Different languages have different standards for curly brace style. In fact, some languages support curly braces but use the convention of not including them except in specific scenarios.

It's better to deal with uncomfortably in looking at curly braces than it is for you to be non-standard. In my experience I've found that it really only takes a week or two to get used to the curly brace style of the language you're using.

Whitespace

There's nothing more annoying then checking in code for a one line change only to notice that the diff shows 500 lines changed. What happened? You, or the person who last touched this file, is using a different standard for whitespace. Nowadays, most IDE's auto-format the code for you. So even if you don't change a line of code the whitespace may change to bring the file into conformance with whatever your IDE preferences have been set at.

Define your use of whitespace such that a standard developer for that language would expect them. If there is no guidance for your language on whitespace choose the use the default for the most common editor or IDE for that language.

Monday, April 14, 2014

Coding Standards Revisited: My Language Agnostic Coding Standards

In my previous post, Coding Standards Revisited: Writing Code That Lasts, I talked about how to approach a code review from a slightly different perspective. Today I'd like to talk about approaching coding standards from a slightly different perspective.

As I mentioned in my previous post I find a lot of value in framework, language, or platform specific coding standards. But I do not believe they tell the whole story. Here's a short list of non-traditional coding standards that I believe apply to all code on all (modern) frameworks, languages, and platforms.

Naming

Use meaningful variable, method, parameter, and class names. DO NOT use shorthand. A person that does not know how to read code should be able to tell you what the variable, method, parameter, or class is doing just from the name.

Separate out your concerns

A method or class should have one reason to change and ONLY one reason to change. If a particular method in that class has more than 10 - 20 lines it's probably doing too much. There are very few exceptions to this. Methods and classes should be distinct features that overlap in functionality as little as possible.

Cohesiveness

Write highly cohesive classes. The methods in a class need to have a lot in common. The methods in a class should act on the same data. The methods in a class should all be related to the data. I.e. an image manipulation class that has code to physically save the image to disk is not cohesive… it's coupled. There should be a class that deals with saving things to disk and a class that deals with image manipulation.

Dependency Management

Dependencies should be added by reference as much as possible and injected into the classes that depend on them. This allows the dependencies to be created at the correct level of abstraction.

High-level modules should not depend on low-level modules. Both should depend on abstractions. Abstractions should not depend upon details. Details should depend upon abstractions.

Don't Start By Abstracting

Only abstract as you need too. The first (and I would argue second) implementation of something should come in the form of concrete classes. Only when you run into a scenario where you need to interchange these classes should you then abstract it.

Code Duplication

Follow the Don't Repeat Yourself (DRY) principle. Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

Commented Out Code

Don't comment code out and leave it. Delete it as you can always go back to it using source control.

Only Write Enough Code To Satisfy What You Need Now

Don't write a single line of code for something in the future (i.e. I'm gonna need a class that….). Only write code that is actually used right now for what you're doing. Usually for me, this means starting with a test class. In the case of a UI I usually start by writing static UI Code and then working backwards making the code dynamic as I go and ensuring that each piece of dynamic code as a set of tests associated with it.

Monday, April 7, 2014

Coding Standards Revisited: Writing Code That Lasts

Over the last several weeks I've been writing a series of posts on Software Craftsmanship. While I don't think I'm completely finished with the topic (nor do I ever hope to be) I thought I would take the opportunity to switch gears and talk about coding standards.

Typically when people think about coding standards they may think of Google's Java Style Guide, GNU's C++ Coding Conventions, MSDN's C# Coding Conventions, Apple's Objective-C Conventions, Android's Style Guidelines for Contributors, or so on. There's practical guidance out there for most languages and frameworks. These conventions provide you with several advantages. They will help you write code that other developers in that language or on that platform will recognize and therefore be able to maintain easier. They help you code for compiler optimization or run-time optimization. And they help you navigate some of the trickier elements of the language. What these guides don't generally do is give you a good framework of how to write code that will last.

My Software Craftsmanship series was a good start down this path but now let's dive a little deeper and see if we can't identify some additional practicals tools that you can use to help you design and write code that will last. Ultimately that should be out goal; to write code that lasts. Code that lasts may not be elegant and it may have flaws but ultimately it has proven useful. So useful that it has stuck around. Sometimes you find code that lasts just because others are afraid to touch it for fear that it will break. I would argue that, while that code may have bad style or convention, it has lasted because it did it's most important task well; it did the one thing it was written for and has proven useful for that task.

Traditionally when we evaluate code fitness we look at how well the developer adhere'd to a certain style guideline or how well they adhere to a language or platforms best practices. We often look for places where patterns have emerged in the code and try to optimize our implementation around this pattern. We look for poor memory utilization or inefficiencies in dealing with large data sets. These are are great things that MUST be part of a code review. But these things alone don't tell the full story of whether or not the code will last.

So here are a few tips from me on what I like to also look for during a code review that I think identifies a pattern of code that will last.

What would a standard ______ expect?

Fill in that blank above with whatever framework, language, or platform the code was written for. So often it's the case that we let other framework, language, or platform coding standards creep into unrelated code. A standard developer for that framework, language, or platform should be able to open the code and see style that their familiar with. They should recognize patterns that are specific to that framework, language, or platform. The code should flow in a way that's optimized for the way that framework, language, or platform expects to run the code.

A few ways I often see this brake down are:

Adding a dependency on a third party tool that auto-manages code style (like ReSharper)
Adding a dependency on a particular IDE (like Eclipse) when the framework, language, or platform is IDE agnostic (like C++ or Java)
Not using an IDE built for the framework, language, or platform. I.e. Xcode for iOS or Visual Studio for Microsoft .NET (though I would argue NOT for Mono)
Curly braces, tabs, and spaces. Let any Java, C#, or Ruby developer read the others code and you'll hear be able to hear the complaints from down the hall.
Programmatically defining UI elements in code when the framework or platform provides built in mechanisms to define such elements.

Often these dependencies are added with good intentions. They're added because the organization wants to increase productivity. Or they're added because the organization believes that conformance to a style is important for the cohesiveness of the code, which it is. Conformance to a style other than what a standard developer of that framework, language, or platform would expect is not cohesive. Cohesiveness should be determined both by how self cohesive it is as well as how cohesive it is within the ecosystem it is built for.

What external dependencies has this code been coupled to? Are they necessary?

Developers become very fond of particular pieces of code or particular third party libraries. We're taught in school and at work that we need to focus on re-use. Often this is misinterpreted to only mean reference other code. But occasionally, or often, it's the case that in order to reuse existing code without adding unnecessary dependencies we need to refactor the code we're referencing into a library or module. This allows us to manage the dependencies of our new code correctly. Often we reference the correct code but add dependencies on other packages or classes that are unnecessary because we fail to refactor the dependencies as their consumption changes.

When this code changes what affect will that have on other non-related code?

This is a smell that you've either got something messed up with your class encapsulation or code organization. You should expect direct consumers of your code to change as your interfaces change. But you should not expect transporters of your code to have to change. One way to test this is to simple change a public method signature and compile. See what will no longer compile. Are all the places you have to change reasonable?

How discoverable are the features of this code? Do I have to read the code to understand what it does?

This one is one of the more subjective items on the list. But I still think they're questions that absolutely must be asked. Your first pass during a code review should be to look at how the code is used. If during this pass you find yourself asking things like "why do you need to pass that?" or "why is this call necessary?" then you're probably looking at code that could use better method naming or is written with the wrong level of encapsulation.

Monday, March 24, 2014

Software Craftsmanship: Going Dark (Or why you need a flashlight)

Over the last few places I've been employed at I've given a similar presentation on the topic of Going Dark. I think it's time to unleash this topic to all three of you out on the interwebs. The term Going Dark may or may not be new to you. Hopefully by the end of this post you'll understand why it's bad, how best to identify it, and how best to avoid it.

Taking a look back at the last few posts I've done on Software Craftsmanship you've probably noticed a pattern emerge in what it means to treat software as a craft. I'll lump them into a category called Software Development Ideals. So what are those ideals?

Develop software for maintainability first
Reduce the waste in our development cycle
Reduce the complexity of the software
Introduce less bugs as more features are introduced
Write better code faster

Before we talk about how Going Dark is the antithesis of the ideals I think it's important to understand the symptoms associated with Going Dark. Those symptoms are:

Working in isolation on a specific feature or problem for long periods of time
Waiting to check in code till the feature is “complete"
Failing to communicate or collaborate on software features as feature development “starts going bad”
Constantly having to explain your work before and after you commit it to your repository
Leads, management, or other teams members feeling like what was built wasn't quite what they expected

“Developers who work for long periods -- and by long I mean more than a day -- without checking anything into source control are setting themselves up for some serious integration headaches down the line” Jeff Atwood – Coding Horror

“Developers often put off checking in. They put it off because they don't want to affect other people too early and they don't want to get blamed for breaking the build. But this leads to other problems such as losing work or not being able to go back to previous versions.”
Damon Poole– Author of “Do It Yourself Agile”

Okay so now that we're able to identify the symptoms of someone who has gone dark. Let's talk about why not treating the problem is actually costing you lots of time, energy, and wasted cycles. There are several problems associated with someone who has gone dark.

There's more code to integrate on each check-in
Code that is written in isolation is full of assumptions about the other software with which it will be integrated
Each assumption is a potential issue that will only be discovered when the software is integrated
The more code you have to integrate the higher the likelihood of introducing bugs
The longer you wait to check in your code the higher the likelihood of breaking a piece of NEW code you've written before you check it in
When you're on the wrong path you end up spending more time going down that path rather then being able to have your path corrected sooner
There's no way to "gut check" your progress
Closes your code off from scrutiny (which may seem uncomfortable but ALWAYS makes your code better)

That list alone should scare you. My guess is that you can probably relate to it. You've probably participated in "merge parties" or spent half a day or a full day trying to re-merge your changes after someone else beat you to committing and has changed the topology of the code. Maybe you've come in one day to find that there's a new bug in the system because a merge went bad and something got missed or accidentally added.

The great thing is that there are some pretty simple changes that can be made. The first is to start working out how to make lots of little small changes in the system which can be verified along the way. One good method for doing this is using some form of Test Driven Development. Once you've started to follow down the path of thinking about your code in bite sized pieces, all of which have an associated test, you can start to think about checking in those bite sized pieces frequently. I like to think about this as Checking In Early and Often. There are a lot of benefits of checking in early and often, and you'll notice that they are solutions to the problems that are presented by Going Dark.

Change set is much smaller which means it's much more manageable and has a lower impact on the code base as a whole
When you start integration of your changes the chances of having changed the same method or function as someone else is much smaller
Less of a chance of having to reapply your changes on top of someone else's just because they beat you to the check-in
When others refactor, you obtain the benefits quicker with less integration work
Reduces the likelihood of anyone's changes being lost or overwritten by a bad integration
Easier to find bugs because there's less code to search through
Easier to fix bugs because there's less code that could be possibly causing the bugs
Helps to reduce unnecessary code (YAGNI) as you're just writing the code you need to do the task you're on

Go forth into the light and help others come out of their darkness!

Monday, March 17, 2014

Software Craftsmanship: Don't Repeat Yourself

In the software world copy and paste is the enemy. It causes bugs. It slows down development. It duplicates functionality. But more importantly it decreases maintainability.

It causes bugs and slows down development because you have multiple places to maintain the same source of knowledge. If you apply just one change to one place and not the others you will have a bug in your system. And the difficulty of finding and fixing this bug increases with every new permutation as there become more and more code paths that utilize that source of knowledge from different sources.

The antidote to our copy and paste problem can be summed up with Andy Hunt and Dave Thomas's acronym DRY (don't repeat yourself). The DRY principle states that "every piece of knowledge must have a single, unambiguous, authoritative representation within a system." That statement is chock full of goodness that can be overlooked so let's break it down.

The simplest part of that statement is single. That one word in and of it's self sounds simple but what it really means is that we have a responsibility to make sure that our source of knowledge only has one place it exists. This doesn't mean that you create a "tools" library for all common code like is often the (misunderstood) case with people trying to implement the DRY principle. It means that you need to make an effort to constantly refactor your code such that you increase the reusability of each source of knowledge.

Each source of knowledge needs to be associated with a clear understanding of how and when to be used. When a source of knowledge in your code can be interpreted to do many things it is an ambiguous source of knowledge. Each source of knowledge should be unambiguous.

When a source of knowledge in your code is one of many ways to do the same thing it is not authoritative. What this means practically is that the writers of code can not come to rely on the outcome of your code. Each implementation may come with different nuances or bugs. This will require the consumer of your code to also understand the differences in how your code executes from how the other code that does the same thing executes. Having one representation of the source of knowledge in your code means that source is authoritative whose outcome can be relied upon.

One of the positive side effects of the DRY principle is that it will force you to organize your code better. You'll start to organize your code such that a change in one area of code will not cause other non-related areas to also change.

Monday, March 10, 2014

Software Craftsmanship: You aren't going to need it

One enemy of software maintainability is the tendency of software engineers to add functionality in the system which does not provide any direct business value or is not part of the requirements. This functionality can be user facing but is often functionality that comes about because the engineer goes through some sort of "what if" scenarios.

For instance if an engineer is writing a piece of software that processes PDF documents they may be tempted to add some hooks that will allow it to process yet to be decided upon documents in the future. Often the reasoning for this is that it provides greater flexibility for the software and you should do that work while you're already in that particular module.

I honestly think that one big reason we see engineers do this is that they're afraid to only provide the business value they're being asked to provide. I believe that often these engineers feel that if they're not building their software to do more than it was originally designed to do that they're not a great developer.

I'm not 100% sure where this mentality came from but my hunch is that it was exasperated in the late 90's and early 2000's when the majority of companies were using the Waterfall design process to build their software. This was due to the fact that a lot of software was being packaged and shipped rather than being distributed as a service.

The waterfall process takes the approach that there is a start and end for a software project and once the end is reached the software design doesn't change. One reason this process fails for software is that requirements are always changing. The more a software product is used the more it's users refine their understanding of how they need to use the software and what they can do with it.

Because software was being packaged and shipped using the waterfall process this promoted the need to get as much functionality into the code as possible in order for the software to be able to provide the unknown functionality to the user. The problem with this is that the software no longer does a few things really well but instead does several things not so well.

Now that we're well into an age in software where even packaged software can be updated automatically online there's really no excuse or reason to hold on to this fallacy. Yet we still see it just as prevalent in today's software industry as it was back in the late 90's and early 2000's.

There's a philosophy that's been created to try to combat this archaic way of thinking which is summed up in the acronym YAGNI (you aren't gonna need it). YAGNI is your permission to build something that does only what it's supposed to do and does it well.

YAGNI is built on the assumption that if you need it you'll know enough about it to build it out fully. You'll have enough information to write some tests to document it's functionality. It'll be valuable to the business which means it will be used and relied on.

Monday, March 3, 2014

Software Craftsmanship: Simplicity

In my previous post Software Craftsmanship: The need to understand maintainability I made the case for needing to understand software maintainability. One of the cornerstones to maintainability is simplicity. The simpler a design or implementation is the less difficult it is to maintain as its intent and purpose is clear to the maintainer.

In the software world we have an acronym which jovially sums up the desire to keep a design or implementation simple. That acronym is KISS which stands for keep it simple stupid. While this is an oft used acronym I rarely run across a user that describes what they mean as they're using it.

So what does it mean to keep it simple? While most people I've worked with over the years have had slightly different opinions on simplicity I do think it's possible to come up with some suggestions that most of us can agree on.

Use meaningful names
Reduce the number of lines of code and nested statements
Test your code

Meaningful Names

Take this snippet of code as an example


for (int i=0; i < j; i++)

{

this.process(a[i]);
}

What's that code doing? No matter how much time you spend looking at it your guess is just that... a guess. How comfortable would you be if I asked you to change it?

Now let's just change the names of things and see how much simpler we can make this code.


for (int dinnerGuest=0; dinnerGuest < numberOfGuestsToNotify; dinnerGuest++)

{

this.notifyDinnerGuestOfVenueChange(guests[dinnerGuest]);
}

Now tell me what that code is doing. I bet you figured it out in less than five seconds. What if I asked you to modify this code now. Would you feel more comfortable modifying it?

Reducing The Number Of Lines Of Code And Nested Statements

The more lines of code you have to read at any one time means that you have to keep the entire context of that code in mind when trying to modify it. This is especially difficult to do when the code is doing multiple things. Every loop and every conditional forces the modifier to have to try to keep state in his/her mind while tracing through all the possible code paths. The more conditionals and loops in any set of code the more possible paths the maintainer is going to have to juggle.

One way to keep a complex algorithm simple is to encapsulate the pieces of the algorithm into their own methods. Separate out each conditional or body of a loop into it's own method and you're helping the maintainer to understand the flow of the algorithm without having to already know the algorithm. The benefit of this is that when going into a piece of code to maintain it the code provides a more readable and understandable blueprint for the maintainer. This increases the likelihood of a successful modification without any new bugs.

Test Your Code

One of the best ways I know of to keep code simple is to test it. Using a technique called Test Driven Development (TDD) allows you to think about your code from the standpoint of the business value it's supposed to provide first and it's technical merit second. Technical implementation is EXTREMELY important but it is secondary to providing business value.

Once your code can provide business value and you have a way to ensure that it always provides that business value (i.e. your tests) you are free to improve upon the technical implementation using a technique called refactoring.

While this has not been an exhaustive list of all the things you can do to keep your software simple my hope is that it provides you with direction on where to start.

Monday, December 2, 2013

Why I don't use a traditional IDE...unless I have to

I don't really think I'm a luddite when it comes to technology but there is one piece of technology that I don't think I will ever truly embrace; The modern IDE.

I feel I need to make a disclaimer from the start. I don't think it's wrong to use IDE's and in fact I'd argue that there are certain times where it's important to use an IDE. But, I do think that becoming dependent on the modern IDE makes you a poor software engineer.

I think I should start with why I keep saying modern IDE and not just IDE. In general an IDE is any development environment that gives you the ability to write, build, and debug software. With this definition a text editor wouldn't be considered an IDE but an argument could be made that VIM is because of it's built in ability to run shell commands.

In contrast here are a few features that most modern IDE's provide.

Intelligent code completion
WYSIWYG UI editor
Static code analysis
Some notion of a solution or project
Code templates
Refactoring tools
Deployment tools
Code/Database modeling tools

Those all sound like great add-on's don't they? While most of them can be very useful tools here's why I think some of them are more harmful than helpful to the average software engineer.

Intelligent code completion

Most developers I know will argue day and night that if you take away their intelligent code completion that they'll significantly slow down as an engineer. Intelligent code completion allows you to start typing a word and the IDE will provide you a drop-down of options applicable to your context. Sounds great doesn't it? Here's why it's not.

It discourages the engineer from learning the API, Framework, SDK, or toolkit that they're working with. All you have to know is the keyword you're looking for and the code completion will do the rest. The designers of the API/SDK/Framework, or etc that you're using put things in certain places for a reason. The more you understand about the structure of the API/SDK/Framework the better you understand how it's designer(s) want you to use it. This becomes very important when trying to do something new with a framework as you have a better understanding of how to compose objects in the framework to build more robust pieces. If you don't understand the intended use of objects in the API/SDK/Framework you may, as is often the case, try to shove a square peg through a circular hole.

WYSIWYG UI Editor

This is actually the least harmful feature modern IDE's provide. I added it to my list because, like intelligent code completion, it discourages the engineer from learning how the UI framework works. I do believe that after the engineer has really learned the UI framework and understood why you do things the way you do for a particular framework use of a WYSIWYG can be very helpful as it abstracts away a lot of the annoyances with UI layout and composition.

Proprietary Solutions/Projects

This one is one of the more evil features of a modern IDE. Most modern IDE's have their own way of organizing the resources and dependencies required to build a project. This in and of itself is not a bad thing. The problem arises when the rules for organizing and the file format used to describe the organization is closed. What this means is that if you use the IDE's built in solutions/projects you're making it so that you can ONLY use this IDE with your project.

Software engineers are a fickle group. I challenge you to find a team of developers that has their development environment exactly the same. That is, the same tools, the same configuration, the same defaults, and etc. You won't find it. Even the best organizations that try to standardize their development environments are usually fighting an uphill battle. This is because, as engineers, we all approach software engineering slightly differently. The configuration we use, the defaults we choose, and the tools we have installed are things that help us, individually, become better engineers. But that doesn't mean that a tool that makes me a better engineer is going to make you a better engineer.

Another big problem that proprietary solutions/projects have is flexibility in using best in class continuous integration servers. When the IDE has a closed solution/project structure it becomes more difficult using 3rd party tools to build your application. You may find plugins that allow the server to integrate with your particular IDE but it's a hack at best unless it's an official plugin provided by the vendor of your IDE. I say it's a hack at best because the closed nature of the solution/project organization format means that anything can change between versions of the IDE. This poses a problem if you want to stay current with the latest version of your IDE as well as the latest version of your integration server.

An IDE should be a means to an end and not the end itself. If you can't build and distribute your program without a particular IDE you will be fighting an uphill battle when trying to use best in class build integration servers, on-board new engineers, share your code with people outside your group, or open source your software.

What this means in the real world

My goal with this post isn't to get you to stop using IDE's. It's to get you to start understanding the trade off's with using a particular IDE so you're prepared to handle the downsides. At the end of the day there are going to be certain frameworks (like iOS) which are not built to be developed outside of a particular IDE. But, if we raise enough awareness about what we want to do outside of these IDE's we can get the framework designers to provide a more robust development environment and we will be better software engineers.

Monday, November 25, 2013

Splitting a Git repository into multiple repositories

Today I thought I would pass along a helpful code organization tip. Occasionally I've run across the need to split an existing git repository into multiple repositories and have wanted to keep the histories intact for each split out repository. One common scenario where this arises is when you want to refactor out a piece of code or submodule from an existing project into it's own library for re-use.

Splitting an existing git repository into multiple repositories is actually pretty straight forward if you use git's subtree command. A git subtree is simply a sub folder within the existing repository that you can commit, branch, and merge. The easiest way to explain how to do this is with an example.

Let's pretend we have a project called MyProj that is really made up of two sub-projects ProjA and ProjB that we want to split into their own repositories. The first thing we need to do is make sure we're in the directory of the git repository that we want to split up.

$ cd /path/to/MyProj

I like to remove the origin remote so I don't accidentally push something to origin. This allows me to always start over if I mess something up.

$ git remote rm origin

Now we can split ProjA and ProjB into their own subtrees. We're going to use the -b argument which tells git to create a new branch for the split subtree with it's own complete history

$ git subtree split -P relative/folder/for/ProjA -b ProjA
$ git subtree split -P relative/folder/for/ProjB -b ProjB

For me the easiest thing to do at this point is to create a new empty git repository for ProjA and ProjB where I can fetch the new branch, add my new remote repository, and then push it to the origin remote's master branch. This presupposes that you've created new empty remote repositories for ProjA and ProjB.

Here I'm going to create the new local repository for ProjA as a sibling folder to the original project. Creating the new ProjB repository is the exact same process.

$ cd ..
$ mkdir ProjARepo
$ cd ProjARepo

Before we do anything with the ProjA subtree we need to initialize our new empty git repository

$ git init

Now that we have an empty git repository we can fetch the ProjA branch from the origin MyProj repository.

$ git fetch ../MyProj ProjA
$ git checkout -b master FETCH_HEAD

The last thing we have to do is add the origin remote for the repository and push our changes to it's master branch.

$ git remote add origin git@github.com:ProjA.git
$ git push -u origin master

And there you have it. You now have separate repositories for ProjA and ProjB. At this point you can remove them from MyProj or remove MyProj alltogether if ProjA and ProjB were the only things in the original repository.