I'll be ranting about software engineering here.

 

Benchmarking strategies

From time to time, software will fail to perform and scale as desired. The Developers usually have an idea of the problem but may not have precise details.  More importantly, the Developers may not have a good idea of what to fix.  In such cases, a running Benchmark is a good idea.

A Benchmark is a test of the performance and scalability of the software system.  Benchmarks answer questions like:

  • Why is the system slow?
  • How many records per second will the system process?
  • What is the transaction latency from input to output in a system?
  • What happens if the input load is doubled?
  • Does the latest release of the software degrade or increase performance?
  • What kind of hardware is required for a Production deployment?

For small-scale software, Benchmarks are pretty simple.  A performance profile e.g. Profile from Dynamic Memory Solutions, can be used to determine where a particular process is spending its time.  For scalability, the system can be flooded with records until something snaps.  

Large-scale software is much more complex.  A software system may contain dozens of major pieces.  Consider all the complexity of using a mobile phone to make a purchase via the web browser.  At least three giant system are involved, the telephone network, the internet and the credit card processing. Each of those system is composed of hundreds of subsystems.  If your browsing is slow and times out often, which system is to blame?  Benchmarks can help to answer this question because they can determine maximal load.  

When setting up a Benchmark there are several elements to consider:

  • What software is being tested?  This is obvious but it is easy to get distracted and to start focusing on other components
  • What are the specific goals?  Some example might be, maximum scalability with a specific set of hardware or min/average/max time per record.  
  • What hardware is available?  As a goal, it is best to Benchmark on hardware that is similar to Production.  The choice in hardware can vary the performance and scalability numbers dramatically. 
  • What are the current numbers?  This may not be known but some details are likely hanging around in the shop.  For example the simple statement, Production processes 100M records per day gives some important information. 

When setting up a Benchmark for a complex system, keep these caveats in mind.

  • Have a clear understanding of the hardware and software deployment. This is basic.
  • Understand that the network can be a bottleneck particularly in geographically distributed systems.  Have a competent Network Admin on the team.
  • Performance and Scalability are almost always I/O bound.  Pay careful attention to disk arrays and other I/O devices setup.  Parallelism pays off big in this area.
  • Databases tend to be bottlenecks.  The database is a complex system itself.  A competent DBA armed with full toolbox is a must.
  • Make sure to have a hardware System Admin handy.  Tweaking the hardware is often a must during a Benchmark.  Most shops only allow Sys Admins to make changes so make sure they are available.  A bunch of time can be wasted waiting on System Admins.

Benchmarks are a great way to gain focus on a project.  Gaining deep insight into performance, scalability and Architecture, often results in modest changes with large ROI. 

Low latency, highly scalable, robust systems (cont)

In my last article, we were discussing some elements of low latency, highly scalable systems.  Such systems are very complicated but have some common elements.

  • Low latency.  This means that the wall-clock time to process a particular message through the system is small and predictable.  This is the exact opposite of a batch system where the last record will have significantly different processing time versus the first.
  • Highly scalable.  This means that the system can process more and more records without significant software changes. As the system grows, more hardware and network resources are required.  
  • Robust.  Most of these systems demand 24x7 uptime. Never having the system in a maintenance window is a big engineering challenge. 
  • High performance.  A minimum of CPU and other system resources are spent on processing each message.
  • Massive volume.  Modern systems are processing thousands of messages per second(mps).  I’ve worked with systems that process more than 50K mps.
  • Heavy multi-threading.  A particular process may have thousands of threads.  This is not mandatory but is common practice.

We covered latency and scalability in the last article.  Lets dig into Robustness.

Robustness is the ability of a software system to continue operating under adverse conditions.  In hardware systems, Robustness is termed Fault-Tolerant.  There are many parallels between hardware and software but big differences as well.  

Robustness under predictable error conditions is a relatively straightforward.  Modern realtime systems must be robust under all conditions which is much more complicated.  Some examples:

  • A process runs out of a resource like disk space
  • A local network outage occurs
  • Power failure to some part of the system
  • Failure of an interface to another 3rd party system
  • Complete failure of connectivity to the internet backbone

As Software Engineers, we are used to thinking about processes having trouble.  Network trouble is often considered in designs.  I’ve never seen a software design document that discussed the more catastrophic errors.  In a very real sense, there is nothing that can be done in software about these issues. How is a software system supposed to cope under such duress?  

The answer lies in the complex field of Disaster Recovery(DR).  It used to be that DR was a reactive strategy for failure of some part of a software/hardware system.  In modern system, DR and Robustness are merging into a proactive strategy.  In this model, Robustness is used to avoid a disaster altogether.

Some common strategies:

  1. Redundancy.  In both a hardware and software sense, every piece of the system must have a reliable backup. In some case, multiple backups will be required.  To avoid network outages, use multiple independent networks.  To avoid a single point of failure, located the software system in multiple, geographically disparate data centers.  Creation of a fully redundant system is a complex Architecture and Operations challenge to say the least.
  2. Pipelining.  Once again, creating multiple parallel processing paths yields an advantage. If one of the pipelines fails, others can continue without interruption.
  3. Rolling upgrades.  Some parts of the system can be shutdown and while the rest of the system continues to run.  Upgrading part of the system to the latest release is a great way to test while managing risk. 
  4. Cascade resistance.  Areas of the software system must in some measure be isolated from one another to avoid a cascading failure.  A cascade happens when the failure of one component causes the failure of another component.  The Architecture must account for an avoid this dire issue.
  5. Self-stabilization.  This is a cool idea. A system is self-stablilized if it can be in any state and return to a known good state.   Lets imagine a gaming system that has thousands of online players.  A hardware failure cause a big chunk of the players to be dropped.    In a self-stabilize system, the players will be able to rejoin with no loss of integrity and the system overall will continue without significant interruption. 
  6. No single point of failure.  This is a facet of Redundancy but worth mentioning.  At every level in the software and hardware, single points of failure must be avoided.  This is harder than is sounds.  I had a recent example where I was used a monitoring thread on multiple pipelines.  The monitoring thread watched for threads that were past a timeout limit.  I coded a bug that caused the monitoring thread to hang when the database went down.  The hang of the monitoring thread caused all the worker threads to hang because each worked thread had to register with the monitor.   The whole realtime system stopped, grrr. The solution was to have a monitoring thread per database avoiding a single point of failure.

Robustness is a key facet of any modern real time system.  As Effective Software Engineers, we should strive to consider Robustness at all points in the Architecture.

    Effective Software Engineering -- Process

    An overview of the Process of creating software. 

    Latency in the Financial world

    I’ve just started a new project in the Finance sector.  The goal is to lower the latency of a trading system into the low microseconds range. The goal is very challenging when compared to traditional Telecom sector latencies.

    I’ve found out about some new software and hardware that I’ve found very interesting.

    1. RDMA.  Remote Direct Memory Access is a strategy the allows one host to directly access the RAM of another host.  This is a cool idea as it avoid wasting CPU cycles. There are several network card manufactures including Melanox that support RDMA.  Its a bit tricky to keep the hosts synchronized but very high speed.
    2. Solarflare Open Onload.  Is a user space TCP stack that integrates nicely with the Solarflare NIC card.  By having the stack in user space, context switching can be avoided.  When microseconds count, context switch should be avoided when possible. 
    3. Infiband.  This network technology has been around awhile.  This is the first chance that I’ve had to use it.  So far, solution based on Infiband have had the low latency in the lab.  We have a messaging system that takes a mere 10us to send a message to another host.  
    4. RDTSC.  Under Linux x86, the rdtsc assembler command can be used to get clock ticks.  It can be used as a timer by subtracting two samples and dividing by the CPU frequency.  There is a glitch for multithread applications in that two threads may be using different CPU’s and thus be out of synch. 

    I’m interested to see what Architecture decision will be required to meet the stiff latency goals.  The hardware is cool stuff but the software is my key interest.  I’ll keep you updated as the project progresses.

    Low latency, highly scalable, robust systems

    As software system become more integrated into every aspect of our lives, there is increasing demand for low latency, highly scalable systems. Telecommunications, gaming, social networking and financial trading systems are good examples

    Many of these system are ‘messaging’ based.  In this context, a ‘message’ is a small bit of work that the system must process. Depending on the market segment, the contents of the message will vary.  Typically, a particular message doesn’t require much processing but the volume of messages is huge.

    Lets explore some of the facets of such systems.

    • Low latency.  This means that the wall-clock time to process a particular message through the system is small and predictable.  This is the exact opposite of a batch system where the last record will have significantly different processing time versus the first.
    • Highly scalable.  This means that the system can process more and more records without significant software changes. As the system grows, more hardware and network resources are required.  
    • Robust.  Most of these systems demand 24x7 uptime. Never having the system in a maintenance window is a big engineering challenge. 
    • High performance.  A minimum of CPU and other system resources are spent on processing each message.
    • Massive volume.  Modern systems are processing thousands of messages per second(mps).  I’ve worked with systems that process more than 50K mps.
    • Heavy multi-threading.  A particular process may have thousands of threads.  This is not mandatory but is common practice.

    When we throw all these requirements into a single project, Effective Software Engineers start to get excited!  Whole books are written on each of these areas but lets hit the highlights.

    How is Low Latency achieved?

    • Solid Architecture modeling.  A detailed understanding of the processing steps must reside in multiple heads.  
    • Understanding latency requirement.  ’Low’ is a vague word in this context.  Some applications mean microseconds by ‘low latency’ while others mean tens of seconds.  Understanding the requirements for a particular segment is vital.
    • Isolation.  Don’t have a bunch of other software running on your low latency system.  This harder than you might imagine when Virtual Machines, Disk Arrays and Network components are considered. 
    • Pipelining.  Multiple pipelines must execute in parallel.  This is a key Architecture Pattern.  Pipelines help by avoiding interference between messages.  Note that pipelines need not be a ridged set of processes.  A common architecture is a ‘grid’ of components passing work back and forth.
    • Realtime based components.  To achieve low latency, each component in the system must have predictable latency.   Complex databases are an example of a component with unpredictable latency.  
    • Minimize complex business logic interaction.  Each message should be handled in the same way if possible.  If some message get feature A while others get feature B then a message that gets both A and B may blow the latency window.  Many of the software system under discussion have complex business logic so this is a major issue. 
    • Divide business logic into realtime and non-realtime.  Typically, the low latency requirement isn’t for all processing.  For example, reports can often be produced in minutes or hours rather than seconds.
    • Move errors aside.  If a particular message cannot be properly processed then it must be shunted aside in an orderly manner without slowing the overall system down.
    • Asynchronous processing. If various parts of the system are waiting on other parts, latency can be impacted throughout the system by a pipeline(s) stall.  Asynch processing from one layer to another works best.
    • Physical Dependencies.  If the latency requirement is 1 millisecond and the disk drive is taking 5 milliseconds to persist the message, the system will quickly fail. The ‘black box’ model of a software system must be examined for real world bottlenecks.
    • Under-load the system.  Set a target of 40% usage for any particular resource.  This low number will help to manage bursts and backlogs.

    How is High Scalability achieved?

    • Pipelining is again the key. Having a set of parallel resources that process messages is a must.  When volume grows, the pieces of the pipeline that are under pressure can be expanded
    • Avoid serialization.  Try not to have all the pipelines accessing a particular resource.  The Database is the most common example. Everyone trying to insert into the database at once is sure to limit scalability
    • Minimize cross-talk.   If messages depend on one another in some way then scalability is likely to be limited.  For example,  consider a duplicate check in a message stream. If each message must be checked against every other message then scalability will suffer.  A variety of cache strategies seek to address this very problem.
    • Asynchronous processing.  Synchronous processing can quickly fill up all the pipelines if a resource is slow. This limits scalability as the number of incoming message is determined by the number of pipelines.  Multi-threading strategies attempt to solve this problem.  Large numbers of threads can act as a large set of pipelines in some cases.  Asynch is much easier where possible
    • Physical dependencies.  Similar to latency, disk,network and other resources can behave differently from the theoretical model.  If the difference is large, unexpected scalability bottlenecks will arise.

    The best case is hardware based scalability.  In this model, new hardware is added as the volume grows. In a perfect world, the software system integrates the new hardware into the pipelines automatically.  The scalability problem has largely been solved for comparatively simple cases e.g. a web search.  Based on these working models, even more complex software systems are being fielded with massive scalability.  This is one area where Software Engineering has made big strides.

    Enough for now, I’ll hit the other areas in my next rant :)

    Avoiding Memory Leaks in C/C++.

    C and C++ have a serious flaw in heap management.  The whole concept of programmer driven, globally managed heap is broken.  The design of malloc libraries have lead to countless hours of debugging and billions of dollars of loses.  Random heap corruption is the worst issue but Memory Leaks are a major irritant as well.

    Before digging in, lets define the parts.  The heap is a large block of memory that may be used dynamically during a process lifetime.  Allocations in the heap are variable sized and may be non-sequential.  Unlike the stack, the heap may become fragmented as interior blocks are deallocated.  

    In C, blocks in the heap are allocated with the malloc library. The ‘new’ in C++ function is typically a wrapper on malloc with some additional code to call the constructor. The allocation strategy is actually fine.  The compiler handles the details so it is hard to screw up a call to malloc/new.  

    A memory leak occurs when the last pointer to a block in the heap is lost before free or delete is called.  Overtime, a leak will cause a program to run out of memory and/or start page swapping.  Leaks can be very hard to fine as there is no immediate consequence. 

    A big problem is ownership.  When a block is allocated, the programmer must understand the deallocation strategy.  This maxim must be true every time.  Memory Leaks are generally created when this rule is broken.  If ownership is unclear then the block will be leaked or worst deallocated twice.  Understanding ownership is the key to avoiding memory leaks.

    There are several cases for ownership.

    In function ownership is the simplest.  The programmer just needs to call the delete at the end of the function/method.

    int foo()

    {

    Object* obj = new Object();

    // other stuff  …

    delete obj;

    }

    Loop ownership gets a bit more complicated.  In the fragment below a leak will occur when an error is encountered.

    int foo()

    {

    Object * obj = NULL;

    while (true)

    {

    obj = new Object;

    if(error)

      break;

    delete obj;

    }

    Return value ownership can be more complicated.  The value can be returned via parameter or return value.

    Object * foo(Object* & returnObj)

    {

      Object* obj = new Object();

      returnObj = new Object();

      return obj;

    }

    void main()

    Object * otherObj = NULL;

    Object* localObj = foo(otherObj);

    return;  // both objects leaked

    In both of these cases, the ‘main’ function must deallocate the objects at some point. This case is probably the most common cause of leaks.  Oftentimes the function is in a library and the programmer doesn’t clearly understand the ownership situation.

    A smaller but common case is a complex ctor/dtor combination.

    class Object

    {

       public:

           Object() { subObject = new SubObject; }

           ~Object() { };

       private:

           SubObject* subObject;

      };

    In this case, subObject will leak every time the destructor is called for an Object instance. The error is obvious in this case but real world examples can be confounding in complexity. 

    Another common case for memory leaks is multiple execution paths.  In complex code, the deallocation of a block may be inadvertently skipped.  Error handling is often involved in this case.  If the error is very uncommon then only a small leak will result.  A high volume system can exhaust or overtax memory resources in just a few minutes however in a repetitive error case.

    For complex code that has poor Cohesion, leaks via multiple execution paths can be a major problem to find and fix.  Error cases must be carefully evaluated to avoid this situation. The various ‘smart’ classes can help but are no silver bullet.

    In complex programs finding memory leaks via code inspection is basically hopeless.  An automated tool like Dynamic Memory Solutions Leak Check is a must. The tool should be used in Unit Test cases and on full end to end testing.  This step is not taken usually and leaks show up in production environments.   

    There is a computer science concept called a Memory Pool.  In a Memory Pool, allocation is done as a heap but deallocation is done all at once.  

    Structuring your C++ classes to allocate from a Memory Pool can be a great way to avoid leaks.  This idea works if a bunch of object are created, used and then can be destroyed as a group.  The key is that a pointer in Memory Pool is always maintained until the objects are destroyed.  

    For C, it is even easier since no dtors need be called.  In this case, the programmer can allocate from the pool but does not call a deallocate routine.  Once the pool is ready to be discarded, a single call free’s all the blocks at once.

    Memory Leaks in C/C++ are a real irritant.  Use the techniques described here and your team can mitigate the problem.

    Oddity in C++ exceptions

    I’ve always disliked exception handling in C++.  The language gives only a half baked implementation.  The implementation has several shortcomings

    1. You can never be sure when a ‘catch’ is needed.  Since a library call can throw, you program must guard against ‘rogue’ throws
    2. Your supposed to throw in ctors but not in dtors.  This is strange is speaks of a broken design.
    3. Programmers use exceptions to avoid difficult error handling.  This is true of any language and most irritating.  Local error handling is best.  For errors that need to bubble up, return codes work well.   

    I now have another reason to dislike exceptions. I recently ran across a way to bumble a call to new. It involves the broken exception handling in C++ constructors.

    It seems that a pointer assignment can be skipped if a ctor throws an exception.  I am sure this is documented somewhere in the specification but it is clearly counter-intuitive.  ’new’ should guarantee assignment of NULL on failure perhaps by assigning the pointer twice, once to NULL and once to the return value.

    Here is an example:

    #include <stdlib.h>

    #include <iostream>

    using namespace std;

    class broken

    {

      public:

       broken() { throw string(“ctor failed”); }

       int value;

    };

    int main(int argc, char* argv[])

    {

      broken * brokenPtr = (broken*)0xdeadbeef;

      try

      {

        brokenPtr = new broken;  // assignment is skipped on the throw!

      }

      catch (…)

      {

        cout « brokenPtr « endl;

        if(brokenPtr)

        {

          // brokenPtr still set to 0xdeadbeef!

          brokenPtr->value = 10; // may crash

          delete brokenPtr; // corrupts heap

        }

      }

      return 0;

    Following the maxim, ‘initialize all variables’ will avoid this trouble.  In the above example, I should have set the brokenPtr to NULL.  

    In my real world example, the situation was more complex with the variable being used in a loop.  In this case, the stale value was the previous (free’d) value of the pointer. I fixed the issue by explicitly assigning the pointer to NULL just before the call to new.  It looks odd in the code to have consecutive lines assigning to the same variable.

    Java’s looking better everyday :)


    Software Engineering Process

    The steps in a properly engineered software project are truly astonishing.  It really is no surprise that steps are skips and projects fail.  Here is a list of most of the steps that I have been involved in:

    1. Idea.  Someone must think of a product.  Some ideas are small i.e. a new iPhone App while some are quite big i.e. put a man on Mars.  Almost all product ideas will involve software in some way. 
    2. Product Management.  Not to be confused with the Project Management.  A Product Manager decides what direction to take a software project.  The evolution of Windows is a good example of Product Management.  Decisions made here make the difference between success and failure for products with a long life. 
    3. Project Management.  Someone has to keep things on schedule, this is the Project Manager.  For profit driven businesses, Project Management is absolutely critical.  Project Managers think about time, dates and resources.
    4. Contract details.  Software Engineers don’t want to think about business details but it really matters.  Contracts often specify broad requirements but also payments and penalties.  Focusing on the bits that make the most money is actually a pretty good idea.  Avoiding giant penalties in the form of SLAs and delivery milestones is vital for success.     
    5. Requirements.  What should the software do?  This is a much harder question than one might expect.  Perhaps you’ve had the ‘RFP’ experience where a client sends a huge document full of requirements.  I’ve worked on RFPs that were more than 1000 pages. 
    6. Analysis.  This is the phase between Requirements and Design.  It usually involves refining requirements, discussing gaps in existing software, kicking around Design ideas.  Some other items that come up are Use Cases, Business Modeling, consideration of 3rd party products.  
    7. Architecture.  Architecture means a variety of things in software engineering. In general, it is the overarching organization of the software. Some Architecture questions might be C++ vs Java?  Linux vs Solaris?  Rewrite or Reuse?  
      Architects can also be the person that tie the business to the software.  For example, an Architect works with the Product Manager to think about features that meet the business needs of a changing marketplace.  The Architect may also be broadly in charge of the traditional SDLC pieces.  Effective Software Engineers are often Architects.
    8. Design.  Turning Requirements and Analysis into a software based solution is Design.  This step is often smeared with Programming.  A good design will have a dramatic impact on software costs and success.  
    9. Human factors.  Really part of design but generally so bad that it deserves its own step.  Clunky but functional software often fails.
    10. Physical dependencies.  Another part of Design and Architecture that is often overlooked.  How the disks, network, CPUs, threads, etc are organized can dramatically impact the functionality of a large software system.   
    11. Programming.  Finally we get to Programming.  This step is the big focus for Developers but actually has less impact on success than several others steps.  One of my rants is the excessive focus on programming by Developers.  In this step, we create the software but in some sense it is an afterthought to Architecture and Design.  If Architecture and Design are good then poor Programming will not cause a project to fail.  Conversely, poor Architecture and Design cannot be corrected by outstanding Programming.  Effective Software Engineers are outstanding Programmers but also much more. I’ll write another article on this issue.
    12. Testing.  Testing is vital, difficult and often skipped.  Poorly testing software is invariably broken in a variety of ways.  Effective Software Engineers understand the value of testing and make sure it happens regardless of schedule pressure.  I’ll write more on this important topic
    13. Performance and Scalability.  Anyone can program a slow, batch system.  It takes an Effective Software Engineer to create a real time, high performance, low latency, highly scalable system.  I have a Design Pattern on this topic that I’ll write about.
    14. Configuration Management.  The software needs to be built, packaged and delivered properly.  On the surface, this appears easy but it actually quite tricky to get right on large projects.  Given a system with 500 Developers, multiple internal and 3rd party components, multiple active releases and geographic distribution,  CM gets confusing to say the least.
    15. Implementation.  As an example, a complicated software system that must be deployed at several Data Centers.  What hardware should it run on?  How is the software configured? What about 3rd party licenses?  What is the power and cooling situation in the Data Center?  Is physical space available?  What will the Network topology look like? How much disk space is required.  
    16. Disaster Recovery.  More and more software systems are expected to be up 24x7.  What happens if a tsunami wipes out the local power plant and power goes off at the Data Center for months. On a much smaller scale, what happens if an application crashes?  DR is getting more and more focus.
    17. Deployment.  Pulling some software off the Internet to your Mac and hitting install is easy enough.  What if the software is part of a robot system that is being deployed on Mars.  In-field software upgrades are required.  If the software Deployment fails then a billion dollars is history.  
    18. Operations.  Running software can be very expensive and complex.  There are huge rewards for making software easy to Operate.  This might include things like Alarms, Robustness, useful GUIs, user manuals.  Actual Operations includes 24x7 staffing, in-service upgrades, performance monitoring, fault monitoring, maintenance windows, client communications and customer service.
    19. Maintenance.  Software is always broken or incomplete.  Start back at the top :)
    20. End of Life.  Eventually no one will use the software.  Software can have a surprising long life cycle however.  I suspect that there is some 1960’s software running on a mainframe somewhere.  Unix itself was developed in that 1970’s so in some sense Linux is ancient software.  

    Books have been written on each of these steps. No person can be an expert in the full Software Engineering Process.  The Effective Software Engineer is familiar with all of pieces however and works to ensure success at every point.