Recently I wrote about some of the perceived risks of Cloud services and how to mitigate them. Today I’m going to look at some important considerations around overall security; focusing on system design, firewalling, network layouts, encryption and password policies of simple or complex Cloud systems. Although this is written on the basis of an iCloudHosting Cloud solution it is valid for any hosting service, even legacy systems like dedicated servers.
The security of a hosted solution is as strong as its weakest link. Often development, testing and personal solutions have common data backends or access privileges to confidential and sensitive company data. In small and medium sized businesses where systems are often developed on-the-fly my independent individuals data security policies can be misinterpreted, ignored and forgotten. This is why it’s important to design an infrastructure that allows for human error but limits the exposure to non-critical systems.
Layout and Design
In a legacy hosting environment where systems are purchased to last for 3-5 years there is often buffer resources available lone servers. It is inevitably enticing for individuals to utilise this spare capacity for short-term experimentation and hobbies but this can often result in:
Unsecured programs running alongside company data, exposing routes in to malicious individuals
Erroneous configurations affecting the performance of key routines and subsystems
Unmanageable and complex servers that are difficult to administer when things go wrong
The best policy is to design an infrastructure that is role-specific, segregates core systems from staging and hobby environments and provides individuals with a flexible test bed that is clear and manageable. One of the benefits of a Cloud environment is that there are no restrictions on server specifications. Physical hardware limitations don’t come into play, allowing for servers with non-standard RAM, CPU and Storage allocation. This allows for a much more streamlined and efficient use of resources but also permits a “one role per server” attitude to be easily achieved. When designing a client’s Cloud we always look at how we can separate out databases from web applications, terminal servers from data stores, development data from live data, and so on.
Having a dedicated database server is a much more efficient use of resource and allows system administrators to analyse and troubleshoot performance easily. However from a security position, the database server doesn’t need to be accessible from the internet so by giving this role its own Cloud server you allow yourself to also move it into a secure DMZ behind a firewall, only allowing access from the front end web servers, and nothing else. Designing your systems like this means that when the boss’s blog server gets compromised for using a lightweight dictionary password it has no more access to your critical company data than anyone else in the world, even though it’s on the same Cloud.
Firewalls
Until someone sees the log of an active firewall they never believe the volume of port scans and brute force attacks that a server on a public connection receives every minute of every day. Within hours you can expect a server without a firewall to have received a myriad of dictionary attacks on SSH and Remote Desktop ports and yet it is still commonplace to discover a server that has been compromised because of it.
It’s important to firewall systems at every level but within reason of everyday use. There always needs to be a balance between security and practicality otherwise users cut corners to circumnavigate over-zealous restrictions, putting the whole system in jeopardy.
The boss’s blog doesn’t need four levels of passwords, a VPN and randomly generated key just to edit it. However, the back end database and datastores do. Deleting data and removing database rows should require an additional level of authentication to viewing or reading it and it should only be achievable from nominated users, on predefined IP Addresses over an encrypted connection.
Hardware, software and appliance firewalls all have their place too. Software firewalls are often dismissed as lightweight but they have progressed significantly over recent years and should no longer be lumped in with a “Windows Firewall” or similar. We would recommend using a sensible combination of solutions in tandem, disseminating administrative access to these firewalls according to role and experience. Afterall, it is important that the development team can provide access at short notice to someone working from home but only to the testing version of the website, not the live version and definitely not to the core business datastore. Firewalling is about security and blocking people from accessing things they shouldn’t but it must also be about flexibility and performance.
Password Policies
I imagine that there are many system administrators out there who remove the password complexity restrictions built into Windows as one of their first tasks, or alternatively keep a list in a text file, e-mail inbox or written on a notepad in their top drawer. This is as a result of getting the balance between security and access wrong. Passwords should get more complex in direct proportion to the level of access they provide. More specifically passwords should be two or three-factor for critical systems. In a Cloud environment it is simple to provision additional authentication and encryption services such as Radius and VPN servers so to rely solely on alphanumerical strings is really a more-than-outdated concept. Complex passwords are essential but consider who has access to them and how people will continue to work in an emergency when key company contacts are unavailable. What is the contingency for the Senior Sys Admin being away on holiday when something fails? Where are the emergency keys to get access and what is the protocol?
In Summary
There are many important considerations when designing a secure system but it’s essential to remember the user when doing so. Security cannot impede business operations and must not stall or frustrate users unnecessarily otherwise they will find ways to circumnavigate the processes designed to protect them. Of course, the iCloudHosting team work everyday with existing and new customer to design and improve security so if you’d like some advice and assistance, just give us a call!
A little while ago, Gartner reported that many IT Procurement and IT Managers were overlooking basic and key risks in their evaluation of new Cloud services. Allegedly they are doing this either through a lack of understanding of potential risks or that they are being hidden by canny Cloud hosts.
Interestingly, some of the key risks highlighted in the report are things that iCloudHosting customers don’t need to worry about. I thought it might be interesting to go through them and explain what we’re doing to ensure that companies aren’t compromising themselves in the pursuit of the significant benefits of Cloud services:
Varying Costs
According to the report, “Cloud solutions often appear to have lower initial and switching costs than traditional solutions, but include hidden costs”. I appreciate this is probably true for some Cloud service providers but we operate a very clean, simple model. We don’t charge extra for technical support and a client’s pricing will remain the same for the full duration of their contract. I’ve always found that my clients appreciate this way of working because they can have a clear plan of their costs for the next 2-5 years and they know that as they grow and use more resources that their granular cost actually goes down.
Data-handling Policies and Procedures
I’m quietly confident that high on every IT Manager’s list of risks and fears about Cloud services will be the security of their data. To be honest, I couldn’t agree more. A company’s data is its business and without it, it can’t function. The critical thing to do is to outline and consider what could cause these fears to become a reality and then to put in place systems and processes that mitigate them.
There are 4 ways in which we protect our clients’ data:
1. Data Corruption
There is a lot of really good information that I won’t try and repeat here about how we have designed our backend data stores to prevent data corruption, specifically silent data corruption which can occur in any RAID setup. If you thought that because you have a nice big SAN with RAID 10 that you can’t suffer from data corruption then you’d be massively mistaken. We utilise checksums at a block level to ensure persistent continuity
2. Accidental or malicious deletion of data
However good your systems are you can’t mitigate against human failure. Regardless of the experience and intelligence of your team it is almost inevitable that mistakes happen and files are deleted. This is why we take full, hourly backups of all our Cloud instances. That way, however much you delete or change incorrectly you can always restore that missing file or folder. Better than that, you can restore a whole Cloud instance to how it was 30 days ago if you need.
3. Primary site failures
We use Tier-IV aligned Data Centres with 2N power, resilient networks, fire protection, CCTV and onsite security but we don’t think that in every situation that that is enough. iCloudHosting run a Cloud in a secondary site over 100kms away from the primary location, vastly exceeding the stringent FSA regulations on DR (Disaster Recovery) sites. We keep all of our daily backups in our secondary site and design systems which can failover as quickly as the client requires without any data loss.
4. Data access for the Cloud provider or other Cloud users
Cloud services provide exactly the same level of risk as any other outsourced IT function. If your data is not on a client’s own equipment then there is inherent risk from that. However, it is not risk that cannot be managed. Data and disk encryption is a common means of ensuring that even if someone had access to a physical hard disk drive that the data held on it would be completely useless. What is the client’s password policy? Often when a service is provisioned, the Cloud provider will maintain a secure copy of administrator passwords but there’s no reason that the client cannot immediately change these, providing Cloud providers with a limited access account to perform routine maintenance and adhoc support. This is a regular occurrence for us and and even more regular fear for prospective clients. The important thing to remember is to discuss the matter with your Cloud provider and to plan accordingly.
Contracts Do Not Have Clear Service Commitments
I’ve not seen a serious hosting contract that doesn’t clearly outline a provider’s obligations and importantly the SLA (Service Level Agreement) in many years. Our documentation clearly states the 100% uptime that we guarantee all our contracted clients and it forms the basis of the hosting agreement. For the duration of the contract these commitments remain valid and true even if contracts for new clients are altered. To act in any other way breaks the UK’s Contract Law and would serve for a quickly-tarnished reputation! Beyond the contract and the SLA, both the Cloud provider and the client should be confident that the service can achieve the guaranteed uptime.
Summary
I think it’s really important that reports like Gartner’s highlight these risks so that prospective Cloud clients can really consider the service that they’re entrusting their business to. At iCloudHosting we’ve designed an infrastructure from the ground-up to meet the demands of 100% uptime but, we continue to work with new and existing clients on a daily basis to help them grow their business in a sustainable, risk-free manner. In my opinion, making the right choice is not just a task for contract-negotiation time but an ongoing process that you should feel your hosting provider supports you with.
Nginx (pronounced Engine-X) is an alternative, open-source web server which is designed to be lightweight and high-performance and since its release in 2004 it has seen a steady increase in uptake. It’s stable, easy to configure and uses an asynchronous architecture to ensure that under heavy load that page requests are returned quickly to the end user.
Having landed $3 million in funding in October, December saw Nginx leapfrog Microsoft’s IIS web server to become the world’s second most popular choice behind Apache. According to Netcraft, Nginx now hosts 22.2 million active sites, a 1.9 million increase on November.
Nginx Performance in cPanel
If you’d like to try out this lightweight hard-hitter please get in touch; we’d be happy to set you up with an Nginx Cloud server for you to test.
Earlier this week we had some amazing cloud activity in the UK. “Altocumulus Lenticularis” cloud formations are regularly mistaken for UFOs and are very common in the Himalayas and Andes but not here in the UK.
Incredibly interesting to see that Northwestern University in the USA have released details of a lithium-ion battery that can hold up to 10 times the charge and recharge in 10 times the speed compared to current models. What this can mean for the technology industry is fascinating.
The capabilities of touchscreen tablets, next generation mobile telephones and ultra-portable laptops all rely heavily on the power they can source from slimline batteries but what could we achieve with server equipment? Would it start to make sense to move DC UPS away from mains source and store it onboard the chassis if you could keep it powered for hours rather than minutes in a space-efficient manner? Whereas it used to be that space was the key factor in building out new sites, it is of course power that is the important consideration now. Could we become more frivolous with our rack usage to get onboard per-server power resilience?
What about individual components? Maybe volatile storage would benefit from extended power sources in much the same way as DRAM performs now?
Bill Gates released the stat that we currently have enough batteries in globally to provide 10 minutes of the world’s power. I’ve not seen how that was calculated but it’s a lovely stat if it’s even vaguely accurate. Maybe the electric car will be able to get us all the way home soon!
The tragic scenes in Thailand of relentless flooding have hit all of our screens this week. Reports today suggest that the flood waters are continuing to rise outside of the flood defences that new Prime Minister Yingluck Shinawatra is using to keep the heart of Bangkok (home to 12 million) mostly dry.
Over 400 Thai residents have lost their lives in the flooding but so far none of those reported have been in Bangkok itself, suggesting that the Prime Minister’s efforts are being successful.
However, the water levels along the Chao Phraya (the river that flows straight through Bangkok) are at their highest levels in days and some of the largest manufacturing plants in the world are based alongside it. Some have already suffered flooding, those that haven’t have closed because it is not safe, or even possible, for their employees to travel to work. The Thai Government has put out an official recommendation that those living in Bangkok escape the dangers of the rising water and go on holiday.
One such manufacturing plant affected by the floods is responsible for producing around 25% of the world’s hard disk drives every year; Western Digital have two Manufacturing Centres in Thailand, and in the last 24 months have increased their workforce by more than 20% to over 40,000 employees.
One of the Western Digital Manufacturing plants has already suffered from flooding
Companies such as Nidec (who manufacture more than 70% of all hard drive motors) have also been forced to temporarily suspend operations at all three of its plants in Thailand.
Not only does this massively impact Western Digital and Nidec as businesses, but the knock-on effects are already being seen here in the UK and Europe with stocks of Hard Drives thinning and restock timeframes becoming longer and vaguer. For those companies such as us who rely on the timely availability of computing components to be able to expand and grow their infrastructure in keeping with client demand, it has been essential to almost “hedge” on disk drives, stocking up before the “drought”!
I imagine that there will be a lot of companies who, in the next 6 to 12 months, find it difficult to provision services with the speed that they would normally do and find that their margins get squeezed as their costs rise. The flooding itself has a massive impact on the communities of Thailand, but the repercussions will be felt globally for a long while to come.
Without the underlying data, nothing else matters. Backing up your data is critical but what about silent data corruption?
The biggest problem with silent data corruption is that it is completely silent. You receive no error notifications or alerts and your systems continue on their merry way having written an erroneous bit of data to your drive. And what’s the downside? Maybe just a lost file or maybe a corrupted operating system. Who knows?
CERN’s Large Hadron Collider creates around 15 million gigabytes (15 petabytes) of data every year. That data is used to make conclusions about how the universe was created. So erroneous data would have a considerable impact on those conclusions. That’s why they ran some considerable tests to check the error rate in their data. When they checked 8.7 TB of user data for corruption – 33,700 files – they found 22 corrupted files, or 1 in every 1500 files.
That’s actually quite a considerable number when you think about it. That means that if you have a 1 TB disk in your home machine that, full with 4MB MP3s (legally purchased of course!), that you could expect 170 of them to be corrupted, unusable, lost forever.
“But surely this isn’t going to be a problem with Raid?” I hear you say? Actually one of the biggest culprits is a very common Raid configuration that we all know and love:
Raid 5
Raid 5 is very good. It creates a parity bit for the data that it writes to disk. It stripes that parity bit across the array. The parity bit value must leave the XOR of all the disks to be zero. That way if a disk is lost, you can recalculate what was on the disk by reversing the calculation.
However, every time you update the data on the disk, you must also update the parity bit. One way that errors occur, is when the data is updated, and power is lost before the parity bit is written. This leaves you with a mismatch between your check data and the real, underlying data. The only way that is going to ever be fixed is if the data is completely written over and the parity bit updated correctly. Otherwise, if you were to recreate the data from the parity check, you’d recreate data that was incorrect and you’d never know. This is called the Raid 5 Write Hole.
Because of the way this all ties together, there is also a significant performance impact to this way of working. If an update is made to the disk that is smaller than the size of a single Raid stripe, the whole stripe must be re-read in order to recalculate the parity bit for that stripe. You were making a very small write and that has incurred a much larger “read” and another, much larger “write”.
There are solutions to these problems but they are all expensive and definitely don’t meet the description of Redundant Array of InexpensiveDisks!
ZFS
ZFS is a transactional filesystem in exactly the same way as most database systems are transactional. All data is committed at once to the disk to prevent “writes” being partially completed. Each block of data has its own checksum which is saved at its pointer. This means that whenever a block of data is accessed it is compared with its checksum. If this is found to differ then the filesystem can heal the block of data so that it is correct before being used.
If ZFS is used with Hardware RAID then it can’t guarantee that at a hardware level you won’t end up with the same “hole” as I outlined for Raid 5. So it needs to be in complete control of underlying disks and the system should only use HBA (Host Bus Adapters) to access the drives.
ZFS isn’t a new technology by any means. We’ve been utilising it for many years. However, it is still relatively unknown and underused for what is effectively a Redundant Array of Inexpensive Disks!
Following Blackberry’s 20 hour blackout yesterday they have suffered further problems today, only hours after restoring service to its EMEA customer base.
The crash in the Slough DC has caused ranging levels of disruption to business and personal Blackberry users alike across Europe, the Middle East and Africa affecting E-mail and SMS messaging facilities.
RIM, who own Blackberry, have given very little information to its customers about the cause of the issues and have merely apologised for the “inconvenience” caused. Given the impact to end users it is a shame that RIM haven’t taken a more adult approach to their communications. Considering their dwindling market share to other smartphone manufacturers it is even more disappointing that the approach taken hasn’t involved greater disclosure.
By not releasing a clear and explanatory statement they also put all their Customer Service reps into the firing line. It’s one thing to control Comms but quite another to show your own staff such low amounts of respect for something they have no control over.
In my long experience in the hosting industry the problem rarely stems from the service impact but from the way in which it is handle and communicated. If you treat your clients and customers with contempt for wanting to know what’s going on, expect derision, mockery, hearsay, assumption, conjecture, exaggeration and amplification in return.
Right now RIM might as well have turned their whole DC off deliberately and be sitting, eating a Ploughman’s in the Three Tuns for all we know. Set the record straight, be open and let us all know what’s gone wrong and afterwards tell us how you’re going to prevent it happening again. Treat us like the adults that we are and we’ll understand and stop guessing.
BT suffered a massive power failure yesterday afternoon after a major exchange in Birmingham went down. The outage affected thousands of business customers’ office connections and left them without internet connectivity for their staff. So what can you do to prevent this happening to you?
“It’s very important to design your infrastructure in proportion to the risks, and the impact of those risks becoming a reality.”
Today, we’ve seen a surge of companies looking to move their hosting into the Cloud and away from their offices. And why wouldn’t you host your services in a site that is designed from the ground up to have multiple power and network connections? Our Data Center, for instance, has multiple power sources, each supported by N+1 UPS and Diesel Generators meaning that even if power were lost on both power feeds, our services would still be live.
In yesterday’s example at least those business customers affected would know that their websites, eCommerce systems and online services were available to their clients. Add in a secondary internet connection such as a 3G dongle and suddenly your staff can still provide personal service as well.
Ultimately, it doesn’t take a lot of planning to beat the problems imposed on your business by the fallout of a power cut and it doesn’t cost much either.
Maybe it’s worth looking at how your business would cope in the event of a power or network outage?
On the day when £64 billion was wiped off the UK’s top 100 companies it’s easy to lose perspective on the quantum of the issue and how it affects small and medium sized businesses.
Large drops in the FTSE 100 are heavily reported and affect consumer confidence in an already tough environment. Expectations of consumers is that with less in their pocket to spend that companies should be keener with their pricing, offering goods and services at a lower rate in the face of rising inflation. According to the Business Inflation Guide (BIG), published by insurer MORE TH>N BUSINESS, there has been a sharp rise in costs incurred by small businesses at a rate of 3.6 per cent in the first quarter of 2011. Businesses are being squeezed from both sides.
Business rates are set to rise in Scotland by around 23% in the next handful of years according to new reports out this week. Credit costs are also on the rise along with energy and fuel prices.
I think we’re really stretching things to hope that the Chinese will come to our aid. As far as I know they’ve only shown interest in investing where there are long-term political gains to support the financial benefits. Going down this route only means we’ll be further mortgaging our futures… Something that got us in today’s mess!
Individuals are also obviously affected. In the States it has just been announced that health insurance is to rise 5.4% next year, this side of the pond, individuals are paying more VAT, and even the price of bad has risen by 38% in the last three years!
However, there are some good stories out there. Easyjet have revised upwards their profit forecasts on the back of an increase in business users indicating a switch in mindset by those that sign off the expenses. This week I tweeted about Forbes magazine’s article professing that Cloud Computing May be a Shot in the Arm our Economy Need and actually I’m very lucky to work in an industry that, despite the economic slump, is very much in a boom. I suppose it makes perfect sense why that would be the case as well. Afterall, outsourcing your hosting has been more economic for companies for over a decade and now, with Cloud, it’s even more cost effective, even more reliable and even simpler to manage! And that has to be the theme for success surely? If you want to make a success of yourself and your business in an economic slump, review your costs and cut wisely. They say that every economic downturn breeds entrepreneurs, but I don’t believe it has to be limited to the few in this way. We can all look at how we’ve run things over the last decade and revise our budgets to meet the changes we face as the economy slowly recovers.