Business Continuity and Disaster Recovery Management

Purpose

The purpose of this procedure is to protect the critical business activities of the organization by minimizing the risk of a disruptive incident and by being prepared to manage such an incident in a controlled way. There is a need to ensure that the plans in place are workable and can be implemented in an emergency situation. Beyond learning the lessons of real incidents that have taken place, the only way to ensure this is to carry out an appropriate program of testing. 

Scope

This procedure describes a Business Continuity Management Plan (BCMP) and IT Disaster Recovery (ITDR), establishes a system for emergency preparedness and disaster management, and sets a schedule for such testing activities. It also includes testing in conjunction with the Information Security Continuity Procedure. This procedure serves as a guideline for the preparation for the company.

Responsibility

The Chief Technology Officer (CTO) is the Business Head and is responsible for planning and coordinating all the activities related to the Business Continuity Planning documentation and identifying needs for documentation. The CTO shall ensure that documented systems are followed by all employees and necessary records are maintained in our company. The CTO is also responsible for preparation, approval and distribution of the Business Continuity Management Plan.

The VP IT & Operations, CISO, serves as the ITDR Coordinator and is responsible for providing resources, updating risks and production facilities, and can perform the duties required by the CTO. All persons concerned with the department are responsible for implementation of system documents applicable to them. Other leaders are added as needed and include:

Member TitleBC/ITDR MemberPhone NumberMember Contact Information
CTO, Co-FounderVineet Joshi303-596-5519vineet@cloud-elements.com
VP IT & Ops, CISOEd Fuller720-244-5270ed@cloud-elements.com
CFOMark Vellequette303-817-8446mark.v@cloud-elements.com
Product ManagementVacant

MDRamana Lashkar+91-9381538882, +1-214-912-8433ramana@cloud-elements.com

Description of Activity

Cloud Elements has configured the AWS production environments with high availability for each tier, with servers in each tier distributed across AWS Availability Zones. We have also setup a passive DR environment for the US production environment in the AWS Ohio region, which guarantees the continuity of the delivered services even through unplanned emergency situations. In the case of a disaster in the primary AWS region of Oregon, the RPO and RTO to fail over to the Ohio DR environment is RPO => 4 hours and RTO => 8 hours.

Based on the AWS infrastructure facilities and Cloud Elements offices the following are considered major company risks:

Major Risk

  1. AWS region goes down
  2. Earthquake
  3. Client information is stolen or leaked out
  4. Unidentified ransomware

Minor Risk

  1. Communication cable failure 
  2. Power supply failure
  3. Fire

The details of all the above events with occurrence probability, impact, controls, person responsible, and time as well as level required to restore is given in Annex-1 of this procedure.

Major Risk

  1. Server Failure: Cloud Elements depends on AWS hosted infrastructure to manage the transfer and transformation of customer sensitive data. A multiple server failure would have a major impact on Cloud Elements and its customers.
  2. Database / Instance corruption or failure: Cloud Elements operations on call team will use playbooks directly related to recovery and restoration of customer data.
    Playbooks:
    https://github.com/cloud-elements/soba/tree/master/deploy/playbooks/db-disaster-recovery
    https://github.com/cloud-elements/soba/wiki/Playbook
  3. Earthquake: Cloud Elements server storage location is situated in an area which is in the High intensity earthquake zone. Consider this as a driving factor and take appropriate actions.
  4. Client information is stolen and leaked out: Any client information is found stolen.
  5. Key persons have left the organization: Because Cloud Elements is a small company, employee turnover will lead to a loss of knowledge and information.

Minor Risk

  1. Data connection failure: The Denver, Industry RiNo, Dallas and Hyderabad offices internet service providers cable is broken or any major fault that damages Cloud Elements communication to AWS environments will not have a damaging impact to customers.
  2. Unidentified virus attack: In spite of having the latest virus protection software it is possible that Cloud Elements can be hacked or a new virus loaded to the servers.
  3. Power supply failure: Power failure to the office space would lead to all employees working from home. This should have minimal impact on the company.
  4. Fire: A company fire would have minimal impact on the company’s ability to function and deliver customer data and support.

Major Risk Preventive Measures Taken to Reduce Occurrence Probability 

This Business Continuity Management Plan takes care of all the situations regarding our infrastructure hosted by AWS. The appropriate steps will be taken to ensure a timely backup is activated to avoid loss of availability for the customer. As part of Cloud Elements backup policy, it is deemed acceptable to have multiple AWS servers available in regions throughout the country. This will allow quick recovery in cases of damage to the current server location. The preventive measures are taken in accordance with AWS server providers to ensure the reliability of our data during the case of an earthquake or long term server outage.

We have implemented the ISO: 27001 security management system and necessary controls are followed for information security to prevent leakage of any client information.

The HR policies and practices are followed with due care to motivate the employee and to ensure employee turnover is kept low. Also, the system is made stronger to take care to make the organization system driven for quality of work and information security by implementing ISO: 27001.

Minor Risk Preventive Measures Taken to Reduce Occurrence Probability

Industry RiNo, our host tennet, provides the ISP for the Cloud Elements Denver office. The building has both CenturyLink and Comcast as providers; if CenturyLink is down, the service automatically switches to Comcast.

There is fiber from AT&T, LOGIX, Level 3, Spectrum and XO Communications into the building in the Dallas office. Cloud Elements is signed up with LOGIX in the suite in Dallas. The building does provide 3 different wifi networks for use by tenants if LOGIX goes down. If there is a LOGIX outage into the building, then the building wifi will failover to Spectrum. [ActON and ActoN Back-up - Hyderabad]

The Information Security Management System and procedures are followed to take care of any intentional virus attack on our server or such threats. Also, disaster recovery plans are followed in case of such situations.

Power failures are handled by allowing employees to work from home during catastrophic events in the headquarters office space.

In case of fires in the vicinity of the office also the plan is implemented. Due care is taken to ensure that the fire does not spread to our premises very easily. Proper precautions have been taken at Cloud Elements to take care of the same. Fire extinguishers and fire alarm systems are available at the office premises. Moreover, people are trained to use the same.

Measures Taken to Mitigate Risk

Adequate measures have been taken in terms of the following:

  1. Sprinkler system is installed throughout the office and fire extinguishers have been provided in the office.
  2. The entire area has been declared a no-smoking zone for the staff and outsiders who visit the company.
  3. All electrical connections from the main board of the electricity provider and all lines to various departments have been properly made.
  4. Necessary precautions against short circuits in the system have been taken.
  5. Proper access ways and exits have been made prominent so that persons can easily get out in event of a fire.
  6. Access control is provided to protect those in the building.
  7. The staff and other people have been trained to handle such situations.
  8. The licensed version of virus software is used and latest definitions are updated periodically on all computers.
  9. All equipment and servers are purchased from the branded supplier.

Responsibilities & Events

In Case of any Major or Minor Fire / Earthquake

  1. In case of any fire being noticed by anyone, shout “FIRE” which is the minimum requirement.
  2. As per the training, which they have been given, they should try to douse the fire using the fire extinguishers and other available material.
  3. If there is still no respite from the fire then they have to call 911.
  4. If there is any chance of any mischief, they are authorized to contact the police.
  5. Once this is done and minimum precautions taken the CTO should be informed immediately.
  6. If there are any persons who are trapped in the fire they should try to evacuate. 
  7. After this comes the computers and other infrastructure, which are lying in the vicinity of the fire, and if that is to be shifted that should be done at the earliest to prevent damage to the same.
  8. In case of the information processing facilities after evacuation of personnel, the servers and storage media should be recovered immediately and efficiently at the earliest.
  9. On reaching the scene of the fire the CTO should take charge of all the happenings and start the evacuation process at the earliest.
  10. He should keep the CEO informed of all the happenings that would be at the site and oversee the operations.
  11. The hierarchy during any fire incident or other incidents would take the same reporting structure as per the position in the organization chart of the company.

In Case of Main AWS Server Location Damage

  1. If the AWS main server location is damaged, notifications from AWS and Ops Rotation will report to the VP, IT & Ops and the CTO are immediately informed. 
  2. Immediately following the notification, the VP, IT & Ops/CTO will ensure the AWS backup location has commenced restoring all necessary data.
  3. Verification is done for any loss of data as well as operational and performance related problems and rectified. 
  4. The VP, IT & Ops and CTO will keep the Business Heads informed and oversee the operations.
  5. The hierarchy during any fire incident or other incidents would take the same reporting structure as per the position in the organization chart of the company.

Other Events

In all the other events details are given in Annex-1 with responsibility to perform such tasks. 

Fallback Actions

After the evacuation process is complete the CTO has the complete authority and responsibility to undertake the startup of the essential business activities. For this:

  1. An emergency meeting of the functional heads takes place.
  2. The destruction is quantified in terms of damage and damage potential.
  3. Steps are discussed, identified and implemented.
  4. If required temporary locations are decided.
  5. Liaison with the local government authority is decided
  6. Insurance issues are discussed
  7. Startup of complete operations is decided with the downtime agreed upon.

At Cloud Elements a downtime of 24 hours has been decided critical and in case of any emergency we have decided that our operations in critical areas such as the information processing facilities should be started within that time frame.

The suggested actions are described as below:

  1. The AWS disaster recovery location is agreed and finalized.
  2. Servers are updated with the most recent backups/AMI.
  3. The data validation is completed once the data has been restored through the backups.
  4. CTO will direct Commercial Account Managers to inform customers about the downtime loss and explain the criticality of the situation. The entire operation is completed within the stipulated period.

Resumption Procedures

  1. The VP, IT & Ops and CTO are responsible for the resumption of the operations of the AWS backup location.
  2. The VP, IT & Ops or CTO will notify customer(s) regarding the incident(s) and has proper understanding of the whole situation in terms of the available infrastructure after a disaster or emergency.
  3. The engineering and customer service teams will begin the data management work of the organization after validation of all previous data from the backup sources.
    Playbooks:
    https://github.com/cloud-elements/soba/tree/master/deploy/playbooks/db-disaster-recovery
    https://github.com/cloud-elements/soba/wiki/Playbook
  4. If the data supplied by the customer has been lost, we will enable a request again from the customer for expeditious startup of the whole system.
  5. The knowledge process outsourcing employees in turn take charge of their operations of the organization and have the responsibility of the same.

Insurance

Cloud Elements has a comprehensive insurance policy for all its equipment and office machinery including cyber insurance. 

Testing for Business Continuity / IT Disaster Recovery

Fundamental Principles

There are a number of fundamental principles that will be adhered to in creating and implementing a schedule of testing of business continuity plans in accordance with the ISO/IEC 27001:2013 standard for information security. 

These are: 

  • The tests must be consistent with the scope and objectives of the ISMS 
  • They must be based on appropriate scenarios that are well planned with clearly defined aims and objectives 
  • Taken over time, they must validate the whole of Cloud Elements business continuity arrangements 
  • Disruption to business operations should be minimized 
  • Post exercise reports should be produced 
  • They should be reviewed, and improvements identified 
  • They must be conducted at planned intervals and upon significant change within the organization or its operating environment

The business continuity plan is prepared and implemented as per our standard template. As part of the ISMS each test that is carried out will be planned in detail on an individual basis and a post-test report produced in accordance with the above principles in BCP test report. 

Table Talk Testing of Business Continuity / IT Disaster Recovery Plans

Testing will be done by the CTO/CISO to consider any events as suggested by Business head(s). Necessary records for the testing are maintained. The long term backup(s) are restored once every quarter, at least annually, and verified for availability of data as a consideration of our business continuity. 

Awareness

Engineers are adequately trained for the Business Continuity / IT Disaster Recovery Plan as well as major and minor risks. The awareness and education activities are carried out at the company site as per established training schedules of the ISMS system.

Responsibilities

Cloud Elements is a small organization, the delegation of responsibilities is done as per the requirement by the CEO. Every decision in this case of the BC/ITDR Plan is made by the CTO with full knowledge of the CEO with their advice and guidance.

The emergency preparedness plans and disaster recovery plans are owned by the CTO and VP, IT & Ops, CISO and follow up actions on the same are the responsibility of the functional head(s).

Major goals of the BCP / IT DR

  • To minimize interruptions to the normal operations.
  • To limit the extent of disruption and damage.
  • To minimize the economic impact of the interruption.
  • To establish alternative means of operation in advance.
  • To train personnel with emergency procedures.
  • To provide for smooth and rapid restoration of service.

Maintenance of the Plan

The plan is discussed at the Management review meetings and updated, as required, after each testing of the Plan. The updating of the plan in terms of name of personnel, telephone numbers and other details are done based on changes in the process, activity, persons as well as experience gained in the process.

Strategic Third Parties

Amazon Web Services (AWS), IPaaS Representative
Greg Foss
(m) 518.796.2120
gregfoss@amazon.com

References

Policy 4        IT Systems And Network Operations Policy
PM-3            Procedure For Corrective Action
SP-20           Information Security Continuity

Enclosures

Annex-1        Table for events and business continuity management
Annex-2        Business Continuity of Operation Procedure  

Forms

FM-14        Compliance / Security Training and BCM-ITDR Training
FM-15        Business Continuity / IT Disaster Recovery Test Report