EIR-OPS-029: Failsafe Entered

Warning

This procedure should NOT be used for initial AOS. If failsafe is identified as the current boot image during the EIR-OPS-004: Initial AOS procedure, the Operator will be directed to the EIR-OPS-005: Failsafe at Initial AOS procedure, NOT this procedure!

Warning

TC authentication is disabled at boot to reduce the risk of loosing communication with the spacecraft. Therefore, TC authentication is now disabled. The Operator should consider following the EIR-OPS-009: Enable TC Authentication procedure ASAP to re-enable TC authentication to prevent replay attacks.

Objective

To identify the most probable cause for the OBC unexpectedly booting the failsafe image.

Introduction

To unexpectedly boot the failsafe image when the previous boot image was a primary image, a reboot must occur within the first 2 hours of the primary image being booted (i.e. before the image gets marked as stable).

Using this procedure, the Operator will downlink data from the spacecraft to determine the most likely chain of events that led to the OBC rebooting and booting the failsafe image. This will allow the Operator to properly assess what further analysis/health checks should then performed.

Procedure

This procedure contains the following sub-procedures:

A. Raising the Failsafe Image Alarm
B. Initial Checks
C. Downlinking Data
D. Data Analysis (After the Communication Pass)
E. Returning to Primary

Note

Communication with the spacecraft is required for Sections B, C and E.

Important

Following the Initial Checks, Section C (Downlinking Data) should be referred to for each communication pass. Data downlink should be the priority of the Operator during each pass while trying to establish the cause of the rebooting/booting failsafe. The analysis detailed in Section D of this procedure should only be performed in parallel by other members of the team or outside of the communication pass.

A. Raising the Failsafe Image Alarm

A.1.

Ensure that the Senior Operations and/or Systems Teams have been alerted that the spacecraft has unexpectedly booted the failsafe image.

B. Initial Checks

Important

You are about to send the first TC of this procedure - Have you completed the ‘Start a Communication Pass’ procedure? A Communication Pass must be started prior to carrying out the operations planned for the pass, even when operating in the failsafe image!

B.1.

If the failsafe image has previously been booted and its Separation Sequence finished, proceed immediately to the next step.
Else, if this is the first time the failsafe image has been booted on-orbit:
- The image’s Separation Sequence will now be cycling through resistor burn and between-burn-wait states.
- Therefore, ONLY IF full antenna deployment has previously been confirmed as part of EIR-OPS-004: Initial AOS , Invoke the mission.SeparationSequence.SeparationSequenceFinish action now.

TC Details
MCS Operation	`Invoke`
Action/Param Name	`mission.SeparationSequence.SeparationSequenceFinish`
Data Expected with TC	No
TM Details
Data Expected from TC	No ( + ACK )

B.2.

Get the mission.SeparationSequence.state parameter.
Ensure that the returned state is 0x42 (hex) / 66 (dec).

TC Details
MCS Operation	`Get`
Action/Param Name	`mission.SeparationSequence.state`
Data Expected with TC	No
TM Details
Data Expected from TC	`state` ( + ACK )
Data Size	1 byte
Data Info	the current state of the Separation Sequence
Allowed Value(s)	00 - 09 or 42 (hex)
Expected Value(s)	42 (hex) / 66 (dec)

B.3.

On exit of the Separation Sequence, all PDMs should be powered OFF. To confirm this, Get the platform.EPS.actualSwitchStates parameter with First row = 0 and Last row = 9.
Ensure that all 0s (excluding row 7/PDM 8) are returned.

Caution

The FSS is drawing parasitic power on row 7/PDM 8 of EPS.actualSwitchStates and so will always be returned as 1 (ON), even if the state of PDM8/Row7 of EPS.expectedSwitchStates is set to 0 (OFF).

TC Details
MCS Operation	`Get`
Action/Param Name	`platform.EPS.actualSwitchStates`
Data Expected with TC	Yes
Data Size	2 bytes, 2 bytes
Data Info	`First row`, `Last row`
Allowed Value(s)	0-9, 0-9
Expected Value(s)	0, 9
TM Details
Data Expected from TC	List of switch states ( + ACK )
Data Size	List[0:10] of Booleans
Data Info	If all 0, all PDMs are off
Allowed Value(s)	0000000000 (all PDMs OFF) - 1111111111 (all PDMs ON)
Expected Value(s)	0000000100 (all PDMs OFF, except for the FSS PDM/PDM 8)

B.4.

Get the parameter platform.ADCS.adcsModeState to determine the current ADCS mode and state.
Ensure 0x0000 (i.e. Standby Mode/Nadir State) is returned.

TC Details
MCS Operation	`Get`
Action/Param Name	`platform.ADCS.adcsModeState`
Data Expected with TC	No
TM Details
Data Expected from TC	`adcsModeState` ( + ACK )
Data Size	4 bytes
Data Info	The current mode (2 MSB) and state (2 LSB) of the ADCS
Allowed Value(s)	See tables below
Expected Value(s)	00000000

Where…

`adcsMode` (hex)	ADCS Mode
0000	Standby (Default)
0001	Detumble
0002	Spin Stabilised
5550	Test

`adcsState` (hex)	ADCS State
0000	Nadir (Default)
AAA8	Test

C. Downlinking Data

C.1.

For the remainder of the pass, downlink data from on-board storage according to EIR-OPS-011: Downlink Data From Storage .
For the first passes following the unexpected boot of failsafe, it is recommended that the Operator downlinks data according to the priorities listed in the table below.
However, when assigning downlink time/priorities, some NEW rows of Event and HK data should also always be included to allow assessment of the most current state of the spacecraft.

Note

In this table the Operator is advised to downlink ‘some’ rows of a particular data type and whether the OLDest or NEWest rows of data in storage should be given preference. This is done with the assumption that the time constraint of the communication pass will not allow the Operator to get all the desired data downlinked in a single pass.

Important

Only MRAM channels can be accessed while in failsafe. MRAM storage has been configured to have 1) channels for logging data generated while operating in failsafe and 2) channels containing a buffer of the most recently logged data by a primary image. Refer to ROW to determine what channels exist in MRAM.

Warning

absRowsLogged is maintained by the logger components. Therefore, no absRowsLogged data will be available for the primary image MRAM channels while operating in failsafe. Therefore, if wanting to request data from these primary image MRAM channels, the Operator should instead use the channels’ numRows when calculating First row and Last row for the downlink.

Priority	What?	Why?
Highest	Some NEW rows of FAILSAFE `HK`	to determine the current state of the spacecraft
	Some OLD rows of FAILSAFE `Event`	may provide useful in-sight into the nature of the reboot(s) that led to failsafe
	Some NEW rows of PRIMARY `HK`	to determine the state of the spacecraft when last in primary1
Lowest	Some NEW rows of PRIMARY `Event`	may provide useful in-sight into why reboot(s) occurred while in primary1

C.2.

If a communication pass is over proceed to Section D, however, for later passes and while the reason for failsafe is still being assessed, the Operator should return to this section and continue to downlink data from the above table as well as any additional data desired as a result of the analysis carried out in Section D.

D. Data Analysis (After the Communication Pass)

Note

The analysis to be carried out by the team is very dependent on the findings as well as what data was successfully downlinked in Section C. Therefore, rather than a strict set of instructions, this section instead provides information to help guide the Operator in their analyses. Also note that in addition to any data downlinked by the UCD GS, data obtained via the amateur radio community may also be used to support the analysis/findings.

SPACECRAFT HEALTH CHECK

D.1.

Any ‘NEW rows of FAILSAFE HK data’ downlinked should now be checked to assess the current state of the spacecraft and its subsystems. Other than the fact that failsafe is the current boot image, do the other HK parameters cause any reason for concern? e.g:
- Are the battery bus voltage levels nominal?
- Are the various EPS and/or battery reset counters as expected given their pre-launch values?
- Has the temperature of the CMC Power Amplifier stayed within expected/acceptable limits since RF transmissions were enabled?

Tip

This information should be used to assist with the ‘FAILSAFE BOOTED ANALYSIS’ below.

Tip

In addition to the most recent value of each parameter, check how the values changed with time. Use the Grafana to help with this.

D.2.

The Operator should also assess whether the failsafe image has been stable since booted. To do this:
- If the full failsafe Event log has been downlinked, search it for occurrences of the Separation Sequence ‘StateFunctionComplete’ event with event data = 0x00 (i.e. the Separation Sequence Init State) or 0x42 (i.e. the Finished State). If failsafe has been stable since booted, at most, only one each event (i.e. with data = 0x00 or 0x42) should be observed.
- If the full Event log has NOT YET been downlinked but some ‘NEW rows of FAILSAFE HK data’ and some ‘NEW rows of PRIMARY HK data’ were retrieved:
  - Use the most recent On-Board Time (OBT) and uptime parameter values in the ‘NEW rows of FAILSAFE HK data’ to determine the OBT of the last reboot.
  - If this OBT is roughly consistent with the last OBT parameter value in the ‘NEW rows of PRIMARY HK data’, then failsafe has likely been stable since booted.
If multiple reboots have occurred since failsafe has been booted, the Operator should investigate this in parallel to the below analysis, which is more focused on the nature of the reboots that led to failsafe as opposed to reboots while operating in failsafe. However, the same analysis largely applies and should be considered prior to proceeding to Section D.

FAILSAFE BOOTED ANALYSIS

D.3.

The Operator should first assess the time-line of the reboot(s) that led to failsafe. To do this, take note of the most recent core.OBT.uptime in the ‘NEW rows of PRIMARY HK data’, and consider the following possibilities:
- If this core.OBT.uptime is >2 hours, failsafe was booted as a result of:
  - A reboot + a failed attempt to boot back into the previously operating primary image, or
  - A reboot where the primary image was not marked as stable even though >2 hours of operating in the image had passed.
  Both scenarios require an assessment of the initial reboot. Additionally, however, both scenarios also require some anomalous/unexpected (software?) behaviour. Therefore, if either scenario has occurred, the Software Engineer should be contacted for support.
- If this core.OBT.uptime is <2 hours AND >2 hours had elapsed since on-orbit deployment, failsafe was booted as a result of more than one reboot sometime after launch, where the first rebooted the primary image.
  
  In this case, ‘NEW rows of PRIMARY HK data’ and ‘NEW rows of PRIMARY Event data’ should be searched for further evidence of the first reboot into the primary image (e.g. did uptime reset?, are there multiple occurrences of the Separation Sequence StateFunctionComplete event with event data = 0x00?).
- If this core.OBT.uptime is <2 hours AND <2 hours had elapsed since on-orbit deployment, failsafe was booted as a result of a single reboot sometime after launch.

D.4.

Although it is highly unlikely that the Operator would not know about commands sent to the spacecraft that would lead to reboots/failsafe, the possibility that the above has resulted from GS commands should still be ruled out.
Therefore, using the MCS/GS logs verify that the following TCs were not sent to the spacecraft since the image was last as expected (i.e. not the failsafe image):
- Set : platform.obc.OBC.imageIsStable
- Set : platform.obc.OBC.imagePriority
- Invoke : platform.obc.OBC.reset
- Invoke : platform.EPS.cycleBus
If any of these commands were sent to the spacecraft since the image was last as expected, check if the timing of the commands is consistent with the timing of reboots that led to failsafe.

D.5.

Assuming the reboots were not commanded, to now further assess the nature of any reboots, the Operator should now search the Event logs (i.e. the ‘OLD rows of FAILSAFE Event data’ and ‘NEW rows of PRIMARY Event data’) for ‘EPSInitialised’ events around the times of the reboots.
If this event is observed, a full spacecraft power-cycle led to the reboot.
Else, an OBC reset occurred.

D.6.

If a full spacecraft power-cycle occurred, the Operator should now assess the ‘NEW rows of PRIMARY HK data’ and ‘NEW rows of PRIMARY Event data’ to determine if there is evidence that low battery conditions caused the reboot(s). In particular, the Operator should:
- Search the HK data for a decrease in the battery bus voltage to ~6.144V, and
- Search the Event log for the ‘LowVoltageExceptionBATSafe’ event.
If evidence that low battery conditions caused the reboot(s) is found, the Operator should now consider using the EIR-OPS-026: Low Battery Fault Analysis procedure to assist further analysis.
If a full spacecraft power-cycle did not occur OR if a power-cycle did occur but there is no evidence of low battery issues, the Operator should now consider using the EIR-OPS-027: Reboot Fault Analysis procedure to assist further analysis.

D.7.

When the team have completed their analysis and wish to leave the failsafe image, Section E should be carried out.

E. Returning to Primary

Warning

This section of the procedure should ONLY be carried out following the close-out of Section D and ONLY IF the decision has been made to proceed with booting back into a primary mission image.

E.1.

If it is decided that a new image must be uplinked to the spacecraft prior to booting a primary image, the Operator should first follow EIR-OPS-023: Upload OBC Image before proceeding.
Else, proceed immediately to the next step.

E.2.

The Operator should now follow the EIR-OPS-024: Boot Into OBC Image procedure to boot the primary image of choice (i.e. primary1 or primary2).
If the primary image is successfully booted and is stable (i.e. no reboots to failsafe), the Operator should assess:
- Whether any new image commissioning is required, and/or
- Whether nominal operations (i.e. Nominal Mode with the experiment running and data logging on-going) can be initiated (see EIR-OPS-012: Set Up Nominal Operations for details).

END OF PROCEDURE