EIR-OPS-029: Failsafe Entered
Warning
This procedure should NOT be used for initial AOS. If failsafe is identified as the current boot image during the InitialAOS procedure, the Operator will be directed to the FailsafeInitialAOS procedure, NOT this procedure!
Warning
TC authentication is disabled at boot to reduce the risk of loosing communication with the spacecraft. Therefore, TC authentication is now disabled. The Operator should consider following the EnableAuthentication procedure ASAP to re-enable TC authentication to prevent replay attacks.
Objective
To identify the most probable cause for the OBC unexpectedly booting the failsafe image.
Introduction
To unexpectedly boot the failsafe image when the previous boot image was a primary image, a reboot must occur within the first 2 hours of the primary image being booted (i.e. before the image gets marked as stable).
Using this procedure, the Operator will downlink data from the spacecraft to determine the most likely chain of events that led to the OBC rebooting and booting the failsafe image. This will allow the Operator to properly assess what further analysis/health checks should then performed.
Procedure
This procedure contains the following sub-procedures:
Note
Communication with the spacecraft is required for Sections B, C and E.
Important
Following the Initial Checks, Section C (Downlinking Data) should be referred to for each communication pass. Data downlink should be the priority of the Operator during each pass while trying to establish the cause of the rebooting/booting failsafe. The analysis detailed in Section D of this procedure should only be performed in parallel by other members of the team or outside of the communication pass.
A. Raising the Failsafe Image Alarm
A.1.
Ensure that the Senior Operations and/or Systems Teams have been alerted that the spacecraft has unexpectedly booted the failsafe image.
B. Initial Checks
Important
You are about to send the first TC of this procedure - Have you completed the ‘Start a Communication Pass’ procedure? A Communication Pass must be started prior to carrying out the operations planned for the pass, even when operating in the failsafe image!
B.1.
If the failsafe image has previously been booted and its Separation Sequence finished, proceed immediately to the next step.
Else, if this is the first time the failsafe image has been booted on-orbit:
The image’s Separation Sequence will now be cycling through resistor burn and between-burn-wait states.
Therefore, ONLY IF full antenna deployment has previously been confirmed as part of InitialAOS ,
Invokethemission.SeparationSequence.SeparationSequenceFinishaction now.
TC Details |
|
MCS Operation |
|
Action/Param Name |
|
Data Expected with TC |
No |
TM Details |
|
Data Expected from TC |
No ( + ACK ) |
B.2.
Getthemission.SeparationSequence.stateparameter.Ensure that the returned state is 0x42 (hex) / 66 (dec).
TC Details |
|
MCS Operation |
|
Action/Param Name |
|
Data Expected with TC |
No |
TM Details |
|
Data Expected from TC |
|
Data Size |
1 byte |
Data Info |
the current state of the Separation Sequence |
Allowed Value(s) |
00 - 09 or 42 (hex) |
Expected Value(s) |
42 (hex) / 66 (dec) |
B.3.
On exit of the Separation Sequence, all PDMs should be powered OFF. To confirm this,
Gettheplatform.EPS.actualSwitchStatesparameter withFirst row= 0 andLast row= 9.Ensure that all 0s (excluding row 7/PDM 8) are returned.
Caution
The FSS is drawing parasitic power on row 7/PDM 8 of EPS.actualSwitchStates and so will always be returned as 1 (ON), even if the state of PDM8/Row7 of EPS.expectedSwitchStates is set to 0 (OFF).
TC Details |
|
MCS Operation |
|
Action/Param Name |
|
Data Expected with TC |
Yes |
Data Size |
2 bytes, 2 bytes |
Data Info |
|
Allowed Value(s) |
0-9, 0-9 |
Expected Value(s) |
0, 9 |
TM Details |
|
Data Expected from TC |
List of switch states ( + ACK ) |
Data Size |
List[0:10] of Booleans |
Data Info |
If all 0, all PDMs are off |
Allowed Value(s) |
0000000000 (all PDMs OFF) - 1111111111 (all PDMs ON) |
Expected Value(s) |
0000000100 (all PDMs OFF, except for the FSS PDM/PDM 8) |
B.4.
Getthe parameterplatform.ADCS.adcsModeStateto determine the current ADCS mode and state.Ensure 0x0000 (i.e. Standby Mode/Nadir State) is returned.
TC Details |
|
MCS Operation |
|
Action/Param Name |
|
Data Expected with TC |
No |
TM Details |
|
Data Expected from TC |
|
Data Size |
4 bytes |
Data Info |
The current mode (2 MSB) and state (2 LSB) of the ADCS |
Allowed Value(s) |
See tables below |
Expected Value(s) |
00000000 |
Where…
|
ADCS Mode |
|---|---|
0000 |
Standby (Default) |
0001 |
Detumble |
0002 |
Spin Stabilised |
5550 |
Test |
|
ADCS State |
|---|---|
0000 |
Nadir (Default) |
AAA8 |
Test |
C. Downlinking Data
C.1.
For the remainder of the pass, downlink data from on-board storage according to EIR-OPS-011: Downlink Data From Storage .
For the first passes following the unexpected boot of failsafe, it is recommended that the Operator downlinks data according to the priorities listed in the table below.
However, when assigning downlink time/priorities, some NEW rows of
EventandHKdata should also always be included to allow assessment of the most current state of the spacecraft.
Note
In this table the Operator is advised to downlink ‘some’ rows of a particular data type and whether the OLDest or NEWest rows of data in storage should be given preference. This is done with the assumption that the time constraint of the communication pass will not allow the Operator to get all the desired data downlinked in a single pass.
Important
Only MRAM channels can be accessed while in failsafe. MRAM storage has been configured to have 1) channels for logging data generated while operating in failsafe and 2) channels containing a buffer of the most recently logged data by a primary image. Refer to ROW to determine what channels exist in MRAM.
Warning
absRowsLogged is maintained by the logger components. Therefore, no absRowsLogged data will be available for the primary image MRAM channels while operating in failsafe. Therefore, if wanting to request data from these primary image MRAM channels, the Operator should instead use the channels’ numRows when calculating First row and Last row for the downlink.
Priority |
What? |
Why? |
|---|---|---|
Highest |
Some NEW rows of FAILSAFE |
to determine the current state of the spacecraft |
Some OLD rows of FAILSAFE |
may provide useful in-sight into the nature of the reboot(s) that led to failsafe |
|
Some NEW rows of PRIMARY |
to determine the state of the spacecraft when last in primary1 |
|
Lowest |
Some NEW rows of PRIMARY |
may provide useful in-sight into why reboot(s) occurred while in primary1 |
C.2.
If a communication pass is over proceed to Section D, however, for later passes and while the reason for failsafe is still being assessed, the Operator should return to this section and continue to downlink data from the above table as well as any additional data desired as a result of the analysis carried out in Section D.
D. Data Analysis (After the Communication Pass)
Note
The analysis to be carried out by the team is very dependent on the findings as well as what data was successfully downlinked in Section C. Therefore, rather than a strict set of instructions, this section instead provides information to help guide the Operator in their analyses. Also note that in addition to any data downlinked by the UCD GS, data obtained via the amateur radio community may also be used to support the analysis/findings.
SPACECRAFT HEALTH CHECK
D.1.
Any ‘NEW rows of FAILSAFE
HKdata’ downlinked should now be checked to assess the current state of the spacecraft and its subsystems. Other than the fact that failsafe is the current boot image, do the other HK parameters cause any reason for concern? e.g:Are the battery bus voltage levels nominal?
Are the various EPS and/or battery reset counters as expected given their pre-launch values?
Has the temperature of the CMC Power Amplifier stayed within expected/acceptable limits since RF transmissions were enabled?
Tip
This information should be used to assist with the ‘FAILSAFE BOOTED ANALYSIS’ below.
Tip
In addition to the most recent value of each parameter, check how the values changed with time. Use the Grafana to help with this.
D.2.
The Operator should also assess whether the failsafe image has been stable since booted. To do this:
If the full failsafe Event log has been downlinked, search it for occurrences of the Separation Sequence ‘StateFunctionComplete’ event with event data = 0x00 (i.e. the Separation Sequence Init State) or 0x42 (i.e. the Finished State). If failsafe has been stable since booted, at most, only one each event (i.e. with data = 0x00 or 0x42) should be observed.
If the full Event log has NOT YET been downlinked but some ‘NEW rows of FAILSAFE
HKdata’ and some ‘NEW rows of PRIMARYHKdata’ were retrieved:Use the most recent On-Board Time (OBT) and uptime parameter values in the ‘NEW rows of FAILSAFE
HKdata’ to determine the OBT of the last reboot.If this OBT is roughly consistent with the last OBT parameter value in the ‘NEW rows of PRIMARY
HKdata’, then failsafe has likely been stable since booted.
If multiple reboots have occurred since failsafe has been booted, the Operator should investigate this in parallel to the below analysis, which is more focused on the nature of the reboots that led to failsafe as opposed to reboots while operating in failsafe. However, the same analysis largely applies and should be considered prior to proceeding to Section D.
FAILSAFE BOOTED ANALYSIS
D.3.
The Operator should first assess the time-line of the reboot(s) that led to failsafe. To do this, take note of the most recent
core.OBT.uptimein the ‘NEW rows of PRIMARYHKdata’, and consider the following possibilities:If this
core.OBT.uptimeis >2 hours, failsafe was booted as a result of:A reboot + a failed attempt to boot back into the previously operating primary image, or
A reboot where the primary image was not marked as stable even though >2 hours of operating in the image had passed.
Both scenarios require an assessment of the initial reboot. Additionally, however, both scenarios also require some anomalous/unexpected (software?) behaviour. Therefore, if either scenario has occurred, the Software Engineer should be contacted for support.
If this
core.OBT.uptimeis <2 hours AND >2 hours had elapsed since on-orbit deployment, failsafe was booted as a result of more than one reboot sometime after launch, where the first rebooted the primary image.In this case, ‘NEW rows of PRIMARY
HKdata’ and ‘NEW rows of PRIMARYEventdata’ should be searched for further evidence of the first reboot into the primary image (e.g. did uptime reset?, are there multiple occurrences of the Separation Sequence StateFunctionComplete event with event data = 0x00?).If this
core.OBT.uptimeis <2 hours AND <2 hours had elapsed since on-orbit deployment, failsafe was booted as a result of a single reboot sometime after launch.
D.4.
Although it is highly unlikely that the Operator would not know about commands sent to the spacecraft that would lead to reboots/failsafe, the possibility that the above has resulted from GS commands should still be ruled out.
Therefore, using the MCS/GS logs verify that the following TCs were not sent to the spacecraft since the image was last as expected (i.e. not the failsafe image):
Set:platform.obc.OBC.imageIsStableSet:platform.obc.OBC.imagePriorityInvoke:platform.obc.OBC.resetInvoke:platform.EPS.cycleBus
If any of these commands were sent to the spacecraft since the image was last as expected, check if the timing of the commands is consistent with the timing of reboots that led to failsafe.
D.5.
Assuming the reboots were not commanded, to now further assess the nature of any reboots, the Operator should now search the Event logs (i.e. the ‘OLD rows of FAILSAFE
Eventdata’ and ‘NEW rows of PRIMARYEventdata’) for ‘EPSInitialised’ events around the times of the reboots.If this event is observed, a full spacecraft power-cycle led to the reboot.
Else, an OBC reset occurred.
D.6.
If a full spacecraft power-cycle occurred, the Operator should now assess the ‘NEW rows of PRIMARY
HKdata’ and ‘NEW rows of PRIMARYEventdata’ to determine if there is evidence that low battery conditions caused the reboot(s). In particular, the Operator should:Search the HK data for a decrease in the battery bus voltage to ~6.144V, and
Search the Event log for the ‘LowVoltageExceptionBATSafe’ event.
If evidence that low battery conditions caused the reboot(s) is found, the Operator should now consider using the lowbatanalysis procedure to assist further analysis.
If a full spacecraft power-cycle did not occur OR if a power-cycle did occur but there is no evidence of low battery issues, the Operator should now consider using the reset_obcreboot procedure to assist further analysis.
D.7.
When the team have completed their analysis and wish to leave the failsafe image, Section E should be carried out.
E. Returning to Primary
Warning
This section of the procedure should ONLY be carried out following the close-out of Section D and ONLY IF the decision has been made to proceed with booting back into a primary mission image.
E.1.
If it is decided that a new image must be uplinked to the spacecraft prior to booting a primary image, the Operator should first follow uploadimage before proceeding.
Else, proceed immediately to the next step.
E.2.
The Operator should now follow the bootswimage procedure to boot the primary image of choice (i.e. primary1 or primary2).
If the primary image is successfully booted and is stable (i.e. no reboots to failsafe), the Operator should assess:
Whether any new image commissioning is required, and/or
Whether nominal operations (i.e. Nominal Mode with the experiment running and data logging on-going) can be initiated (see firstTimeNom for details).
END OF PROCEDURE