Update explainer text for corruption detection header extension.

Bug: webrtc:358039777 Change-Id: I6a1cffc2a5797d154bfecb50c60b4c05d4943426 Reviewed-on: https://webrtc-review.googlesource.com/c/src/+/360661 Commit-Queue: Erik Språng <sprang@webrtc.org> Auto-Submit: Erik Språng <sprang@webrtc.org> Reviewed-by: Fanny Linderborg <linderborg@webrtc.org> Cr-Commit-Position: refs/heads/main@{#42862}
2024-08-26 14:33:40 +02:00 · 2024-08-26 14:33:40 +02:00 · c1a0d233d0
commit c1a0d233d0
parent fd6f4b4e51
3 changed files with 357 additions and 42 deletions
--- a/docs/native-code/rtp-hdrext/README.md
+++ b/docs/native-code/rtp-hdrext/README.md
@ -1,13 +1,15 @@
 # Experimental RTP header extensions

-The following subpages define experiemental RTP header extensions:
+The following subpages define experimental RTP header extensions:

-  * [abs-send-time](abs-send-time/README.md)
  * [abs-capture-time](abs-capture-time/README.md)
+  * [abs-send-time](abs-send-time/README.md)
  * [color-space](color-space/README.md)
+  * [corruption-detection](corruption-detection/README.md)
+  * [inband-cn](inband-cn/README.md)
  * [playout-delay](playout-delay/README.md)
  * [transport-wide-cc-02](transport-wide-cc-02/README.md)
  * [video-content-type](video-content-type/README.md)
+  * [video-frame-tracking-id](video-frame-tracking-id/README.md)
+  * [video-layers-allocation00](video-layers-allocation00/README.md)
  * [video-timing](video-timing/README.md)
-  * [inband-cn](inband-cn/README.md)
-  * [video-layers-allocation00](video-layes-allocation00/README.md)
--- a/docs/native-code/rtp-hdrext/corruption-detection/README.md
+++ b/docs/native-code/rtp-hdrext/corruption-detection/README.md
@ -0,0 +1,349 @@
+# Corruption Detection
+
+** Name: **
+"Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"
+** Formal name: **
+<http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection>
+** Status: ** This extension is defined here to allow for experimentation.
+** Contact: ** <sprang@google.com>
+
+NOTE: This explainer is a work in progress and may change without notice.
+
+The Corruption Detection (sometimes referred to as automatic corruption
+detection or ACD) extension is intended to be a part of a system that allows
+estimating a likelihood that a video transmission is in a valid state. That is,
+the input to the video encoder on the send side corresponds to the output of the
+video decoder on the receive side with the only difference being the expected
+distortions from lossy compression. 
+
+The goal is to be able to detect outright coding errors caused by things such as
+bugs in encoder/decoders, malformed packetization data, incorrect relay
+decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
+so forth. We want to accomplish this with a high signal-to-noise ratio while
+consuming a minimum of resources in terms of bandwidth and/or computation. It
+should be noted that it is _not_ a goal to be able to e.g. gauge general video
+quality using this method.
+
+This explainer contains two parts:
+
+1) A definition of the RTP header extension itself and how it is to be parsed.
+2) The intended usage and implementation details for a WebRTC sender and
+   receiver respectively.
+
+If this extension has been negotiated, all the client behavior outlined in this
+doc MUST be adhered to.
+
+## RTP Header Extension Format
+
+### Data Layout Overview
+
+The message format of the header extension:
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |B|  seq index  |    std dev    | Y err | UV err|    sample 0   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |    sample 1   |   sample 2    |    …   up to sample <=12
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+### Data Layout Details
+
+* B (1 bit): If the sequence number should be interpreted as the MSB or LSB
+  of the full size 14 bit sequence index described in the next point.
+* seq index (7 bits): The index into the Halton sequence (used to locate
+  where the samples should be drawn from).
+  * If B is set: the 7 most significant bits of the true index. The 7 least
+    significant bits of the true index shall be interpreted as 0. This is
+    because this is the point where we can guarantee that the sender and
+    receiver has the same full index. B MUST be set on keyframes. On droppable
+    frames B MUST NOT be set.
+  * If B is not set: The 7 LSB of the true index. The 7 most significant bits
+    should be inferred based on the most recent message.
+* std dev (8 bits):  The standard deviation of the Gaussian filter used
+  to weigh the samples. The value is scaled using a linear map:
+  0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
+  just the sample value at the desired coordinate, without any weighting.
+* Y err (4 bits): The allowed error for the luma channel.
+* UV err (4 bits): The allowed error for the chroma channels.
+* Sample N (8 bits): The N:th filtered sample from the input image. Each
+  sample represents a new point in one of the image planes, the plane and
+  coordinates being determined by index into the Halton sequence (starting at
+  seq# index and is incremented by one for each sample). Each sample has gone
+  through a Gaussian filter with the std dev specified above. The samples
+  have been floored to the nearest integer.
+
+A special case is the so-called "synchronization" message. Such a message
+only contains the first byte. They are used to keep the sender and receiver in
+sync even if no "full" message has been received for a while. Such messages
+MUST NOT be sent on droppable frames.
+
+### A note on encryption
+
+Privacy and security are core parts of nearly every WebRTC-based application,
+which means that some sort of encryption needs to be present. The most common
+form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
+section 9.4 of that RFC, RTP header extensions are considered part of the header
+and are thus not encrypted.
+
+The automatic corruption detection header extension is different from most other
+header extensions in that it provides not only metadata about the media stream
+being transmitted but in practice comprises an extremely sparse representation
+of the actual video stream itself. Given a static scene and enough time, a crude
+image of the encrypted video can rather trivially be constructed.
+
+As such, most applications should use this extension with SRTP only if
+additional security is present to protect it. That could be for example in the
+form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
+by encapsulating the entire RTP stream in an additional layer such as IPSec. 
+
+## Usage & Guidelines
+
+In this section we’ll first look at a general overview of the intended usage of
+this header extensions, followed by more details around the expected
+implementation.
+
+### Overview
+
+The premise of the extension described here is that we can validate the state of
+the video pipeline by quasi-randomly selecting a few samples from the raw input
+frame to an encoder, and then checking them against the output of a decoder.
+Assuming that a lossless codec is used we can follow these steps:
+
+1) In an image that is to be encoded, quasi-randomly select N sampling positions
+   and store the samples values for those positions from the raw input image. 
+2) Encode the image, and attach the selected sample values to the RTP packets
+   containing the encoded bitstream of that image.
+3) Transmit the RTP packets to a remote receiver.
+4) At the receiver, collect the attached sample values from the RTP packets when
+   assembling the frame, and then pass the bitstream to a decoder.
+5) Using the same quasi-random sequence as in (1), calculate the corresponding N
+   sampling positions.
+6) Take the output of the decoder and check the values of the samples from the
+   RTP packets. If they differ significantly, it is likely that an image
+   corruption has occurred.
+
+Lossless encoding is however rarely used in practice, and that introduces
+problems for the above algorithm.
+
+* Quantization causes values to be different from the desired value.
+* Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
+  vectors.
+* Inaccuracies caused by in-loop or post-process filtering.
+* etc.
+
+We must therefore take these distortions into consideration, as they are merely
+a natural side-effect of the compression and their effect is not to be
+considered an “invalid state”. We aim to accomplish this using two tools.
+
+First, instead of a sample being a single raw sample value let it be a filtered
+one: a weighted average of samples in the vicinity of the desired location, with
+the weights being a 2D Gaussian centered at that location and the variance
+adjusted depending on the magnitude of the expected distortions
+(higher distortion => higher variance). This smoothes out inaccuracies caused by
+both quantization and motion compensation.
+
+Secondly, even with a very large filter kernel the new sample might not converge
+towards the exact desired value. For that reason, set an “allowed error
+threshold” that removes small magnitude differences. Since chroma and luma
+channels have different scales, separate error thresholds are available for
+them.
+
+### Sequence Index Handling
+
+The quasi-random sequence of choice for this extension is a 2D
+[Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence). 
+
+The index into the Halton Sequence is indicated by the header extension and
+results in a 14 bit unsigned integer which on overflow will wrap around back to
+0.
+
+For each sample contained within the extension, the sequence index should be
+considered to be incremented by one. Thus the sequence index at the start of the
+header should be considered “the sequence index for the next sample to be
+drawn”.
+
+The ACD extension may be sent containing either the 7 most significant bits
+(B = true) or the 7 least significant bits (B = false) of the sequence index.
+
+Key-frames MUST be populated with the ACD extension, and those MUST use B = true
+indicating only the 7 most significant bits are transmitted.
+
+The sender may choose any arbitrary starting point. The biggest reason to not
+always start with (B = true, seq index = 0) is that with frequent/periodic
+keyframes you might end up always sampling the same small subset of image
+locations over and over.
+
+If B = false and the LSB seq index + number of samples exceeds the capacity of
+the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
+sequence counter should be considered to be implicitly incremented by the
+overflow.
+
+Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
+example temporal layering using the
+[L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario,
+key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
+droppable.
+
+For non-droppable frames, B MAY be set to true even though there is often little
+utility for it.
+For droppable frames B MUST NOT be set to true, since a receiver could otherwise
+easily end up out of sync with the sender.
+
+A receiver must store a state containing the last sequence index used. If an ACD
+extension is receiver with B = false but the LSB does not match the last known
+sequence index state, this indicates that an instrumented frame has been
+dropped. The receiver should recover from this by incrementing the last known
+sequence index until the 7 least significant bits match.
+
+Because of this, the sender MUST send ACD messages on non-droppable frames such
+that the delta between their sequence indexing (from the last sample of the
+previous packet to the first of the next) indexing does not exceed 0x7F. A
+synchronization message may be used for this purpose if there is no wish to
+instrument the non-droppable frame.
+
+It is not required to add the ACD extension to every frame. Indeed, for
+performance reasons it may be reasonable to only instrument a small subset of
+frames, for example using only one frame per second.
+
+Additionally, when encoding a structure that has independent decode targets
+(e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
+per target resolution so that a receiver can validate the state of the
+sub-stream they receive.
+
+// TODO: Add concrete examples.
+
+### Sample Selection
+
+As mentioned above, a Halton Sequence is used to generate sampling coordinates.
+Base 2 is used for selecting the rows, and base 3 is used for selecting columns.
+
+Each sample in the ACD extension represents a single image sample, meaning it
+belongs to a single channel rather than e.g. being an RGB pixel.
+
+The initial version of the ACD extension supports only the I420 chroma
+subsampling format. When determining which plane a location belongs to, it is
+easiest to visualize it as the chroma planes being “stacked” to the side of the
+luma plane:
+
+------+---+
+|      | U |
+  Y   +---+
+|      | V |
+------+---+
+
+In pseudo code:
+```
+  row = GetHaltonSequence(seq_index, /*base=*/2) * image_height;
+  col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5;
+
+  if (col < image_width) {
+    HandleSample(Y_PLANE, row, col);
+  } else if (row < image_height / 2) {
+    HandleSample(U_PLANE, row, col - image_width);
+  } else {
+    HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
+  }
+  
+  seq_index++;
+```
+Support for other layout types may be added in later versions of this extension.
+
+Note that the image dimensions are not explicitly a part of the ACD extension -
+that has to be inferred from the raw image itself.
+
+### Sample Filtering
+
+As mentioned above, when filtering a sample we create a weighted average around
+the desired location. Only samples in the same plane are considered. The
+weighting consists of a 2D Gaussian centered on the desired location, with the
+standard deviation specified in the ACD extension header.
+
+If the standard deviation is specified as 0.0 - we consider only a singular
+sample. Otherwise, we first determine a cutoff distance below which the weights
+are considered too small to matter. For now, we have set the weight cutoff to
+0.2 - meaning the maximum distance from the center sample we need to consider is
+max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.
+
+Any samples outside the plane are considered to have weight 0.
+
+In pseudo-code, that means we get the following:
+```  
+  sample_sum = 0;
+  weight_sum = 0;
+  for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
+    for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
+	weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
+	sample_sum += SampleAt(x, y) * weight;
+	weight_sum += weight;
+    }
+  }
+  filtered_sample = sample_sum / weight_sum;
+```
+### Receive Side Considerations
+
+When a frame has been decoded and an ACD message is present, the receiver
+performs the following steps:
+
+* Update the sequence index so that it is consistent with the ACD message.
+* Calculate the sample positions from the Halton sequence.
+* Filter each sample of the decoded image using the standard deviation provided
+  in the ACD message.
+
+We then need to compare the actual samples present in the ACD message and the
+samples generated from the locally decoded frame, and take the allowed error
+into account:
+
+```
+for (i = 0 to num_samples) {
+  // Allowed error from ACD message, depending on which plane sample i is in.
+  allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
+  delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
+}
+```
+
+It is then up to the receiver how to interpret these deltas. A suggested method
+is to calculate a “corruption score” by calculating sum(delta(i)^2), where
+delta(i) is the delta for i:th sample in the message, and then scaling and
+capping that result to a maximum of 1.0. By squaring the sample, we make sure
+that even singular samples that are way outside their expected values cause a
+noticeable shift in the score. Another possible way is to calculate the distance
+and cap it using a sigmoid function.
+
+This extension message format does not make recommendations about what a
+receiver should do with the corruption scores, but some possibilities are:
+
+* Expose it as a statistics connected to the video receive stream. Let the
+  application decide what to do with the information.
+* Let the WebRTC application use a corruption signal to take proactive measures.
+  E.g. request a key-frame in order to recover, or try to switch to another
+  codec type or implementation.
+
+### Determining Filter Settings & Error Thresholds
+
+It is up to the sender to estimate how large the filter kernel and the allowed
+error thresholds should be.
+
+One method to do this is to analyze example outputs from different encoders and
+map the average frame QP to suitable settings. There will of course have to be
+different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
+get “tighter” values with knowledge of the exact implementation used. E.g. a
+mapping designed just for libaom encoder version X running with speed setting Y.
+
+Another method is to use the actual reconstructor state from the encoder. That
+of course means the encoder has to expose that state, which is not common.
+A benefit of doing it that way is that the filter size and allowed error can be
+very small (really only post-processing could introduce distortions in that
+scenario). A drawback is if the reconstructed state already contains corruption
+due to an encoder bug - then we would not be able to detect that corruption at
+all.
+
+There are also possibly more accurate but probably much more costly alternatives
+as well, such as training an ML model to determine the settings based on both
+the content of the source frame and any metadata present in the encoded
+bitstream. 
+
+Regardless of method, the implementation at the send side SHOULD strive to set
+the filter size and error thresholds such that 99.5% of filtered samples end up
+with a delta <= the error threshold for that plane, based on a representative
+set of test clips and bandwidth constraints.
--- a/modules/rtp_rtcp/source/corruption_detection_extension.cc
+++ b/modules/rtp_rtcp/source/corruption_detection_extension.cc
@ -27,44 +27,8 @@ constexpr double kMaxValueForStdDev = 40.0;

 }  // namespace

-// The message format of the header extension:
-//
-//   0                   1                   2                   3
-//   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
-//  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-//  |B| seq# index  |  kernel size  | Y err | UV err|    sample 0   |
-//  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-//  |    sample 1   |   sample 2    |    …   up to sample <=12
-//  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-//
-// * B (1 bit): If the sequence number should be interpreted as the MSB or LSB
-//   of the full size 14 bit sequence index described in the next point.
-// * seq# index (7 bits): The index into the Halton sequence (used to locate
-//   where the samples should be drawn from).
-//   * If B is set: the 7 most significant bits of the true index. The 7 least
-//     significant bits of the true index shall be interpreted as 0. This is
-//     because this is the point where we can guarantee that the sender and
-//     receiver has the same full index). For this reason, B must only be set
-//     for key frames.
-//   * If B is not set: The 7 LSB of the true index. The 7 most significant bits
-//     should be inferred based on the most recent message.
-// * kernel size (8 bits):  The standard deviation of the gaussian filter used
-//   to weigh the samples. The value is scaled using a linear map:
-//   0 = 0.0 to 255 = 40.0. A kernel size of 0 is interpreted as directly using
-//   just the sample value at the desired coordinate, without any weighting.
-// * Y err (4 bits): The allowed error for the luma channel.
-// * UV err (4 bits): The allowed error for the chroma channels.
-// * Sample N (8 bits): The N:th filtered sample from the input image. Each
-//   sample represents a new point in one of the image planes, the plane and
-//   coordinates being determined by index into the Halton sequence (starting at
-//   seq# index and is incremented by one for each sample). Each sample has gone
-//   through a Gaussian filter with the kernel size specified above. The samples
-//   have been floored to the nearest integer.
-//
-// A special case is so called "synchronization" messages. These are messages
-// that only contains the first byte. They always have B set and are used to
-// keep the sender and receiver in sync even if no "full" messages have been
-// sent for a while.
+// A description of the extension can be found at
+// http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection

 bool CorruptionDetectionExtension::Parse(rtc::ArrayView<const uint8_t> data,
                                         CorruptionDetectionMessage* message) {