Update explainer text for corruption detection header extension.
Bug: webrtc:358039777 Change-Id: I6a1cffc2a5797d154bfecb50c60b4c05d4943426 Reviewed-on: https://webrtc-review.googlesource.com/c/src/+/360661 Commit-Queue: Erik Språng <sprang@webrtc.org> Auto-Submit: Erik Språng <sprang@webrtc.org> Reviewed-by: Fanny Linderborg <linderborg@webrtc.org> Cr-Commit-Position: refs/heads/main@{#42862}
This commit is contained in:
parent
fd6f4b4e51
commit
c1a0d233d0
@ -1,13 +1,15 @@
|
||||
# Experimental RTP header extensions
|
||||
|
||||
The following subpages define experiemental RTP header extensions:
|
||||
The following subpages define experimental RTP header extensions:
|
||||
|
||||
* [abs-send-time](abs-send-time/README.md)
|
||||
* [abs-capture-time](abs-capture-time/README.md)
|
||||
* [abs-send-time](abs-send-time/README.md)
|
||||
* [color-space](color-space/README.md)
|
||||
* [corruption-detection](corruption-detection/README.md)
|
||||
* [inband-cn](inband-cn/README.md)
|
||||
* [playout-delay](playout-delay/README.md)
|
||||
* [transport-wide-cc-02](transport-wide-cc-02/README.md)
|
||||
* [video-content-type](video-content-type/README.md)
|
||||
* [video-frame-tracking-id](video-frame-tracking-id/README.md)
|
||||
* [video-layers-allocation00](video-layers-allocation00/README.md)
|
||||
* [video-timing](video-timing/README.md)
|
||||
* [inband-cn](inband-cn/README.md)
|
||||
* [video-layers-allocation00](video-layes-allocation00/README.md)
|
||||
|
||||
349
docs/native-code/rtp-hdrext/corruption-detection/README.md
Normal file
349
docs/native-code/rtp-hdrext/corruption-detection/README.md
Normal file
@ -0,0 +1,349 @@
|
||||
# Corruption Detection
|
||||
|
||||
** Name: **
|
||||
"Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"
|
||||
** Formal name: **
|
||||
<http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection>
|
||||
** Status: ** This extension is defined here to allow for experimentation.
|
||||
** Contact: ** <sprang@google.com>
|
||||
|
||||
NOTE: This explainer is a work in progress and may change without notice.
|
||||
|
||||
The Corruption Detection (sometimes referred to as automatic corruption
|
||||
detection or ACD) extension is intended to be a part of a system that allows
|
||||
estimating a likelihood that a video transmission is in a valid state. That is,
|
||||
the input to the video encoder on the send side corresponds to the output of the
|
||||
video decoder on the receive side with the only difference being the expected
|
||||
distortions from lossy compression.
|
||||
|
||||
The goal is to be able to detect outright coding errors caused by things such as
|
||||
bugs in encoder/decoders, malformed packetization data, incorrect relay
|
||||
decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
|
||||
so forth. We want to accomplish this with a high signal-to-noise ratio while
|
||||
consuming a minimum of resources in terms of bandwidth and/or computation. It
|
||||
should be noted that it is _not_ a goal to be able to e.g. gauge general video
|
||||
quality using this method.
|
||||
|
||||
This explainer contains two parts:
|
||||
|
||||
1) A definition of the RTP header extension itself and how it is to be parsed.
|
||||
2) The intended usage and implementation details for a WebRTC sender and
|
||||
receiver respectively.
|
||||
|
||||
If this extension has been negotiated, all the client behavior outlined in this
|
||||
doc MUST be adhered to.
|
||||
|
||||
## RTP Header Extension Format
|
||||
|
||||
### Data Layout Overview
|
||||
|
||||
The message format of the header extension:
|
||||
|
||||
0 1 2 3
|
||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|B| seq index | std dev | Y err | UV err| sample 0 |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| sample 1 | sample 2 | … up to sample <=12
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
### Data Layout Details
|
||||
|
||||
* B (1 bit): If the sequence number should be interpreted as the MSB or LSB
|
||||
of the full size 14 bit sequence index described in the next point.
|
||||
* seq index (7 bits): The index into the Halton sequence (used to locate
|
||||
where the samples should be drawn from).
|
||||
* If B is set: the 7 most significant bits of the true index. The 7 least
|
||||
significant bits of the true index shall be interpreted as 0. This is
|
||||
because this is the point where we can guarantee that the sender and
|
||||
receiver has the same full index. B MUST be set on keyframes. On droppable
|
||||
frames B MUST NOT be set.
|
||||
* If B is not set: The 7 LSB of the true index. The 7 most significant bits
|
||||
should be inferred based on the most recent message.
|
||||
* std dev (8 bits): The standard deviation of the Gaussian filter used
|
||||
to weigh the samples. The value is scaled using a linear map:
|
||||
0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
|
||||
just the sample value at the desired coordinate, without any weighting.
|
||||
* Y err (4 bits): The allowed error for the luma channel.
|
||||
* UV err (4 bits): The allowed error for the chroma channels.
|
||||
* Sample N (8 bits): The N:th filtered sample from the input image. Each
|
||||
sample represents a new point in one of the image planes, the plane and
|
||||
coordinates being determined by index into the Halton sequence (starting at
|
||||
seq# index and is incremented by one for each sample). Each sample has gone
|
||||
through a Gaussian filter with the std dev specified above. The samples
|
||||
have been floored to the nearest integer.
|
||||
|
||||
A special case is the so-called "synchronization" message. Such a message
|
||||
only contains the first byte. They are used to keep the sender and receiver in
|
||||
sync even if no "full" message has been received for a while. Such messages
|
||||
MUST NOT be sent on droppable frames.
|
||||
|
||||
### A note on encryption
|
||||
|
||||
Privacy and security are core parts of nearly every WebRTC-based application,
|
||||
which means that some sort of encryption needs to be present. The most common
|
||||
form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
|
||||
section 9.4 of that RFC, RTP header extensions are considered part of the header
|
||||
and are thus not encrypted.
|
||||
|
||||
The automatic corruption detection header extension is different from most other
|
||||
header extensions in that it provides not only metadata about the media stream
|
||||
being transmitted but in practice comprises an extremely sparse representation
|
||||
of the actual video stream itself. Given a static scene and enough time, a crude
|
||||
image of the encrypted video can rather trivially be constructed.
|
||||
|
||||
As such, most applications should use this extension with SRTP only if
|
||||
additional security is present to protect it. That could be for example in the
|
||||
form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
|
||||
by encapsulating the entire RTP stream in an additional layer such as IPSec.
|
||||
|
||||
## Usage & Guidelines
|
||||
|
||||
In this section we’ll first look at a general overview of the intended usage of
|
||||
this header extensions, followed by more details around the expected
|
||||
implementation.
|
||||
|
||||
### Overview
|
||||
|
||||
The premise of the extension described here is that we can validate the state of
|
||||
the video pipeline by quasi-randomly selecting a few samples from the raw input
|
||||
frame to an encoder, and then checking them against the output of a decoder.
|
||||
Assuming that a lossless codec is used we can follow these steps:
|
||||
|
||||
1) In an image that is to be encoded, quasi-randomly select N sampling positions
|
||||
and store the samples values for those positions from the raw input image.
|
||||
2) Encode the image, and attach the selected sample values to the RTP packets
|
||||
containing the encoded bitstream of that image.
|
||||
3) Transmit the RTP packets to a remote receiver.
|
||||
4) At the receiver, collect the attached sample values from the RTP packets when
|
||||
assembling the frame, and then pass the bitstream to a decoder.
|
||||
5) Using the same quasi-random sequence as in (1), calculate the corresponding N
|
||||
sampling positions.
|
||||
6) Take the output of the decoder and check the values of the samples from the
|
||||
RTP packets. If they differ significantly, it is likely that an image
|
||||
corruption has occurred.
|
||||
|
||||
Lossless encoding is however rarely used in practice, and that introduces
|
||||
problems for the above algorithm.
|
||||
|
||||
* Quantization causes values to be different from the desired value.
|
||||
* Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
|
||||
vectors.
|
||||
* Inaccuracies caused by in-loop or post-process filtering.
|
||||
* etc.
|
||||
|
||||
We must therefore take these distortions into consideration, as they are merely
|
||||
a natural side-effect of the compression and their effect is not to be
|
||||
considered an “invalid state”. We aim to accomplish this using two tools.
|
||||
|
||||
First, instead of a sample being a single raw sample value let it be a filtered
|
||||
one: a weighted average of samples in the vicinity of the desired location, with
|
||||
the weights being a 2D Gaussian centered at that location and the variance
|
||||
adjusted depending on the magnitude of the expected distortions
|
||||
(higher distortion => higher variance). This smoothes out inaccuracies caused by
|
||||
both quantization and motion compensation.
|
||||
|
||||
Secondly, even with a very large filter kernel the new sample might not converge
|
||||
towards the exact desired value. For that reason, set an “allowed error
|
||||
threshold” that removes small magnitude differences. Since chroma and luma
|
||||
channels have different scales, separate error thresholds are available for
|
||||
them.
|
||||
|
||||
### Sequence Index Handling
|
||||
|
||||
The quasi-random sequence of choice for this extension is a 2D
|
||||
[Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence).
|
||||
|
||||
The index into the Halton Sequence is indicated by the header extension and
|
||||
results in a 14 bit unsigned integer which on overflow will wrap around back to
|
||||
0.
|
||||
|
||||
For each sample contained within the extension, the sequence index should be
|
||||
considered to be incremented by one. Thus the sequence index at the start of the
|
||||
header should be considered “the sequence index for the next sample to be
|
||||
drawn”.
|
||||
|
||||
The ACD extension may be sent containing either the 7 most significant bits
|
||||
(B = true) or the 7 least significant bits (B = false) of the sequence index.
|
||||
|
||||
Key-frames MUST be populated with the ACD extension, and those MUST use B = true
|
||||
indicating only the 7 most significant bits are transmitted.
|
||||
|
||||
The sender may choose any arbitrary starting point. The biggest reason to not
|
||||
always start with (B = true, seq index = 0) is that with frequent/periodic
|
||||
keyframes you might end up always sampling the same small subset of image
|
||||
locations over and over.
|
||||
|
||||
If B = false and the LSB seq index + number of samples exceeds the capacity of
|
||||
the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
|
||||
sequence counter should be considered to be implicitly incremented by the
|
||||
overflow.
|
||||
|
||||
Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
|
||||
example temporal layering using the
|
||||
[L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario,
|
||||
key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
|
||||
droppable.
|
||||
|
||||
For non-droppable frames, B MAY be set to true even though there is often little
|
||||
utility for it.
|
||||
For droppable frames B MUST NOT be set to true, since a receiver could otherwise
|
||||
easily end up out of sync with the sender.
|
||||
|
||||
A receiver must store a state containing the last sequence index used. If an ACD
|
||||
extension is receiver with B = false but the LSB does not match the last known
|
||||
sequence index state, this indicates that an instrumented frame has been
|
||||
dropped. The receiver should recover from this by incrementing the last known
|
||||
sequence index until the 7 least significant bits match.
|
||||
|
||||
Because of this, the sender MUST send ACD messages on non-droppable frames such
|
||||
that the delta between their sequence indexing (from the last sample of the
|
||||
previous packet to the first of the next) indexing does not exceed 0x7F. A
|
||||
synchronization message may be used for this purpose if there is no wish to
|
||||
instrument the non-droppable frame.
|
||||
|
||||
It is not required to add the ACD extension to every frame. Indeed, for
|
||||
performance reasons it may be reasonable to only instrument a small subset of
|
||||
frames, for example using only one frame per second.
|
||||
|
||||
Additionally, when encoding a structure that has independent decode targets
|
||||
(e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
|
||||
per target resolution so that a receiver can validate the state of the
|
||||
sub-stream they receive.
|
||||
|
||||
// TODO: Add concrete examples.
|
||||
|
||||
### Sample Selection
|
||||
|
||||
As mentioned above, a Halton Sequence is used to generate sampling coordinates.
|
||||
Base 2 is used for selecting the rows, and base 3 is used for selecting columns.
|
||||
|
||||
Each sample in the ACD extension represents a single image sample, meaning it
|
||||
belongs to a single channel rather than e.g. being an RGB pixel.
|
||||
|
||||
The initial version of the ACD extension supports only the I420 chroma
|
||||
subsampling format. When determining which plane a location belongs to, it is
|
||||
easiest to visualize it as the chroma planes being “stacked” to the side of the
|
||||
luma plane:
|
||||
|
||||
+------+---+
|
||||
| | U |
|
||||
+ Y +---+
|
||||
| | V |
|
||||
+------+---+
|
||||
|
||||
In pseudo code:
|
||||
```
|
||||
row = GetHaltonSequence(seq_index, /*base=*/2) * image_height;
|
||||
col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5;
|
||||
|
||||
if (col < image_width) {
|
||||
HandleSample(Y_PLANE, row, col);
|
||||
} else if (row < image_height / 2) {
|
||||
HandleSample(U_PLANE, row, col - image_width);
|
||||
} else {
|
||||
HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
|
||||
}
|
||||
|
||||
seq_index++;
|
||||
```
|
||||
Support for other layout types may be added in later versions of this extension.
|
||||
|
||||
Note that the image dimensions are not explicitly a part of the ACD extension -
|
||||
that has to be inferred from the raw image itself.
|
||||
|
||||
### Sample Filtering
|
||||
|
||||
As mentioned above, when filtering a sample we create a weighted average around
|
||||
the desired location. Only samples in the same plane are considered. The
|
||||
weighting consists of a 2D Gaussian centered on the desired location, with the
|
||||
standard deviation specified in the ACD extension header.
|
||||
|
||||
If the standard deviation is specified as 0.0 - we consider only a singular
|
||||
sample. Otherwise, we first determine a cutoff distance below which the weights
|
||||
are considered too small to matter. For now, we have set the weight cutoff to
|
||||
0.2 - meaning the maximum distance from the center sample we need to consider is
|
||||
max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.
|
||||
|
||||
Any samples outside the plane are considered to have weight 0.
|
||||
|
||||
In pseudo-code, that means we get the following:
|
||||
```
|
||||
sample_sum = 0;
|
||||
weight_sum = 0;
|
||||
for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
|
||||
for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
|
||||
weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
|
||||
sample_sum += SampleAt(x, y) * weight;
|
||||
weight_sum += weight;
|
||||
}
|
||||
}
|
||||
filtered_sample = sample_sum / weight_sum;
|
||||
```
|
||||
### Receive Side Considerations
|
||||
|
||||
When a frame has been decoded and an ACD message is present, the receiver
|
||||
performs the following steps:
|
||||
|
||||
* Update the sequence index so that it is consistent with the ACD message.
|
||||
* Calculate the sample positions from the Halton sequence.
|
||||
* Filter each sample of the decoded image using the standard deviation provided
|
||||
in the ACD message.
|
||||
|
||||
We then need to compare the actual samples present in the ACD message and the
|
||||
samples generated from the locally decoded frame, and take the allowed error
|
||||
into account:
|
||||
|
||||
```
|
||||
for (i = 0 to num_samples) {
|
||||
// Allowed error from ACD message, depending on which plane sample i is in.
|
||||
allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
|
||||
delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
|
||||
}
|
||||
```
|
||||
|
||||
It is then up to the receiver how to interpret these deltas. A suggested method
|
||||
is to calculate a “corruption score” by calculating sum(delta(i)^2), where
|
||||
delta(i) is the delta for i:th sample in the message, and then scaling and
|
||||
capping that result to a maximum of 1.0. By squaring the sample, we make sure
|
||||
that even singular samples that are way outside their expected values cause a
|
||||
noticeable shift in the score. Another possible way is to calculate the distance
|
||||
and cap it using a sigmoid function.
|
||||
|
||||
This extension message format does not make recommendations about what a
|
||||
receiver should do with the corruption scores, but some possibilities are:
|
||||
|
||||
* Expose it as a statistics connected to the video receive stream. Let the
|
||||
application decide what to do with the information.
|
||||
* Let the WebRTC application use a corruption signal to take proactive measures.
|
||||
E.g. request a key-frame in order to recover, or try to switch to another
|
||||
codec type or implementation.
|
||||
|
||||
### Determining Filter Settings & Error Thresholds
|
||||
|
||||
It is up to the sender to estimate how large the filter kernel and the allowed
|
||||
error thresholds should be.
|
||||
|
||||
One method to do this is to analyze example outputs from different encoders and
|
||||
map the average frame QP to suitable settings. There will of course have to be
|
||||
different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
|
||||
get “tighter” values with knowledge of the exact implementation used. E.g. a
|
||||
mapping designed just for libaom encoder version X running with speed setting Y.
|
||||
|
||||
Another method is to use the actual reconstructor state from the encoder. That
|
||||
of course means the encoder has to expose that state, which is not common.
|
||||
A benefit of doing it that way is that the filter size and allowed error can be
|
||||
very small (really only post-processing could introduce distortions in that
|
||||
scenario). A drawback is if the reconstructed state already contains corruption
|
||||
due to an encoder bug - then we would not be able to detect that corruption at
|
||||
all.
|
||||
|
||||
There are also possibly more accurate but probably much more costly alternatives
|
||||
as well, such as training an ML model to determine the settings based on both
|
||||
the content of the source frame and any metadata present in the encoded
|
||||
bitstream.
|
||||
|
||||
Regardless of method, the implementation at the send side SHOULD strive to set
|
||||
the filter size and error thresholds such that 99.5% of filtered samples end up
|
||||
with a delta <= the error threshold for that plane, based on a representative
|
||||
set of test clips and bandwidth constraints.
|
||||
@ -27,44 +27,8 @@ constexpr double kMaxValueForStdDev = 40.0;
|
||||
|
||||
} // namespace
|
||||
|
||||
// The message format of the header extension:
|
||||
//
|
||||
// 0 1 2 3
|
||||
// 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
// |B| seq# index | kernel size | Y err | UV err| sample 0 |
|
||||
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
// | sample 1 | sample 2 | … up to sample <=12
|
||||
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
//
|
||||
// * B (1 bit): If the sequence number should be interpreted as the MSB or LSB
|
||||
// of the full size 14 bit sequence index described in the next point.
|
||||
// * seq# index (7 bits): The index into the Halton sequence (used to locate
|
||||
// where the samples should be drawn from).
|
||||
// * If B is set: the 7 most significant bits of the true index. The 7 least
|
||||
// significant bits of the true index shall be interpreted as 0. This is
|
||||
// because this is the point where we can guarantee that the sender and
|
||||
// receiver has the same full index). For this reason, B must only be set
|
||||
// for key frames.
|
||||
// * If B is not set: The 7 LSB of the true index. The 7 most significant bits
|
||||
// should be inferred based on the most recent message.
|
||||
// * kernel size (8 bits): The standard deviation of the gaussian filter used
|
||||
// to weigh the samples. The value is scaled using a linear map:
|
||||
// 0 = 0.0 to 255 = 40.0. A kernel size of 0 is interpreted as directly using
|
||||
// just the sample value at the desired coordinate, without any weighting.
|
||||
// * Y err (4 bits): The allowed error for the luma channel.
|
||||
// * UV err (4 bits): The allowed error for the chroma channels.
|
||||
// * Sample N (8 bits): The N:th filtered sample from the input image. Each
|
||||
// sample represents a new point in one of the image planes, the plane and
|
||||
// coordinates being determined by index into the Halton sequence (starting at
|
||||
// seq# index and is incremented by one for each sample). Each sample has gone
|
||||
// through a Gaussian filter with the kernel size specified above. The samples
|
||||
// have been floored to the nearest integer.
|
||||
//
|
||||
// A special case is so called "synchronization" messages. These are messages
|
||||
// that only contains the first byte. They always have B set and are used to
|
||||
// keep the sender and receiver in sync even if no "full" messages have been
|
||||
// sent for a while.
|
||||
// A description of the extension can be found at
|
||||
// http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection
|
||||
|
||||
bool CorruptionDetectionExtension::Parse(rtc::ArrayView<const uint8_t> data,
|
||||
CorruptionDetectionMessage* message) {
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user