Content for TR 22.874 Word version: 18.2.0

0… 4 5… 5.2… 5.3… 5.4… 5.5… 6… 6.2… 6.3… 6.4… 6.5… 6.6… 6.7… 7… 7.2… 7.3… 7.4… 8… A… A.2 A.3 A.4 B C D…

6.2 Real time media editing with on-board AI inference 6.2.1 Description 6.2.2 Pre-conditions 6.2.3 Service Flows 6.2.4 Post-conditions 6.2.5 Existing features partly or fully covering the use case functionality 6.2.6 Potential New Requirements needed to support the use case
...

6.2 Real time media editing with on-board AI inference p. 28

6.2.1 Description p. 28

Smartphone is the #1 device that people carry and use everywhere for audio and video recording. It also becomes the first device to exchange media content with friends and family, to publish on social media. High end smartphones embed more and more powerful CPU and GPU and even dedicated AI hardware accelerators. As camera and picture/video quality become a differentiator among high end smartphones, AI/ML models to enhance photo shoots locally emerge on these high-end devices. AI accelerators are expected to enable the execution of complex AI/ML models directly on end-users connected devices; not only photo enhancements but high-quality audio and video content analysis and enhancement are expected to be executed locally on smartphones. Smartphones will consequently become a device to edit media content prior to sharing over the network. With the advent of 5G, new services relying on on-demand downloads of large AI/ML models to be executed in (near) real time on end user device will emerge; depending on the service, the environment, the user's preference, the device characteristics, etc., these DNN models will need to be adapted or updated under strict latency constraints which prevent all of them to be stored locally in advance.

Media content analysis combines tasks as object detection, segmentation, face recognition, people counting, human activity tracking. In Table 6.2.1-1, examples of commonly used DNN models for object detection (with their respective sizes) are listed.

Table 6.2.1-1: Sizes of typical object detection models

Model for object detection	Number of parameters (Million)	Size of the model (MByte) 32 bits parameters	Size of the model (MByte) 8 bit parameters
MobileNet	3.2	12.8	3.2
DarkNet	20	80	20
SE ResNet	26	104	26
Inception v4	41	164	41
YOLONet	64	256	64
VGGNet	134	536	134

Media content edition combines tasks as audio and video quality improvement, language translation, face anonymisation. In Table 6.2.1-2 and Table 6.2.1-3, examples of commonly used DNN models for image super-resolution and for video super-resolution are listed.

Table 6.2.1-2: Sizes of typical image super-resolution models

Model for image super-resolution	Number of parameters (Million)	Size of the model (MByte) 32 bits parameters	Size of the model (MByte) 8 bits parameters
RCAN	15.44	61.78	15.44
SAN	15.71	62.82	15.71
RDN	22.12	88.48	22.12
EDSR	40.73	162.92	40.73
OISR-RK3	41.91	167.64	41.91

Table 6.2.1-3: Sizes of typical video super-resolution models

Model for video super-resolution	Number of parameters (Million)	Size of the model (MByte) 32 bits parameters	Size of the model (MByte) 8 bits parameters
RBPN/4-PF	12.7	50.8	12.7
RBPN/6-PF	12.7	50.8	12.7
VSR-DUF	6.8	27.2	6.8
DRDVSR	0.7	2.8	0.7

Two settings are considered for the use case:

Independent user: a person takes a video or starts a video call on its UE in a noisy environment, with difficult lighting conditions, and automatic tagging of scene and objects are embedded in the video.
Crowd: During a large event, like a live concert, several thousand people use their UEs to film or photograph the musician band at the same time, and request additional information on the concert like band discography, lyrics, artist facial recognition, instrument/equipment brand. In this context, UEs request the downloads of DNN models to improve the capture and recording of the concert, and to provide information requested by people attending the concert. Several DNN models can be requested by each UE to e.g. execute the following tasks: image shooting and video optimization, artist face recognition, musical instrument identification, audio improvement and lyrics generation. Given the heterogeneous fleet of UEs, thousands of DNN models required by the application/service - can be requested for download. These DNN models are adapted or updated to the UEs operating system type and version, hardware characteristics and environment.

6.2.2 Pre-conditions p. 29

The setting for this use case is as follows. Alice is attending a crowded live concert. She is eager to get movie clips and pictures as a great souvenir of the concert, but she is worried about difficult conditions to get this great souvenir as the conditions for light and sound are very variable, with limited light for the spectators and too much light on the scene. The audio stereo is variable and not well-balanced dependent on where she is among the audience and has a very noisy background.

Alice would like to store good quality movie clips and pictures on her private account on the internet for the future, and also post photos and videos tagged with artist name and other relevant information during the concert. As Alice is also an amateur musician, she also wants to get detailed real-time information about the structure of the song, the lyrics and the instruments.

The pre-conditions are:

Alice is attending a crowded live concert.
Her UE is registered to the 5G network.
Applications of Alice's smartphone can rely on fine-tuned machine learning models that are available via the network that covers the concert hall. These models are:
1. An ML model improving photo capture for this concert hall (a model specially fine-tuned for this concert place).
2. An ML model improving audio capture for this concert hall (a model specially fine-tuned for this concert place).
3. An ML model specialized in the discography and the lyrics of the band.
4. An ML model specialized in the artists face recognition.
5. An ML model specialized in music instrument identification.
NOTE:
The listed models above are examples for this use-case and is not an exhaustive list.

6.2.3 Service Flows p. 30

Shortly after the start of the concert, Alice, as most of the fans, launches the camera application on her mobile phone to film the scene and to get additional information about the band, the individual artists or the songs or instruments.
She points her device's camera towards the scene.
The environment is very dark with strong light spots. To be fully functional and to render the best user experience, the camera application downloads ad-hoc ML models.
The proposed ML models are very performant in this environment but also very heavy in size.
The ML models can be used for the whole capture especially if environment remains stable. If the environment changes substantially, or better ML models become available, ML models can be progressively updated accordingly with respect of some operating rules (like a maximum number of ML update per second). In all cases, the camera application continues working seamlessly.
The audio and video streams are captured, improved in quality, processed to extract and display additional information, and either stored in real-time on the mobile phone itself or provided as a live stream.

6.2.4 Post-conditions p. 30

Alice can see that even in harsh light conditions and with the noisy background the photos and videos are great, additional information is provided and all is correctly tagged as requested.

The post-conditions are:

Photos and videos are either stored on the mobile phone in the improved high quality, ready to be uploaded and shared on social media, or directly shared on social media.
Audio recording is high quality with ambient noise reduction, improved stereo balance.
Additional information about band, song/lyrics, instruments, etc. are displayed on the mobile phone and stored in media recordings' metadata.
Alice can visualize additional information and upload the photos and videos on her social network(s) with the associated tags and information provided by the models, and also store the above on her personal media server.

6.2.5 Existing features partly or fully covering the use case functionality p. 30

The performance requirements for high data rate and traffic density scenarios are found in clause 7.1 of TS 22.261. The scenario Broadband access in a crowd is relevant for the use case of very dense crowds, for example at stadiums or concerts. In addition to a very high connection density, the users can share their experience, i.e. what they see and hear. This can put a higher requirement on the uplink than the downlink.

This new use case has some requirements on the downlink not covered by the existing requirements, see Table 6.2.5-1.

Table 6.2.5-1: Excerpt from TS 22.261 [4] Table 7.1-1

	Scenario	Experienced data rate (DL)	Experienced data rate (UL)	Area traffic capacity (DL)	Area traffic capacity (UL)	Overall user density	Activity factor	UE speed	Coverage
4	Broadband access in a crowd	25 Mbit/s	50 Mbit/s	[3,75] Tbit/s/km2	[7,5] Tbit/s/km2	[500 000]/km2	30%	Pedestrians	Confined area

6.2.6 Potential New Requirements needed to support the use case p. 30

6.2.6.1 Introduction p. 30

Potential new requirements needed to support the use case result of the following parameters:

AI/ML model size.
Accuracy of the model.
latency constraint of the application or service.
number of concurrent downloads, i.e. number of UEs requesting AI/ML model downloads within same time window.

The number of concurrent downloads further depends on the density of UE in the covered area and the covered area size.

Table 6.2.6.1-1, Table 6.2.6.1-2 and Table 6.2.6.1-3 contain KPI for different aspects of the real-time media editing use case.

Table 6.2.6.1-1: Typical sizes of AI/ML models for the UC

AI/ML Model	Number of parameters (Million)	Size of the AI/ML model (MByte)	Comments
MobileNet	3.2	3.2	8-bit parameters
MobileNet	3.2	12.8	32-bit parameters
RCAN	15.44	15.44	8-bit parameters
DarkNet	20	20	8-bit parameters
Inception v4	41	41	8-bit parameters
RCAN	15.44	61.78	32-bit parameters
YOLONet	64	64	8-bit parameters
DarkNet	20	80	32-bit parameters
VGGNet	134	134	8-bit parameters
Inception v4	41	164	32-bit parameters
YOLONet	64	256	32-bit parameters
VGGNet	134	536	32-bit parameters

From Table 6.2.6.1-1, AI/ML models currently available to elaborate the use case have sizes that vary from 3.2 MB to 536 MB.

As indicated in the use case in clause 6.1, it can be noted that the size of AI/ML models can be reduced prior to transmission with dedicated model compression techniques. On the contrary, AI/ML models with more neural network layers and more complex architectures arise to solve more complex tasks and to improve accuracy. This trend is expected to continue in the coming years. Typical model sizes in the range of 3 MB to 500 MB appear to be a reasonable compromise to consider for this use case.

In the following, two categories are considered for AI/ML model sizes:

AI/ML model sizes below 64 MB, which can be associated to models optimized for fast transmission,
AI/ML model sizes below 500 MB, which can be associated to models optimized for higher accuracy.

Maximum latency in function of application or service:

Videocall service: end-to-end latency below 200 ms,
Video recording, video streaming, and object recognition applications: latency below 1 s.

User experienced DL data rate results from above AI/ML models' sizes and maximum latency values are summarized in the Table 6.2.6.1-2.

Table 6.2.6.1-2: UC AI/ML model download - single UE - KPIs

UC model download	AI/ML model size	Latency: maximum	User experienced DL data rate
Single Model / Single UE	[3 MB - 64 MB]	<1 s	[24 Mb/s ~ 512 Mb/s]
Single Model / Single UE	[3 MB - 64 MB]	<200 ms	[120 Mb/s ~ 2.56 Gb/s]
Single Model / Single UE	[64 MB - 500 MB]	<1 s	[512 Mb/s ~ 4 Gb/s]
Single Model / Single UE	[64 MB - 500 MB]	<200 ms	[2.56 Gb/s ~ 20 Gb/s]

As indicated above, the number of concurrent downloads is a third parameter to determine potential new requirements. This corresponds to the maximum number of UEs requesting a AI/ML model download in a same time window and same covered area/cell.

The case of a concert hall is an illustration of the scenario, "Broadband access in a crowd" from TS 22.261. This scenario assumes an overall user density of 500 000 UE / km2 (i.e. 0.5 UE / m2) and an activity factor of 30 %.

In the concert hall case, it is also assumed that only a part of the UEs intends to request AI/ML model downloads. Moreover, only a subpart will request AI/ML model download during the same time window in the same cell. The activity factor is finally estimated to 1 % (i.e. % of UE requesting an AI/ML model download within same time window).

Typical number of UEs in a concert hall varies from ~1000 seats to ~ 5000 seats.

Based on these figures and UE activity assumption, the number of concurrent downloads is estimated as given in Table 6.2.6.1-3.

Table 6.2.6.1-3: Estimated number of concurrent downloads

Number of UEs	Estimated area (density of 0.5 UE / m²)	Activity Factor	Number of concurrent downloads in the same cell
1000	2000 m2	1 %	10
5000	10000 m2	1 %	50

From Table 6.2.6.1-2 and Table 6.2.6.1-3, requirements on the covered area are estimated as follows:

Table 6.2.6.1-4: Estimated covered area DL data rate requirements

Number of UEs	Activity Factor	Number of concurrent downloads	AI/ML model size	Latency: maximum	User experienced DL data rate	Aggregated user experience DL data rate for covered area
1000	1 %	10	[3 MB - 64 MB]	<1 s	[24 Mb/s ~ 512 Mb/s]	[240 Mb/s ~ 5.12 Gb/s]
			[3 MB - 64 MB]	<200 ms	[120 Mb/s ~ 2.56 Gb/s]	[1.2 Gb/s ~ 25.6 Gb/s]
			[64 MB - 500 MB]	<1 s	[512 Mb/s ~ 4 Gb/s]	[5.12 Gb/s ~ 40 Gb/s]
5000	1 %	50	[3 MB - 64 MB]	<1 s	[24 Mb/s ~ 512 Mb/s]	[1.2 Gb/s ~ 25.6 Gb/s]
			[3 MB - 64 MB]	<200 ms	[120 Mb/s ~ 2.56 Gb/s]	[6 Gb/s ~ 128 Gb/s]
			[64 MB - 500 MB]	<1 s	[512 Mb/s ~ 4 Gb/s]	[25.6 Gb/s ~ 200 Gb/s]

Another approach to estimate the number of concurrent downloads is to estimate the number of different AI/ML models requested by UEs instead of the number of UEs requesting AI/ML models. The AI/ML models can then be broadcast/multicast to multiple UEs. The number of different AI/ML models depends on the accuracy expectations of the AI/ML models, the execution environments and the hardware characteristics of end devices. When the number of UE requesting AI/ML models is very high, the number of different AI/ML models can remain smaller. This approach is well suited for very large crowd.

The number of concurrent downloads when transmitted in broadcast/multicast to many UEs can be estimated between 1 (i.e. all UEs request the same AI/ML model) and 50 (i.e. all UEs request different AI/ML models).

6.2.6.2 Potential KPI Requirements p. 32

Independent user:

[P.R.6.2-I-001]

The 5G system shall support the download of AI/ML models with a latency below 1s and a user experienced data rate of 512 Mb/s.

[P.R.6.2-I-002]

The 5G system shall support the download of AI/ML models with a latency below 1 s and a user experienced data rate of 4 Gb/s.

Crowd:

[P.R.6.2-C-001]

The 5G system shall support the parallel download of up to 50 AI/ML models with a latency below 1 s.

[P.R.6.2-C-002]

The 5G system should support the functionality to broadcast/multicast a same AI/ML model to many UEs with a latency below 1 s.