Abstract: Acoustic scene perception spans what the sound is, when it occurs, where it is in direction and distance, and how it sounds in loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order ambisonics recordings and their metadata, Sci-Phi enumerates and describes up to four sound sources in one pass, alongside background noise and room characteristics. We evaluate the model with a carefully designed permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across various signal-to-noise ratios, reverberation levels, and challenging cases such as spatially, temporally, or semantically overlapping sound sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment.
Full Scene Description (from Synthetic RIR Test Set)βΆ
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1000m^3;
RT60=0.4s;
n_src=4. noise_label: ambient sound; noise_loudness=-45dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βI can understand why they have gone away.β: (5.3s-8.2s, upper front-left, 0.8m, -18dB, 22dB); Nature:(0.0s-10.0s, horizontal front-right, 3.0m, -25dB, 13dB); fingers on teeth:(8.0s-8.1s, horizontal back-left, 1.2m, -28dB, 19dB); kick bass drum:(0.0s-10.0s, horizontal front-right, 3.4m, -33dB, 12dB). |
room_volume=1900m^3;
RT60=0.4s;
n_src=4. noise_label: ambient electronic hum;noise_loudness=-48dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βI can understand why they have gone.β: (5.2s-8.5s, upper front-left, 1.3m, -19dB, 19dB); Bonapartes Gull:(0.1s-8.5s, horizontal front-right, 4.4m, -25dB, 12dB); baseball bat swing:(7.8s-8.5s, horizontal back-left, 1.6m, -28dB, 19dB); axe chopping:(0.0s-7.5s, horizontal front-right, 4.4m, -35dB, 11dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1500m^3;
RT60=0.3s;
n_src=3. noise_label: ambient sounds;noise_loudness=-59dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βOf course, on a diet like this one, I wouldn't recommend.β: (4.1s-8.2s, upper front-left, 0.8m, -33dB, 27dB); English female speech with transcript βThat goes without saying.β: (4.1s-6.4s, lower back-right, 1.6m, -33dB, 22dB); Nature:(0.0s-10.0s, horizontal front-left, 2.0m, -42dB, 20dB). |
room_volume=1900m^3;
RT60=0.3s;
n_src=4. noise_label: computer powering down;noise_loudness=-58dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βOf course, on Tuesday, United were beaten despite this.β: (4.0s-7.9s, upper front-left, 0.8m, -33dB, 24dB); censor beep:(4.7s-5.4s, horizontal back-left, 2.0m, -36dB, 18dB); English male speech with transcript βThat doesn't happen in Europe.β: (3.6s-6.2s, horizontal back-right, 2.3m, -36dB, 17dB); rattlesnake rattle:(0.0s-10.0s, horizontal front-left, 2.9m, -42dB, 14dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=200m^3;
RT60=0.5s;
n_src=4. noise_label: rain;noise_loudness=-47dB. Sound label: (time, direction, distance, loudness, C50): female voice:(4.7s-6.6s, lower front-right, 1.1m, -25dB, 10dB); applause:(2.0s-7.8s, lower back-right, 2.4m, -26dB, 7dB); kettle pouring:(1.2s-9.4s, horizontal front-left, 0.8m, -34dB, 12dB); English female speech with transcript βThe briefcase held the day's knives.β: (1.8s-4.7s, horizontal front, 1.0m, -37dB, 11dB). |
room_volume=200m^3;
RT60=0.6s;
n_src=4. noise_label: storm;noise_loudness=-49dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βYou must be ready to play anyone.β: (3.8s-6.8s, lower front-right, 0.8m, -26dB, 11dB); audience applause:(1.7s-8.3s, lower back-right, 2.2m, -27dB, 6dB); hot water pouring:(1.1s-8.9s, horizontal front-left, 0.8m, -36dB, 10dB); English female speech with transcript βEveryone is taking a breath and waiting.β: (1.5s-5.4s, horizontal front-right, 1.4m, -37dB, 8dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=800m^3;
RT60=0.7s;
n_src=4. noise_label: rain;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): animal growling:(2.6s-6.9s, upper front, 0.8m, -15dB, 17dB); dog barking, dog growling, dog whimpering:(0.0s-10.0s, horizontal left, 2.4m, -26dB, 10dB); robotic voice:(0.6s-4.2s, horizontal front, 5.5m, -28dB, 6dB); English female speech with transcript βI should think so too.β: (4.5s-6.8s, horizontal front-right, 2.4m, -32dB, 9dB). |
room_volume=600m^3;
RT60=0.7s;
n_src=4. noise_label: thunder;noise_loudness=-54dB. Sound label: (time, direction, distance, loudness, C50): zombie, demon:(2.8s-6.0s, upper front, 0.7m, -13dB, 16dB); doberman pincher, barking:(0.0s-10.0s, horizontal left, 1.9m, -25dB, 10dB); woosh, slow motion effect:(0.5s-4.0s, horizontal front, 6.1m, -26dB, 6dB); English female speech with transcript βI should think so, too.β: (4.6s-6.8s, horizontal front-right, 2.5m, -32dB, 9dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=400m^3;
RT60=0.6s;
n_src=4. noise_label: environmental sounds;noise_loudness=-52dB. Sound label: (time, direction, distance, loudness, C50): soda can opening:(3.6s-4.3s, horizontal right, 0.8m, -27dB, 15dB); ringtone:(0.0s-10.0s, upper front-left, 0.8m, -29dB, 15dB); Nature:(0.0s-10.0s, horizontal front-right, 0.8m, -37dB, 15dB); French speech with transcript βIl est le pΓ¨re de FrantiΕ‘ek Kaberle et TomΓ‘Ε‘ Kaberle.β: (3.7s-8.4s, lower back-left, 2.1m, -42dB, 9dB). |
room_volume=700m^3;
RT60=0.5s;
n_src=4. noise_label: bell;noise_loudness=-54dB. Sound label: (time, direction, distance, loudness, C50): beer can opening:(3.6s-5.7s, horizontal right, 1.4m, -29dB, 13dB); phone ringing:(0.0s-10.0s, upper front-left, 1.3m, -30dB, 14dB); bear growling, bear roaring:(0.0s-10.0s, horizontal front-right, 1.2m, -39dB, 14dB); voice, dice rolling:(4.6s-9.6s, lower back, 3.4m, -42dB, 8dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1500m^3;
RT60=0.6s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): English male speech with transcript βI've lost my head.β: (2.2s-5.6s, horizontal back, 2.6m, -38dB, 13dB); English female speech with transcript βA final agreement has not yet been completed.β: (1.2s-4.6s, upper left, 1.0m, -41dB, 20dB); English male speech with transcript βThere was no time to mark.β: (6.2s-8.2s, lower right, 2.8m, -41dB, 13dB). |
room_volume=1000m^3;
RT60=0.6s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): English male speech with transcript βI lost my head.β: (3.1s-5.6s, horizontal back, 2.6m, -38dB, 12dB); English female speech with transcript βA final agreement has not yet been completed.β: (1.3s-4.4s, upper left, 0.9m, -40dB, 19dB); English female speech with transcript βThere was no time scale.β: (6.2s-8.4s, lower right, 2.5m, -42dB, 12dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=500m^3;
RT60=0.6s;
n_src=3. noise_label: ambient bathroom sounds;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): aluminium foil tearing:(6.3s-6.8s, horizontal back, 1.2m, -25dB, 14dB); English male speech with transcript βWe were in different places, and we talked for a while.β: (0.5s-4.1s, lower back-left, 0.8m, -28dB, 16dB); fart:(8.6s-9.9s, horizontal front-left, 2.4m, -43dB, 10dB). |
room_volume=500m^3;
RT60=0.6s;
n_src=3. noise_label: horn;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): patting or tapping:(6.3s-6.6s, horizontal back-right, 1.6m, -26dB, 12dB); English male speech with transcript βWe were in different places, usually in cellars.β: (0.5s-4.6s, lower back-left, 0.8m, -28dB, 16dB); fart:(8.6s-10.0s, horizontal front-left, 2.7m, -44dB, 10dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=500m^3;
RT60=0.8s;
n_src=3. noise_label: fireworks;noise_loudness=-68dB. Sound label: (time, direction, distance, loudness, C50): explosion-like sound:(5.7s-9.1s, horizontal front, 1.1m, -45dB, 10dB); water pouring:(0.0s-10.0s, upper front, 1.4m, -48dB, 9dB); English male speech with transcript βI always fall asleep when I'm doing something.β: (5.2s-8.6s, lower right, 1.1m, -48dB, 10dB). |
room_volume=900m^3;
RT60=0.9s;
n_src=3. noise_label: priest walking in hard sole shoes;noise_loudness=-66dB. Sound label: (time, direction, distance, loudness, C50): time bomb:(1.8s-8.3s, horizontal front, 1.6m, -47dB, 10dB); pouring drink:(0.0s-10.0s, upper front, 2.1m, -48dB, 9dB); English male speech with transcript βI always felt that I was in control of the match.β: (5.3s-8.9s, lower front-right, 2.1m, -48dB, 9dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1300m^3;
RT60=0.6s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): button click:(1.9s-2.4s, horizontal right, 7.0m, -32dB, 6dB); coin drop:(7.7s-8.5s, upper back-right, 0.8m, -35dB, 18dB); metal band:(1.5s-3.4s, horizontal back-left, 2.5m, -42dB, 10dB). |
room_volume=1000m^3;
RT60=0.7s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): button click:(1.9s-6.9s, horizontal right, 9.5m, -34dB, 6dB); metal impact:(7.8s-8.5s, horizontal back-right, 0.7m, -35dB, 18dB); boxing bell:(1.6s-3.3s, horizontal back-left, 2.6m, -41dB, 9dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=500m^3;
RT60=0.8s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): human scream:(0.1s-7.1s, horizontal front-right, 1.1m, -31dB, 12dB); drum, snare:(1.8s-2.0s, upper back-right, 0.9m, -42dB, 13dB); English speech with transcript βA thousand years ago, the city was the center of an ancient civilisation.β: (4.7s-8.9s, lower back-left, 1.8m, -44dB, 9dB). |
room_volume=600m^3;
RT60=0.8s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): crow:(0.4s-6.4s, horizontal front-right, 1.2m, -31dB, 12dB); punch:(1.8s-2.2s, upper back-right, 1.1m, -42dB, 13dB); English male speech with transcript βA thousand years ago the church was a powerful force in Europe.β: (4.6s-8.8s, lower back-left, 2.3m, -44dB, 9dB). |
Full Scene Description (from Real RIR Test Set)βΆ
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1000m^3;
RT60=0.2s;
n_src=4. noise_label: ambient noise;noise_loudness=-50dB. Sound label: (time, direction, distance, loudness, C50): bricks falling:(2.7s-4.5s, horizontal front, 1.4m, -22dB, 22dB); dough hook tapping:(1.3s-2.6s, horizontal back-right, 0.8m, -25dB, 27dB); coin sound effect:(7.9s-8.6s, horizontal back, 2.0m, -30dB, 20dB); English female speech with transcript βHowever, no further action was taken by police.β: (3.7s-7.8s, horizontal back-right, 0.8m, -37dB, 27dB). |
room_volume=Unknown;
RT60=0.2s;
n_src=4. noise_label: ambient noise;noise_loudness=-72dB. Sound label: (time, direction, distance, loudness, C50): toy train:(2.7s-4.4s, horizontal front, 1.5m, -21dB, 28dB); wood block:(1.3s-2.1s, horizontal back-right, 0.8m, -22dB, 21dB); bicycle bell:(7.9s-8.5s, horizontal back, 0.8m, -26dB, 17dB); English male speech with transcript βHowever, no further action was taken by police.β: (3.6s-7.3s, horizontal right, 1.5m, -37dB, 26dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=700m^3;
RT60=0.2s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): guitar:(2.2s-4.6s, horizontal back-left, 1.0m, -27dB, 25dB); English male speech with transcript βOnly one person can claim the credit.β: (3.0s-6.1s, horizontal front, 0.8m, -33dB, 27dB); shaker:(5.2s-7.2s, horizontal back-left, 1.0m, -35dB, 25dB). |
room_volume=Unknown;
RT60=0.6s;
n_src=4. noise_label: ambient noise;noise_loudness=-80dB. Sound label: (time, direction, distance, loudness, C50): piano:(2.2s-4.4s, horizontal back-left, 0.8m, -26dB, 27dB); English male speech with transcript βOnly one person can claim the credit.β: (2.9s-6.0s, horizontal front, 1.5m, -33dB, 22dB); English female speech with transcript βWe deserved the three points.β: (3.0s-6.8s, horizontal back-left, 0.8m, -38dB, 27dB); shaker:(5.1s-7.2s, horizontal back-left, 0.8m, -38dB, 24dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=700m^3;
RT60=0.6s;
n_src=4. noise_label: ambient room noise;noise_loudness=-64dB. Sound label: (time, direction, distance, loudness, C50): Bongo, Congo:(1.9s-2.1s, horizontal front, 0.8m, -35dB, 16dB); clapping:(1.6s-3.4s, horizontal left, 0.7m, -40dB, 17dB); human speech:(5.8s-8.4s, lower back-right, 1.5m, -46dB, 12dB); ticking clock sound, voice:(4.5s-6.4s, horizontal back-right, 0.8m, -49dB, 16dB). |
room_volume=Unknown;
RT60=1.3s;
n_src=4. noise_label: ambient noise;noise_loudness=-77dB. Sound label: (time, direction, distance, loudness, C50): hit drum:(1.8s-2.2s, horizontal front, 1.5m, -36dB, 13dB); hand clap:(1.7s-3.3s, horizontal left, 0.8m, -39dB, 16dB); English female speech with transcript βI am about protecting the state pension.β: (5.1s-8.7s, horizontal back-right, 1.5m, -45dB, 13dB); wood block:(4.6s-5.8s, horizontal back-right, 0.8m, -46dB, 17dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=700m^3;
RT60=0.3s;
n_src=3. noise_label: ambient environmental sounds;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): female voice:(0.0s-10.0s, horizontal front, 1.1m, -28dB, 20dB); English speech with transcript βOr maybe it's the other way around.β: (3.9s-6.5s, horizontal front-left, 0.7m, -35dB, 23dB); Military:(0.0s-10.0s, upper front-right, 1.4m, -38dB, 18dB). |
room_volume=Unknown;
RT60=0.5s;
n_src=4. noise_label: ambient noise;noise_loudness=-62dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βAnyone remaining after that will be targeted.β: (0.6s-5.0s, horizontal front, 1.5m, -26dB, 16dB); English female speech with transcript βOr maybe it's the other way around.β: (3.4s-6.4s, horizontal left, 0.8m, -33dB, 22dB); hand clap:(0.6s-2.3s, horizontal front, 0.8m, -34dB, 18dB); toy train:(0.0s-4.5s, horizontal front-right, 1.5m, -34dB, 14dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=400m^3;
RT60=0.5s;
n_src=4. noise_label: ambient noise;noise_loudness=-49dB. Sound label: (time, direction, distance, loudness, C50): English speech with transcript βWe don't want to be too intrusive.β: (7.1s-9.8s, horizontal right, 0.8m, -20dB, 15dB); English female speech with transcript βI guess they just can't help it.β: (7.7s-10.0s, upper back-left, 2.0m, -24dB, 9dB); Congo drum:(6.5s-6.6s, horizontal front, 1.1m, -26dB, 12dB); castanet:(0.4s-2.4s, horizontal back, 1.8m, -38dB, 9dB). |
room_volume=Unknown;
RT60=0.7s;
n_src=4. noise_label: ambient noise;noise_loudness=-65dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βWe don't want to be too intrusive.β: (7.1s-9.8s, horizontal right, 0.8m, -20dB, 16dB); English male speech with transcript βThat is my preference.β: (7.7s-10.0s, horizontal back-left, 1.5m, -25dB, 11dB); hit drum:(6.5s-6.8s, horizontal front, 1.5m, -26dB, 11dB); wood block:(0.5s-1.9s, horizontal back, 1.5m, -37dB, 9dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=2200m^3;
RT60=0.5s;
n_src=3. noise_label: not present;noise_loudness=None. Sound label: (time, direction, distance, loudness, C50): tapping of broom, concrete floor, wooden broom scrape, woody thump:(8.3s-9.2s, upper front-right, 0.7m, -32dB, 23dB); sax baritone:(3.5s-6.9s, horizontal left, 1.9m, -33dB, 16dB); English speech with transcript βSo, in a sense, it was a selfless act.β: (0.0s-7.2s, horizontal front, 0.8m, -37dB, 22dB). |
room_volume=Unknown;
RT60=0.7s;
n_src=3. noise_label: ambient noise;noise_loudness=-75dB. Sound label: (time, direction, distance, loudness, C50): wood block:(8.3s-9.1s, horizontal front-right, 0.8m, -27dB, 27dB); organ:(4.1s-5.9s, horizontal left, 1.5m, -30dB, 18dB); English male speech with transcript βSo, in a sense, it was a government subsidy.β: (3.1s-7.1s, horizontal front, 1.5m, -36dB, 21dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1000m^3;
RT60=0.4s;
n_src=4. noise_label: thunder;noise_loudness=-72dB. Sound label: (time, direction, distance, loudness, C50): English speech with transcript βIt's a delightful idea, but the implementation is extremely complicated.β: (4.4s-7.3s, horizontal front, 0.8m, -42dB, 19dB); English female speech with transcript βParty is up for it!β: (3.2s-5.5s, horizontal back-left, 0.7m, -43dB, 20dB); English speech with transcript βThat'll be the case on Tuesday.β: (7.5s-9.7s, horizontal left, 1.3m, -50dB, 16dB); footsteps on carpet:(0.0s-10.0s, horizontal front-left, 1.6m, -59dB, 15dB). |
room_volume=Unknown;
RT60=0.7s;
n_src=3. noise_label: ambient noise;noise_loudness=-65dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βThe party is up for it.β: (2.9s-4.9s, horizontal back-left, 0.8m, -40dB, 19dB); English female speech with transcript βIt's a delightful idea, but a distancing one.β: (4.3s-7.7s, horizontal front, 1.5m, -41dB, 16dB); English female speech with transcript βThat will be the case on Tuesday.β: (7.5s-9.8s, horizontal left, 1.5m, -47dB, 12dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=500m^3;
RT60=0.4s;
n_src=3. noise_label: ambient background noise;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): English female speech with transcript βWe want to sort it out.β: (5.0s-7.3s, horizontal right, 1.6m, -35dB, 12dB); English female speech with transcript βYou need a trademark.β: (7.4s-10.0s, horizontal left, 0.7m, -37dB, 18dB); English speech with transcript βThe man was obviously desperate enough to hire a private thief.β: (0.2s-4.2s, horizontal front, 1.2m, -44dB, 14dB). |
room_volume=Unknown;
RT60=0.6s;
n_src=3. noise_label: ambient noise;noise_loudness=-66dB. Sound label: (time, direction, distance, loudness, C50): English male speech with transcript βWe want to sort it out.β: (5.0s-7.3s, horizontal right, 1.5m, -35dB, 13dB); English female speech with transcript βYou need a trademark.β: (7.4s-9.6s, horizontal left, 0.8m, -37dB, 17dB); English female speech with transcript βThe man was obviously desperate to get away from the police.β: (0.0s-4.0s, horizontal front, 1.5m, -42dB, 12dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=700m^3;
RT60=0.4s;
n_src=3. noise_label: ambient sounds;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): brass bell:(2.4s-3.1s, horizontal left, 0.7m, -28dB, 20dB); Military:(2.5s-8.6s, horizontal front, 1.1m, -30dB, 17dB); percussion:(4.1s-6.0s, horizontal front, 0.8m, -40dB, 19dB). |
room_volume=Unknown;
RT60=1.7s;
n_src=3. noise_label: ambient noise;noise_loudness=-85dB. Sound label: (time, direction, distance, loudness, C50): bicycle bell:(2.4s-3.3s, horizontal left, 0.8m, -29dB, 19dB); toy train:(3.1s-7.3s, horizontal front, 1.5m, -30dB, 14dB); hand clap:(4.1s-5.6s, horizontal front, 1.5m, -38dB, 15dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=500m^3;
RT60=0.4s;
n_src=1. noise_label: car horn;noise_loudness=-35dB. Sound label: (time, direction, distance, loudness, C50): English speech with transcript βThat is giving me great confidence.β: (2.8s-9.9s, horizontal left, 0.7m, -40dB, 19dB). |
room_volume=Unknown;
RT60=0.6s;
n_src=2. noise_label: ambient noise;noise_loudness=-76dB. Sound label: (time, direction, distance, loudness, C50): organ:(2.9s-5.0s, horizontal front-left, 1.5m, -22dB, 11dB); English female speech with transcript βThat has given me great confidence.β: (3.2s-6.1s, horizontal left, 0.8m, -36dB, 17dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1500m^3;
RT60=0.2s;
n_src=2. noise_label: ambient noise;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): hi hat:(7.9s-9.9s, horizontal front, 1.8m, -28dB, 22dB); English female speech with transcript βThey had to be cut from the wreckage.β: (1.9s-6.2s, upper back-left, 0.8m, -39dB, 28dB). |
room_volume=Unknown;
RT60=0.2s;
n_src=2. noise_label: ambient noise;noise_loudness=-72dB. Sound label: (time, direction, distance, loudness, C50): shaker:(7.9s-10.0s, horizontal front, 1.5m, -29dB, 25dB); English female speech with transcript βThey had to be cut from the wreckage.β: (2.0s-6.2s, horizontal back-left, 0.8m, -37dB, 30dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=1000m^3;
RT60=0.6s;
n_src=2. noise_label: ambient room sounds;noise_loudness=-55dB. Sound label: (time, direction, distance, loudness, C50): electronic alarm:(5.1s-9.7s, horizontal front, 0.8m, -34dB, 17dB); English speech with transcript βThat was never her agenda.β: (0.1s-9.6s, horizontal left, 0.7m, -42dB, 18dB). |
room_volume=Unknown;
RT60=0.7s;
n_src=2. noise_label: ambient noise;noise_loudness=-67dB. Sound label: (time, direction, distance, loudness, C50): security buzzer:(6.1s-8.4s, horizontal front-right, 1.5m, -33dB, 18dB); English female speech with transcript βThat was never their agenda.β: (2.9s-5.8s, horizontal left, 1.5m, -42dB, 17dB). |
Sci-Phi's Description | Ground-truth Description |
---|---|
room_volume=200m^3;
RT60=0.5s;
n_src=3. noise_label: stairs;noise_loudness=-79dB. Sound label: (time, direction, distance, loudness, C50): English male speech with transcript βHe is an extraordinary writer on so many levels.β: (5.0s-9.7s, horizontal left, 1.1m, -51dB, 10dB); bell:(4.3s-9.8s, horizontal front, 0.8m, -53dB, 12dB); Nature:(0.0s-10.0s, horizontal left, 1.1m, -64dB, 10dB). |
room_volume=Unknown;
RT60=0.6s;
n_src=2. noise_label: ambient noise;noise_loudness=-74dB. Sound label: (time, direction, distance, loudness, C50): English male speech with transcript βHe is an extraordinary writer on so many levels.β: (4.9s-9.3s, horizontal left, 1.5m, -49dB, 9dB); metallophone:(6.2s-9.0s, horizontal front, 1.5m, -51dB, 10dB). |
Open-ended Question Answering (from Real RIR Test Set)βΆ
Sci-Phi's Answer | Ground-truth Answer |
---|---|
A low-reverberant large-sized room with high SNR. | A low-reverberant unknown-sized room with high SNR. |
(Notice that there are multiple bells at different locations.)
Sci-Phi's Answer | Ground-truth Answer |
---|---|
Bell:(3.8s-4.6s,horizontal front,1.0m). | Bicycle bell:(3.8s-4.6s,horizontal front,1.5m). |
Sci-Phi's Answer | Ground-truth Answer |
---|---|
Baby. | Baby; Bronx cheer. |
Sci-Phi's Answer | Ground-truth Answer |
---|---|
Horizontal back. | Horizontal back. |
Sci-Phi's Answer | Ground-truth Answer |
---|---|
No audio sources in lower left found. | No audio sources in lower left found. |