It is possible to achieve something similar using Lightship VPS. Instead of scanning the entire building, you would scan a specific point of interest for the user to begin the experience at.

Users will still need to localize against specific meshes that are at ground level, rather than pointing their phone at the structure from any ambiguous area.

For large structures, you'll need to find to-scale models of the structure, or create one yourself using photogrammetry or a NeRF.

For reference, here is a mesh I used to localize against a large tower and the final experience, using a to-scale model (created with screenshots from Apple Maps and photogrammetry) for occlusion.

coit-tower-mesh.png

coit-tower.MOV

You might also consider integrating an external VPS, such as Immersal, designed specifically for city scale AR.